Multi-GPU support is disabled. Using a single GPU. +-----------------------+----------------------------------------------------+ | Parameter | Value | +-----------------------+----------------------------------------------------+ | train data pattern | dev/data/fineweb10B/fineweb_train_*.bin | | val data pattern | dev/data/fineweb10B/fineweb_val_*.bin | | output log dir | log124M | | checkpoint_every | 5000 | | resume | 0 | | micro batch size B | 32 | | sequence length T | 1024 | | total batch size | 524288 | | LR scheduler | cosine | | learning rate (LR) | 6.000000e-04 | | warmup iterations | 700 | | final LR fraction | 0.000000e+00 | | weight decay | 1.000000e-01 | | skip update lossz | 0.000000 | | skip update gradz | 0.000000 | | max_steps | -1 | | val_loss_every | 250 | | val_max_steps | 20 | | sample_every | 20000 | | genT | 64 | | overfit_single_batch | 0 | | use_master_weights | enabled | | gelu_fusion | 0 | | recompute | 0 | +-----------------------+----------------------------------------------------+ | device | NVIDIA A100-PCIE-40GB | | peak TFlops | 312.0 | | precision | BF16 | +-----------------------+----------------------------------------------------+ | weight init method | d12 | | max_sequence_length T | 1024 | | vocab_size V | 50257 | | padded_vocab_size Vp | 50304 | | num_layers L | 12 | | num_heads NH | 12 | | channels C | 768 | | num_parameters | 124475904 | +-----------------------+----------------------------------------------------+ | train_num_batches | 19560 | | val_num_batches | 20 | +-----------------------+----------------------------------------------------+ | run hellaswag | yes | +-----------------------+----------------------------------------------------+ | num_processes | 1 | | zero_stage | 1 | +-----------------------+----------------------------------------------------+ num_parameters: 124475904 => bytes: 248951808 allocated 237 MiB for model parameters batch_size B=32 * seq_len T=1024 * num_processes=1 and total_batch_size=524288 => setting grad_accum_steps=16 created directory: log124M --- WARNING: Failed to open the tokenizer file gpt2_tokenizer.bin The Tokenizer is a new feature added April 14 2024. Re-run `python train_gpt2.py` to write it --- allocating 237 MiB for parameter gradients allocating 21918 MiB for activations allocating 474 MiB for AdamW optimizer state m allocating 474 MiB for AdamW optimizer state v allocating 474 MiB for master copy of params device memory usage: 24323 MiB / 40338 MiB memory per sequence: 684 MiB -> estimated maximum batch size: 55 val loss 11.008742 step 1/19560 | loss 11.010252 (+nanz)| norm 15.1374 (+nanz)| lr 8.57e-07 | 4071.60 ms | 33.2% bf16 MFU | 128767 tok/s step 2/19560 | loss 10.956968 (+nanz)| norm 15.2986 (+nanz)| lr 1.71e-06 | 4172.30 ms | 32.4% bf16 MFU | 125659 tok/s step 3/19560 | loss 10.848854 (+nanz)| norm 14.8131 (+nanz)| lr 2.57e-06 | 4078.48 ms | 33.1% bf16 MFU | 127142 tok/s step 4/19560 | loss 10.718035 (+nanz)| norm 12.9272 (+nanz)| lr 3.43e-06 | 4064.06 ms | 33.2% bf16 MFU | 127795 tok/s step 5/19560 | loss 10.560909 (+nanz)| norm 10.5613 (+nanz)| lr 4.29e-06 | 4048.10 ms | 33.4% bf16 MFU | 128259 tok/s step 6/19560 | loss 10.416231 (+nanz)| norm 8.5775 (+nanz)| lr 5.14e-06 | 4053.60 ms | 33.3% bf16 MFU | 128497 tok/s step 7/19560 | loss 10.299289 (+nanz)| norm 7.1370 (+nanz)| lr 6.00e-06 | 4046.89 ms | 33.4% bf16 MFU | 128697 tok/s step 8/19560 | loss 10.185875 (+nanz)| norm 6.2595 (+nanz)| lr 6.86e-06 | 4048.19 ms | 33.4% bf16 MFU | 128832 tok/s step 9/19560 | loss 10.062778 (+nanz)| norm 5.4449 (+nanz)| lr 7.71e-06 | 4041.57 ms | 33.4% bf16 MFU | 128964 tok/s step 10/19560 | loss 10.002686 (+nanz)| norm 4.5098 (+nanz)| lr 8.57e-06 | 4060.28 ms | 33.3% bf16 MFU | 128986 tok/s step 11/19560 | loss 9.922112 (+nanz)| norm 3.8772 (+nanz)| lr 9.43e-06 | 4050.41 ms | 33.3% bf16 MFU | 129043 tok/s step 12/19560 | loss 9.843137 (+nanz)| norm 3.4079 (+nanz)| lr 1.03e-05 | 4055.49 ms | 33.3% bf16 MFU | 129070 tok/s step 13/19560 | loss 9.813899 (+nanz)| norm 2.9496 (+nanz)| lr 1.11e-05 | 4040.15 ms | 33.4% bf16 MFU | 129146 tok/s step 14/19560 | loss 9.751227 (+nanz)| norm 2.7037 (+nanz)| lr 1.20e-05 | 4045.81 ms | 33.4% bf16 MFU | 129192 tok/s step 15/19560 | loss 9.709184 (+nanz)| norm 2.4934 (+nanz)| lr 1.29e-05 | 4042.56 ms | 33.4% bf16 MFU | 129240 tok/s step 16/19560 | loss 9.699945 (+nanz)| norm 2.2957 (+nanz)| lr 1.37e-05 | 4052.77 ms | 33.3% bf16 MFU | 129252 tok/s step 17/19560 | loss 9.646930 (+nanz)| norm 2.2621 (+nanz)| lr 1.46e-05 | 4053.49 ms | 33.3% bf16 MFU | 129260 tok/s step 18/19560 | loss 9.651306 (+nanz)| norm 2.1829 (+nanz)| lr 1.54e-05 | 4055.86 ms | 33.3% bf16 MFU | 129261 tok/s step 19/19560 | loss 9.606390 (+nanz)| norm 2.1961 (+nanz)| lr 1.63e-05 | 4059.29 ms | 33.3% bf16 MFU | 129252 tok/s step 20/19560 | loss 9.587294 (+nanz)| norm 2.1759 (+nanz)| lr 1.71e-05 | 4056.40 ms | 33.3% bf16 MFU | 129252 tok/s step 21/19560 | loss 9.574850 (+nanz)| norm 2.1448 (+nanz)| lr 1.80e-05 | 4053.00 ms | 33.3% bf16 MFU | 129260 tok/s step 22/19560 | loss 9.529497 (+nanz)| norm 2.1576 (+nanz)| lr 1.89e-05 | 4063.23 ms | 33.2% bf16 MFU | 129243 tok/s step 23/19560 | loss 9.510900 (+nanz)| norm 2.1088 (+nanz)| lr 1.97e-05 | 4062.58 ms | 33.2% bf16 MFU | 129229 tok/s step 24/19560 | loss 9.470356 (+nanz)| norm 2.0936 (+nanz)| lr 2.06e-05 | 4066.38 ms | 33.2% bf16 MFU | 129207 tok/s step 25/19560 | loss 9.478792 (+nanz)| norm 1.9899 (+nanz)| lr 2.14e-05 | 4073.38 ms | 33.1% bf16 MFU | 129172 tok/s step 26/19560 | loss 9.438276 (+nanz)| norm 2.0026 (+nanz)| lr 2.23e-05 | 4078.09 ms | 33.1% bf16 MFU | 129130 tok/s step 27/19560 | loss 9.404680 (+nanz)| norm 1.9771 (+nanz)| lr 2.31e-05 | 4074.55 ms | 33.1% bf16 MFU | 129099 tok/s step 28/19560 | loss 9.371593 (+nanz)| norm 2.0174 (+nanz)| lr 2.40e-05 | 4079.79 ms | 33.1% bf16 MFU | 129060 tok/s step 29/19560 | loss 9.333740 (+nanz)| norm 1.9821 (+nanz)| lr 2.49e-05 | 4081.67 ms | 33.1% bf16 MFU | 129020 tok/s step 30/19560 | loss 9.274384 (+nanz)| norm 1.9662 (+nanz)| lr 2.57e-05 | 4084.38 ms | 33.1% bf16 MFU | 128977 tok/s step 31/19560 | loss 9.276795 (+nanz)| norm 1.9990 (+nanz)| lr 2.66e-05 | 4092.83 ms | 33.0% bf16 MFU | 128921 tok/s step 32/19560 | loss 9.189783 (+nanz)| norm 2.0196 (+nanz)| lr 2.74e-05 | 4091.85 ms | 33.0% bf16 MFU | 128872 tok/s step 33/19560 | loss 9.177305 (+nanz)| norm 2.0294 (+nanz)| lr 2.83e-05 | 4093.08 ms | 33.0% bf16 MFU | 128823 tok/s step 34/19560 | loss 9.166351 (+nanz)| norm 1.8597 (+nanz)| lr 2.91e-05 | 4116.85 ms | 32.8% bf16 MFU | 128733 tok/s step 35/19560 | loss 9.104334 (+nanz)| norm 1.8565 (+nanz)| lr 3.00e-05 | 4093.19 ms | 33.0% bf16 MFU | 128694 tok/s step 36/19560 | loss 9.059115 (+nanz)| norm 1.9520 (+nanz)| lr 3.09e-05 | 4115.30 ms | 32.8% bf16 MFU | 128616 tok/s step 37/19560 | loss 9.031851 (+nanz)| norm 1.8381 (+nanz)| lr 3.17e-05 | 4106.38 ms | 32.9% bf16 MFU | 128561 tok/s step 38/19560 | loss 8.993953 (+nanz)| norm 1.7589 (+nanz)| lr 3.26e-05 | 4099.23 ms | 32.9% bf16 MFU | 128522 tok/s step 39/19560 | loss 8.993319 (+nanz)| norm 1.7563 (+nanz)| lr 3.34e-05 | 4123.78 ms | 32.7% bf16 MFU | 128441 tok/s step 40/19560 | loss 8.945615 (+nanz)| norm 1.9749 (+nanz)| lr 3.43e-05 | 4107.50 ms | 32.9% bf16 MFU | 128395 tok/s step 41/19560 | loss 8.925097 (+nanz)| norm 1.8322 (+nanz)| lr 3.51e-05 | 4107.45 ms | 32.9% bf16 MFU | 128352 tok/s step 42/19560 | loss 8.840963 (+nanz)| norm 1.6473 (+nanz)| lr 3.60e-05 | 4112.16 ms | 32.8% bf16 MFU | 128303 tok/s step 43/19560 | loss 8.828307 (+nanz)| norm 1.8796 (+nanz)| lr 3.69e-05 | 4118.68 ms | 32.8% bf16 MFU | 128246 tok/s step 44/19560 | loss 8.798753 (+nanz)| norm 2.0346 (+nanz)| lr 3.77e-05 | 4108.38 ms | 32.9% bf16 MFU | 128211 tok/s step 45/19560 | loss 8.754614 (+nanz)| norm 1.7787 (+nanz)| lr 3.86e-05 | 4111.37 ms | 32.8% bf16 MFU | 128172 tok/s step 46/19560 | loss 8.743908 (+nanz)| norm 1.5513 (+nanz)| lr 3.94e-05 | 4115.38 ms | 32.8% bf16 MFU | 128129 tok/s step 47/19560 | loss 8.667839 (+nanz)| norm 1.7076 (+nanz)| lr 4.03e-05 | 4107.06 ms | 32.9% bf16 MFU | 128103 tok/s step 48/19560 | loss 8.625178 (+nanz)| norm 1.8846 (+nanz)| lr 4.11e-05 | 4125.62 ms | 32.7% bf16 MFU | 128047 tok/s step 49/19560 | loss 8.604111 (+nanz)| norm 1.6329 (+nanz)| lr 4.20e-05 | 4113.83 ms | 32.8% bf16 MFU | 128014 tok/s step 50/19560 | loss 8.599235 (+nanz)| norm 1.5363 (+nanz)| lr 4.29e-05 | 4117.21 ms | 32.8% bf16 MFU | 127977 tok/s step 51/19560 | loss 8.553820 (+nanz)| norm 1.5498 (+nanz)| lr 4.37e-05 | 4115.13 ms | 32.8% bf16 MFU | 127946 tok/s step 52/19560 | loss 8.454524 (+nanz)| norm 1.6064 (+nanz)| lr 4.46e-05 | 4123.18 ms | 32.7% bf16 MFU | 127904 tok/s step 53/19560 | loss 8.464828 (+nanz)| norm 1.6133 (+nanz)| lr 4.54e-05 | 4114.36 ms | 32.8% bf16 MFU | 127878 tok/s step 54/19560 | loss 8.432864 (+nanz)| norm 1.5460 (+nanz)| lr 4.63e-05 | 4118.08 ms | 32.8% bf16 MFU | 127848 tok/s step 55/19560 | loss 8.429247 (+nanz)| norm 1.4434 (+nanz)| lr 4.71e-05 | 4113.39 ms | 32.8% bf16 MFU | 127827 tok/s step 56/19560 | loss 8.354961 (+nanz)| norm 1.5806 (+nanz)| lr 4.80e-05 | 4118.59 ms | 32.8% bf16 MFU | 127799 tok/s step 57/19560 | loss 8.295547 (+nanz)| norm 1.6289 (+nanz)| lr 4.89e-05 | 4131.72 ms | 32.7% bf16 MFU | 127751 tok/s step 58/19560 | loss 8.285473 (+nanz)| norm 1.5636 (+nanz)| lr 4.97e-05 | 4121.85 ms | 32.8% bf16 MFU | 127722 tok/s step 59/19560 | loss 8.217405 (+nanz)| norm 1.5603 (+nanz)| lr 5.06e-05 | 4124.63 ms | 32.7% bf16 MFU | 127690 tok/s step 60/19560 | loss 8.237095 (+nanz)| norm 1.3580 (+nanz)| lr 5.14e-05 | 4117.85 ms | 32.8% bf16 MFU | 127670 tok/s step 61/19560 | loss 8.181175 (+nanz)| norm 1.5190 (+nanz)| lr 5.23e-05 | 4125.38 ms | 32.7% bf16 MFU | 127640 tok/s step 62/19560 | loss 8.111239 (+nanz)| norm 1.5719 (+nanz)| lr 5.31e-05 | 4115.66 ms | 32.8% bf16 MFU | 127627 tok/s step 63/19560 | loss 8.097012 (+nanz)| norm 1.5858 (+nanz)| lr 5.40e-05 | 4123.13 ms | 32.7% bf16 MFU | 127602 tok/s step 64/19560 | loss 8.010080 (+nanz)| norm 1.4264 (+nanz)| lr 5.49e-05 | 4119.49 ms | 32.8% bf16 MFU | 127585 tok/s step 65/19560 | loss 8.007753 (+nanz)| norm 1.3238 (+nanz)| lr 5.57e-05 | 4206.50 ms | 32.1% bf16 MFU | 127432 tok/s step 66/19560 | loss 8.009417 (+nanz)| norm 1.4227 (+nanz)| lr 5.66e-05 | 4121.05 ms | 32.8% bf16 MFU | 127421 tok/s step 67/19560 | loss 7.949610 (+nanz)| norm 1.2847 (+nanz)| lr 5.74e-05 | 4133.23 ms | 32.7% bf16 MFU | 127391 tok/s step 68/19560 | loss 7.929121 (+nanz)| norm 1.2304 (+nanz)| lr 5.83e-05 | 4130.16 ms | 32.7% bf16 MFU | 127368 tok/s step 69/19560 | loss 7.881710 (+nanz)| norm 1.3918 (+nanz)| lr 5.91e-05 | 4143.92 ms | 32.6% bf16 MFU | 127324 tok/s step 70/19560 | loss 7.857580 (+nanz)| norm 1.2736 (+nanz)| lr 6.00e-05 | 4135.22 ms | 32.7% bf16 MFU | 127296 tok/s step 71/19560 | loss 7.789918 (+nanz)| norm 1.4670 (+nanz)| lr 6.09e-05 | 4126.50 ms | 32.7% bf16 MFU | 127284 tok/s step 72/19560 | loss 7.779979 (+nanz)| norm 1.2948 (+nanz)| lr 6.17e-05 | 4118.28 ms | 32.8% bf16 MFU | 127285 tok/s step 73/19560 | loss 7.801703 (+nanz)| norm 1.1145 (+nanz)| lr 6.26e-05 | 4117.75 ms | 32.8% bf16 MFU | 127287 tok/s step 74/19560 | loss 7.726159 (+nanz)| norm 1.3455 (+nanz)| lr 6.34e-05 | 4120.99 ms | 32.8% bf16 MFU | 127284 tok/s step 75/19560 | loss 7.745239 (+nanz)| norm 1.0780 (+nanz)| lr 6.43e-05 | 4123.08 ms | 32.7% bf16 MFU | 127278 tok/s step 76/19560 | loss 7.658813 (+nanz)| norm 1.3200 (+nanz)| lr 6.51e-05 | 4117.75 ms | 32.8% bf16 MFU | 127280 tok/s step 77/19560 | loss 7.627581 (+nanz)| norm 1.3540 (+nanz)| lr 6.60e-05 | 4132.92 ms | 32.7% bf16 MFU | 127258 tok/s step 78/19560 | loss 7.599282 (+nanz)| norm 1.2295 (+nanz)| lr 6.69e-05 | 4128.06 ms | 32.7% bf16 MFU | 127245 tok/s step 79/19560 | loss 7.551337 (+nanz)| norm 1.0759 (+nanz)| lr 6.77e-05 | 4122.64 ms | 32.8% bf16 MFU | 127242 tok/s step 80/19560 | loss 7.512386 (+nanz)| norm 1.1204 (+nanz)| lr 6.86e-05 | 4168.03 ms | 32.4% bf16 MFU | 127168 tok/s step 81/19560 | loss 7.509316 (+nanz)| norm 1.0917 (+nanz)| lr 6.94e-05 | 4124.38 ms | 32.7% bf16 MFU | 127165 tok/s step 82/19560 | loss 7.473096 (+nanz)| norm 1.2228 (+nanz)| lr 7.03e-05 | 4129.65 ms | 32.7% bf16 MFU | 127155 tok/s step 83/19560 | loss 7.487222 (+nanz)| norm 1.0560 (+nanz)| lr 7.11e-05 | 4120.78 ms | 32.8% bf16 MFU | 127159 tok/s step 84/19560 | loss 7.500445 (+nanz)| norm 1.0092 (+nanz)| lr 7.20e-05 | 4125.32 ms | 32.7% bf16 MFU | 127155 tok/s step 85/19560 | loss 7.400823 (+nanz)| norm 1.2149 (+nanz)| lr 7.29e-05 | 4123.47 ms | 32.7% bf16 MFU | 127155 tok/s step 86/19560 | loss 7.387080 (+nanz)| norm 0.9537 (+nanz)| lr 7.37e-05 | 4126.35 ms | 32.7% bf16 MFU | 127150 tok/s step 87/19560 | loss 7.333549 (+nanz)| norm 0.9862 (+nanz)| lr 7.46e-05 | 4126.25 ms | 32.7% bf16 MFU | 127145 tok/s step 88/19560 | loss 7.360038 (+nanz)| norm 0.8684 (+nanz)| lr 7.54e-05 | 4121.73 ms | 32.8% bf16 MFU | 127148 tok/s step 89/19560 | loss 7.374776 (+nanz)| norm 0.9763 (+nanz)| lr 7.63e-05 | 4128.91 ms | 32.7% bf16 MFU | 127140 tok/s step 90/19560 | loss 7.322541 (+nanz)| norm 0.8711 (+nanz)| lr 7.71e-05 | 4148.36 ms | 32.5% bf16 MFU | 127101 tok/s step 91/19560 | loss 7.249904 (+nanz)| norm 1.1522 (+nanz)| lr 7.80e-05 | 4127.99 ms | 32.7% bf16 MFU | 127097 tok/s step 92/19560 | loss 7.254242 (+nanz)| norm 0.8810 (+nanz)| lr 7.89e-05 | 4123.87 ms | 32.7% bf16 MFU | 127099 tok/s step 93/19560 | loss 7.256532 (+nanz)| norm 0.8510 (+nanz)| lr 7.97e-05 | 4131.24 ms | 32.7% bf16 MFU | 127089 tok/s step 94/19560 | loss 7.270530 (+nanz)| norm 0.9430 (+nanz)| lr 8.06e-05 | 4141.09 ms | 32.6% bf16 MFU | 127065 tok/s step 95/19560 | loss 7.200323 (+nanz)| norm 0.8393 (+nanz)| lr 8.14e-05 | 4134.57 ms | 32.7% bf16 MFU | 127052 tok/s step 96/19560 | loss 7.198968 (+nanz)| norm 0.8211 (+nanz)| lr 8.23e-05 | 4139.58 ms | 32.6% bf16 MFU | 127032 tok/s step 97/19560 | loss 7.138680 (+nanz)| norm 0.9847 (+nanz)| lr 8.31e-05 | 4134.13 ms | 32.7% bf16 MFU | 127021 tok/s step 98/19560 | loss 7.168355 (+nanz)| norm 1.0598 (+nanz)| lr 8.40e-05 | 4136.78 ms | 32.6% bf16 MFU | 127007 tok/s step 99/19560 | loss 7.150956 (+nanz)| norm 0.9048 (+nanz)| lr 8.49e-05 | 4130.92 ms | 32.7% bf16 MFU | 127002 tok/s step 100/19560 | loss 7.107867 (+nanz)| norm 0.8493 (+nanz)| lr 8.57e-05 | 4129.94 ms | 32.7% bf16 MFU | 126999 tok/s step 101/19560 | loss 7.128820 (+nanz)| norm 0.8135 (+nanz)| lr 8.66e-05 | 4139.27 ms | 32.6% bf16 MFU | 126982 tok/s step 102/19560 | loss 7.098422 (+nanz)| norm 1.0754 (+nanz)| lr 8.74e-05 | 4138.47 ms | 32.6% bf16 MFU | 126968 tok/s step 103/19560 | loss 7.006820 (+nanz)| norm 1.0624 (+nanz)| lr 8.83e-05 | 4126.48 ms | 32.7% bf16 MFU | 126972 tok/s step 104/19560 | loss 7.069552 (+nanz)| norm 1.1098 (+nanz)| lr 8.91e-05 | 4123.41 ms | 32.7% bf16 MFU | 126981 tok/s step 105/19560 | loss 7.045156 (+nanz)| norm 1.3595 (+nanz)| lr 9.00e-05 | 4164.76 ms | 32.4% bf16 MFU | 126926 tok/s step 106/19560 | loss 7.021362 (+nanz)| norm 0.8847 (+nanz)| lr 9.09e-05 | 4146.61 ms | 32.6% bf16 MFU | 126901 tok/s step 107/19560 | loss 7.036779 (+nanz)| norm 0.8794 (+nanz)| lr 9.17e-05 | 4137.46 ms | 32.6% bf16 MFU | 126892 tok/s step 108/19560 | loss 7.005927 (+nanz)| norm 0.9442 (+nanz)| lr 9.26e-05 | 4164.05 ms | 32.4% bf16 MFU | 126843 tok/s step 109/19560 | loss 6.963003 (+nanz)| norm 1.1256 (+nanz)| lr 9.34e-05 | 5562.42 ms | 24.3% bf16 MFU | 125207 tok/s step 110/19560 | loss 7.036716 (+nanz)| norm 1.4282 (+nanz)| lr 9.43e-05 | 4132.98 ms | 32.7% bf16 MFU | 125290 tok/s step 111/19560 | loss 6.987480 (+nanz)| norm 0.6036 (+nanz)| lr 9.51e-05 | 4127.29 ms | 32.7% bf16 MFU | 125377 tok/s step 112/19560 | loss 6.992048 (+nanz)| norm 1.6312 (+nanz)| lr 9.60e-05 | 4134.20 ms | 32.7% bf16 MFU | 125449 tok/s step 113/19560 | loss 6.982272 (+nanz)| norm 1.0781 (+nanz)| lr 9.69e-05 | 4128.14 ms | 32.7% bf16 MFU | 125527 tok/s step 114/19560 | loss 6.868057 (+nanz)| norm 0.8829 (+nanz)| lr 9.77e-05 | 4142.65 ms | 32.6% bf16 MFU | 125579 tok/s step 115/19560 | loss 6.885142 (+nanz)| norm 1.0602 (+nanz)| lr 9.86e-05 | 4137.60 ms | 32.6% bf16 MFU | 125636 tok/s step 116/19560 | loss 6.908633 (+nanz)| norm 0.7695 (+nanz)| lr 9.94e-05 | 4140.18 ms | 32.6% bf16 MFU | 125686 tok/s step 117/19560 | loss 6.870278 (+nanz)| norm 0.7609 (+nanz)| lr 1.00e-04 | 4132.77 ms | 32.7% bf16 MFU | 125745 tok/s step 118/19560 | loss 6.911804 (+nanz)| norm 1.0809 (+nanz)| lr 1.01e-04 | 4134.71 ms | 32.7% bf16 MFU | 125798 tok/s step 119/19560 | loss 6.867452 (+nanz)| norm 0.8105 (+nanz)| lr 1.02e-04 | 4133.84 ms | 32.7% bf16 MFU | 125849 tok/s step 120/19560 | loss 6.828675 (+nanz)| norm 0.7539 (+nanz)| lr 1.03e-04 | 4143.11 ms | 32.6% bf16 MFU | 125884 tok/s step 121/19560 | loss 6.831336 (+nanz)| norm 0.6586 (+nanz)| lr 1.04e-04 | 4138.32 ms | 32.6% bf16 MFU | 125925 tok/s step 122/19560 | loss 6.853160 (+nanz)| norm 0.6669 (+nanz)| lr 1.05e-04 | 4128.08 ms | 32.7% bf16 MFU | 125979 tok/s step 123/19560 | loss 6.857122 (+nanz)| norm 1.0042 (+nanz)| lr 1.05e-04 | 4138.74 ms | 32.6% bf16 MFU | 126014 tok/s step 124/19560 | loss 6.844392 (+nanz)| norm 1.6934 (+nanz)| lr 1.06e-04 | 4140.37 ms | 32.6% bf16 MFU | 126045 tok/s step 125/19560 | loss 6.855590 (+nanz)| norm 1.3942 (+nanz)| lr 1.07e-04 | 4146.03 ms | 32.6% bf16 MFU | 126065 tok/s step 126/19560 | loss 6.860312 (+nanz)| norm 0.8997 (+nanz)| lr 1.08e-04 | 4147.23 ms | 32.6% bf16 MFU | 126083 tok/s step 127/19560 | loss 6.818297 (+nanz)| norm 1.3468 (+nanz)| lr 1.09e-04 | 4179.45 ms | 32.3% bf16 MFU | 126051 tok/s step 128/19560 | loss 6.865798 (+nanz)| norm 0.8838 (+nanz)| lr 1.10e-04 | 4135.21 ms | 32.7% bf16 MFU | 126088 tok/s step 129/19560 | loss 6.810286 (-1.25z)| norm 1.1167 (-0.39z)| lr 1.11e-04 | 4144.80 ms | 32.6% bf16 MFU | 126108 tok/s step 130/19560 | loss 6.768888 (-1.28z)| norm 0.9466 (-0.48z)| lr 1.11e-04 | 4140.35 ms | 32.6% bf16 MFU | 126134 tok/s step 131/19560 | loss 6.739851 (-1.30z)| norm 0.8086 (-0.59z)| lr 1.12e-04 | 4159.21 ms | 32.5% bf16 MFU | 126130 tok/s step 132/19560 | loss 6.775944 (-1.25z)| norm 0.6403 (-0.78z)| lr 1.13e-04 | 4176.41 ms | 32.3% bf16 MFU | 126100 tok/s step 133/19560 | loss 6.807941 (-1.21z)| norm 0.7068 (-0.82z)| lr 1.14e-04 | 4364.14 ms | 30.9% bf16 MFU | 125802 tok/s step 134/19560 | loss 6.744745 (-1.26z)| norm 0.7560 (-0.85z)| lr 1.15e-04 | 4139.33 ms | 32.6% bf16 MFU | 125845 tok/s step 135/19560 | loss 6.710180 (-1.29z)| norm 0.7690 (-0.91z)| lr 1.16e-04 | 4135.25 ms | 32.7% bf16 MFU | 125892 tok/s step 136/19560 | loss 6.668845 (-1.31z)| norm 1.1925 (-0.41z)| lr 1.17e-04 | 4153.64 ms | 32.5% bf16 MFU | 125908 tok/s step 137/19560 | loss 6.702995 (-1.27z)| norm 1.0176 (-0.70z)| lr 1.17e-04 | 4129.79 ms | 32.7% bf16 MFU | 125961 tok/s step 138/19560 | loss 6.676449 (-1.28z)| norm 0.9268 (-0.88z)| lr 1.18e-04 | 4144.52 ms | 32.6% bf16 MFU | 125988 tok/s step 139/19560 | loss 6.593547 (-1.35z)| norm 0.6808 (-1.37z)| lr 1.19e-04 | 4175.58 ms | 32.3% bf16 MFU | 125966 tok/s step 140/19560 | loss 6.619243 (-1.31z)| norm 0.9924 (-0.79z)| lr 1.20e-04 | 4191.81 ms | 32.2% bf16 MFU | 125922 tok/s step 141/19560 | loss 6.619309 (-1.29z)| norm 1.2171 (-0.32z)| lr 1.21e-04 | 4147.42 ms | 32.6% bf16 MFU | 125946 tok/s step 142/19560 | loss 6.630524 (-1.26z)| norm 0.8538 (-1.08z)| lr 1.22e-04 | 4134.88 ms | 32.7% bf16 MFU | 125989 tok/s step 143/19560 | loss 6.661499 (-1.22z)| norm 0.8521 (-1.08z)| lr 1.23e-04 | 4141.60 ms | 32.6% bf16 MFU | 126019 tok/s step 144/19560 | loss 6.573740 (-1.30z)| norm 0.8105 (-1.16z)| lr 1.23e-04 | 4130.51 ms | 32.7% bf16 MFU | 126065 tok/s step 145/19560 | loss 6.594590 (-1.26z)| norm 0.6270 (-1.55z)| lr 1.24e-04 | 4143.68 ms | 32.6% bf16 MFU | 126088 tok/s step 146/19560 | loss 6.599759 (-1.24z)| norm 0.7985 (-1.15z)| lr 1.25e-04 | 4148.28 ms | 32.5% bf16 MFU | 126103 tok/s step 147/19560 | loss 6.587889 (-1.24z)| norm 0.5757 (-1.64z)| lr 1.26e-04 | 4141.34 ms | 32.6% bf16 MFU | 126127 tok/s step 148/19560 | loss 6.622248 (-1.19z)| norm 0.6137 (-1.53z)| lr 1.27e-04 | 4143.46 ms | 32.6% bf16 MFU | 126148 tok/s step 149/19560 | loss 6.604148 (-1.20z)| norm 0.9592 (-0.73z)| lr 1.28e-04 | 4147.77 ms | 32.6% bf16 MFU | 126161 tok/s step 150/19560 | loss 6.590884 (-1.20z)| norm 0.9351 (-0.77z)| lr 1.29e-04 | 4149.97 ms | 32.5% bf16 MFU | 126169 tok/s step 151/19560 | loss 6.596620 (-1.18z)| norm 1.3715 (+0.28z)| lr 1.29e-04 | 4155.34 ms | 32.5% bf16 MFU | 126169 tok/s step 152/19560 | loss 6.557407 (-1.21z)| norm 1.0228 (-0.55z)| lr 1.30e-04 | 4130.79 ms | 32.7% bf16 MFU | 126207 tok/s step 153/19560 | loss 6.538983 (-1.22z)| norm 1.4602 (+0.53z)| lr 1.31e-04 | 4138.60 ms | 32.6% bf16 MFU | 126231 tok/s step 154/19560 | loss 6.545948 (-1.20z)| norm 0.7711 (-1.15z)| lr 1.32e-04 | 4138.96 ms | 32.6% bf16 MFU | 126253 tok/s step 155/19560 | loss 6.582598 (-1.15z)| norm 0.8079 (-1.05z)| lr 1.33e-04 | 4139.74 ms | 32.6% bf16 MFU | 126273 tok/s step 156/19560 | loss 6.573215 (-1.15z)| norm 0.7911 (-1.08z)| lr 1.34e-04 | 4146.82 ms | 32.6% bf16 MFU | 126281 tok/s step 157/19560 | loss 6.564999 (-1.15z)| norm 0.9722 (-0.61z)| lr 1.35e-04 | 4143.56 ms | 32.6% bf16 MFU | 126293 tok/s step 158/19560 | loss 6.514930 (-1.20z)| norm 0.9750 (-0.59z)| lr 1.35e-04 | 4156.23 ms | 32.5% bf16 MFU | 126286 tok/s step 159/19560 | loss 6.514951 (-1.19z)| norm 1.4323 (+0.63z)| lr 1.36e-04 | 4153.94 ms | 32.5% bf16 MFU | 126282 tok/s step 160/19560 | loss 6.521049 (-1.17z)| norm 0.7397 (-1.20z)| lr 1.37e-04 | 9832.78 ms | 13.7% bf16 MFU | 122633 tok/s step 161/19560 | loss 6.544691 (-1.13z)| norm 1.2959 (+0.32z)| lr 1.38e-04 | 4134.78 ms | 32.7% bf16 MFU | 122841 tok/s step 162/19560 | loss 6.545934 (-1.12z)| norm 1.3554 (+0.50z)| lr 1.39e-04 | 4131.40 ms | 32.7% bf16 MFU | 123045 tok/s step 163/19560 | loss 6.433540 (-1.26z)| norm 1.0449 (-0.35z)| lr 1.40e-04 | 4142.97 ms | 32.6% bf16 MFU | 123220 tok/s step 164/19560 | loss 6.577137 (-1.06z)| norm 0.9001 (-0.75z)| lr 1.41e-04 | 4135.67 ms | 32.6% bf16 MFU | 123397 tok/s step 165/19560 | loss 6.492131 (-1.16z)| norm 0.9030 (-0.73z)| lr 1.41e-04 | 4159.07 ms | 32.5% bf16 MFU | 123531 tok/s step 166/19560 | loss 6.480570 (-1.17z)| norm 1.0924 (-0.17z)| lr 1.42e-04 | 4133.99 ms | 32.7% bf16 MFU | 123695 tok/s step 167/19560 | loss 6.597379 (-1.00z)| norm 0.8442 (-0.88z)| lr 1.43e-04 | 4134.77 ms | 32.7% bf16 MFU | 123850 tok/s step 168/19560 | loss 6.407262 (-1.26z)| norm 1.1589 (+0.07z)| lr 1.44e-04 | 4130.40 ms | 32.7% bf16 MFU | 124005 tok/s step 169/19560 | loss 6.451786 (-1.18z)| norm 1.0007 (-0.40z)| lr 1.45e-04 | 4129.23 ms | 32.7% bf16 MFU | 124153 tok/s step 170/19560 | loss 6.512338 (-1.09z)| norm 1.2302 (+0.33z)| lr 1.46e-04 | 4142.69 ms | 32.6% bf16 MFU | 124273 tok/s step 171/19560 | loss 6.479221 (-1.13z)| norm 1.2104 (+0.29z)| lr 1.47e-04 | 4131.64 ms | 32.7% bf16 MFU | 124404 tok/s step 172/19560 | loss 6.466462 (-1.14z)| norm 0.9801 (-0.44z)| lr 1.47e-04 | 4127.49 ms | 32.7% bf16 MFU | 124535 tok/s step 173/19560 | loss 6.480914 (-1.11z)| norm 1.5831 (+1.58z)| lr 1.48e-04 | 4137.46 ms | 32.6% bf16 MFU | 124644 tok/s step 174/19560 | loss 6.425045 (-1.19z)| norm 0.8224 (-0.95z)| lr 1.49e-04 | 4134.28 ms | 32.7% bf16 MFU | 124753 tok/s step 175/19560 | loss 6.458856 (-1.13z)| norm 0.9372 (-0.55z)| lr 1.50e-04 | 4140.16 ms | 32.6% bf16 MFU | 124847 tok/s step 176/19560 | loss 6.416730 (-1.19z)| norm 0.8892 (-0.71z)| lr 1.51e-04 | 4133.70 ms | 32.7% bf16 MFU | 124946 tok/s step 177/19560 | loss 6.399880 (-1.21z)| norm 0.9698 (-0.41z)| lr 1.52e-04 | 4142.29 ms | 32.6% bf16 MFU | 125028 tok/s step 178/19560 | loss 6.431341 (-1.15z)| norm 1.1826 (+0.36z)| lr 1.53e-04 | 4134.65 ms | 32.7% bf16 MFU | 125116 tok/s step 179/19560 | loss 6.490291 (-1.04z)| norm 0.8794 (-0.72z)| lr 1.53e-04 | 4137.53 ms | 32.6% bf16 MFU | 125196 tok/s step 180/19560 | loss 6.424993 (-1.15z)| norm 1.4857 (+1.49z)| lr 1.54e-04 | 4129.93 ms | 32.7% bf16 MFU | 125284 tok/s step 181/19560 | loss 6.439236 (-1.12z)| norm 1.2802 (+0.76z)| lr 1.55e-04 | 4154.58 ms | 32.5% bf16 MFU | 125329 tok/s step 182/19560 | loss 6.430534 (-1.12z)| norm 0.9441 (-0.47z)| lr 1.56e-04 | 4135.43 ms | 32.6% bf16 MFU | 125402 tok/s step 183/19560 | loss 6.463459 (-1.06z)| norm 0.7323 (-1.24z)| lr 1.57e-04 | 4130.02 ms | 32.7% bf16 MFU | 125479 tok/s step 184/19560 | loss 6.406823 (-1.16z)| norm 0.7956 (-0.99z)| lr 1.58e-04 | 4139.00 ms | 32.6% bf16 MFU | 125539 tok/s step 185/19560 | loss 6.465394 (-1.04z)| norm 0.8709 (-0.70z)| lr 1.59e-04 | 4134.43 ms | 32.7% bf16 MFU | 125602 tok/s step 186/19560 | loss 6.407650 (-1.15z)| norm 1.4774 (+1.65z)| lr 1.59e-04 | 4130.49 ms | 32.7% bf16 MFU | 125669 tok/s step 187/19560 | loss 6.471938 (-1.01z)| norm 0.7420 (-1.18z)| lr 1.60e-04 | 4132.78 ms | 32.7% bf16 MFU | 125728 tok/s step 188/19560 | loss 6.396870 (-1.16z)| norm 0.7474 (-1.14z)| lr 1.61e-04 | 4139.36 ms | 32.6% bf16 MFU | 125775 tok/s step 189/19560 | loss 6.409274 (-1.13z)| norm 1.2253 (+0.74z)| lr 1.62e-04 | 4127.34 ms | 32.7% bf16 MFU | 125838 tok/s step 190/19560 | loss 6.367773 (-1.21z)| norm 1.3390 (+1.21z)| lr 1.63e-04 | 4131.61 ms | 32.7% bf16 MFU | 125891 tok/s step 191/19560 | loss 6.405262 (-1.12z)| norm 1.1014 (+0.28z)| lr 1.64e-04 | 4162.22 ms | 32.4% bf16 MFU | 125894 tok/s step 192/19560 | loss 6.398538 (-1.13z)| norm 0.9216 (-0.44z)| lr 1.65e-04 | 4153.65 ms | 32.5% bf16 MFU | 125911 tok/s step 193/19560 | loss 6.377398 (-1.17z)| norm 0.8735 (-0.63z)| lr 1.65e-04 | 4138.95 ms | 32.6% bf16 MFU | 125949 tok/s step 194/19560 | loss 6.370158 (-1.18z)| norm 0.7674 (-1.05z)| lr 1.66e-04 | 4146.31 ms | 32.6% bf16 MFU | 125974 tok/s step 195/19560 | loss 6.427714 (-1.04z)| norm 0.9130 (-0.43z)| lr 1.67e-04 | 4132.73 ms | 32.7% bf16 MFU | 126018 tok/s step 196/19560 | loss 6.416325 (-1.07z)| norm 0.7803 (-0.97z)| lr 1.68e-04 | 4150.56 ms | 32.5% bf16 MFU | 126033 tok/s step 197/19560 | loss 6.389481 (-1.12z)| norm 0.9087 (-0.43z)| lr 1.69e-04 | 4134.19 ms | 32.7% bf16 MFU | 126072 tok/s step 198/19560 | loss 6.371243 (-1.16z)| norm 0.7209 (-1.20z)| lr 1.70e-04 | 4146.31 ms | 32.6% bf16 MFU | 126091 tok/s step 199/19560 | loss 6.306969 (-1.32z)| norm 0.8664 (-0.57z)| lr 1.71e-04 | 4143.55 ms | 32.6% bf16 MFU | 126113 tok/s step 200/19560 | loss 6.307517 (-1.31z)| norm 1.0371 (+0.17z)| lr 1.71e-04 | 4138.76 ms | 32.6% bf16 MFU | 126141 tok/s step 201/19560 | loss 6.335395 (-1.24z)| norm 1.1116 (+0.49z)| lr 1.72e-04 | 4134.61 ms | 32.7% bf16 MFU | 126174 tok/s step 202/19560 | loss 6.325506 (-1.26z)| norm 0.8990 (-0.42z)| lr 1.73e-04 | 4134.61 ms | 32.7% bf16 MFU | 126206 tok/s step 203/19560 | loss 6.311151 (-1.29z)| norm 1.1641 (+0.73z)| lr 1.74e-04 | 4138.07 ms | 32.6% bf16 MFU | 126231 tok/s step 204/19560 | loss 6.321631 (-1.26z)| norm 0.8216 (-0.74z)| lr 1.75e-04 | 4142.39 ms | 32.6% bf16 MFU | 126247 tok/s step 205/19560 | loss 6.298928 (-1.31z)| norm 0.7151 (-1.19z)| lr 1.76e-04 | 4133.30 ms | 32.7% bf16 MFU | 126277 tok/s step 206/19560 | loss 6.271302 (-1.39z)| norm 0.5021 (-2.07z)| lr 1.77e-04 | 4134.93 ms | 32.7% bf16 MFU | 126303 tok/s step 207/19560 | loss 6.360522 (-1.12z)| norm 0.6734 (-1.31z)| lr 1.77e-04 | 4130.16 ms | 32.7% bf16 MFU | 126335 tok/s step 208/19560 | loss 6.273656 (-1.37z)| norm 0.6769 (-1.27z)| lr 1.78e-04 | 4139.52 ms | 32.6% bf16 MFU | 126351 tok/s step 209/19560 | loss 6.336602 (-1.17z)| norm 0.8820 (-0.39z)| lr 1.79e-04 | 4134.97 ms | 32.7% bf16 MFU | 126373 tok/s step 210/19560 | loss 6.251688 (-1.42z)| norm 1.3970 (+1.79z)| lr 1.80e-04 | 4137.82 ms | 32.6% bf16 MFU | 126390 tok/s step 211/19560 | loss 6.301167 (-1.26z)| norm 1.2269 (+1.06z)| lr 1.81e-04 | 4184.10 ms | 32.3% bf16 MFU | 126336 tok/s step 212/19560 | loss 6.363885 (-1.05z)| norm 0.7761 (-0.83z)| lr 1.82e-04 | 4169.54 ms | 32.4% bf16 MFU | 126306 tok/s step 213/19560 | loss 6.317413 (-1.20z)| norm 0.9097 (-0.26z)| lr 1.83e-04 | 4135.86 ms | 32.6% bf16 MFU | 126329 tok/s step 214/19560 | loss 6.282655 (-1.30z)| norm 1.0276 (+0.24z)| lr 1.83e-04 | 4133.25 ms | 32.7% bf16 MFU | 126355 tok/s step 215/19560 | loss 6.249884 (-1.40z)| norm 0.9145 (-0.24z)| lr 1.84e-04 | 4129.14 ms | 32.7% bf16 MFU | 126386 tok/s step 216/19560 | loss 6.225291 (-1.48z)| norm 1.0427 (+0.30z)| lr 1.85e-04 | 4146.98 ms | 32.6% bf16 MFU | 126388 tok/s step 217/19560 | loss 6.219942 (-1.49z)| norm 0.9248 (-0.20z)| lr 1.86e-04 | 4136.87 ms | 32.6% bf16 MFU | 126405 tok/s step 218/19560 | loss 6.305531 (-1.18z)| norm 0.9313 (-0.17z)| lr 1.87e-04 | 4135.03 ms | 32.7% bf16 MFU | 126424 tok/s step 219/19560 | loss 6.293392 (-1.21z)| norm 0.8648 (-0.45z)| lr 1.88e-04 | 4138.46 ms | 32.6% bf16 MFU | 126438 tok/s step 220/19560 | loss 6.295951 (-1.19z)| norm 0.9172 (-0.23z)| lr 1.89e-04 | 4139.19 ms | 32.6% bf16 MFU | 126449 tok/s step 221/19560 | loss 6.247685 (-1.36z)| norm 0.9700 (-0.01z)| lr 1.89e-04 | 4133.94 ms | 32.7% bf16 MFU | 126468 tok/s step 222/19560 | loss 6.236104 (-1.40z)| norm 0.7239 (-1.04z)| lr 1.90e-04 | 4135.77 ms | 32.6% bf16 MFU | 126483 tok/s step 223/19560 | loss 6.374409 (-0.86z)| norm 0.8027 (-0.70z)| lr 1.91e-04 | 4135.02 ms | 32.7% bf16 MFU | 126498 tok/s step 224/19560 | loss 6.207123 (-1.50z)| norm 0.8344 (-0.57z)| lr 1.92e-04 | 4138.37 ms | 32.6% bf16 MFU | 126508 tok/s step 225/19560 | loss 6.197450 (-1.52z)| norm 1.1094 (+0.59z)| lr 1.93e-04 | 4145.73 ms | 32.6% bf16 MFU | 126506 tok/s step 226/19560 | loss 6.238132 (-1.35z)| norm 0.9592 (-0.05z)| lr 1.94e-04 | 4145.15 ms | 32.6% bf16 MFU | 126504 tok/s step 227/19560 | loss 6.199447 (-1.49z)| norm 0.8002 (-0.71z)| lr 1.95e-04 | 4136.76 ms | 32.6% bf16 MFU | 126516 tok/s step 228/19560 | loss 6.230466 (-1.36z)| norm 0.8063 (-0.68z)| lr 1.95e-04 | 4146.06 ms | 32.6% bf16 MFU | 126513 tok/s step 229/19560 | loss 6.240688 (-1.31z)| norm 0.7209 (-1.04z)| lr 1.96e-04 | 4138.85 ms | 32.6% bf16 MFU | 126521 tok/s step 230/19560 | loss 6.188414 (-1.51z)| norm 0.7409 (-0.94z)| lr 1.97e-04 | 4142.56 ms | 32.6% bf16 MFU | 126523 tok/s step 231/19560 | loss 6.177166 (-1.54z)| norm 0.8721 (-0.38z)| lr 1.98e-04 | 4133.30 ms | 32.7% bf16 MFU | 126539 tok/s step 232/19560 | loss 6.166073 (-1.57z)| norm 0.8589 (-0.43z)| lr 1.99e-04 | 4144.34 ms | 32.6% bf16 MFU | 126538 tok/s step 233/19560 | loss 6.256610 (-1.18z)| norm 0.7967 (-0.68z)| lr 2.00e-04 | 4140.27 ms | 32.6% bf16 MFU | 126542 tok/s step 234/19560 | loss 6.180317 (-1.49z)| norm 0.9125 (-0.19z)| lr 2.01e-04 | 4136.34 ms | 32.6% bf16 MFU | 126553 tok/s step 235/19560 | loss 6.184452 (-1.46z)| norm 0.8028 (-0.65z)| lr 2.01e-04 | 4130.95 ms | 32.7% bf16 MFU | 126571 tok/s step 236/19560 | loss 6.190324 (-1.42z)| norm 0.7345 (-0.93z)| lr 2.02e-04 | 4141.87 ms | 32.6% bf16 MFU | 126572 tok/s step 237/19560 | loss 6.192926 (-1.40z)| norm 0.8591 (-0.40z)| lr 2.03e-04 | 4140.20 ms | 32.6% bf16 MFU | 126575 tok/s step 238/19560 | loss 6.152856 (-1.57z)| norm 0.7782 (-0.73z)| lr 2.04e-04 | 4139.39 ms | 32.6% bf16 MFU | 126579 tok/s step 239/19560 | loss 6.143903 (-1.59z)| norm 0.7206 (-0.98z)| lr 2.05e-04 | 4131.51 ms | 32.7% bf16 MFU | 126595 tok/s step 240/19560 | loss 6.183730 (-1.40z)| norm 0.7047 (-1.05z)| lr 2.06e-04 | 4134.02 ms | 32.7% bf16 MFU | 126606 tok/s step 241/19560 | loss 6.154095 (-1.53z)| norm 0.7011 (-1.05z)| lr 2.07e-04 | 4146.73 ms | 32.6% bf16 MFU | 126598 tok/s step 242/19560 | loss 6.170349 (-1.43z)| norm 0.8141 (-0.55z)| lr 2.07e-04 | 4137.98 ms | 32.6% bf16 MFU | 126603 tok/s step 243/19560 | loss 6.151607 (-1.50z)| norm 0.8947 (-0.19z)| lr 2.08e-04 | 4134.91 ms | 32.7% bf16 MFU | 126612 tok/s step 244/19560 | loss 6.213929 (-1.19z)| norm 0.9491 (+0.05z)| lr 2.09e-04 | 4136.51 ms | 32.6% bf16 MFU | 126619 tok/s step 245/19560 | loss 6.145094 (-1.51z)| norm 0.9860 (+0.20z)| lr 2.10e-04 | 4136.82 ms | 32.6% bf16 MFU | 126625 tok/s step 246/19560 | loss 6.250987 (-0.98z)| norm 1.2474 (+1.35z)| lr 2.11e-04 | 4135.08 ms | 32.7% bf16 MFU | 126633 tok/s step 247/19560 | loss 6.213789 (-1.16z)| norm 1.1901 (+1.08z)| lr 2.12e-04 | 4139.85 ms | 32.6% bf16 MFU | 126634 tok/s step 248/19560 | loss 6.220041 (-1.11z)| norm 0.8807 (-0.29z)| lr 2.13e-04 | 4136.21 ms | 32.6% bf16 MFU | 126640 tok/s step 249/19560 | loss 6.116169 (-1.62z)| norm 0.9613 (+0.06z)| lr 2.13e-04 | 4220.65 ms | 32.0% bf16 MFU | 126519 tok/s step 250/19560 | loss 6.159375 (-1.39z)| norm 0.9089 (-0.18z)| lr 2.14e-04 | 4136.77 ms | 32.6% bf16 MFU | 126530 tok/s val loss 6.208999 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2500/10042 = 0.248954 step 251/19560 | loss 6.222259 (-1.05z)| norm 0.9454 (-0.02z)| lr 2.15e-04 | 4137.79 ms | 32.6% bf16 MFU | 126539 tok/s step 252/19560 | loss 6.139060 (-1.48z)| norm 1.0016 (+0.27z)| lr 2.16e-04 | 4139.78 ms | 32.6% bf16 MFU | 126544 tok/s step 253/19560 | loss 6.150771 (-1.41z)| norm 0.8651 (-0.36z)| lr 2.17e-04 | 4132.96 ms | 32.7% bf16 MFU | 126560 tok/s step 254/19560 | loss 6.119084 (-1.57z)| norm 0.7477 (-0.90z)| lr 2.18e-04 | 4145.42 ms | 32.6% bf16 MFU | 126555 tok/s step 255/19560 | loss 6.115254 (-1.57z)| norm 0.7818 (-0.73z)| lr 2.19e-04 | 4135.23 ms | 32.7% bf16 MFU | 126567 tok/s step 256/19560 | loss 6.152792 (-1.36z)| norm 0.8496 (-0.41z)| lr 2.19e-04 | 4138.18 ms | 32.6% bf16 MFU | 126573 tok/s step 257/19560 | loss 6.200411 (-1.09z)| norm 0.9634 (+0.15z)| lr 2.20e-04 | 4132.61 ms | 32.7% bf16 MFU | 126588 tok/s step 258/19560 | loss 6.129179 (-1.48z)| norm 0.8171 (-0.55z)| lr 2.21e-04 | 4139.18 ms | 32.6% bf16 MFU | 126592 tok/s step 259/19560 | loss 6.116148 (-1.54z)| norm 1.0287 (+0.46z)| lr 2.22e-04 | 4144.11 ms | 32.6% bf16 MFU | 126588 tok/s step 260/19560 | loss 6.086681 (-1.69z)| norm 1.0254 (+0.43z)| lr 2.23e-04 | 4133.14 ms | 32.7% bf16 MFU | 126601 tok/s step 261/19560 | loss 6.098022 (-1.62z)| norm 0.8488 (-0.43z)| lr 2.24e-04 | 4146.54 ms | 32.6% bf16 MFU | 126593 tok/s step 262/19560 | loss 6.138419 (-1.37z)| norm 0.7298 (-1.01z)| lr 2.25e-04 | 4129.81 ms | 32.7% bf16 MFU | 126611 tok/s step 263/19560 | loss 6.093211 (-1.62z)| norm 0.8611 (-0.38z)| lr 2.25e-04 | 4139.17 ms | 32.6% bf16 MFU | 126614 tok/s step 264/19560 | loss 6.049329 (-1.86z)| norm 0.6349 (-1.45z)| lr 2.26e-04 | 4131.96 ms | 32.7% bf16 MFU | 126627 tok/s step 265/19560 | loss 6.130955 (-1.35z)| norm 0.8878 (-0.22z)| lr 2.27e-04 | 4139.07 ms | 32.6% bf16 MFU | 126629 tok/s step 266/19560 | loss 6.127565 (-1.36z)| norm 0.8181 (-0.55z)| lr 2.28e-04 | 4131.65 ms | 32.7% bf16 MFU | 126643 tok/s step 267/19560 | loss 6.105062 (-1.48z)| norm 0.7420 (-0.93z)| lr 2.29e-04 | 4130.56 ms | 32.7% bf16 MFU | 126657 tok/s step 268/19560 | loss 6.095211 (-1.51z)| norm 0.6955 (-1.14z)| lr 2.30e-04 | 4139.49 ms | 32.6% bf16 MFU | 126657 tok/s step 269/19560 | loss 6.082567 (-1.57z)| norm 0.6971 (-1.11z)| lr 2.31e-04 | 4145.50 ms | 32.6% bf16 MFU | 126648 tok/s step 270/19560 | loss 6.192090 (-0.88z)| norm 0.8549 (-0.35z)| lr 2.31e-04 | 4131.28 ms | 32.7% bf16 MFU | 126661 tok/s step 271/19560 | loss 6.120798 (-1.31z)| norm 1.0542 (+0.61z)| lr 2.32e-04 | 4143.97 ms | 32.6% bf16 MFU | 126653 tok/s step 272/19560 | loss 6.059509 (-1.68z)| norm 1.2214 (+1.40z)| lr 2.33e-04 | 4157.28 ms | 32.5% bf16 MFU | 126626 tok/s step 273/19560 | loss 6.084623 (-1.49z)| norm 0.7550 (-0.86z)| lr 2.34e-04 | 4316.79 ms | 31.3% bf16 MFU | 126368 tok/s step 274/19560 | loss 6.038417 (-1.76z)| norm 0.7001 (-1.12z)| lr 2.35e-04 | 4129.46 ms | 32.7% bf16 MFU | 126398 tok/s step 275/19560 | loss 6.104932 (-1.32z)| norm 0.8115 (-0.59z)| lr 2.36e-04 | 4137.88 ms | 32.6% bf16 MFU | 126413 tok/s step 276/19560 | loss 6.031767 (-1.77z)| norm 1.0080 (+0.35z)| lr 2.37e-04 | 4134.98 ms | 32.7% bf16 MFU | 126432 tok/s step 277/19560 | loss 6.026074 (-1.78z)| norm 1.0053 (+0.34z)| lr 2.37e-04 | 4129.30 ms | 32.7% bf16 MFU | 126459 tok/s step 278/19560 | loss 6.041319 (-1.66z)| norm 1.1026 (+0.81z)| lr 2.38e-04 | 4146.17 ms | 32.6% bf16 MFU | 126458 tok/s step 279/19560 | loss 6.058398 (-1.53z)| norm 0.9410 (+0.03z)| lr 2.39e-04 | 4142.13 ms | 32.6% bf16 MFU | 126464 tok/s step 280/19560 | loss 6.056643 (-1.52z)| norm 0.7954 (-0.68z)| lr 2.40e-04 | 4136.00 ms | 32.6% bf16 MFU | 126479 tok/s step 281/19560 | loss 6.061846 (-1.46z)| norm 0.8817 (-0.24z)| lr 2.41e-04 | 4134.37 ms | 32.7% bf16 MFU | 126496 tok/s step 282/19560 | loss 6.049668 (-1.52z)| norm 0.8425 (-0.44z)| lr 2.42e-04 | 4141.66 ms | 32.6% bf16 MFU | 126500 tok/s step 283/19560 | loss 5.998260 (-1.83z)| norm 0.6586 (-1.37z)| lr 2.43e-04 | 4140.15 ms | 32.6% bf16 MFU | 126507 tok/s step 284/19560 | loss 6.057266 (-1.43z)| norm 0.7761 (-0.77z)| lr 2.43e-04 | 4127.92 ms | 32.7% bf16 MFU | 126532 tok/s step 285/19560 | loss 6.047146 (-1.48z)| norm 0.9486 (+0.11z)| lr 2.44e-04 | 4143.32 ms | 32.6% bf16 MFU | 126532 tok/s step 286/19560 | loss 5.971049 (-1.95z)| norm 1.0238 (+0.49z)| lr 2.45e-04 | 4138.39 ms | 32.6% bf16 MFU | 126540 tok/s step 287/19560 | loss 5.988630 (-1.80z)| norm 0.8843 (-0.20z)| lr 2.46e-04 | 4129.19 ms | 32.7% bf16 MFU | 126562 tok/s step 288/19560 | loss 6.034809 (-1.47z)| norm 0.8872 (-0.20z)| lr 2.47e-04 | 4139.73 ms | 32.6% bf16 MFU | 126566 tok/s step 289/19560 | loss 6.044781 (-1.39z)| norm 0.8051 (-0.62z)| lr 2.48e-04 | 4129.37 ms | 32.7% bf16 MFU | 126586 tok/s step 290/19560 | loss 6.014321 (-1.58z)| norm 0.7778 (-0.75z)| lr 2.49e-04 | 4147.70 ms | 32.6% bf16 MFU | 126577 tok/s step 291/19560 | loss 6.011685 (-1.57z)| norm 0.8230 (-0.50z)| lr 2.49e-04 | 4144.87 ms | 32.6% bf16 MFU | 126573 tok/s step 292/19560 | loss 6.011129 (-1.56z)| norm 1.0008 (+0.47z)| lr 2.50e-04 | 4140.38 ms | 32.6% bf16 MFU | 126576 tok/s step 293/19560 | loss 6.034459 (-1.38z)| norm 0.9089 (-0.03z)| lr 2.51e-04 | 4139.50 ms | 32.6% bf16 MFU | 126579 tok/s step 294/19560 | loss 5.989864 (-1.66z)| norm 0.6756 (-1.28z)| lr 2.52e-04 | 4130.19 ms | 32.7% bf16 MFU | 126598 tok/s step 295/19560 | loss 6.018467 (-1.45z)| norm 0.7249 (-1.01z)| lr 2.53e-04 | 4137.64 ms | 32.6% bf16 MFU | 126603 tok/s step 296/19560 | loss 5.999021 (-1.56z)| norm 0.7301 (-0.96z)| lr 2.54e-04 | 4136.16 ms | 32.6% bf16 MFU | 126611 tok/s step 297/19560 | loss 5.942707 (-1.91z)| norm 0.8168 (-0.49z)| lr 2.55e-04 | 4164.33 ms | 32.4% bf16 MFU | 126575 tok/s step 298/19560 | loss 5.999697 (-1.50z)| norm 0.8548 (-0.27z)| lr 2.55e-04 | 4185.24 ms | 32.3% bf16 MFU | 126510 tok/s step 299/19560 | loss 6.008988 (-1.42z)| norm 0.8642 (-0.20z)| lr 2.56e-04 | 4150.28 ms | 32.5% bf16 MFU | 126501 tok/s step 300/19560 | loss 5.945554 (-1.83z)| norm 1.1915 (+1.59z)| lr 2.57e-04 | 4136.72 ms | 32.6% bf16 MFU | 126513 tok/s step 301/19560 | loss 6.017043 (-1.32z)| norm 1.1073 (+1.21z)| lr 2.58e-04 | 4142.86 ms | 32.6% bf16 MFU | 126515 tok/s step 302/19560 | loss 6.030993 (-1.21z)| norm 0.9232 (+0.14z)| lr 2.59e-04 | 4149.94 ms | 32.5% bf16 MFU | 126506 tok/s step 303/19560 | loss 6.037485 (-1.15z)| norm 1.0218 (+0.71z)| lr 2.60e-04 | 4135.46 ms | 32.6% bf16 MFU | 126520 tok/s step 304/19560 | loss 5.991580 (-1.45z)| norm 0.8482 (-0.30z)| lr 2.61e-04 | 4141.78 ms | 32.6% bf16 MFU | 126523 tok/s step 305/19560 | loss 5.948257 (-1.72z)| norm 0.6842 (-1.23z)| lr 2.61e-04 | 4236.25 ms | 31.9% bf16 MFU | 126385 tok/s step 306/19560 | loss 5.946974 (-1.70z)| norm 0.8227 (-0.42z)| lr 2.62e-04 | 4140.70 ms | 32.6% bf16 MFU | 126396 tok/s step 307/19560 | loss 5.933905 (-1.78z)| norm 1.0714 (+1.02z)| lr 2.63e-04 | 4134.85 ms | 32.7% bf16 MFU | 126417 tok/s step 308/19560 | loss 5.906692 (-1.93z)| norm 1.2264 (+1.99z)| lr 2.64e-04 | 4132.37 ms | 32.7% bf16 MFU | 126439 tok/s step 309/19560 | loss 5.985750 (-1.36z)| norm 0.7578 (-0.81z)| lr 2.65e-04 | 4130.39 ms | 32.7% bf16 MFU | 126464 tok/s step 310/19560 | loss 5.945859 (-1.62z)| norm 1.1342 (+1.47z)| lr 2.66e-04 | 4133.34 ms | 32.7% bf16 MFU | 126483 tok/s step 311/19560 | loss 5.964751 (-1.47z)| norm 1.0535 (+0.97z)| lr 2.67e-04 | 4135.51 ms | 32.6% bf16 MFU | 126498 tok/s step 312/19560 | loss 5.926900 (-1.71z)| norm 0.9312 (+0.22z)| lr 2.67e-04 | 4139.95 ms | 32.6% bf16 MFU | 126505 tok/s step 313/19560 | loss 5.867825 (-2.10z)| norm 0.9951 (+0.60z)| lr 2.68e-04 | 4143.06 ms | 32.6% bf16 MFU | 126507 tok/s step 314/19560 | loss 5.914752 (-1.74z)| norm 1.0708 (+1.13z)| lr 2.69e-04 | 4136.65 ms | 32.6% bf16 MFU | 126519 tok/s step 315/19560 | loss 5.989698 (-1.19z)| norm 0.9165 (+0.14z)| lr 2.70e-04 | 4144.23 ms | 32.6% bf16 MFU | 126518 tok/s step 316/19560 | loss 5.854711 (-2.13z)| norm 0.8622 (-0.21z)| lr 2.71e-04 | 4132.07 ms | 32.7% bf16 MFU | 126537 tok/s step 317/19560 | loss 5.948465 (-1.44z)| norm 0.8836 (-0.06z)| lr 2.72e-04 | 4157.42 ms | 32.5% bf16 MFU | 126515 tok/s step 318/19560 | loss 5.962542 (-1.32z)| norm 0.7734 (-0.77z)| lr 2.73e-04 | 4135.28 ms | 32.7% bf16 MFU | 126529 tok/s step 319/19560 | loss 5.924672 (-1.57z)| norm 0.7752 (-0.74z)| lr 2.73e-04 | 4137.38 ms | 32.6% bf16 MFU | 126538 tok/s step 320/19560 | loss 5.881156 (-1.86z)| norm 0.6849 (-1.33z)| lr 2.74e-04 | 4131.58 ms | 32.7% bf16 MFU | 126556 tok/s step 321/19560 | loss 5.943926 (-1.38z)| norm 0.7623 (-0.81z)| lr 2.75e-04 | 4132.99 ms | 32.7% bf16 MFU | 126571 tok/s step 322/19560 | loss 5.962286 (-1.23z)| norm 0.9991 (+0.77z)| lr 2.76e-04 | 4151.60 ms | 32.5% bf16 MFU | 126557 tok/s step 323/19560 | loss 5.941606 (-1.38z)| norm 1.0320 (+0.98z)| lr 2.77e-04 | 4147.59 ms | 32.6% bf16 MFU | 126549 tok/s step 324/19560 | loss 5.929418 (-1.45z)| norm 0.9825 (+0.64z)| lr 2.78e-04 | 4137.88 ms | 32.6% bf16 MFU | 126557 tok/s step 325/19560 | loss 5.956742 (-1.23z)| norm 1.0627 (+1.16z)| lr 2.79e-04 | 4133.35 ms | 32.7% bf16 MFU | 126571 tok/s step 326/19560 | loss 5.867339 (-1.89z)| norm 0.9125 (+0.15z)| lr 2.79e-04 | 4150.19 ms | 32.5% bf16 MFU | 126559 tok/s step 327/19560 | loss 5.920711 (-1.46z)| norm 1.0288 (+0.92z)| lr 2.80e-04 | 4139.65 ms | 32.6% bf16 MFU | 126564 tok/s step 328/19560 | loss 5.909341 (-1.52z)| norm 1.6008 (+4.34z)| lr 2.81e-04 | 4157.54 ms | 32.5% bf16 MFU | 126541 tok/s step 329/19560 | loss 5.871153 (-1.79z)| norm 0.7898 (-0.64z)| lr 2.82e-04 | 4142.97 ms | 32.6% bf16 MFU | 126541 tok/s step 330/19560 | loss 5.838599 (-2.00z)| norm 0.8346 (-0.36z)| lr 2.83e-04 | 4144.52 ms | 32.6% bf16 MFU | 126539 tok/s step 331/19560 | loss 5.887083 (-1.60z)| norm 0.9269 (+0.23z)| lr 2.84e-04 | 4140.79 ms | 32.6% bf16 MFU | 126543 tok/s step 332/19560 | loss 5.809031 (-2.16z)| norm 0.9282 (+0.23z)| lr 2.85e-04 | 4131.10 ms | 32.7% bf16 MFU | 126562 tok/s step 333/19560 | loss 5.869812 (-1.67z)| norm 0.7667 (-0.78z)| lr 2.85e-04 | 4135.73 ms | 32.6% bf16 MFU | 126572 tok/s step 334/19560 | loss 5.903934 (-1.39z)| norm 0.8991 (+0.03z)| lr 2.86e-04 | 4149.61 ms | 32.5% bf16 MFU | 126561 tok/s step 335/19560 | loss 5.934385 (-1.14z)| norm 1.0323 (+0.87z)| lr 2.87e-04 | 4143.30 ms | 32.6% bf16 MFU | 126560 tok/s step 336/19560 | loss 5.979350 (-0.78z)| norm 0.9063 (+0.04z)| lr 2.88e-04 | 4133.41 ms | 32.7% bf16 MFU | 126574 tok/s step 337/19560 | loss 5.825984 (-1.94z)| norm 0.9506 (+0.33z)| lr 2.89e-04 | 4151.51 ms | 32.5% bf16 MFU | 126559 tok/s step 338/19560 | loss 5.871199 (-1.56z)| norm 0.9084 (+0.08z)| lr 2.90e-04 | 4150.32 ms | 32.5% bf16 MFU | 126548 tok/s step 339/19560 | loss 5.827404 (-1.87z)| norm 0.9105 (+0.12z)| lr 2.91e-04 | 4138.28 ms | 32.6% bf16 MFU | 126555 tok/s step 340/19560 | loss 5.844449 (-1.72z)| norm 1.1099 (+1.47z)| lr 2.91e-04 | 4130.81 ms | 32.7% bf16 MFU | 126573 tok/s step 341/19560 | loss 5.887079 (-1.38z)| norm 1.2750 (+2.52z)| lr 2.92e-04 | 4143.74 ms | 32.6% bf16 MFU | 126571 tok/s step 342/19560 | loss 5.847695 (-1.66z)| norm 0.9166 (+0.12z)| lr 2.93e-04 | 4133.71 ms | 32.7% bf16 MFU | 126584 tok/s step 343/19560 | loss 5.863029 (-1.51z)| norm 0.9487 (+0.34z)| lr 2.94e-04 | 4137.94 ms | 32.6% bf16 MFU | 126590 tok/s step 344/19560 | loss 5.844218 (-1.63z)| norm 1.0746 (+1.18z)| lr 2.95e-04 | 4132.48 ms | 32.7% bf16 MFU | 126604 tok/s step 345/19560 | loss 5.855651 (-1.52z)| norm 1.0126 (+0.76z)| lr 2.96e-04 | 4130.26 ms | 32.7% bf16 MFU | 126621 tok/s step 346/19560 | loss 5.876893 (-1.34z)| norm 1.2402 (+2.22z)| lr 2.97e-04 | 4132.13 ms | 32.7% bf16 MFU | 126634 tok/s step 347/19560 | loss 5.812848 (-1.82z)| norm 0.8426 (-0.39z)| lr 2.97e-04 | 4145.90 ms | 32.6% bf16 MFU | 126625 tok/s step 348/19560 | loss 5.915021 (-1.00z)| norm 0.8839 (-0.12z)| lr 2.98e-04 | 4133.27 ms | 32.7% bf16 MFU | 126636 tok/s step 349/19560 | loss 5.785582 (-1.99z)| norm 0.9739 (+0.48z)| lr 2.99e-04 | 4141.45 ms | 32.6% bf16 MFU | 126634 tok/s step 350/19560 | loss 5.838096 (-1.55z)| norm 0.7019 (-1.31z)| lr 3.00e-04 | 4137.81 ms | 32.6% bf16 MFU | 126638 tok/s step 351/19560 | loss 5.844804 (-1.50z)| norm 0.7534 (-0.97z)| lr 3.01e-04 | 4130.12 ms | 32.7% bf16 MFU | 126653 tok/s step 352/19560 | loss 5.854352 (-1.40z)| norm 0.7032 (-1.28z)| lr 3.02e-04 | 4140.52 ms | 32.6% bf16 MFU | 126651 tok/s step 353/19560 | loss 5.774199 (-2.00z)| norm 0.7170 (-1.17z)| lr 3.03e-04 | 4149.30 ms | 32.5% bf16 MFU | 126637 tok/s step 354/19560 | loss 5.828044 (-1.55z)| norm 0.9154 (+0.12z)| lr 3.03e-04 | 4131.44 ms | 32.7% bf16 MFU | 126650 tok/s step 355/19560 | loss 5.823662 (-1.56z)| norm 1.0998 (+1.30z)| lr 3.04e-04 | 4137.58 ms | 32.6% bf16 MFU | 126653 tok/s step 356/19560 | loss 5.854085 (-1.30z)| norm 1.0331 (+0.86z)| lr 3.05e-04 | 4142.31 ms | 32.6% bf16 MFU | 126649 tok/s step 357/19560 | loss 5.833635 (-1.44z)| norm 1.3350 (+2.72z)| lr 3.06e-04 | 4147.12 ms | 32.6% bf16 MFU | 126637 tok/s step 358/19560 | loss 5.778560 (-1.85z)| norm 1.0003 (+0.59z)| lr 3.07e-04 | 4127.58 ms | 32.7% bf16 MFU | 126657 tok/s step 359/19560 | loss 5.841792 (-1.32z)| norm 0.9907 (+0.52z)| lr 3.08e-04 | 4135.35 ms | 32.6% bf16 MFU | 126663 tok/s step 360/19560 | loss 5.836424 (-1.34z)| norm 1.1898 (+1.75z)| lr 3.09e-04 | 4132.85 ms | 32.7% bf16 MFU | 126673 tok/s step 361/19560 | loss 5.794858 (-1.66z)| norm 1.0252 (+0.70z)| lr 3.09e-04 | 4134.04 ms | 32.7% bf16 MFU | 126680 tok/s step 362/19560 | loss 5.879757 (-0.96z)| norm 1.2829 (+2.25z)| lr 3.10e-04 | 4155.45 ms | 32.5% bf16 MFU | 126655 tok/s step 363/19560 | loss 5.773860 (-1.79z)| norm 1.4217 (+2.97z)| lr 3.11e-04 | 4138.63 ms | 32.6% bf16 MFU | 126656 tok/s step 364/19560 | loss 5.778598 (-1.72z)| norm 0.7356 (-1.10z)| lr 3.12e-04 | 4138.78 ms | 32.6% bf16 MFU | 126657 tok/s step 365/19560 | loss 5.816780 (-1.39z)| norm 0.7652 (-0.91z)| lr 3.13e-04 | 4139.06 ms | 32.6% bf16 MFU | 126658 tok/s step 366/19560 | loss 5.797138 (-1.53z)| norm 0.9044 (-0.10z)| lr 3.14e-04 | 4131.51 ms | 32.7% bf16 MFU | 126670 tok/s step 367/19560 | loss 5.821796 (-1.30z)| norm 0.8860 (-0.21z)| lr 3.15e-04 | 4143.95 ms | 32.6% bf16 MFU | 126662 tok/s step 368/19560 | loss 5.858879 (-0.99z)| norm 0.8723 (-0.31z)| lr 3.15e-04 | 4129.99 ms | 32.7% bf16 MFU | 126676 tok/s step 369/19560 | loss 5.719609 (-2.08z)| norm 1.0037 (+0.47z)| lr 3.16e-04 | 4165.22 ms | 32.4% bf16 MFU | 126636 tok/s step 370/19560 | loss 5.735748 (-1.92z)| norm 1.1087 (+1.09z)| lr 3.17e-04 | 4144.85 ms | 32.6% bf16 MFU | 126629 tok/s step 371/19560 | loss 5.722393 (-1.98z)| norm 0.9770 (+0.29z)| lr 3.18e-04 | 4171.83 ms | 32.4% bf16 MFU | 126581 tok/s step 372/19560 | loss 5.721086 (-1.96z)| norm 1.0445 (+0.69z)| lr 3.19e-04 | 4138.48 ms | 32.6% bf16 MFU | 126586 tok/s step 373/19560 | loss 5.748936 (-1.71z)| norm 0.9551 (+0.16z)| lr 3.20e-04 | 4155.57 ms | 32.5% bf16 MFU | 126565 tok/s step 374/19560 | loss 5.799592 (-1.29z)| norm 0.8327 (-0.57z)| lr 3.21e-04 | 4143.04 ms | 32.6% bf16 MFU | 126564 tok/s step 375/19560 | loss 5.792371 (-1.34z)| norm 0.9317 (+0.05z)| lr 3.21e-04 | 4138.18 ms | 32.6% bf16 MFU | 126571 tok/s step 376/19560 | loss 5.703047 (-2.04z)| norm 1.0897 (+1.01z)| lr 3.22e-04 | 4147.92 ms | 32.6% bf16 MFU | 126562 tok/s step 377/19560 | loss 5.749856 (-1.63z)| norm 1.1239 (+1.20z)| lr 3.23e-04 | 4138.45 ms | 32.6% bf16 MFU | 126569 tok/s step 378/19560 | loss 5.843534 (-0.85z)| norm 1.0015 (+0.45z)| lr 3.24e-04 | 4132.83 ms | 32.7% bf16 MFU | 126583 tok/s step 379/19560 | loss 5.865229 (-0.66z)| norm 0.9839 (+0.34z)| lr 3.25e-04 | 4155.51 ms | 32.5% bf16 MFU | 126562 tok/s step 380/19560 | loss 5.788890 (-1.28z)| norm 1.1460 (+1.31z)| lr 3.26e-04 | 4135.24 ms | 32.7% bf16 MFU | 126573 tok/s step 381/19560 | loss 5.719217 (-1.84z)| norm 0.7203 (-1.25z)| lr 3.27e-04 | 4138.68 ms | 32.6% bf16 MFU | 126579 tok/s step 382/19560 | loss 5.751314 (-1.55z)| norm 0.7296 (-1.19z)| lr 3.27e-04 | 4142.06 ms | 32.6% bf16 MFU | 126579 tok/s step 383/19560 | loss 5.769798 (-1.37z)| norm 0.8156 (-0.68z)| lr 3.28e-04 | 4127.10 ms | 32.7% bf16 MFU | 126601 tok/s step 384/19560 | loss 5.703270 (-1.90z)| norm 0.9329 (+0.03z)| lr 3.29e-04 | 4129.58 ms | 32.7% bf16 MFU | 126619 tok/s step 385/19560 | loss 5.816855 (-0.93z)| norm 1.2061 (+1.64z)| lr 3.30e-04 | 4153.78 ms | 32.5% bf16 MFU | 126599 tok/s step 386/19560 | loss 5.756639 (-1.43z)| norm 0.9406 (+0.05z)| lr 3.31e-04 | 4131.28 ms | 32.7% bf16 MFU | 126615 tok/s step 387/19560 | loss 5.685158 (-2.01z)| norm 0.9087 (-0.13z)| lr 3.32e-04 | 4129.93 ms | 32.7% bf16 MFU | 126631 tok/s step 388/19560 | loss 5.711809 (-1.75z)| norm 0.8989 (-0.18z)| lr 3.33e-04 | 4158.99 ms | 32.5% bf16 MFU | 126603 tok/s step 389/19560 | loss 5.668904 (-2.07z)| norm 0.7910 (-0.83z)| lr 3.33e-04 | 4137.01 ms | 32.6% bf16 MFU | 126609 tok/s step 390/19560 | loss 5.734026 (-1.50z)| norm 0.6447 (-1.69z)| lr 3.34e-04 | 4163.50 ms | 32.4% bf16 MFU | 126575 tok/s step 391/19560 | loss 5.659719 (-2.09z)| norm 0.9955 (+0.39z)| lr 3.35e-04 | 4147.94 ms | 32.6% bf16 MFU | 126566 tok/s step 392/19560 | loss 5.739859 (-1.39z)| norm 0.8048 (-0.76z)| lr 3.36e-04 | 4128.74 ms | 32.7% bf16 MFU | 126587 tok/s step 393/19560 | loss 5.686283 (-1.82z)| norm 0.8007 (-0.78z)| lr 3.37e-04 | 4133.89 ms | 32.7% bf16 MFU | 126599 tok/s step 394/19560 | loss 5.698113 (-1.69z)| norm 0.6948 (-1.40z)| lr 3.38e-04 | 4142.52 ms | 32.6% bf16 MFU | 126597 tok/s step 395/19560 | loss 5.690879 (-1.73z)| norm 0.9204 (-0.06z)| lr 3.39e-04 | 4141.01 ms | 32.6% bf16 MFU | 126598 tok/s step 396/19560 | loss 5.702210 (-1.61z)| norm 1.0625 (+0.78z)| lr 3.39e-04 | 4128.74 ms | 32.7% bf16 MFU | 126617 tok/s step 397/19560 | loss 5.737134 (-1.29z)| norm 1.0357 (+0.60z)| lr 3.40e-04 | 4153.94 ms | 32.5% bf16 MFU | 126597 tok/s step 398/19560 | loss 5.775587 (-0.96z)| norm 1.1158 (+1.07z)| lr 3.41e-04 | 4141.89 ms | 32.6% bf16 MFU | 126596 tok/s step 399/19560 | loss 5.755457 (-1.12z)| norm 1.0803 (+0.86z)| lr 3.42e-04 | 4183.98 ms | 32.3% bf16 MFU | 126532 tok/s step 400/19560 | loss 5.698531 (-1.60z)| norm 1.5871 (+3.72z)| lr 3.43e-04 | 4135.66 ms | 32.6% bf16 MFU | 126544 tok/s step 401/19560 | loss 5.710340 (-1.48z)| norm 0.8971 (-0.26z)| lr 3.44e-04 | 4142.87 ms | 32.6% bf16 MFU | 126544 tok/s step 402/19560 | loss 5.651685 (-1.96z)| norm 1.3077 (+2.07z)| lr 3.45e-04 | 4140.74 ms | 32.6% bf16 MFU | 126548 tok/s step 403/19560 | loss 5.673520 (-1.75z)| norm 1.1023 (+0.88z)| lr 3.45e-04 | 4144.38 ms | 32.6% bf16 MFU | 126546 tok/s step 404/19560 | loss 5.696913 (-1.51z)| norm 0.8857 (-0.36z)| lr 3.46e-04 | 4140.40 ms | 32.6% bf16 MFU | 126550 tok/s step 405/19560 | loss 5.743883 (-1.08z)| norm 0.8088 (-0.79z)| lr 3.47e-04 | 4144.07 ms | 32.6% bf16 MFU | 126548 tok/s step 406/19560 | loss 5.683905 (-1.59z)| norm 1.0237 (+0.44z)| lr 3.48e-04 | 4139.32 ms | 32.6% bf16 MFU | 126554 tok/s step 407/19560 | loss 5.710062 (-1.34z)| norm 1.3569 (+2.29z)| lr 3.49e-04 | 4131.52 ms | 32.7% bf16 MFU | 126571 tok/s step 408/19560 | loss 5.681242 (-1.58z)| norm 0.8687 (-0.46z)| lr 3.50e-04 | 4146.90 ms | 32.6% bf16 MFU | 126564 tok/s step 409/19560 | loss 5.701427 (-1.38z)| norm 1.0998 (+0.83z)| lr 3.51e-04 | 4143.16 ms | 32.6% bf16 MFU | 126563 tok/s step 410/19560 | loss 5.675318 (-1.59z)| norm 1.0648 (+0.62z)| lr 3.51e-04 | 4130.22 ms | 32.7% bf16 MFU | 126582 tok/s step 411/19560 | loss 5.690817 (-1.43z)| norm 0.8378 (-0.66z)| lr 3.52e-04 | 4133.67 ms | 32.7% bf16 MFU | 126594 tok/s step 412/19560 | loss 5.713560 (-1.21z)| norm 0.8646 (-0.52z)| lr 3.53e-04 | 4142.67 ms | 32.6% bf16 MFU | 126593 tok/s step 413/19560 | loss 5.618536 (-2.04z)| norm 0.7564 (-1.12z)| lr 3.54e-04 | 4133.54 ms | 32.7% bf16 MFU | 126605 tok/s step 414/19560 | loss 5.685939 (-1.40z)| norm 1.0244 (+0.40z)| lr 3.55e-04 | 4136.68 ms | 32.6% bf16 MFU | 126612 tok/s step 415/19560 | loss 5.696529 (-1.28z)| norm 0.9870 (+0.18z)| lr 3.56e-04 | 4144.14 ms | 32.6% bf16 MFU | 126607 tok/s step 416/19560 | loss 5.624035 (-1.91z)| norm 0.9421 (-0.07z)| lr 3.57e-04 | 4130.87 ms | 32.7% bf16 MFU | 126622 tok/s step 417/19560 | loss 5.624284 (-1.88z)| norm 0.9892 (+0.18z)| lr 3.57e-04 | 4130.43 ms | 32.7% bf16 MFU | 126638 tok/s step 418/19560 | loss 5.632921 (-1.77z)| norm 1.1758 (+1.22z)| lr 3.58e-04 | 4129.55 ms | 32.7% bf16 MFU | 126654 tok/s step 419/19560 | loss 5.557875 (-2.39z)| norm 0.9226 (-0.22z)| lr 3.59e-04 | 4128.92 ms | 32.7% bf16 MFU | 126670 tok/s step 420/19560 | loss 5.649373 (-1.55z)| norm 0.8698 (-0.51z)| lr 3.60e-04 | 4127.78 ms | 32.7% bf16 MFU | 126687 tok/s step 421/19560 | loss 5.654309 (-1.49z)| norm 0.9097 (-0.28z)| lr 3.61e-04 | 4148.87 ms | 32.5% bf16 MFU | 126672 tok/s step 422/19560 | loss 5.579610 (-2.11z)| norm 1.1049 (+0.81z)| lr 3.62e-04 | 4167.97 ms | 32.4% bf16 MFU | 126627 tok/s step 423/19560 | loss 5.531165 (-2.48z)| norm 1.0401 (+0.43z)| lr 3.63e-04 | 4130.75 ms | 32.7% bf16 MFU | 126642 tok/s step 424/19560 | loss 5.650484 (-1.41z)| norm 1.1009 (+0.77z)| lr 3.63e-04 | 4141.27 ms | 32.6% bf16 MFU | 126640 tok/s step 425/19560 | loss 5.651461 (-1.38z)| norm 1.1028 (+0.76z)| lr 3.64e-04 | 4143.06 ms | 32.6% bf16 MFU | 126635 tok/s step 426/19560 | loss 5.596119 (-1.84z)| norm 0.9742 (+0.02z)| lr 3.65e-04 | 4136.02 ms | 32.6% bf16 MFU | 126642 tok/s step 427/19560 | loss 5.636807 (-1.46z)| norm 1.2303 (+1.47z)| lr 3.66e-04 | 4142.20 ms | 32.6% bf16 MFU | 126638 tok/s step 428/19560 | loss 5.620543 (-1.57z)| norm 0.8357 (-0.78z)| lr 3.67e-04 | 4144.37 ms | 32.6% bf16 MFU | 126632 tok/s step 429/19560 | loss 5.600867 (-1.72z)| norm 0.7679 (-1.16z)| lr 3.68e-04 | 4138.34 ms | 32.6% bf16 MFU | 126635 tok/s step 430/19560 | loss 5.584724 (-1.84z)| norm 0.7441 (-1.28z)| lr 3.69e-04 | 4137.83 ms | 32.6% bf16 MFU | 126638 tok/s step 431/19560 | loss 5.615232 (-1.56z)| norm 0.6870 (-1.57z)| lr 3.69e-04 | 4479.31 ms | 30.1% bf16 MFU | 126159 tok/s step 432/19560 | loss 5.566201 (-1.96z)| norm 0.6756 (-1.62z)| lr 3.70e-04 | 4131.53 ms | 32.7% bf16 MFU | 126196 tok/s step 433/19560 | loss 5.543412 (-2.12z)| norm 0.7641 (-1.13z)| lr 3.71e-04 | 4129.92 ms | 32.7% bf16 MFU | 126233 tok/s step 434/19560 | loss 5.563705 (-1.90z)| norm 0.8519 (-0.64z)| lr 3.72e-04 | 4140.56 ms | 32.6% bf16 MFU | 126253 tok/s step 435/19560 | loss 5.618038 (-1.40z)| norm 0.9147 (-0.27z)| lr 3.73e-04 | 4140.81 ms | 32.6% bf16 MFU | 126271 tok/s step 436/19560 | loss 5.601065 (-1.52z)| norm 0.8565 (-0.59z)| lr 3.74e-04 | 4129.45 ms | 32.7% bf16 MFU | 126305 tok/s step 437/19560 | loss 5.608268 (-1.44z)| norm 0.9222 (-0.23z)| lr 3.75e-04 | 4140.69 ms | 32.6% bf16 MFU | 126321 tok/s step 438/19560 | loss 5.628381 (-1.25z)| norm 1.0982 (+0.79z)| lr 3.75e-04 | 4137.27 ms | 32.6% bf16 MFU | 126341 tok/s step 439/19560 | loss 5.503274 (-2.30z)| norm 0.8855 (-0.43z)| lr 3.76e-04 | 4133.30 ms | 32.7% bf16 MFU | 126366 tok/s step 440/19560 | loss 5.557974 (-1.79z)| norm 1.0067 (+0.26z)| lr 3.77e-04 | 4134.25 ms | 32.7% bf16 MFU | 126389 tok/s step 441/19560 | loss 5.581739 (-1.55z)| norm 1.1293 (+0.96z)| lr 3.78e-04 | 4147.08 ms | 32.6% bf16 MFU | 126391 tok/s step 442/19560 | loss 5.553272 (-1.77z)| norm 0.7489 (-1.20z)| lr 3.79e-04 | 4144.23 ms | 32.6% bf16 MFU | 126397 tok/s step 443/19560 | loss 5.595390 (-1.39z)| norm 0.7026 (-1.44z)| lr 3.80e-04 | 4130.89 ms | 32.7% bf16 MFU | 126423 tok/s step 444/19560 | loss 5.513729 (-2.05z)| norm 0.6806 (-1.54z)| lr 3.81e-04 | 4134.94 ms | 32.7% bf16 MFU | 126441 tok/s step 445/19560 | loss 5.535677 (-1.83z)| norm 0.6356 (-1.76z)| lr 3.81e-04 | 4141.38 ms | 32.6% bf16 MFU | 126449 tok/s step 446/19560 | loss 5.528336 (-1.86z)| norm 0.6192 (-1.83z)| lr 3.82e-04 | 4138.81 ms | 32.6% bf16 MFU | 126460 tok/s step 447/19560 | loss 5.496508 (-2.09z)| norm 0.5946 (-1.93z)| lr 3.83e-04 | 4148.84 ms | 32.5% bf16 MFU | 126456 tok/s step 448/19560 | loss 5.490743 (-2.09z)| norm 0.6288 (-1.74z)| lr 3.84e-04 | 4134.48 ms | 32.7% bf16 MFU | 126474 tok/s step 449/19560 | loss 5.504359 (-1.94z)| norm 0.9339 (-0.10z)| lr 3.85e-04 | 4139.71 ms | 32.6% bf16 MFU | 126482 tok/s step 450/19560 | loss 5.514923 (-1.82z)| norm 1.2616 (+1.65z)| lr 3.86e-04 | 4434.59 ms | 30.4% bf16 MFU | 126070 tok/s step 451/19560 | loss 5.533488 (-1.64z)| norm 0.8542 (-0.53z)| lr 3.87e-04 | 4131.95 ms | 32.7% bf16 MFU | 126110 tok/s step 452/19560 | loss 5.493029 (-1.95z)| norm 0.9185 (-0.18z)| lr 3.87e-04 | 4134.19 ms | 32.7% bf16 MFU | 126146 tok/s step 453/19560 | loss 5.506896 (-1.81z)| norm 1.0174 (+0.35z)| lr 3.88e-04 | 4134.22 ms | 32.7% bf16 MFU | 126179 tok/s step 454/19560 | loss 5.549751 (-1.42z)| norm 0.9894 (+0.20z)| lr 3.89e-04 | 4133.26 ms | 32.7% bf16 MFU | 126213 tok/s step 455/19560 | loss 5.549312 (-1.40z)| norm 1.0016 (+0.26z)| lr 3.90e-04 | 4141.95 ms | 32.6% bf16 MFU | 126231 tok/s step 456/19560 | loss 5.558182 (-1.31z)| norm 1.0579 (+0.62z)| lr 3.91e-04 | 4140.88 ms | 32.6% bf16 MFU | 126250 tok/s step 457/19560 | loss 5.508919 (-1.70z)| norm 0.9337 (-0.09z)| lr 3.92e-04 | 4143.72 ms | 32.6% bf16 MFU | 126264 tok/s step 458/19560 | loss 5.512096 (-1.64z)| norm 1.3891 (+2.41z)| lr 3.93e-04 | 4144.20 ms | 32.6% bf16 MFU | 126276 tok/s step 459/19560 | loss 5.502805 (-1.69z)| norm 0.9581 (+0.02z)| lr 3.93e-04 | 4128.79 ms | 32.7% bf16 MFU | 126312 tok/s step 460/19560 | loss 5.467860 (-1.93z)| norm 0.9859 (+0.17z)| lr 3.94e-04 | 4130.99 ms | 32.7% bf16 MFU | 126342 tok/s step 461/19560 | loss 5.537746 (-1.34z)| norm 1.2224 (+1.46z)| lr 3.95e-04 | 4140.42 ms | 32.6% bf16 MFU | 126356 tok/s step 462/19560 | loss 5.543465 (-1.27z)| norm 0.8798 (-0.43z)| lr 3.96e-04 | 4129.99 ms | 32.7% bf16 MFU | 126386 tok/s step 463/19560 | loss 5.475070 (-1.82z)| norm 0.9071 (-0.27z)| lr 3.97e-04 | 4144.56 ms | 32.6% bf16 MFU | 126391 tok/s step 464/19560 | loss 5.484945 (-1.72z)| norm 1.0631 (+0.58z)| lr 3.98e-04 | 4151.71 ms | 32.5% bf16 MFU | 126386 tok/s step 465/19560 | loss 5.588867 (-0.83z)| norm 0.9272 (-0.17z)| lr 3.99e-04 | 4143.44 ms | 32.6% bf16 MFU | 126393 tok/s step 466/19560 | loss 5.508962 (-1.48z)| norm 0.8836 (-0.41z)| lr 3.99e-04 | 4128.43 ms | 32.7% bf16 MFU | 126423 tok/s step 467/19560 | loss 5.563402 (-1.01z)| norm 0.9827 (+0.13z)| lr 4.00e-04 | 4143.14 ms | 32.6% bf16 MFU | 126429 tok/s step 468/19560 | loss 5.524750 (-1.31z)| norm 0.9170 (-0.22z)| lr 4.01e-04 | 4128.84 ms | 32.7% bf16 MFU | 126457 tok/s step 469/19560 | loss 5.479646 (-1.67z)| norm 0.8794 (-0.41z)| lr 4.02e-04 | 4152.34 ms | 32.5% bf16 MFU | 126447 tok/s step 470/19560 | loss 5.470434 (-1.72z)| norm 0.9673 (+0.07z)| lr 4.03e-04 | 4130.41 ms | 32.7% bf16 MFU | 126472 tok/s step 471/19560 | loss 5.473112 (-1.67z)| norm 0.9706 (+0.09z)| lr 4.04e-04 | 4124.94 ms | 32.7% bf16 MFU | 126503 tok/s step 472/19560 | loss 5.493184 (-1.48z)| norm 0.9061 (-0.26z)| lr 4.05e-04 | 4137.82 ms | 32.6% bf16 MFU | 126513 tok/s step 473/19560 | loss 5.430421 (-1.97z)| norm 0.8452 (-0.59z)| lr 4.05e-04 | 4135.83 ms | 32.6% bf16 MFU | 126526 tok/s step 474/19560 | loss 5.486382 (-1.48z)| norm 0.7797 (-0.95z)| lr 4.06e-04 | 4140.75 ms | 32.6% bf16 MFU | 126531 tok/s step 475/19560 | loss 5.494722 (-1.38z)| norm 0.7643 (-1.03z)| lr 4.07e-04 | 4145.03 ms | 32.6% bf16 MFU | 126528 tok/s step 476/19560 | loss 5.464324 (-1.62z)| norm 0.6804 (-1.48z)| lr 4.08e-04 | 4195.68 ms | 32.2% bf16 MFU | 126450 tok/s step 477/19560 | loss 5.445189 (-1.75z)| norm 0.7491 (-1.08z)| lr 4.09e-04 | 4129.96 ms | 32.7% bf16 MFU | 126475 tok/s step 478/19560 | loss 5.441978 (-1.75z)| norm 0.8121 (-0.74z)| lr 4.10e-04 | 4138.56 ms | 32.6% bf16 MFU | 126485 tok/s step 479/19560 | loss 5.415825 (-1.93z)| norm 0.7801 (-0.92z)| lr 4.11e-04 | 4140.97 ms | 32.6% bf16 MFU | 126491 tok/s step 480/19560 | loss 5.432731 (-1.76z)| norm 0.6882 (-1.43z)| lr 4.11e-04 | 4130.61 ms | 32.7% bf16 MFU | 126513 tok/s step 481/19560 | loss 5.486556 (-1.29z)| norm 0.9121 (-0.19z)| lr 4.12e-04 | 4130.32 ms | 32.7% bf16 MFU | 126534 tok/s step 482/19560 | loss 5.472397 (-1.39z)| norm 1.3377 (+2.13z)| lr 4.13e-04 | 4142.61 ms | 32.6% bf16 MFU | 126536 tok/s step 483/19560 | loss 5.486857 (-1.25z)| norm 0.9070 (-0.23z)| lr 4.14e-04 | 4138.33 ms | 32.6% bf16 MFU | 126543 tok/s step 484/19560 | loss 5.620652 (-0.11z)| norm 0.9985 (+0.28z)| lr 4.15e-04 | 4128.47 ms | 32.7% bf16 MFU | 126566 tok/s step 485/19560 | loss 5.406143 (-1.90z)| norm 0.8517 (-0.52z)| lr 4.16e-04 | 4140.19 ms | 32.6% bf16 MFU | 126569 tok/s step 486/19560 | loss 5.422742 (-1.73z)| norm 0.7307 (-1.18z)| lr 4.17e-04 | 4146.90 ms | 32.6% bf16 MFU | 126562 tok/s step 487/19560 | loss 5.383490 (-2.02z)| norm 0.7426 (-1.10z)| lr 4.17e-04 | 4134.53 ms | 32.7% bf16 MFU | 126575 tok/s step 488/19560 | loss 5.408220 (-1.79z)| norm 0.9057 (-0.18z)| lr 4.18e-04 | 4179.82 ms | 32.3% bf16 MFU | 126517 tok/s step 489/19560 | loss 5.327270 (-2.40z)| norm 1.0695 (+0.73z)| lr 4.19e-04 | 4152.15 ms | 32.5% bf16 MFU | 126505 tok/s step 490/19560 | loss 5.377940 (-1.96z)| norm 1.2177 (+1.57z)| lr 4.20e-04 | 4155.39 ms | 32.5% bf16 MFU | 126488 tok/s step 491/19560 | loss 5.413152 (-1.63z)| norm 0.9656 (+0.18z)| lr 4.21e-04 | 4152.72 ms | 32.5% bf16 MFU | 126476 tok/s step 492/19560 | loss 5.373709 (-1.92z)| norm 0.9576 (+0.13z)| lr 4.22e-04 | 4159.08 ms | 32.5% bf16 MFU | 126456 tok/s step 493/19560 | loss 5.381718 (-1.83z)| norm 0.9358 (-0.01z)| lr 4.23e-04 | 4165.33 ms | 32.4% bf16 MFU | 126426 tok/s step 494/19560 | loss 5.353516 (-2.02z)| norm 0.9738 (+0.21z)| lr 4.23e-04 | 4142.84 ms | 32.6% bf16 MFU | 126433 tok/s step 495/19560 | loss 5.347912 (-2.02z)| norm 0.9403 (+0.01z)| lr 4.24e-04 | 4143.99 ms | 32.6% bf16 MFU | 126437 tok/s step 496/19560 | loss 5.383267 (-1.72z)| norm 0.7358 (-1.17z)| lr 4.25e-04 | 4149.67 ms | 32.5% bf16 MFU | 126432 tok/s step 497/19560 | loss 5.323201 (-2.15z)| norm 0.5781 (-2.04z)| lr 4.26e-04 | 4138.56 ms | 32.6% bf16 MFU | 126445 tok/s step 498/19560 | loss 5.401671 (-1.49z)| norm 0.6681 (-1.50z)| lr 4.27e-04 | 4144.70 ms | 32.6% bf16 MFU | 126447 tok/s step 499/19560 | loss 5.348873 (-1.88z)| norm 0.7071 (-1.25z)| lr 4.28e-04 | 4138.88 ms | 32.6% bf16 MFU | 126459 tok/s step 500/19560 | loss 5.383116 (-1.58z)| norm 0.7560 (-0.96z)| lr 4.29e-04 | 4156.20 ms | 32.5% bf16 MFU | 126443 tok/s val loss 5.421806 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2441/10042 = 0.243079 step 501/19560 | loss 5.410663 (-1.34z)| norm 0.9233 (-0.01z)| lr 4.29e-04 | 4138.74 ms | 32.6% bf16 MFU | 126455 tok/s step 502/19560 | loss 5.401445 (-1.39z)| norm 1.0233 (+0.54z)| lr 4.30e-04 | 4150.94 ms | 32.5% bf16 MFU | 126447 tok/s step 503/19560 | loss 5.360641 (-1.69z)| norm 1.0116 (+0.47z)| lr 4.31e-04 | 4141.07 ms | 32.6% bf16 MFU | 126455 tok/s step 504/19560 | loss 5.450426 (-0.96z)| norm 1.0988 (+0.96z)| lr 4.32e-04 | 4137.35 ms | 32.6% bf16 MFU | 126469 tok/s step 505/19560 | loss 5.346799 (-1.75z)| norm 1.0571 (+0.73z)| lr 4.33e-04 | 4139.88 ms | 32.6% bf16 MFU | 126477 tok/s step 506/19560 | loss 5.416059 (-1.19z)| norm 1.1029 (+0.99z)| lr 4.34e-04 | 4145.66 ms | 32.6% bf16 MFU | 126477 tok/s step 507/19560 | loss 5.366592 (-1.58z)| norm 1.0992 (+0.96z)| lr 4.35e-04 | 4138.06 ms | 32.6% bf16 MFU | 126488 tok/s step 508/19560 | loss 5.358528 (-1.63z)| norm 0.8968 (-0.17z)| lr 4.35e-04 | 4135.42 ms | 32.6% bf16 MFU | 126503 tok/s step 509/19560 | loss 5.310958 (-1.97z)| norm 0.8816 (-0.27z)| lr 4.36e-04 | 4156.05 ms | 32.5% bf16 MFU | 126485 tok/s step 510/19560 | loss 5.276128 (-2.20z)| norm 1.1841 (+1.43z)| lr 4.37e-04 | 4138.61 ms | 32.6% bf16 MFU | 126495 tok/s step 511/19560 | loss 5.336853 (-1.69z)| norm 1.0057 (+0.41z)| lr 4.38e-04 | 4142.56 ms | 32.6% bf16 MFU | 126498 tok/s step 512/19560 | loss 5.379762 (-1.32z)| norm 0.8869 (-0.26z)| lr 4.39e-04 | 4142.27 ms | 32.6% bf16 MFU | 126502 tok/s step 513/19560 | loss 5.382866 (-1.29z)| norm 0.9484 (+0.10z)| lr 4.40e-04 | 4130.86 ms | 32.7% bf16 MFU | 126523 tok/s step 514/19560 | loss 5.389692 (-1.22z)| norm 1.0149 (+0.48z)| lr 4.41e-04 | 4135.21 ms | 32.7% bf16 MFU | 126536 tok/s step 515/19560 | loss 5.355047 (-1.48z)| norm 0.8589 (-0.42z)| lr 4.41e-04 | 4196.96 ms | 32.2% bf16 MFU | 126455 tok/s step 516/19560 | loss 5.340406 (-1.57z)| norm 0.9340 (+0.01z)| lr 4.42e-04 | 4139.65 ms | 32.6% bf16 MFU | 126465 tok/s step 517/19560 | loss 5.354393 (-1.43z)| norm 0.9900 (+0.33z)| lr 4.43e-04 | 4136.19 ms | 32.6% bf16 MFU | 126479 tok/s step 518/19560 | loss 5.274272 (-2.04z)| norm 0.7825 (-0.88z)| lr 4.44e-04 | 4145.32 ms | 32.6% bf16 MFU | 126479 tok/s step 519/19560 | loss 5.256150 (-2.13z)| norm 0.7563 (-1.02z)| lr 4.45e-04 | 4139.60 ms | 32.6% bf16 MFU | 126488 tok/s step 520/19560 | loss 5.343247 (-1.42z)| norm 0.7359 (-1.13z)| lr 4.46e-04 | 4135.77 ms | 32.6% bf16 MFU | 126502 tok/s step 521/19560 | loss 5.268140 (-1.98z)| norm 0.7075 (-1.28z)| lr 4.47e-04 | 4136.47 ms | 32.6% bf16 MFU | 126514 tok/s step 522/19560 | loss 5.275993 (-1.88z)| norm 0.6470 (-1.62z)| lr 4.47e-04 | 4140.49 ms | 32.6% bf16 MFU | 126520 tok/s step 523/19560 | loss 5.360007 (-1.20z)| norm 0.7654 (-0.93z)| lr 4.48e-04 | 4139.73 ms | 32.6% bf16 MFU | 126526 tok/s step 524/19560 | loss 5.307317 (-1.59z)| norm 0.9255 (-0.02z)| lr 4.49e-04 | 4132.94 ms | 32.7% bf16 MFU | 126543 tok/s step 525/19560 | loss 5.286082 (-1.73z)| norm 0.9725 (+0.25z)| lr 4.50e-04 | 4134.36 ms | 32.7% bf16 MFU | 126556 tok/s step 526/19560 | loss 5.267875 (-1.85z)| norm 0.7780 (-0.84z)| lr 4.51e-04 | 4150.05 ms | 32.5% bf16 MFU | 126545 tok/s step 527/19560 | loss 5.272644 (-1.79z)| norm 0.7864 (-0.78z)| lr 4.52e-04 | 4146.42 ms | 32.6% bf16 MFU | 126540 tok/s step 528/19560 | loss 5.309250 (-1.48z)| norm 0.8455 (-0.44z)| lr 4.53e-04 | 4133.65 ms | 32.7% bf16 MFU | 126555 tok/s step 529/19560 | loss 5.274933 (-1.73z)| norm 1.0281 (+0.66z)| lr 4.53e-04 | 4139.96 ms | 32.6% bf16 MFU | 126559 tok/s step 530/19560 | loss 5.264808 (-1.77z)| norm 0.8153 (-0.61z)| lr 4.54e-04 | 4146.88 ms | 32.6% bf16 MFU | 126552 tok/s step 531/19560 | loss 5.208993 (-2.17z)| norm 0.8790 (-0.21z)| lr 4.55e-04 | 4159.46 ms | 32.5% bf16 MFU | 126527 tok/s step 532/19560 | loss 5.253896 (-1.78z)| norm 0.9103 (-0.02z)| lr 4.56e-04 | 4146.36 ms | 32.6% bf16 MFU | 126523 tok/s step 533/19560 | loss 5.277305 (-1.58z)| norm 0.6831 (-1.41z)| lr 4.57e-04 | 4144.10 ms | 32.6% bf16 MFU | 126523 tok/s step 534/19560 | loss 5.237673 (-1.86z)| norm 0.6200 (-1.77z)| lr 4.58e-04 | 4132.14 ms | 32.7% bf16 MFU | 126541 tok/s step 535/19560 | loss 5.245411 (-1.77z)| norm 0.6509 (-1.57z)| lr 4.59e-04 | 4232.62 ms | 31.9% bf16 MFU | 126407 tok/s step 536/19560 | loss 5.232684 (-1.84z)| norm 0.6322 (-1.66z)| lr 4.59e-04 | 4147.32 ms | 32.6% bf16 MFU | 126407 tok/s step 537/19560 | loss 5.294469 (-1.34z)| norm 0.5992 (-1.83z)| lr 4.60e-04 | 4135.43 ms | 32.6% bf16 MFU | 126426 tok/s step 538/19560 | loss 5.276689 (-1.46z)| norm 0.5980 (-1.80z)| lr 4.61e-04 | 4144.69 ms | 32.6% bf16 MFU | 126430 tok/s step 539/19560 | loss 5.216685 (-1.91z)| norm 0.7233 (-1.03z)| lr 4.62e-04 | 4137.59 ms | 32.6% bf16 MFU | 126444 tok/s step 540/19560 | loss 5.247838 (-1.64z)| norm 0.7431 (-0.90z)| lr 4.63e-04 | 4150.45 ms | 32.5% bf16 MFU | 126438 tok/s step 541/19560 | loss 5.255838 (-1.55z)| norm 0.8296 (-0.38z)| lr 4.64e-04 | 4140.80 ms | 32.6% bf16 MFU | 126446 tok/s step 542/19560 | loss 5.288829 (-1.27z)| norm 0.9880 (+0.58z)| lr 4.65e-04 | 4142.93 ms | 32.6% bf16 MFU | 126452 tok/s step 543/19560 | loss 5.203354 (-1.93z)| norm 1.0576 (+0.99z)| lr 4.65e-04 | 4136.94 ms | 32.6% bf16 MFU | 126466 tok/s step 544/19560 | loss 5.251774 (-1.52z)| norm 0.8557 (-0.22z)| lr 4.66e-04 | 4132.33 ms | 32.7% bf16 MFU | 126486 tok/s step 545/19560 | loss 5.269773 (-1.35z)| norm 0.8172 (-0.45z)| lr 4.67e-04 | 4182.48 ms | 32.3% bf16 MFU | 126430 tok/s step 546/19560 | loss 5.202909 (-1.86z)| norm 0.7910 (-0.59z)| lr 4.68e-04 | 4142.14 ms | 32.6% bf16 MFU | 126437 tok/s step 547/19560 | loss 5.165549 (-2.11z)| norm 0.7387 (-0.90z)| lr 4.69e-04 | 4147.12 ms | 32.6% bf16 MFU | 126436 tok/s step 548/19560 | loss 5.239280 (-1.50z)| norm 0.7695 (-0.71z)| lr 4.70e-04 | 4139.71 ms | 32.6% bf16 MFU | 126447 tok/s step 549/19560 | loss 5.234118 (-1.52z)| norm 0.9264 (+0.25z)| lr 4.71e-04 | 4134.89 ms | 32.7% bf16 MFU | 126464 tok/s step 550/19560 | loss 5.191045 (-1.83z)| norm 1.0184 (+0.81z)| lr 4.71e-04 | 4134.84 ms | 32.7% bf16 MFU | 126481 tok/s step 551/19560 | loss 5.155909 (-2.06z)| norm 0.8002 (-0.51z)| lr 4.72e-04 | 4137.59 ms | 32.6% bf16 MFU | 126492 tok/s step 552/19560 | loss 5.238894 (-1.39z)| norm 0.7438 (-0.84z)| lr 4.73e-04 | 4148.08 ms | 32.5% bf16 MFU | 126487 tok/s step 553/19560 | loss 5.282013 (-1.03z)| norm 0.6922 (-1.14z)| lr 4.74e-04 | 4144.27 ms | 32.6% bf16 MFU | 126489 tok/s step 554/19560 | loss 5.207033 (-1.60z)| norm 0.7457 (-0.80z)| lr 4.75e-04 | 4144.99 ms | 32.6% bf16 MFU | 126488 tok/s step 555/19560 | loss 5.139520 (-2.10z)| norm 0.8881 (+0.10z)| lr 4.76e-04 | 4140.81 ms | 32.6% bf16 MFU | 126495 tok/s step 556/19560 | loss 5.251895 (-1.20z)| norm 0.7272 (-0.90z)| lr 4.77e-04 | 4131.34 ms | 32.7% bf16 MFU | 126515 tok/s step 557/19560 | loss 5.268099 (-1.05z)| norm 0.9178 (+0.28z)| lr 4.77e-04 | 4143.43 ms | 32.6% bf16 MFU | 126516 tok/s step 558/19560 | loss 5.243589 (-1.23z)| norm 0.8670 (-0.04z)| lr 4.78e-04 | 4130.53 ms | 32.7% bf16 MFU | 126537 tok/s step 559/19560 | loss 5.272929 (-0.98z)| norm 0.8659 (-0.06z)| lr 4.79e-04 | 4143.34 ms | 32.6% bf16 MFU | 126537 tok/s step 560/19560 | loss 5.187394 (-1.65z)| norm 0.8234 (-0.34z)| lr 4.80e-04 | 4139.06 ms | 32.6% bf16 MFU | 126544 tok/s step 561/19560 | loss 5.242605 (-1.18z)| norm 0.8024 (-0.47z)| lr 4.81e-04 | 4154.58 ms | 32.5% bf16 MFU | 126526 tok/s step 562/19560 | loss 5.275115 (-0.90z)| norm 0.8847 (+0.05z)| lr 4.82e-04 | 4137.62 ms | 32.6% bf16 MFU | 126535 tok/s step 563/19560 | loss 5.202663 (-1.47z)| norm 0.8639 (-0.08z)| lr 4.83e-04 | 4149.35 ms | 32.5% bf16 MFU | 126526 tok/s step 564/19560 | loss 5.167138 (-1.74z)| norm 0.9168 (+0.25z)| lr 4.83e-04 | 4149.93 ms | 32.5% bf16 MFU | 126517 tok/s step 565/19560 | loss 5.240691 (-1.12z)| norm 0.8090 (-0.43z)| lr 4.84e-04 | 4139.56 ms | 32.6% bf16 MFU | 126524 tok/s step 566/19560 | loss 5.142546 (-1.91z)| norm 0.8005 (-0.47z)| lr 4.85e-04 | 4149.53 ms | 32.5% bf16 MFU | 126515 tok/s step 567/19560 | loss 5.162832 (-1.71z)| norm 0.8737 (-0.00z)| lr 4.86e-04 | 4164.19 ms | 32.4% bf16 MFU | 126484 tok/s step 568/19560 | loss 5.216852 (-1.24z)| norm 0.8371 (-0.23z)| lr 4.87e-04 | 4129.50 ms | 32.7% bf16 MFU | 126508 tok/s step 569/19560 | loss 5.141877 (-1.83z)| norm 0.7728 (-0.63z)| lr 4.88e-04 | 4354.19 ms | 31.0% bf16 MFU | 126203 tok/s step 570/19560 | loss 5.173242 (-1.55z)| norm 0.7977 (-0.47z)| lr 4.89e-04 | 4134.28 ms | 32.7% bf16 MFU | 126234 tok/s step 571/19560 | loss 5.251345 (-0.89z)| norm 0.8008 (-0.46z)| lr 4.89e-04 | 4154.35 ms | 32.5% bf16 MFU | 126232 tok/s step 572/19560 | loss 5.148722 (-1.72z)| norm 0.8110 (-0.40z)| lr 4.90e-04 | 4140.12 ms | 32.6% bf16 MFU | 126252 tok/s step 573/19560 | loss 5.100296 (-2.07z)| norm 0.7786 (-0.62z)| lr 4.91e-04 | 4130.60 ms | 32.7% bf16 MFU | 126286 tok/s step 574/19560 | loss 5.148020 (-1.65z)| norm 0.7199 (-1.02z)| lr 4.92e-04 | 4135.74 ms | 32.6% bf16 MFU | 126310 tok/s step 575/19560 | loss 5.141572 (-1.67z)| norm 0.7570 (-0.79z)| lr 4.93e-04 | 4133.07 ms | 32.7% bf16 MFU | 126338 tok/s step 576/19560 | loss 5.147487 (-1.60z)| norm 0.8092 (-0.46z)| lr 4.94e-04 | 4163.59 ms | 32.4% bf16 MFU | 126317 tok/s step 577/19560 | loss 5.139610 (-1.63z)| norm 0.8601 (-0.11z)| lr 4.95e-04 | 4132.64 ms | 32.7% bf16 MFU | 126344 tok/s step 578/19560 | loss 5.223802 (-0.93z)| norm 1.1401 (+1.83z)| lr 4.95e-04 | 4158.02 ms | 32.5% bf16 MFU | 126332 tok/s step 579/19560 | loss 5.114901 (-1.79z)| norm 0.9632 (+0.60z)| lr 4.96e-04 | 4139.50 ms | 32.6% bf16 MFU | 126348 tok/s step 580/19560 | loss 5.124086 (-1.68z)| norm 0.9472 (+0.49z)| lr 4.97e-04 | 4135.41 ms | 32.6% bf16 MFU | 126369 tok/s step 581/19560 | loss 5.157681 (-1.39z)| norm 0.7855 (-0.61z)| lr 4.98e-04 | 4150.01 ms | 32.5% bf16 MFU | 126368 tok/s step 582/19560 | loss 5.099780 (-1.83z)| norm 0.7894 (-0.57z)| lr 4.99e-04 | 4162.69 ms | 32.4% bf16 MFU | 126347 tok/s step 583/19560 | loss 5.187353 (-1.11z)| norm 0.6689 (-1.38z)| lr 5.00e-04 | 4155.76 ms | 32.5% bf16 MFU | 126337 tok/s step 584/19560 | loss 5.154626 (-1.36z)| norm 0.6308 (-1.61z)| lr 5.01e-04 | 4132.56 ms | 32.7% bf16 MFU | 126364 tok/s step 585/19560 | loss 5.103097 (-1.75z)| norm 0.7063 (-1.08z)| lr 5.01e-04 | 4135.50 ms | 32.6% bf16 MFU | 126384 tok/s step 586/19560 | loss 5.145481 (-1.38z)| norm 0.6350 (-1.59z)| lr 5.02e-04 | 4148.77 ms | 32.5% bf16 MFU | 126384 tok/s step 587/19560 | loss 5.110574 (-1.64z)| norm 0.6169 (-1.69z)| lr 5.03e-04 | 4132.04 ms | 32.7% bf16 MFU | 126409 tok/s step 588/19560 | loss 5.103407 (-1.67z)| norm 0.6080 (-1.72z)| lr 5.04e-04 | 4160.50 ms | 32.5% bf16 MFU | 126389 tok/s step 589/19560 | loss 5.110651 (-1.59z)| norm 0.4952 (-2.47z)| lr 5.05e-04 | 4141.38 ms | 32.6% bf16 MFU | 126400 tok/s step 590/19560 | loss 5.105122 (-1.61z)| norm 0.5377 (-2.12z)| lr 5.06e-04 | 4149.83 ms | 32.5% bf16 MFU | 126397 tok/s step 591/19560 | loss 5.056837 (-1.97z)| norm 0.6020 (-1.64z)| lr 5.07e-04 | 4135.73 ms | 32.6% bf16 MFU | 126415 tok/s step 592/19560 | loss 5.099702 (-1.59z)| norm 0.6111 (-1.55z)| lr 5.07e-04 | 4166.98 ms | 32.4% bf16 MFU | 126385 tok/s step 593/19560 | loss 5.091959 (-1.65z)| norm 0.7941 (-0.30z)| lr 5.08e-04 | 4156.18 ms | 32.5% bf16 MFU | 126374 tok/s step 594/19560 | loss 5.132559 (-1.29z)| norm 1.1547 (+2.11z)| lr 5.09e-04 | 4142.90 ms | 32.6% bf16 MFU | 126382 tok/s step 595/19560 | loss 5.112273 (-1.45z)| norm 0.8818 (+0.28z)| lr 5.10e-04 | 4144.15 ms | 32.6% bf16 MFU | 126389 tok/s step 596/19560 | loss 5.034700 (-2.07z)| norm 0.7665 (-0.48z)| lr 5.11e-04 | 4137.69 ms | 32.6% bf16 MFU | 126405 tok/s step 597/19560 | loss 5.080562 (-1.66z)| norm 0.8143 (-0.16z)| lr 5.12e-04 | 4129.85 ms | 32.7% bf16 MFU | 126432 tok/s step 598/19560 | loss 5.109977 (-1.39z)| norm 1.1115 (+1.81z)| lr 5.13e-04 | 4139.30 ms | 32.6% bf16 MFU | 126444 tok/s step 599/19560 | loss 5.106707 (-1.40z)| norm 1.0822 (+1.60z)| lr 5.13e-04 | 4145.19 ms | 32.6% bf16 MFU | 126446 tok/s step 600/19560 | loss 5.136946 (-1.13z)| norm 0.8952 (+0.37z)| lr 5.14e-04 | 4137.58 ms | 32.6% bf16 MFU | 126459 tok/s step 601/19560 | loss 5.103392 (-1.39z)| norm 0.8615 (+0.14z)| lr 5.15e-04 | 4144.89 ms | 32.6% bf16 MFU | 126461 tok/s step 602/19560 | loss 5.095973 (-1.44z)| norm 0.7912 (-0.32z)| lr 5.16e-04 | 4157.20 ms | 32.5% bf16 MFU | 126443 tok/s step 603/19560 | loss 5.088473 (-1.49z)| norm 0.7502 (-0.59z)| lr 5.17e-04 | 4142.75 ms | 32.6% bf16 MFU | 126449 tok/s step 604/19560 | loss 5.064477 (-1.67z)| norm 0.6691 (-1.13z)| lr 5.18e-04 | 4145.22 ms | 32.6% bf16 MFU | 126450 tok/s step 605/19560 | loss 5.067946 (-1.61z)| norm 0.6535 (-1.22z)| lr 5.19e-04 | 4137.82 ms | 32.6% bf16 MFU | 126463 tok/s step 606/19560 | loss 5.070156 (-1.57z)| norm 0.6588 (-1.17z)| lr 5.19e-04 | 4154.43 ms | 32.5% bf16 MFU | 126450 tok/s step 607/19560 | loss 4.993822 (-2.18z)| norm 0.6801 (-1.02z)| lr 5.20e-04 | 4141.62 ms | 32.6% bf16 MFU | 126457 tok/s step 608/19560 | loss 5.065988 (-1.54z)| norm 0.7729 (-0.42z)| lr 5.21e-04 | 4146.35 ms | 32.6% bf16 MFU | 126457 tok/s step 609/19560 | loss 5.079674 (-1.41z)| norm 0.8407 (+0.02z)| lr 5.22e-04 | 4155.41 ms | 32.5% bf16 MFU | 126442 tok/s step 610/19560 | loss 5.055216 (-1.60z)| norm 0.9870 (+1.04z)| lr 5.23e-04 | 4133.24 ms | 32.7% bf16 MFU | 126462 tok/s step 611/19560 | loss 5.030598 (-1.79z)| norm 0.9823 (+1.00z)| lr 5.24e-04 | 4133.08 ms | 32.7% bf16 MFU | 126482 tok/s step 612/19560 | loss 5.045133 (-1.68z)| norm 0.8848 (+0.35z)| lr 5.25e-04 | 4153.65 ms | 32.5% bf16 MFU | 126469 tok/s step 613/19560 | loss 5.018426 (-1.89z)| norm 0.7939 (-0.27z)| lr 5.25e-04 | 4157.33 ms | 32.5% bf16 MFU | 126451 tok/s step 614/19560 | loss 5.026768 (-1.79z)| norm 0.7220 (-0.76z)| lr 5.26e-04 | 4139.21 ms | 32.6% bf16 MFU | 126462 tok/s step 615/19560 | loss 5.093419 (-1.17z)| norm 0.6527 (-1.22z)| lr 5.27e-04 | 4151.24 ms | 32.5% bf16 MFU | 126453 tok/s step 616/19560 | loss 5.094318 (-1.14z)| norm 0.6998 (-0.89z)| lr 5.28e-04 | 4131.89 ms | 32.7% bf16 MFU | 126475 tok/s step 617/19560 | loss 5.049459 (-1.52z)| norm 0.7669 (-0.42z)| lr 5.29e-04 | 4134.17 ms | 32.7% bf16 MFU | 126492 tok/s step 618/19560 | loss 5.048223 (-1.51z)| norm 0.9947 (+1.17z)| lr 5.30e-04 | 4144.95 ms | 32.6% bf16 MFU | 126492 tok/s step 619/19560 | loss 5.013582 (-1.79z)| norm 1.0503 (+1.55z)| lr 5.31e-04 | 4158.72 ms | 32.5% bf16 MFU | 126471 tok/s step 620/19560 | loss 5.133510 (-0.70z)| norm 0.8456 (+0.13z)| lr 5.31e-04 | 4157.58 ms | 32.5% bf16 MFU | 126453 tok/s step 621/19560 | loss 4.989361 (-1.97z)| norm 0.7392 (-0.60z)| lr 5.32e-04 | 4139.35 ms | 32.6% bf16 MFU | 126463 tok/s step 622/19560 | loss 5.034567 (-1.54z)| norm 0.6240 (-1.38z)| lr 5.33e-04 | 4132.35 ms | 32.7% bf16 MFU | 126484 tok/s step 623/19560 | loss 5.030142 (-1.55z)| norm 0.5285 (-2.00z)| lr 5.34e-04 | 4131.41 ms | 32.7% bf16 MFU | 126505 tok/s step 624/19560 | loss 5.021397 (-1.60z)| norm 0.4826 (-2.26z)| lr 5.35e-04 | 4178.47 ms | 32.3% bf16 MFU | 126453 tok/s step 625/19560 | loss 5.029033 (-1.51z)| norm 0.5055 (-2.08z)| lr 5.36e-04 | 4136.42 ms | 32.6% bf16 MFU | 126468 tok/s step 626/19560 | loss 5.017883 (-1.58z)| norm 0.5633 (-1.68z)| lr 5.37e-04 | 4144.67 ms | 32.6% bf16 MFU | 126469 tok/s step 627/19560 | loss 5.026538 (-1.48z)| norm 0.6981 (-0.78z)| lr 5.37e-04 | 4165.95 ms | 32.4% bf16 MFU | 126438 tok/s step 628/19560 | loss 5.025185 (-1.47z)| norm 1.0168 (+1.31z)| lr 5.38e-04 | 4142.43 ms | 32.6% bf16 MFU | 126445 tok/s step 629/19560 | loss 5.075903 (-1.01z)| norm 1.2528 (+2.77z)| lr 5.39e-04 | 4132.83 ms | 32.7% bf16 MFU | 126465 tok/s step 630/19560 | loss 5.034486 (-1.37z)| norm 0.7664 (-0.34z)| lr 5.40e-04 | 4155.88 ms | 32.5% bf16 MFU | 126450 tok/s step 631/19560 | loss 5.025496 (-1.43z)| norm 0.6656 (-0.97z)| lr 5.41e-04 | 4128.27 ms | 32.7% bf16 MFU | 126477 tok/s step 632/19560 | loss 4.992155 (-1.73z)| norm 0.6397 (-1.12z)| lr 5.42e-04 | 4146.28 ms | 32.6% bf16 MFU | 126476 tok/s step 633/19560 | loss 5.016230 (-1.48z)| norm 0.6266 (-1.19z)| lr 5.43e-04 | 4148.35 ms | 32.5% bf16 MFU | 126471 tok/s step 634/19560 | loss 5.052446 (-1.13z)| norm 0.5933 (-1.40z)| lr 5.43e-04 | 4132.83 ms | 32.7% bf16 MFU | 126491 tok/s step 635/19560 | loss 5.018460 (-1.44z)| norm 0.6086 (-1.28z)| lr 5.44e-04 | 4164.23 ms | 32.4% bf16 MFU | 126461 tok/s step 636/19560 | loss 4.995578 (-1.63z)| norm 0.6546 (-0.96z)| lr 5.45e-04 | 4151.66 ms | 32.5% bf16 MFU | 126452 tok/s step 637/19560 | loss 4.985185 (-1.70z)| norm 0.7320 (-0.44z)| lr 5.46e-04 | 4158.24 ms | 32.5% bf16 MFU | 126434 tok/s step 638/19560 | loss 5.096459 (-0.63z)| norm 0.7688 (-0.18z)| lr 5.47e-04 | 4127.57 ms | 32.7% bf16 MFU | 126463 tok/s step 639/19560 | loss 5.007314 (-1.46z)| norm 0.8240 (+0.21z)| lr 5.48e-04 | 4147.36 ms | 32.6% bf16 MFU | 126461 tok/s step 640/19560 | loss 5.045067 (-1.09z)| norm 0.8134 (+0.14z)| lr 5.49e-04 | 4131.24 ms | 32.7% bf16 MFU | 126483 tok/s step 641/19560 | loss 5.002483 (-1.49z)| norm 0.7414 (-0.34z)| lr 5.49e-04 | 4147.58 ms | 32.6% bf16 MFU | 126480 tok/s step 642/19560 | loss 5.003258 (-1.47z)| norm 0.7036 (-0.59z)| lr 5.50e-04 | 4142.81 ms | 32.6% bf16 MFU | 126483 tok/s step 643/19560 | loss 5.053842 (-0.95z)| norm 0.7737 (-0.10z)| lr 5.51e-04 | 4138.15 ms | 32.6% bf16 MFU | 126494 tok/s step 644/19560 | loss 4.998504 (-1.49z)| norm 0.7278 (-0.41z)| lr 5.52e-04 | 4136.53 ms | 32.6% bf16 MFU | 126507 tok/s step 645/19560 | loss 5.004532 (-1.42z)| norm 0.9010 (+0.81z)| lr 5.53e-04 | 4127.79 ms | 32.7% bf16 MFU | 126532 tok/s step 646/19560 | loss 5.028782 (-1.15z)| norm 0.9274 (+0.98z)| lr 5.54e-04 | 4142.07 ms | 32.6% bf16 MFU | 126534 tok/s step 647/19560 | loss 4.990808 (-1.51z)| norm 0.9332 (+1.01z)| lr 5.55e-04 | 4154.21 ms | 32.5% bf16 MFU | 126518 tok/s step 648/19560 | loss 4.977862 (-1.63z)| norm 0.8625 (+0.51z)| lr 5.55e-04 | 4135.51 ms | 32.6% bf16 MFU | 126531 tok/s step 649/19560 | loss 4.973782 (-1.64z)| norm 0.6834 (-0.74z)| lr 5.56e-04 | 4145.76 ms | 32.6% bf16 MFU | 126527 tok/s step 650/19560 | loss 4.913610 (-2.20z)| norm 0.7822 (-0.06z)| lr 5.57e-04 | 4152.38 ms | 32.5% bf16 MFU | 126514 tok/s step 651/19560 | loss 4.975988 (-1.56z)| norm 0.8581 (+0.47z)| lr 5.58e-04 | 4138.72 ms | 32.6% bf16 MFU | 126522 tok/s step 652/19560 | loss 4.982138 (-1.48z)| norm 0.7960 (+0.04z)| lr 5.59e-04 | 4145.95 ms | 32.6% bf16 MFU | 126519 tok/s step 653/19560 | loss 5.008202 (-1.19z)| norm 0.7575 (-0.22z)| lr 5.60e-04 | 4144.23 ms | 32.6% bf16 MFU | 126519 tok/s step 654/19560 | loss 4.917525 (-2.08z)| norm 0.6293 (-1.11z)| lr 5.61e-04 | 4127.72 ms | 32.7% bf16 MFU | 126544 tok/s step 655/19560 | loss 4.972147 (-1.50z)| norm 0.6463 (-0.98z)| lr 5.61e-04 | 4134.90 ms | 32.7% bf16 MFU | 126556 tok/s step 656/19560 | loss 4.995691 (-1.25z)| norm 0.6129 (-1.19z)| lr 5.62e-04 | 4133.46 ms | 32.7% bf16 MFU | 126570 tok/s step 657/19560 | loss 4.941581 (-1.78z)| norm 0.6117 (-1.19z)| lr 5.63e-04 | 4154.51 ms | 32.5% bf16 MFU | 126552 tok/s step 658/19560 | loss 4.925938 (-1.90z)| norm 0.7165 (-0.45z)| lr 5.64e-04 | 4140.09 ms | 32.6% bf16 MFU | 126556 tok/s step 659/19560 | loss 4.971005 (-1.41z)| norm 0.6132 (-1.15z)| lr 5.65e-04 | 4142.06 ms | 32.6% bf16 MFU | 126557 tok/s step 660/19560 | loss 4.936940 (-1.73z)| norm 0.6755 (-0.70z)| lr 5.66e-04 | 4133.11 ms | 32.7% bf16 MFU | 126572 tok/s step 661/19560 | loss 4.942402 (-1.65z)| norm 0.7542 (-0.16z)| lr 5.67e-04 | 4135.52 ms | 32.6% bf16 MFU | 126582 tok/s step 662/19560 | loss 4.915818 (-1.88z)| norm 0.6758 (-0.71z)| lr 5.67e-04 | 4143.78 ms | 32.6% bf16 MFU | 126579 tok/s step 663/19560 | loss 4.903691 (-1.96z)| norm 0.6312 (-1.02z)| lr 5.68e-04 | 4134.80 ms | 32.7% bf16 MFU | 126590 tok/s step 664/19560 | loss 4.903735 (-1.92z)| norm 0.6892 (-0.62z)| lr 5.69e-04 | 4255.04 ms | 31.7% bf16 MFU | 126421 tok/s step 665/19560 | loss 4.877220 (-2.15z)| norm 0.7641 (-0.10z)| lr 5.70e-04 | 4130.30 ms | 32.7% bf16 MFU | 126447 tok/s step 666/19560 | loss 5.002810 (-0.89z)| norm 0.8691 (+0.63z)| lr 5.71e-04 | 4132.24 ms | 32.7% bf16 MFU | 126469 tok/s step 667/19560 | loss 4.918637 (-1.70z)| norm 0.8903 (+0.77z)| lr 5.72e-04 | 4133.33 ms | 32.7% bf16 MFU | 126487 tok/s step 668/19560 | loss 4.881395 (-2.03z)| norm 0.7844 (+0.01z)| lr 5.73e-04 | 4131.50 ms | 32.7% bf16 MFU | 126508 tok/s step 669/19560 | loss 4.961739 (-1.22z)| norm 0.8199 (+0.27z)| lr 5.73e-04 | 4132.14 ms | 32.7% bf16 MFU | 126527 tok/s step 670/19560 | loss 4.903781 (-1.77z)| norm 0.6824 (-0.70z)| lr 5.74e-04 | 4129.84 ms | 32.7% bf16 MFU | 126548 tok/s step 671/19560 | loss 4.956753 (-1.22z)| norm 0.7089 (-0.50z)| lr 5.75e-04 | 4322.96 ms | 31.2% bf16 MFU | 126284 tok/s step 672/19560 | loss 4.947535 (-1.30z)| norm 0.7323 (-0.32z)| lr 5.76e-04 | 4147.81 ms | 32.6% bf16 MFU | 126290 tok/s step 673/19560 | loss 4.904582 (-1.71z)| norm 0.7585 (-0.13z)| lr 5.77e-04 | 4138.57 ms | 32.6% bf16 MFU | 126310 tok/s step 674/19560 | loss 4.943056 (-1.30z)| norm 0.7185 (-0.41z)| lr 5.78e-04 | 4156.47 ms | 32.5% bf16 MFU | 126301 tok/s step 675/19560 | loss 4.897941 (-1.72z)| norm 0.7950 (+0.14z)| lr 5.79e-04 | 4132.10 ms | 32.7% bf16 MFU | 126330 tok/s step 676/19560 | loss 4.948200 (-1.20z)| norm 0.7863 (+0.07z)| lr 5.79e-04 | 4172.43 ms | 32.4% bf16 MFU | 126297 tok/s step 677/19560 | loss 5.000299 (-0.66z)| norm 0.6230 (-1.09z)| lr 5.80e-04 | 4174.07 ms | 32.3% bf16 MFU | 126262 tok/s step 678/19560 | loss 4.923842 (-1.42z)| norm 0.5965 (-1.27z)| lr 5.81e-04 | 4157.81 ms | 32.5% bf16 MFU | 126254 tok/s step 679/19560 | loss 4.910847 (-1.52z)| norm 0.6702 (-0.72z)| lr 5.82e-04 | 4175.98 ms | 32.3% bf16 MFU | 126219 tok/s step 680/19560 | loss 4.933260 (-1.28z)| norm 0.7073 (-0.45z)| lr 5.83e-04 | 4153.07 ms | 32.5% bf16 MFU | 126220 tok/s step 681/19560 | loss 4.883051 (-1.78z)| norm 0.6695 (-0.72z)| lr 5.84e-04 | 4172.89 ms | 32.4% bf16 MFU | 126191 tok/s step 682/19560 | loss 4.919046 (-1.38z)| norm 0.6140 (-1.11z)| lr 5.85e-04 | 4149.87 ms | 32.5% bf16 MFU | 126198 tok/s step 683/19560 | loss 4.919908 (-1.35z)| norm 0.6781 (-0.64z)| lr 5.85e-04 | 4153.72 ms | 32.5% bf16 MFU | 126199 tok/s step 684/19560 | loss 4.884124 (-1.70z)| norm 0.6514 (-0.83z)| lr 5.86e-04 | 4163.03 ms | 32.4% bf16 MFU | 126186 tok/s step 685/19560 | loss 4.880183 (-1.72z)| norm 0.7320 (-0.23z)| lr 5.87e-04 | 4160.61 ms | 32.5% bf16 MFU | 126178 tok/s step 686/19560 | loss 4.899936 (-1.50z)| norm 0.7562 (-0.05z)| lr 5.88e-04 | 4156.99 ms | 32.5% bf16 MFU | 126175 tok/s step 687/19560 | loss 4.901659 (-1.48z)| norm 0.8740 (+0.81z)| lr 5.89e-04 | 4143.82 ms | 32.6% bf16 MFU | 126192 tok/s step 688/19560 | loss 4.895343 (-1.52z)| norm 0.8994 (+0.98z)| lr 5.90e-04 | 4154.58 ms | 32.5% bf16 MFU | 126192 tok/s step 689/19560 | loss 4.932967 (-1.11z)| norm 0.7969 (+0.24z)| lr 5.91e-04 | 4145.34 ms | 32.6% bf16 MFU | 126207 tok/s step 690/19560 | loss 4.969951 (-0.70z)| norm 0.7651 (+0.02z)| lr 5.91e-04 | 4154.39 ms | 32.5% bf16 MFU | 126206 tok/s step 691/19560 | loss 4.848440 (-2.01z)| norm 0.6370 (-0.90z)| lr 5.92e-04 | 4143.55 ms | 32.6% bf16 MFU | 126223 tok/s step 692/19560 | loss 4.946679 (-0.91z)| norm 0.6596 (-0.73z)| lr 5.93e-04 | 4153.25 ms | 32.5% bf16 MFU | 126223 tok/s step 693/19560 | loss 4.846752 (-2.00z)| norm 0.7311 (-0.20z)| lr 5.94e-04 | 4356.63 ms | 31.0% bf16 MFU | 125929 tok/s step 694/19560 | loss 4.888895 (-1.51z)| norm 0.8193 (+0.44z)| lr 5.95e-04 | 4153.55 ms | 32.5% bf16 MFU | 125944 tok/s step 695/19560 | loss 4.910860 (-1.24z)| norm 0.9138 (+1.13z)| lr 5.96e-04 | 4146.17 ms | 32.6% bf16 MFU | 125969 tok/s step 696/19560 | loss 4.847974 (-1.92z)| norm 0.7997 (+0.30z)| lr 5.97e-04 | 4173.40 ms | 32.4% bf16 MFU | 125952 tok/s step 697/19560 | loss 4.883776 (-1.50z)| norm 0.6299 (-0.93z)| lr 5.97e-04 | 4151.51 ms | 32.5% bf16 MFU | 125969 tok/s step 698/19560 | loss 4.911618 (-1.17z)| norm 0.6633 (-0.68z)| lr 5.98e-04 | 4163.42 ms | 32.4% bf16 MFU | 125967 tok/s step 699/19560 | loss 4.857482 (-1.78z)| norm 0.6982 (-0.42z)| lr 5.99e-04 | 4152.38 ms | 32.5% bf16 MFU | 125982 tok/s step 700/19560 | loss 4.798713 (-2.39z)| norm 0.8973 (+1.02z)| lr 6.00e-04 | 4147.70 ms | 32.6% bf16 MFU | 126003 tok/s step 701/19560 | loss 4.927189 (-0.91z)| norm 0.9513 (+1.39z)| lr 6.00e-04 | 5476.96 ms | 24.7% bf16 MFU | 124489 tok/s step 702/19560 | loss 4.879541 (-1.44z)| norm 0.9197 (+1.14z)| lr 6.00e-04 | 4384.33 ms | 30.8% bf16 MFU | 124244 tok/s step 703/19560 | loss 4.880314 (-1.41z)| norm 0.8438 (+0.60z)| lr 6.00e-04 | 4157.07 ms | 32.5% bf16 MFU | 124337 tok/s step 704/19560 | loss 4.891588 (-1.26z)| norm 0.7768 (+0.12z)| lr 6.00e-04 | 4150.23 ms | 32.5% bf16 MFU | 124437 tok/s step 705/19560 | loss 4.850619 (-1.70z)| norm 0.7190 (-0.28z)| lr 6.00e-04 | 4152.67 ms | 32.5% bf16 MFU | 124528 tok/s step 706/19560 | loss 4.866300 (-1.52z)| norm 0.8177 (+0.45z)| lr 6.00e-04 | 4145.15 ms | 32.6% bf16 MFU | 124625 tok/s step 707/19560 | loss 4.907849 (-1.02z)| norm 0.7999 (+0.33z)| lr 6.00e-04 | 4147.85 ms | 32.6% bf16 MFU | 124714 tok/s step 708/19560 | loss 4.895057 (-1.15z)| norm 0.6940 (-0.44z)| lr 6.00e-04 | 4168.31 ms | 32.4% bf16 MFU | 124767 tok/s step 709/19560 | loss 4.813187 (-2.08z)| norm 0.6038 (-1.10z)| lr 6.00e-04 | 4142.60 ms | 32.6% bf16 MFU | 124857 tok/s step 710/19560 | loss 4.836961 (-1.76z)| norm 0.6538 (-0.72z)| lr 6.00e-04 | 4274.25 ms | 31.6% bf16 MFU | 124747 tok/s step 711/19560 | loss 4.959348 (-0.32z)| norm 0.5451 (-1.51z)| lr 6.00e-04 | 4138.73 ms | 32.6% bf16 MFU | 124844 tok/s step 712/19560 | loss 4.787807 (-2.32z)| norm 0.5351 (-1.56z)| lr 6.00e-04 | 4154.43 ms | 32.5% bf16 MFU | 124912 tok/s step 713/19560 | loss 4.823283 (-1.86z)| norm 0.6512 (-0.71z)| lr 6.00e-04 | 4143.91 ms | 32.6% bf16 MFU | 124992 tok/s step 714/19560 | loss 4.815512 (-1.92z)| norm 0.6689 (-0.58z)| lr 6.00e-04 | 4152.49 ms | 32.5% bf16 MFU | 125055 tok/s step 715/19560 | loss 4.850516 (-1.49z)| norm 0.5876 (-1.17z)| lr 6.00e-04 | 4163.04 ms | 32.4% bf16 MFU | 125100 tok/s step 716/19560 | loss 4.814942 (-1.87z)| norm 0.6513 (-0.71z)| lr 6.00e-04 | 4170.76 ms | 32.4% bf16 MFU | 125130 tok/s step 717/19560 | loss 4.794540 (-2.07z)| norm 0.7964 (+0.34z)| lr 6.00e-04 | 4150.11 ms | 32.5% bf16 MFU | 125190 tok/s step 718/19560 | loss 4.818372 (-1.76z)| norm 0.8581 (+0.78z)| lr 6.00e-04 | 4140.87 ms | 32.6% bf16 MFU | 125261 tok/s step 719/19560 | loss 4.758850 (-2.38z)| norm 0.7687 (+0.10z)| lr 6.00e-04 | 4139.97 ms | 32.6% bf16 MFU | 125330 tok/s step 720/19560 | loss 4.784060 (-2.05z)| norm 0.7827 (+0.20z)| lr 6.00e-04 | 4153.35 ms | 32.5% bf16 MFU | 125375 tok/s step 721/19560 | loss 4.787440 (-1.97z)| norm 0.7325 (-0.17z)| lr 6.00e-04 | 4151.98 ms | 32.5% bf16 MFU | 125420 tok/s step 722/19560 | loss 4.783862 (-1.98z)| norm 0.6861 (-0.52z)| lr 6.00e-04 | 4139.27 ms | 32.6% bf16 MFU | 125482 tok/s step 723/19560 | loss 4.832684 (-1.41z)| norm 0.6450 (-0.82z)| lr 6.00e-04 | 4140.29 ms | 32.6% bf16 MFU | 125540 tok/s step 724/19560 | loss 4.790690 (-1.84z)| norm 0.5161 (-1.79z)| lr 6.00e-04 | 4162.01 ms | 32.4% bf16 MFU | 125561 tok/s step 725/19560 | loss 4.827788 (-1.40z)| norm 0.5407 (-1.57z)| lr 6.00e-04 | 4161.67 ms | 32.4% bf16 MFU | 125582 tok/s step 726/19560 | loss 4.793927 (-1.75z)| norm 0.5600 (-1.43z)| lr 6.00e-04 | 4154.01 ms | 32.5% bf16 MFU | 125614 tok/s step 727/19560 | loss 4.748523 (-2.21z)| norm 0.6797 (-0.48z)| lr 6.00e-04 | 4140.52 ms | 32.6% bf16 MFU | 125664 tok/s step 728/19560 | loss 4.819051 (-1.42z)| norm 0.7843 (+0.38z)| lr 6.00e-04 | 4146.71 ms | 32.6% bf16 MFU | 125703 tok/s step 729/19560 | loss 4.798915 (-1.62z)| norm 0.6898 (-0.38z)| lr 6.00e-04 | 4147.36 ms | 32.6% bf16 MFU | 125738 tok/s step 730/19560 | loss 4.737908 (-2.25z)| norm 0.6913 (-0.36z)| lr 6.00e-04 | 4383.54 ms | 30.8% bf16 MFU | 125432 tok/s step 731/19560 | loss 4.807175 (-1.46z)| norm 0.6315 (-0.84z)| lr 6.00e-04 | 4164.91 ms | 32.4% bf16 MFU | 125454 tok/s step 732/19560 | loss 4.771341 (-1.82z)| norm 0.5860 (-1.20z)| lr 6.00e-04 | 4152.96 ms | 32.5% bf16 MFU | 125494 tok/s step 733/19560 | loss 4.758344 (-1.93z)| norm 0.5912 (-1.15z)| lr 6.00e-04 | 4194.86 ms | 32.2% bf16 MFU | 125468 tok/s step 734/19560 | loss 4.724557 (-2.25z)| norm 0.7533 (+0.15z)| lr 6.00e-04 | 4141.11 ms | 32.6% bf16 MFU | 125525 tok/s step 735/19560 | loss 4.758609 (-1.83z)| norm 0.7742 (+0.32z)| lr 6.00e-04 | 4148.15 ms | 32.5% bf16 MFU | 125568 tok/s step 736/19560 | loss 4.780861 (-1.57z)| norm 0.9800 (+1.94z)| lr 6.00e-04 | 4160.85 ms | 32.4% bf16 MFU | 125590 tok/s step 737/19560 | loss 4.800035 (-1.34z)| norm 0.9728 (+1.85z)| lr 6.00e-04 | 4152.11 ms | 32.5% bf16 MFU | 125624 tok/s step 738/19560 | loss 4.811452 (-1.20z)| norm 0.9794 (+1.91z)| lr 6.00e-04 | 4145.78 ms | 32.6% bf16 MFU | 125666 tok/s step 739/19560 | loss 4.722511 (-2.11z)| norm 0.7551 (+0.15z)| lr 6.00e-04 | 4138.79 ms | 32.6% bf16 MFU | 125717 tok/s step 740/19560 | loss 4.805267 (-1.21z)| norm 0.7192 (-0.12z)| lr 6.00e-04 | 4160.35 ms | 32.5% bf16 MFU | 125732 tok/s step 741/19560 | loss 4.765231 (-1.61z)| norm 0.8173 (+0.66z)| lr 6.00e-04 | 4149.81 ms | 32.5% bf16 MFU | 125762 tok/s step 742/19560 | loss 4.790125 (-1.32z)| norm 0.7323 (-0.02z)| lr 6.00e-04 | 4137.23 ms | 32.6% bf16 MFU | 125810 tok/s step 743/19560 | loss 4.823959 (-0.95z)| norm 0.6183 (-0.93z)| lr 6.00e-04 | 4153.41 ms | 32.5% bf16 MFU | 125831 tok/s step 744/19560 | loss 4.690168 (-2.33z)| norm 0.5617 (-1.37z)| lr 6.00e-04 | 4156.68 ms | 32.5% bf16 MFU | 125846 tok/s step 745/19560 | loss 4.698401 (-2.19z)| norm 0.6016 (-1.04z)| lr 6.00e-04 | 4167.69 ms | 32.4% bf16 MFU | 125844 tok/s step 746/19560 | loss 4.734725 (-1.78z)| norm 0.5501 (-1.43z)| lr 6.00e-04 | 4142.79 ms | 32.6% bf16 MFU | 125879 tok/s step 747/19560 | loss 4.704302 (-2.05z)| norm 0.5657 (-1.30z)| lr 6.00e-04 | 4139.17 ms | 32.6% bf16 MFU | 125919 tok/s step 748/19560 | loss 4.739213 (-1.67z)| norm 0.6874 (-0.30z)| lr 6.00e-04 | 4144.97 ms | 32.6% bf16 MFU | 125947 tok/s step 749/19560 | loss 4.716482 (-1.87z)| norm 0.6685 (-0.45z)| lr 6.00e-04 | 4167.84 ms | 32.4% bf16 MFU | 125939 tok/s step 750/19560 | loss 4.763922 (-1.36z)| norm 0.5849 (-1.13z)| lr 6.00e-04 | 4159.86 ms | 32.5% bf16 MFU | 125944 tok/s val loss 4.723455 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2480/10042 = 0.246963 step 751/19560 | loss 4.732341 (-1.66z)| norm 0.6761 (-0.39z)| lr 6.00e-04 | 4161.45 ms | 32.4% bf16 MFU | 125946 tok/s step 752/19560 | loss 4.776701 (-1.18z)| norm 0.7238 (-0.02z)| lr 6.00e-04 | 4156.77 ms | 32.5% bf16 MFU | 125955 tok/s step 753/19560 | loss 4.765312 (-1.28z)| norm 0.9174 (+1.59z)| lr 6.00e-04 | 4144.98 ms | 32.6% bf16 MFU | 125982 tok/s step 754/19560 | loss 4.743116 (-1.48z)| norm 0.7643 (+0.29z)| lr 6.00e-04 | 4149.33 ms | 32.5% bf16 MFU | 126001 tok/s step 755/19560 | loss 4.747171 (-1.42z)| norm 0.6456 (-0.72z)| lr 6.00e-04 | 4151.57 ms | 32.5% bf16 MFU | 126015 tok/s step 756/19560 | loss 4.715487 (-1.72z)| norm 0.5753 (-1.30z)| lr 6.00e-04 | 4142.15 ms | 32.6% bf16 MFU | 126043 tok/s step 757/19560 | loss 4.706964 (-1.78z)| norm 0.7003 (-0.21z)| lr 6.00e-04 | 4158.12 ms | 32.5% bf16 MFU | 126045 tok/s step 758/19560 | loss 4.711347 (-1.71z)| norm 0.7551 (+0.31z)| lr 6.00e-04 | 4153.54 ms | 32.5% bf16 MFU | 126054 tok/s step 759/19560 | loss 4.716322 (-1.63z)| norm 0.6735 (-0.46z)| lr 6.00e-04 | 4146.67 ms | 32.6% bf16 MFU | 126073 tok/s step 760/19560 | loss 4.690375 (-1.86z)| norm 0.7386 (+0.14z)| lr 6.00e-04 | 4142.48 ms | 32.6% bf16 MFU | 126098 tok/s step 761/19560 | loss 4.678017 (-1.95z)| norm 0.7299 (+0.06z)| lr 6.00e-04 | 4156.69 ms | 32.5% bf16 MFU | 126100 tok/s step 762/19560 | loss 4.700541 (-1.69z)| norm 0.5734 (-1.42z)| lr 6.00e-04 | 4154.62 ms | 32.5% bf16 MFU | 126104 tok/s step 763/19560 | loss 4.750545 (-1.17z)| norm 0.5066 (-2.02z)| lr 6.00e-04 | 4152.90 ms | 32.5% bf16 MFU | 126111 tok/s step 764/19560 | loss 4.689342 (-1.76z)| norm 0.4814 (-2.20z)| lr 6.00e-04 | 4173.90 ms | 32.3% bf16 MFU | 126086 tok/s step 765/19560 | loss 4.662873 (-1.99z)| norm 0.4782 (-2.17z)| lr 6.00e-04 | 4155.22 ms | 32.5% bf16 MFU | 126091 tok/s step 766/19560 | loss 4.667557 (-1.92z)| norm 0.4910 (-2.01z)| lr 6.00e-04 | 4163.00 ms | 32.4% bf16 MFU | 126083 tok/s step 767/19560 | loss 4.686333 (-1.70z)| norm 0.4601 (-2.22z)| lr 6.00e-04 | 4143.01 ms | 32.6% bf16 MFU | 126106 tok/s step 768/19560 | loss 4.711433 (-1.43z)| norm 0.5402 (-1.49z)| lr 6.00e-04 | 4165.59 ms | 32.4% bf16 MFU | 126094 tok/s step 769/19560 | loss 4.659329 (-1.93z)| norm 0.6605 (-0.45z)| lr 6.00e-04 | 4149.10 ms | 32.5% bf16 MFU | 126108 tok/s step 770/19560 | loss 4.652306 (-1.96z)| norm 0.7023 (-0.08z)| lr 6.00e-04 | 4151.33 ms | 32.5% bf16 MFU | 126117 tok/s step 771/19560 | loss 4.604406 (-2.40z)| norm 0.7109 (-0.00z)| lr 6.00e-04 | 4152.65 ms | 32.5% bf16 MFU | 126124 tok/s step 772/19560 | loss 4.642845 (-1.97z)| norm 0.6463 (-0.56z)| lr 6.00e-04 | 4167.66 ms | 32.4% bf16 MFU | 126108 tok/s step 773/19560 | loss 4.640206 (-1.96z)| norm 0.5680 (-1.22z)| lr 6.00e-04 | 4163.80 ms | 32.4% bf16 MFU | 126098 tok/s step 774/19560 | loss 4.681564 (-1.53z)| norm 0.5641 (-1.24z)| lr 6.00e-04 | 4143.87 ms | 32.6% bf16 MFU | 126119 tok/s step 775/19560 | loss 4.636529 (-1.94z)| norm 0.6126 (-0.80z)| lr 6.00e-04 | 4148.23 ms | 32.5% bf16 MFU | 126133 tok/s step 776/19560 | loss 4.633603 (-1.93z)| norm 0.5790 (-1.08z)| lr 6.00e-04 | 4148.69 ms | 32.5% bf16 MFU | 126145 tok/s step 777/19560 | loss 4.581892 (-2.38z)| norm 0.5738 (-1.12z)| lr 6.00e-04 | 4135.58 ms | 32.6% bf16 MFU | 126176 tok/s step 778/19560 | loss 4.606609 (-2.09z)| norm 0.6349 (-0.57z)| lr 6.00e-04 | 4145.11 ms | 32.6% bf16 MFU | 126192 tok/s step 779/19560 | loss 4.617550 (-1.94z)| norm 0.6272 (-0.62z)| lr 6.00e-04 | 4146.04 ms | 32.6% bf16 MFU | 126205 tok/s step 780/19560 | loss 4.565962 (-2.38z)| norm 0.5471 (-1.32z)| lr 6.00e-04 | 4149.50 ms | 32.5% bf16 MFU | 126212 tok/s step 781/19560 | loss 4.644843 (-1.60z)| norm 0.5172 (-1.55z)| lr 6.00e-04 | 4154.08 ms | 32.5% bf16 MFU | 126212 tok/s step 782/19560 | loss 4.614006 (-1.86z)| norm 0.5892 (-0.91z)| lr 6.00e-04 | 4146.58 ms | 32.6% bf16 MFU | 126223 tok/s step 783/19560 | loss 4.644007 (-1.55z)| norm 0.6883 (-0.04z)| lr 6.00e-04 | 4142.45 ms | 32.6% bf16 MFU | 126240 tok/s step 784/19560 | loss 4.602549 (-1.91z)| norm 0.5949 (-0.86z)| lr 6.00e-04 | 4149.65 ms | 32.5% bf16 MFU | 126246 tok/s step 785/19560 | loss 4.600671 (-1.89z)| norm 0.4968 (-1.70z)| lr 6.00e-04 | 4159.60 ms | 32.5% bf16 MFU | 126235 tok/s step 786/19560 | loss 4.633266 (-1.55z)| norm 0.5209 (-1.47z)| lr 6.00e-04 | 4149.34 ms | 32.5% bf16 MFU | 126241 tok/s step 787/19560 | loss 4.617851 (-1.67z)| norm 0.5961 (-0.81z)| lr 6.00e-04 | 4150.98 ms | 32.5% bf16 MFU | 126245 tok/s step 788/19560 | loss 4.645998 (-1.39z)| norm 0.6736 (-0.14z)| lr 6.00e-04 | 4155.16 ms | 32.5% bf16 MFU | 126241 tok/s step 789/19560 | loss 4.657230 (-1.26z)| norm 0.6090 (-0.69z)| lr 6.00e-04 | 4173.08 ms | 32.4% bf16 MFU | 126211 tok/s step 790/19560 | loss 4.602058 (-1.74z)| norm 0.6370 (-0.45z)| lr 6.00e-04 | 4175.39 ms | 32.3% bf16 MFU | 126179 tok/s step 791/19560 | loss 4.577084 (-1.92z)| norm 0.5920 (-0.83z)| lr 6.00e-04 | 4145.13 ms | 32.6% bf16 MFU | 126194 tok/s step 792/19560 | loss 4.605041 (-1.64z)| norm 0.4950 (-1.64z)| lr 6.00e-04 | 4155.40 ms | 32.5% bf16 MFU | 126193 tok/s step 793/19560 | loss 4.612544 (-1.54z)| norm 0.4627 (-1.87z)| lr 6.00e-04 | 4154.32 ms | 32.5% bf16 MFU | 126193 tok/s step 794/19560 | loss 4.595887 (-1.67z)| norm 0.4993 (-1.54z)| lr 6.00e-04 | 4141.30 ms | 32.6% bf16 MFU | 126214 tok/s step 795/19560 | loss 4.607608 (-1.54z)| norm 0.5895 (-0.77z)| lr 6.00e-04 | 4159.40 ms | 32.5% bf16 MFU | 126205 tok/s step 796/19560 | loss 4.623144 (-1.37z)| norm 0.6850 (+0.05z)| lr 6.00e-04 | 4142.71 ms | 32.6% bf16 MFU | 126223 tok/s step 797/19560 | loss 4.610389 (-1.47z)| norm 0.6489 (-0.24z)| lr 6.00e-04 | 4162.44 ms | 32.4% bf16 MFU | 126210 tok/s step 798/19560 | loss 4.632446 (-1.25z)| norm 0.6130 (-0.55z)| lr 6.00e-04 | 4176.68 ms | 32.3% bf16 MFU | 126175 tok/s step 799/19560 | loss 4.609378 (-1.44z)| norm 0.6387 (-0.32z)| lr 6.00e-04 | 4165.20 ms | 32.4% bf16 MFU | 126160 tok/s step 800/19560 | loss 4.571755 (-1.74z)| norm 0.6322 (-0.37z)| lr 6.00e-04 | 4152.34 ms | 32.5% bf16 MFU | 126166 tok/s step 801/19560 | loss 4.596229 (-1.50z)| norm 0.6057 (-0.59z)| lr 6.00e-04 | 4130.57 ms | 32.7% bf16 MFU | 126204 tok/s step 802/19560 | loss 4.614043 (-1.32z)| norm 0.5910 (-0.71z)| lr 6.00e-04 | 4140.94 ms | 32.6% bf16 MFU | 126224 tok/s step 803/19560 | loss 4.588289 (-1.53z)| norm 0.5083 (-1.39z)| lr 6.00e-04 | 4157.44 ms | 32.5% bf16 MFU | 126218 tok/s step 804/19560 | loss 4.563710 (-1.72z)| norm 0.5144 (-1.32z)| lr 6.00e-04 | 4157.45 ms | 32.5% bf16 MFU | 126213 tok/s step 805/19560 | loss 4.601615 (-1.37z)| norm 0.6738 (+0.04z)| lr 6.00e-04 | 4404.57 ms | 30.7% bf16 MFU | 125854 tok/s step 806/19560 | loss 4.624617 (-1.14z)| norm 0.7230 (+0.45z)| lr 6.00e-04 | 4142.34 ms | 32.6% bf16 MFU | 125889 tok/s step 807/19560 | loss 4.551342 (-1.78z)| norm 0.6364 (-0.29z)| lr 6.00e-04 | 4160.31 ms | 32.5% bf16 MFU | 125896 tok/s step 808/19560 | loss 4.611180 (-1.22z)| norm 0.7086 (+0.33z)| lr 6.00e-04 | 4178.19 ms | 32.3% bf16 MFU | 125875 tok/s step 809/19560 | loss 4.585300 (-1.43z)| norm 0.6565 (-0.11z)| lr 6.00e-04 | 4159.37 ms | 32.5% bf16 MFU | 125884 tok/s step 810/19560 | loss 4.628616 (-1.02z)| norm 0.6241 (-0.39z)| lr 6.00e-04 | 4140.37 ms | 32.6% bf16 MFU | 125921 tok/s step 811/19560 | loss 4.572625 (-1.51z)| norm 0.6635 (-0.05z)| lr 6.00e-04 | 4144.97 ms | 32.6% bf16 MFU | 125950 tok/s step 812/19560 | loss 4.626138 (-1.01z)| norm 0.5990 (-0.60z)| lr 6.00e-04 | 4150.23 ms | 32.5% bf16 MFU | 125968 tok/s step 813/19560 | loss 4.539516 (-1.77z)| norm 0.6033 (-0.56z)| lr 6.00e-04 | 4143.41 ms | 32.6% bf16 MFU | 125997 tok/s step 814/19560 | loss 4.534060 (-1.79z)| norm 0.6859 (+0.15z)| lr 6.00e-04 | 4176.83 ms | 32.3% bf16 MFU | 125973 tok/s step 815/19560 | loss 4.589293 (-1.27z)| norm 0.6393 (-0.23z)| lr 6.00e-04 | 4159.00 ms | 32.5% bf16 MFU | 125978 tok/s step 816/19560 | loss 4.522037 (-1.85z)| norm 0.5042 (-1.38z)| lr 6.00e-04 | 4151.68 ms | 32.5% bf16 MFU | 125993 tok/s step 817/19560 | loss 4.507386 (-1.95z)| norm 0.4649 (-1.69z)| lr 6.00e-04 | 4143.90 ms | 32.6% bf16 MFU | 126019 tok/s step 818/19560 | loss 4.594559 (-1.14z)| norm 0.5183 (-1.21z)| lr 6.00e-04 | 4152.82 ms | 32.5% bf16 MFU | 126031 tok/s step 819/19560 | loss 4.557223 (-1.47z)| norm 0.5399 (-1.01z)| lr 6.00e-04 | 4143.03 ms | 32.6% bf16 MFU | 126056 tok/s step 820/19560 | loss 4.618298 (-0.89z)| norm 0.5879 (-0.59z)| lr 6.00e-04 | 4151.50 ms | 32.5% bf16 MFU | 126068 tok/s step 821/19560 | loss 4.493887 (-2.01z)| norm 0.5821 (-0.63z)| lr 6.00e-04 | 4140.78 ms | 32.6% bf16 MFU | 126095 tok/s step 822/19560 | loss 4.565408 (-1.33z)| norm 0.5715 (-0.71z)| lr 6.00e-04 | 4160.42 ms | 32.5% bf16 MFU | 126092 tok/s step 823/19560 | loss 4.540412 (-1.54z)| norm 0.5834 (-0.60z)| lr 6.00e-04 | 4153.02 ms | 32.5% bf16 MFU | 126099 tok/s step 824/19560 | loss 4.566422 (-1.28z)| norm 0.5441 (-0.93z)| lr 6.00e-04 | 4164.14 ms | 32.4% bf16 MFU | 126089 tok/s step 825/19560 | loss 4.676084 (-0.24z)| norm 0.5528 (-0.85z)| lr 6.00e-04 | 4156.94 ms | 32.5% bf16 MFU | 126091 tok/s step 826/19560 | loss 4.581748 (-1.12z)| norm 0.5867 (-0.54z)| lr 6.00e-04 | 4154.91 ms | 32.5% bf16 MFU | 126096 tok/s step 827/19560 | loss 4.548733 (-1.42z)| norm 0.6791 (+0.27z)| lr 6.00e-04 | 4145.00 ms | 32.6% bf16 MFU | 126115 tok/s step 828/19560 | loss 4.544778 (-1.43z)| norm 0.6230 (-0.21z)| lr 6.00e-04 | 4144.20 ms | 32.6% bf16 MFU | 126135 tok/s step 829/19560 | loss 4.528641 (-1.57z)| norm 0.5410 (-0.94z)| lr 6.00e-04 | 4141.37 ms | 32.6% bf16 MFU | 126158 tok/s step 830/19560 | loss 4.499314 (-1.82z)| norm 0.5036 (-1.28z)| lr 6.00e-04 | 4154.30 ms | 32.5% bf16 MFU | 126161 tok/s step 831/19560 | loss 4.507078 (-1.72z)| norm 0.5150 (-1.16z)| lr 6.00e-04 | 4155.77 ms | 32.5% bf16 MFU | 126161 tok/s step 832/19560 | loss 4.538511 (-1.40z)| norm 0.5297 (-1.00z)| lr 6.00e-04 | 4149.02 ms | 32.5% bf16 MFU | 126171 tok/s step 833/19560 | loss 4.563267 (-1.15z)| norm 0.5260 (-1.02z)| lr 6.00e-04 | 4151.76 ms | 32.5% bf16 MFU | 126176 tok/s step 834/19560 | loss 4.559680 (-1.17z)| norm 0.5904 (-0.40z)| lr 6.00e-04 | 4160.30 ms | 32.5% bf16 MFU | 126168 tok/s step 835/19560 | loss 4.465487 (-2.07z)| norm 0.5805 (-0.48z)| lr 6.00e-04 | 4165.65 ms | 32.4% bf16 MFU | 126153 tok/s step 836/19560 | loss 4.565206 (-1.07z)| norm 0.5510 (-0.76z)| lr 6.00e-04 | 4169.77 ms | 32.4% bf16 MFU | 126132 tok/s step 837/19560 | loss 4.629509 (-0.41z)| norm 0.4780 (-1.44z)| lr 6.00e-04 | 4140.53 ms | 32.6% bf16 MFU | 126157 tok/s step 838/19560 | loss 4.524108 (-1.46z)| norm 0.4555 (-1.63z)| lr 6.00e-04 | 4151.62 ms | 32.5% bf16 MFU | 126163 tok/s step 839/19560 | loss 4.501510 (-1.70z)| norm 0.4769 (-1.41z)| lr 6.00e-04 | 4173.48 ms | 32.4% bf16 MFU | 126136 tok/s step 840/19560 | loss 4.537669 (-1.30z)| norm 0.5931 (-0.32z)| lr 6.00e-04 | 4150.39 ms | 32.5% bf16 MFU | 126145 tok/s step 841/19560 | loss 4.574052 (-0.91z)| norm 0.6413 (+0.14z)| lr 6.00e-04 | 4178.70 ms | 32.3% bf16 MFU | 126112 tok/s step 842/19560 | loss 4.525093 (-1.40z)| norm 0.5450 (-0.76z)| lr 6.00e-04 | 4148.90 ms | 32.5% bf16 MFU | 126124 tok/s step 843/19560 | loss 4.520700 (-1.43z)| norm 0.4498 (-1.64z)| lr 6.00e-04 | 4154.21 ms | 32.5% bf16 MFU | 126128 tok/s step 844/19560 | loss 4.512008 (-1.50z)| norm 0.4186 (-1.89z)| lr 6.00e-04 | 4139.60 ms | 32.6% bf16 MFU | 126155 tok/s step 845/19560 | loss 4.492531 (-1.68z)| norm 0.4048 (-1.98z)| lr 6.00e-04 | 4149.97 ms | 32.5% bf16 MFU | 126164 tok/s step 846/19560 | loss 4.532119 (-1.25z)| norm 0.4287 (-1.74z)| lr 6.00e-04 | 4142.31 ms | 32.6% bf16 MFU | 126184 tok/s step 847/19560 | loss 4.483549 (-1.73z)| norm 0.4726 (-1.31z)| lr 6.00e-04 | 4176.90 ms | 32.3% bf16 MFU | 126151 tok/s step 848/19560 | loss 4.497542 (-1.56z)| norm 0.5534 (-0.55z)| lr 6.00e-04 | 4135.79 ms | 32.6% bf16 MFU | 126182 tok/s step 849/19560 | loss 4.498920 (-1.52z)| norm 0.6655 (+0.51z)| lr 6.00e-04 | 4145.97 ms | 32.6% bf16 MFU | 126195 tok/s step 850/19560 | loss 4.500949 (-1.47z)| norm 0.8111 (+1.85z)| lr 6.00e-04 | 4244.47 ms | 31.8% bf16 MFU | 126062 tok/s step 851/19560 | loss 4.470072 (-1.78z)| norm 0.7660 (+1.41z)| lr 6.00e-04 | 4218.17 ms | 32.0% bf16 MFU | 125973 tok/s step 852/19560 | loss 4.492456 (-1.52z)| norm 0.7163 (+0.94z)| lr 6.00e-04 | 4896.46 ms | 27.6% bf16 MFU | 125028 tok/s step 853/19560 | loss 4.573354 (-0.64z)| norm 0.6353 (+0.18z)| lr 6.00e-04 | 4147.45 ms | 32.6% bf16 MFU | 125098 tok/s step 854/19560 | loss 4.532670 (-1.07z)| norm 0.5964 (-0.18z)| lr 6.00e-04 | 4169.59 ms | 32.4% bf16 MFU | 125130 tok/s step 855/19560 | loss 4.539366 (-0.98z)| norm 0.6088 (-0.06z)| lr 6.00e-04 | 4150.76 ms | 32.5% bf16 MFU | 125189 tok/s step 856/19560 | loss 4.483072 (-1.58z)| norm 0.5986 (-0.14z)| lr 6.00e-04 | 4159.56 ms | 32.5% bf16 MFU | 125232 tok/s step 857/19560 | loss 4.474562 (-1.65z)| norm 0.4756 (-1.27z)| lr 6.00e-04 | 4148.83 ms | 32.5% bf16 MFU | 125289 tok/s step 858/19560 | loss 4.499742 (-1.35z)| norm 0.4820 (-1.19z)| lr 6.00e-04 | 4162.05 ms | 32.4% bf16 MFU | 125323 tok/s step 859/19560 | loss 4.489024 (-1.45z)| norm 0.5647 (-0.42z)| lr 6.00e-04 | 4156.38 ms | 32.5% bf16 MFU | 125363 tok/s step 860/19560 | loss 4.529162 (-0.99z)| norm 0.6642 (+0.50z)| lr 6.00e-04 | 4295.17 ms | 31.4% bf16 MFU | 125199 tok/s step 861/19560 | loss 4.427960 (-2.09z)| norm 0.6611 (+0.46z)| lr 6.00e-04 | 4210.19 ms | 32.1% bf16 MFU | 125165 tok/s step 862/19560 | loss 4.452559 (-1.78z)| norm 0.5285 (-0.75z)| lr 6.00e-04 | 4141.47 ms | 32.6% bf16 MFU | 125236 tok/s step 863/19560 | loss 4.509327 (-1.13z)| norm 0.5560 (-0.49z)| lr 6.00e-04 | 4146.96 ms | 32.6% bf16 MFU | 125296 tok/s step 864/19560 | loss 4.488204 (-1.35z)| norm 0.4890 (-1.13z)| lr 6.00e-04 | 4142.48 ms | 32.6% bf16 MFU | 125359 tok/s step 865/19560 | loss 4.499045 (-1.22z)| norm 0.4372 (-1.67z)| lr 6.00e-04 | 4146.96 ms | 32.6% bf16 MFU | 125413 tok/s step 866/19560 | loss 4.482355 (-1.40z)| norm 0.4522 (-1.55z)| lr 6.00e-04 | 4159.64 ms | 32.5% bf16 MFU | 125444 tok/s step 867/19560 | loss 4.475229 (-1.46z)| norm 0.4616 (-1.43z)| lr 6.00e-04 | 4142.74 ms | 32.6% bf16 MFU | 125500 tok/s step 868/19560 | loss 4.521144 (-0.92z)| norm 0.4747 (-1.27z)| lr 6.00e-04 | 4222.76 ms | 32.0% bf16 MFU | 125433 tok/s step 869/19560 | loss 4.471204 (-1.49z)| norm 0.4840 (-1.16z)| lr 6.00e-04 | 4182.48 ms | 32.3% bf16 MFU | 125429 tok/s step 870/19560 | loss 4.407623 (-2.21z)| norm 0.4943 (-1.03z)| lr 6.00e-04 | 4165.11 ms | 32.4% bf16 MFU | 125451 tok/s step 871/19560 | loss 4.488254 (-1.25z)| norm 0.5113 (-0.83z)| lr 6.00e-04 | 4168.26 ms | 32.4% bf16 MFU | 125468 tok/s step 872/19560 | loss 4.439842 (-1.80z)| norm 0.6036 (+0.19z)| lr 6.00e-04 | 4358.68 ms | 31.0% bf16 MFU | 125209 tok/s step 873/19560 | loss 4.504984 (-1.00z)| norm 0.6241 (+0.42z)| lr 6.00e-04 | 6819.82 ms | 19.8% bf16 MFU | 122792 tok/s step 874/19560 | loss 4.424600 (-1.94z)| norm 0.5477 (-0.43z)| lr 6.00e-04 | 4143.09 ms | 32.6% bf16 MFU | 122980 tok/s step 875/19560 | loss 4.528825 (-0.67z)| norm 0.5816 (-0.06z)| lr 6.00e-04 | 4151.26 ms | 32.5% bf16 MFU | 123145 tok/s step 876/19560 | loss 4.448313 (-1.62z)| norm 0.5764 (-0.11z)| lr 6.00e-04 | 4155.83 ms | 32.5% bf16 MFU | 123296 tok/s step 877/19560 | loss 4.468525 (-1.36z)| norm 0.5154 (-0.77z)| lr 6.00e-04 | 4146.78 ms | 32.6% bf16 MFU | 123453 tok/s step 878/19560 | loss 4.425548 (-1.86z)| norm 0.4681 (-1.28z)| lr 6.00e-04 | 4142.55 ms | 32.6% bf16 MFU | 123608 tok/s step 879/19560 | loss 4.458728 (-1.44z)| norm 0.4496 (-1.46z)| lr 6.00e-04 | 4241.62 ms | 31.8% bf16 MFU | 123608 tok/s step 880/19560 | loss 4.428349 (-1.80z)| norm 0.5065 (-0.82z)| lr 6.00e-04 | 4159.29 ms | 32.5% bf16 MFU | 123730 tok/s step 881/19560 | loss 4.429509 (-1.77z)| norm 0.5312 (-0.54z)| lr 6.00e-04 | 4154.07 ms | 32.5% bf16 MFU | 123854 tok/s step 882/19560 | loss 4.380317 (-2.35z)| norm 0.5836 (+0.09z)| lr 6.00e-04 | 4146.08 ms | 32.6% bf16 MFU | 123984 tok/s step 883/19560 | loss 4.459574 (-1.34z)| norm 0.6128 (+0.45z)| lr 6.00e-04 | 4155.26 ms | 32.5% bf16 MFU | 124094 tok/s step 884/19560 | loss 4.434773 (-1.64z)| norm 0.5770 (+0.02z)| lr 6.00e-04 | 4152.57 ms | 32.5% bf16 MFU | 124202 tok/s step 885/19560 | loss 4.451157 (-1.41z)| norm 0.4840 (-1.09z)| lr 6.00e-04 | 4170.20 ms | 32.4% bf16 MFU | 124278 tok/s step 886/19560 | loss 4.472383 (-1.12z)| norm 0.5707 (-0.02z)| lr 6.00e-04 | 4152.37 ms | 32.5% bf16 MFU | 124377 tok/s step 887/19560 | loss 4.491097 (-0.87z)| norm 0.6877 (+1.42z)| lr 6.00e-04 | 4142.54 ms | 32.6% bf16 MFU | 124486 tok/s step 888/19560 | loss 4.445339 (-1.45z)| norm 0.6640 (+1.15z)| lr 6.00e-04 | 4155.91 ms | 32.5% bf16 MFU | 124570 tok/s step 889/19560 | loss 4.480494 (-0.97z)| norm 0.6949 (+1.55z)| lr 6.00e-04 | 4147.37 ms | 32.6% bf16 MFU | 124662 tok/s step 890/19560 | loss 4.598138 (+0.61z)| norm 0.7124 (+1.73z)| lr 6.00e-04 | 4146.53 ms | 32.6% bf16 MFU | 124751 tok/s step 891/19560 | loss 4.451771 (-1.36z)| norm 0.6289 (+0.68z)| lr 6.00e-04 | 4152.44 ms | 32.5% bf16 MFU | 124826 tok/s step 892/19560 | loss 4.389336 (-2.18z)| norm 0.5438 (-0.38z)| lr 6.00e-04 | 4164.80 ms | 32.4% bf16 MFU | 124879 tok/s step 893/19560 | loss 4.476994 (-0.96z)| norm 0.4872 (-1.09z)| lr 6.00e-04 | 4158.86 ms | 32.5% bf16 MFU | 124939 tok/s step 894/19560 | loss 4.431361 (-1.57z)| norm 0.4833 (-1.13z)| lr 6.00e-04 | 4157.28 ms | 32.5% bf16 MFU | 124997 tok/s step 895/19560 | loss 4.438078 (-1.46z)| norm 0.4756 (-1.23z)| lr 6.00e-04 | 4181.32 ms | 32.3% bf16 MFU | 125017 tok/s step 896/19560 | loss 4.482390 (-0.83z)| norm 0.5799 (+0.07z)| lr 6.00e-04 | 4140.66 ms | 32.6% bf16 MFU | 125097 tok/s step 897/19560 | loss 4.406796 (-1.87z)| norm 0.6228 (+0.61z)| lr 6.00e-04 | 4154.59 ms | 32.5% bf16 MFU | 125152 tok/s step 898/19560 | loss 4.405282 (-1.86z)| norm 0.6083 (+0.44z)| lr 6.00e-04 | 4151.33 ms | 32.5% bf16 MFU | 125209 tok/s step 899/19560 | loss 4.495388 (-0.58z)| norm 0.5886 (+0.20z)| lr 6.00e-04 | 4146.23 ms | 32.6% bf16 MFU | 125271 tok/s step 900/19560 | loss 4.543745 (+0.11z)| norm 0.5559 (-0.21z)| lr 6.00e-04 | 4143.46 ms | 32.6% bf16 MFU | 125334 tok/s step 901/19560 | loss 4.388460 (-2.05z)| norm 0.4833 (-1.13z)| lr 6.00e-04 | 4144.70 ms | 32.6% bf16 MFU | 125392 tok/s step 902/19560 | loss 4.408422 (-1.75z)| norm 0.4244 (-1.84z)| lr 6.00e-04 | 4144.02 ms | 32.6% bf16 MFU | 125449 tok/s step 903/19560 | loss 4.413375 (-1.65z)| norm 0.4385 (-1.63z)| lr 6.00e-04 | 4161.67 ms | 32.4% bf16 MFU | 125475 tok/s step 904/19560 | loss 4.427725 (-1.43z)| norm 0.4412 (-1.57z)| lr 6.00e-04 | 4174.19 ms | 32.3% bf16 MFU | 125481 tok/s step 905/19560 | loss 4.389142 (-1.93z)| norm 0.4623 (-1.29z)| lr 6.00e-04 | 4146.13 ms | 32.6% bf16 MFU | 125530 tok/s step 906/19560 | loss 4.344875 (-2.47z)| norm 0.4880 (-0.96z)| lr 6.00e-04 | 4156.10 ms | 32.5% bf16 MFU | 125561 tok/s step 907/19560 | loss 4.405111 (-1.61z)| norm 0.4545 (-1.34z)| lr 6.00e-04 | 4217.30 ms | 32.0% bf16 MFU | 125499 tok/s step 908/19560 | loss 4.449981 (-0.98z)| norm 0.4818 (-1.00z)| lr 6.00e-04 | 4153.85 ms | 32.5% bf16 MFU | 125535 tok/s step 909/19560 | loss 4.423509 (-1.33z)| norm 0.4615 (-1.24z)| lr 6.00e-04 | 4154.98 ms | 32.5% bf16 MFU | 125567 tok/s step 910/19560 | loss 4.404213 (-1.57z)| norm 0.5100 (-0.64z)| lr 6.00e-04 | 4201.03 ms | 32.1% bf16 MFU | 125529 tok/s step 911/19560 | loss 4.444175 (-1.01z)| norm 0.4985 (-0.77z)| lr 6.00e-04 | 4153.84 ms | 32.5% bf16 MFU | 125563 tok/s step 912/19560 | loss 4.378644 (-1.87z)| norm 0.4317 (-1.55z)| lr 6.00e-04 | 4154.09 ms | 32.5% bf16 MFU | 125596 tok/s step 913/19560 | loss 4.451152 (-0.86z)| norm 0.4856 (-0.90z)| lr 6.00e-04 | 4146.21 ms | 32.6% bf16 MFU | 125638 tok/s step 914/19560 | loss 4.447934 (-0.89z)| norm 0.5694 (+0.11z)| lr 6.00e-04 | 4142.58 ms | 32.6% bf16 MFU | 125684 tok/s step 915/19560 | loss 4.425172 (-1.19z)| norm 0.6138 (+0.64z)| lr 6.00e-04 | 4161.99 ms | 32.4% bf16 MFU | 125699 tok/s step 916/19560 | loss 4.446879 (-0.88z)| norm 0.6456 (+1.03z)| lr 6.00e-04 | 4151.15 ms | 32.5% bf16 MFU | 125729 tok/s step 917/19560 | loss 4.438625 (-0.98z)| norm 0.6994 (+1.66z)| lr 6.00e-04 | 4159.23 ms | 32.5% bf16 MFU | 125745 tok/s step 918/19560 | loss 4.466086 (-0.58z)| norm 0.5700 (+0.12z)| lr 6.00e-04 | 4147.97 ms | 32.6% bf16 MFU | 125778 tok/s step 919/19560 | loss 4.381656 (-1.75z)| norm 0.4854 (-0.89z)| lr 6.00e-04 | 4152.24 ms | 32.5% bf16 MFU | 125802 tok/s step 920/19560 | loss 4.441757 (-0.89z)| norm 0.4732 (-1.03z)| lr 6.00e-04 | 4148.67 ms | 32.5% bf16 MFU | 125831 tok/s step 921/19560 | loss 4.380440 (-1.73z)| norm 0.4364 (-1.47z)| lr 6.00e-04 | 4185.34 ms | 32.3% bf16 MFU | 125803 tok/s step 922/19560 | loss 4.376863 (-1.74z)| norm 0.4544 (-1.24z)| lr 6.00e-04 | 4143.59 ms | 32.6% bf16 MFU | 125839 tok/s step 923/19560 | loss 4.374866 (-1.74z)| norm 0.4291 (-1.52z)| lr 6.00e-04 | 4155.70 ms | 32.5% bf16 MFU | 125855 tok/s step 924/19560 | loss 4.368805 (-1.80z)| norm 0.4744 (-0.97z)| lr 6.00e-04 | 4165.81 ms | 32.4% bf16 MFU | 125855 tok/s step 925/19560 | loss 4.384296 (-1.56z)| norm 0.4524 (-1.21z)| lr 6.00e-04 | 4147.38 ms | 32.6% bf16 MFU | 125883 tok/s step 926/19560 | loss 4.375625 (-1.66z)| norm 0.5099 (-0.52z)| lr 6.00e-04 | 5204.55 ms | 25.9% bf16 MFU | 124626 tok/s step 927/19560 | loss 4.385227 (-1.50z)| norm 0.5138 (-0.46z)| lr 6.00e-04 | 4144.78 ms | 32.6% bf16 MFU | 124719 tok/s step 928/19560 | loss 4.395167 (-1.34z)| norm 0.4814 (-0.83z)| lr 6.00e-04 | 4155.30 ms | 32.5% bf16 MFU | 124792 tok/s step 929/19560 | loss 4.375902 (-1.58z)| norm 0.5111 (-0.47z)| lr 6.00e-04 | 4141.00 ms | 32.6% bf16 MFU | 124883 tok/s step 930/19560 | loss 4.365022 (-1.71z)| norm 0.5054 (-0.53z)| lr 6.00e-04 | 4144.55 ms | 32.6% bf16 MFU | 124964 tok/s step 931/19560 | loss 4.386584 (-1.38z)| norm 0.4683 (-0.97z)| lr 6.00e-04 | 4148.73 ms | 32.5% bf16 MFU | 125034 tok/s step 932/19560 | loss 4.365622 (-1.65z)| norm 0.4342 (-1.36z)| lr 6.00e-04 | 4144.05 ms | 32.6% bf16 MFU | 125108 tok/s step 933/19560 | loss 4.403763 (-1.10z)| norm 0.4173 (-1.53z)| lr 6.00e-04 | 4193.63 ms | 32.2% bf16 MFU | 125104 tok/s step 934/19560 | loss 4.332341 (-2.07z)| norm 0.4081 (-1.62z)| lr 6.00e-04 | 4150.02 ms | 32.5% bf16 MFU | 125165 tok/s step 935/19560 | loss 4.366664 (-1.56z)| norm 0.4543 (-1.06z)| lr 6.00e-04 | 4166.87 ms | 32.4% bf16 MFU | 125198 tok/s step 936/19560 | loss 4.374604 (-1.43z)| norm 0.4872 (-0.65z)| lr 6.00e-04 | 4144.85 ms | 32.6% bf16 MFU | 125263 tok/s step 937/19560 | loss 4.373844 (-1.42z)| norm 0.4173 (-1.47z)| lr 6.00e-04 | 4155.09 ms | 32.5% bf16 MFU | 125309 tok/s step 938/19560 | loss 4.379047 (-1.33z)| norm 0.4428 (-1.15z)| lr 6.00e-04 | 4159.04 ms | 32.5% bf16 MFU | 125346 tok/s step 939/19560 | loss 4.367495 (-1.47z)| norm 0.4946 (-0.51z)| lr 6.00e-04 | 4162.72 ms | 32.4% bf16 MFU | 125376 tok/s step 940/19560 | loss 4.396810 (-1.05z)| norm 0.5751 (+0.46z)| lr 6.00e-04 | 4148.24 ms | 32.5% bf16 MFU | 125427 tok/s step 941/19560 | loss 4.374796 (-1.34z)| norm 0.6009 (+0.78z)| lr 6.00e-04 | 4162.60 ms | 32.4% bf16 MFU | 125453 tok/s step 942/19560 | loss 4.372553 (-1.35z)| norm 0.6232 (+1.06z)| lr 6.00e-04 | 4161.84 ms | 32.4% bf16 MFU | 125479 tok/s step 943/19560 | loss 4.331776 (-1.91z)| norm 0.5888 (+0.65z)| lr 6.00e-04 | 4164.70 ms | 32.4% bf16 MFU | 125500 tok/s step 944/19560 | loss 4.368430 (-1.36z)| norm 0.4724 (-0.77z)| lr 6.00e-04 | 4149.84 ms | 32.5% bf16 MFU | 125542 tok/s step 945/19560 | loss 4.358911 (-1.47z)| norm 0.4852 (-0.62z)| lr 6.00e-04 | 4214.14 ms | 32.0% bf16 MFU | 125485 tok/s step 946/19560 | loss 4.375561 (-1.22z)| norm 0.5205 (-0.19z)| lr 6.00e-04 | 4151.73 ms | 32.5% bf16 MFU | 125525 tok/s step 947/19560 | loss 4.370446 (-1.27z)| norm 0.5390 (+0.04z)| lr 6.00e-04 | 4150.54 ms | 32.5% bf16 MFU | 125565 tok/s step 948/19560 | loss 4.411848 (-0.67z)| norm 0.5001 (-0.43z)| lr 6.00e-04 | 4166.09 ms | 32.4% bf16 MFU | 125579 tok/s step 949/19560 | loss 4.343562 (-1.63z)| norm 0.4502 (-1.03z)| lr 6.00e-04 | 4161.01 ms | 32.4% bf16 MFU | 125600 tok/s step 950/19560 | loss 4.413587 (-0.60z)| norm 0.4375 (-1.17z)| lr 6.00e-04 | 4161.18 ms | 32.4% bf16 MFU | 125620 tok/s step 951/19560 | loss 4.244847 (-2.95z)| norm 0.4509 (-0.99z)| lr 6.00e-04 | 4156.16 ms | 32.5% bf16 MFU | 125646 tok/s step 952/19560 | loss 4.388772 (-0.89z)| norm 0.5299 (-0.02z)| lr 6.00e-04 | 4159.01 ms | 32.5% bf16 MFU | 125667 tok/s step 953/19560 | loss 4.323602 (-1.84z)| norm 0.5431 (+0.14z)| lr 6.00e-04 | 4167.66 ms | 32.4% bf16 MFU | 125673 tok/s step 954/19560 | loss 4.368807 (-1.16z)| norm 0.5925 (+0.74z)| lr 6.00e-04 | 4156.29 ms | 32.5% bf16 MFU | 125697 tok/s step 955/19560 | loss 4.366425 (-1.18z)| norm 0.6721 (+1.71z)| lr 6.00e-04 | 4143.09 ms | 32.6% bf16 MFU | 125739 tok/s step 956/19560 | loss 4.327767 (-1.72z)| norm 0.5734 (+0.51z)| lr 6.00e-04 | 4153.47 ms | 32.5% bf16 MFU | 125764 tok/s step 957/19560 | loss 4.377759 (-0.97z)| norm 0.4806 (-0.61z)| lr 6.00e-04 | 4174.36 ms | 32.3% bf16 MFU | 125755 tok/s step 958/19560 | loss 4.378768 (-0.94z)| norm 0.4505 (-0.97z)| lr 6.00e-04 | 4161.78 ms | 32.4% bf16 MFU | 125766 tok/s step 959/19560 | loss 4.372918 (-1.01z)| norm 0.4301 (-1.21z)| lr 6.00e-04 | 4162.18 ms | 32.4% bf16 MFU | 125776 tok/s step 960/19560 | loss 4.343076 (-1.43z)| norm 0.3984 (-1.56z)| lr 6.00e-04 | 4156.12 ms | 32.5% bf16 MFU | 125795 tok/s step 961/19560 | loss 4.353112 (-1.26z)| norm 0.3614 (-1.96z)| lr 6.00e-04 | 4238.31 ms | 31.9% bf16 MFU | 125690 tok/s step 962/19560 | loss 4.413294 (-0.35z)| norm 0.3694 (-1.83z)| lr 6.00e-04 | 4171.10 ms | 32.4% bf16 MFU | 125691 tok/s step 963/19560 | loss 4.367284 (-1.03z)| norm 0.4249 (-1.16z)| lr 6.00e-04 | 4148.55 ms | 32.5% bf16 MFU | 125725 tok/s step 964/19560 | loss 4.405528 (-0.44z)| norm 0.5037 (-0.24z)| lr 6.00e-04 | 4151.03 ms | 32.5% bf16 MFU | 125754 tok/s step 965/19560 | loss 4.331620 (-1.58z)| norm 0.5505 (+0.30z)| lr 6.00e-04 | 4146.26 ms | 32.6% bf16 MFU | 125789 tok/s step 966/19560 | loss 4.320589 (-1.73z)| norm 0.5579 (+0.38z)| lr 6.00e-04 | 4173.93 ms | 32.3% bf16 MFU | 125780 tok/s step 967/19560 | loss 4.336472 (-1.45z)| norm 0.5008 (-0.29z)| lr 6.00e-04 | 4142.56 ms | 32.6% bf16 MFU | 125819 tok/s step 968/19560 | loss 4.367531 (-0.95z)| norm 0.5078 (-0.20z)| lr 6.00e-04 | 4152.32 ms | 32.5% bf16 MFU | 125841 tok/s step 969/19560 | loss 4.360922 (-1.05z)| norm 0.5431 (+0.22z)| lr 6.00e-04 | 4153.96 ms | 32.5% bf16 MFU | 125860 tok/s step 970/19560 | loss 4.320029 (-1.68z)| norm 0.5373 (+0.15z)| lr 6.00e-04 | 4153.53 ms | 32.5% bf16 MFU | 125878 tok/s step 971/19560 | loss 4.344875 (-1.26z)| norm 0.5463 (+0.25z)| lr 6.00e-04 | 4160.71 ms | 32.5% bf16 MFU | 125885 tok/s step 972/19560 | loss 4.289953 (-2.10z)| norm 0.4939 (-0.38z)| lr 6.00e-04 | 4147.31 ms | 32.6% bf16 MFU | 125911 tok/s step 973/19560 | loss 4.406199 (-0.23z)| norm 0.4089 (-1.39z)| lr 6.00e-04 | 4153.43 ms | 32.5% bf16 MFU | 125927 tok/s step 974/19560 | loss 4.318318 (-1.62z)| norm 0.4435 (-0.98z)| lr 6.00e-04 | 4163.75 ms | 32.4% bf16 MFU | 125927 tok/s step 975/19560 | loss 4.359049 (-0.95z)| norm 0.4386 (-1.03z)| lr 6.00e-04 | 4154.28 ms | 32.5% bf16 MFU | 125941 tok/s step 976/19560 | loss 4.355531 (-0.99z)| norm 0.3824 (-1.67z)| lr 6.00e-04 | 4162.44 ms | 32.4% bf16 MFU | 125941 tok/s step 977/19560 | loss 4.338413 (-1.25z)| norm 0.3624 (-1.87z)| lr 6.00e-04 | 4153.76 ms | 32.5% bf16 MFU | 125955 tok/s step 978/19560 | loss 4.350200 (-1.04z)| norm 0.4024 (-1.42z)| lr 6.00e-04 | 4160.11 ms | 32.5% bf16 MFU | 125959 tok/s step 979/19560 | loss 4.294776 (-1.89z)| norm 0.4452 (-0.90z)| lr 6.00e-04 | 4158.28 ms | 32.5% bf16 MFU | 125965 tok/s step 980/19560 | loss 4.363134 (-0.79z)| norm 0.4851 (-0.38z)| lr 6.00e-04 | 4183.30 ms | 32.3% bf16 MFU | 125933 tok/s step 981/19560 | loss 4.375899 (-0.57z)| norm 0.4783 (-0.45z)| lr 6.00e-04 | 4150.35 ms | 32.5% bf16 MFU | 125953 tok/s step 982/19560 | loss 4.275250 (-2.19z)| norm 0.3720 (-1.80z)| lr 6.00e-04 | 4155.40 ms | 32.5% bf16 MFU | 125964 tok/s step 983/19560 | loss 4.339678 (-1.12z)| norm 0.4030 (-1.38z)| lr 6.00e-04 | 4320.28 ms | 31.3% bf16 MFU | 125733 tok/s step 984/19560 | loss 4.285474 (-1.98z)| norm 0.3760 (-1.70z)| lr 6.00e-04 | 4166.01 ms | 32.4% bf16 MFU | 125739 tok/s step 985/19560 | loss 4.259423 (-2.34z)| norm 0.3609 (-1.85z)| lr 6.00e-04 | 4151.92 ms | 32.5% bf16 MFU | 125766 tok/s step 986/19560 | loss 4.282437 (-1.93z)| norm 0.3817 (-1.57z)| lr 6.00e-04 | 4150.34 ms | 32.5% bf16 MFU | 125794 tok/s step 987/19560 | loss 4.394337 (-0.11z)| norm 0.4498 (-0.70z)| lr 6.00e-04 | 4145.71 ms | 32.6% bf16 MFU | 125827 tok/s step 988/19560 | loss 4.373552 (-0.44z)| norm 0.5648 (+0.77z)| lr 6.00e-04 | 4151.91 ms | 32.5% bf16 MFU | 125850 tok/s step 989/19560 | loss 4.308431 (-1.49z)| norm 0.6705 (+2.11z)| lr 6.00e-04 | 4165.26 ms | 32.4% bf16 MFU | 125851 tok/s step 990/19560 | loss 4.306285 (-1.50z)| norm 0.6360 (+1.64z)| lr 6.00e-04 | 4155.74 ms | 32.5% bf16 MFU | 125866 tok/s step 991/19560 | loss 4.342887 (-0.89z)| norm 0.5483 (+0.54z)| lr 6.00e-04 | 4156.19 ms | 32.5% bf16 MFU | 125880 tok/s step 992/19560 | loss 4.339830 (-0.92z)| norm 0.4595 (-0.58z)| lr 6.00e-04 | 4144.06 ms | 32.6% bf16 MFU | 125912 tok/s step 993/19560 | loss 4.349755 (-0.75z)| norm 0.4508 (-0.69z)| lr 6.00e-04 | 4160.57 ms | 32.5% bf16 MFU | 125917 tok/s step 994/19560 | loss 4.367080 (-0.45z)| norm 0.5295 (+0.30z)| lr 6.00e-04 | 4173.45 ms | 32.4% bf16 MFU | 125903 tok/s step 995/19560 | loss 4.318614 (-1.24z)| norm 0.5739 (+0.84z)| lr 6.00e-04 | 4152.27 ms | 32.5% bf16 MFU | 125921 tok/s step 996/19560 | loss 4.345092 (-0.79z)| norm 0.5641 (+0.71z)| lr 6.00e-04 | 4182.54 ms | 32.3% bf16 MFU | 125892 tok/s step 997/19560 | loss 4.381394 (-0.15z)| norm 0.7304 (+2.70z)| lr 6.00e-04 | 4147.58 ms | 32.6% bf16 MFU | 125918 tok/s step 998/19560 | loss 4.251171 (-2.33z)| norm 0.4972 (-0.15z)| lr 6.00e-04 | 4154.29 ms | 32.5% bf16 MFU | 125932 tok/s step 999/19560 | loss 4.390424 (+0.04z)| norm 0.5087 (-0.01z)| lr 6.00e-04 | 4162.74 ms | 32.4% bf16 MFU | 125933 tok/s step 1000/19560 | loss 4.357683 (-0.51z)| norm 0.6404 (+1.59z)| lr 6.00e-04 | 4154.67 ms | 32.5% bf16 MFU | 125946 tok/s val loss 4.351290 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2590/10042 = 0.257917 step 1001/19560 | loss 4.318175 (-1.18z)| norm 0.6123 (+1.25z)| lr 6.00e-04 | 4147.43 ms | 32.6% bf16 MFU | 125969 tok/s step 1002/19560 | loss 4.325842 (-1.03z)| norm 0.5568 (+0.57z)| lr 6.00e-04 | 4152.19 ms | 32.5% bf16 MFU | 125984 tok/s step 1003/19560 | loss 4.359817 (-0.43z)| norm 0.4539 (-0.67z)| lr 6.00e-04 | 4165.42 ms | 32.4% bf16 MFU | 125978 tok/s step 1004/19560 | loss 4.293911 (-1.57z)| norm 0.4792 (-0.35z)| lr 6.00e-04 | 4153.58 ms | 32.5% bf16 MFU | 125991 tok/s step 1005/19560 | loss 4.388281 (+0.11z)| norm 0.5298 (+0.26z)| lr 6.00e-04 | 4178.70 ms | 32.3% bf16 MFU | 125965 tok/s step 1006/19560 | loss 4.350739 (-0.55z)| norm 0.5062 (-0.03z)| lr 6.00e-04 | 4161.93 ms | 32.4% bf16 MFU | 125965 tok/s step 1007/19560 | loss 4.356535 (-0.44z)| norm 0.5395 (+0.37z)| lr 6.00e-04 | 4146.22 ms | 32.6% bf16 MFU | 125989 tok/s step 1008/19560 | loss 4.299978 (-1.42z)| norm 0.4575 (-0.63z)| lr 6.00e-04 | 4149.68 ms | 32.5% bf16 MFU | 126007 tok/s step 1009/19560 | loss 4.252327 (-2.21z)| norm 0.6757 (+2.00z)| lr 6.00e-04 | 4163.20 ms | 32.4% bf16 MFU | 126003 tok/s step 1010/19560 | loss 4.352796 (-0.45z)| norm 0.4158 (-1.11z)| lr 6.00e-04 | 4171.42 ms | 32.4% bf16 MFU | 125987 tok/s step 1011/19560 | loss 4.341992 (-0.62z)| norm 0.6565 (+1.76z)| lr 6.00e-04 | 4166.27 ms | 32.4% bf16 MFU | 125980 tok/s step 1012/19560 | loss 4.332219 (-0.78z)| norm 0.5164 (+0.10z)| lr 6.00e-04 | 4156.92 ms | 32.5% bf16 MFU | 125987 tok/s step 1013/19560 | loss 4.344549 (-0.55z)| norm 0.6364 (+1.50z)| lr 6.00e-04 | 4152.06 ms | 32.5% bf16 MFU | 126002 tok/s step 1014/19560 | loss 4.373199 (-0.03z)| norm 1.4840 (+8.05z)| lr 6.00e-04 | 4223.97 ms | 32.0% bf16 MFU | 125908 tok/s step 1015/19560 | loss 4.341451 (-0.59z)| norm 1.0486 (+4.13z)| lr 6.00e-04 | 4154.67 ms | 32.5% bf16 MFU | 125922 tok/s step 1016/19560 | loss 4.357483 (-0.29z)| norm 0.7946 (+2.11z)| lr 6.00e-04 | 4164.81 ms | 32.4% bf16 MFU | 125920 tok/s step 1017/19560 | loss 4.378585 (+0.12z)| norm 3.2183 (+9.91z)| lr 6.00e-04 | 4165.30 ms | 32.4% bf16 MFU | 125917 tok/s step 1018/19560 | loss 4.472327 (+2.00z)| norm 0.8849 (+1.26z)| lr 6.00e-04 | 4166.84 ms | 32.4% bf16 MFU | 125913 tok/s step 1019/19560 | loss 4.391287 (+0.41z)| norm 0.9342 (+1.43z)| lr 6.00e-04 | 4155.90 ms | 32.5% bf16 MFU | 125925 tok/s step 1020/19560 | loss 4.408880 (+0.76z)| norm 0.9219 (+1.36z)| lr 6.00e-04 | 4150.49 ms | 32.5% bf16 MFU | 125945 tok/s step 1021/19560 | loss 4.339309 (-0.62z)| norm 0.6916 (+0.52z)| lr 6.00e-04 | 4156.04 ms | 32.5% bf16 MFU | 125955 tok/s step 1022/19560 | loss 4.388470 (+0.39z)| norm 0.4735 (-0.27z)| lr 6.00e-04 | 4164.21 ms | 32.4% bf16 MFU | 125952 tok/s step 1023/19560 | loss 4.327250 (-0.85z)| norm 0.4361 (-0.41z)| lr 6.00e-04 | 4147.98 ms | 32.6% bf16 MFU | 125975 tok/s step 1024/19560 | loss 4.339831 (-0.58z)| norm 0.4685 (-0.29z)| lr 6.00e-04 | 4155.99 ms | 32.5% bf16 MFU | 125983 tok/s step 1025/19560 | loss 4.290837 (-1.58z)| norm 0.4127 (-0.48z)| lr 6.00e-04 | 4144.55 ms | 32.6% bf16 MFU | 126009 tok/s step 1026/19560 | loss 4.303467 (-1.29z)| norm 0.4092 (-0.49z)| lr 6.00e-04 | 4153.30 ms | 32.5% bf16 MFU | 126021 tok/s step 1027/19560 | loss 4.339140 (-0.54z)| norm 0.4489 (-0.34z)| lr 6.00e-04 | 4151.95 ms | 32.5% bf16 MFU | 126033 tok/s step 1028/19560 | loss 4.348336 (-0.33z)| norm 0.4488 (-0.34z)| lr 6.00e-04 | 4161.01 ms | 32.4% bf16 MFU | 126032 tok/s step 1029/19560 | loss 4.286840 (-1.69z)| norm 0.4350 (-0.39z)| lr 6.00e-04 | 4164.17 ms | 32.4% bf16 MFU | 126025 tok/s step 1030/19560 | loss 4.310772 (-1.14z)| norm 0.4282 (-0.41z)| lr 6.00e-04 | 4156.20 ms | 32.5% bf16 MFU | 126031 tok/s step 1031/19560 | loss 4.325130 (-0.80z)| norm 0.4122 (-0.47z)| lr 6.00e-04 | 4151.35 ms | 32.5% bf16 MFU | 126044 tok/s step 1032/19560 | loss 4.291169 (-1.54z)| norm 0.3685 (-0.62z)| lr 6.00e-04 | 4157.25 ms | 32.5% bf16 MFU | 126048 tok/s step 1033/19560 | loss 4.311072 (-1.08z)| norm 0.3587 (-0.65z)| lr 6.00e-04 | 4161.63 ms | 32.4% bf16 MFU | 126045 tok/s step 1034/19560 | loss 4.354864 (-0.10z)| norm 0.3079 (-0.83z)| lr 6.00e-04 | 4146.56 ms | 32.6% bf16 MFU | 126064 tok/s step 1035/19560 | loss 4.361913 (+0.07z)| norm 0.3084 (-0.82z)| lr 6.00e-04 | 4155.60 ms | 32.5% bf16 MFU | 126069 tok/s step 1036/19560 | loss 4.338464 (-0.45z)| norm 0.3101 (-0.81z)| lr 6.00e-04 | 4142.24 ms | 32.6% bf16 MFU | 126094 tok/s step 1037/19560 | loss 4.268327 (-2.01z)| norm 0.3242 (-0.75z)| lr 6.00e-04 | 4162.85 ms | 32.4% bf16 MFU | 126087 tok/s step 1038/19560 | loss 4.309513 (-1.06z)| norm 0.3855 (-0.53z)| lr 6.00e-04 | 4155.40 ms | 32.5% bf16 MFU | 126091 tok/s step 1039/19560 | loss 4.366868 (+0.26z)| norm 0.4830 (-0.18z)| lr 6.00e-04 | 4159.11 ms | 32.5% bf16 MFU | 126089 tok/s step 1040/19560 | loss 4.354258 (-0.02z)| norm 0.6109 (+0.27z)| lr 6.00e-04 | 4160.50 ms | 32.5% bf16 MFU | 126086 tok/s step 1041/19560 | loss 4.318675 (-0.83z)| norm 0.6315 (+0.34z)| lr 6.00e-04 | 4157.44 ms | 32.5% bf16 MFU | 126087 tok/s step 1042/19560 | loss 4.266089 (-2.04z)| norm 0.6432 (+0.38z)| lr 6.00e-04 | 4158.99 ms | 32.5% bf16 MFU | 126086 tok/s step 1043/19560 | loss 4.341546 (-0.25z)| norm 0.6209 (+0.30z)| lr 6.00e-04 | 4157.11 ms | 32.5% bf16 MFU | 126087 tok/s step 1044/19560 | loss 4.232248 (-2.78z)| norm 0.4859 (-0.18z)| lr 6.00e-04 | 4146.89 ms | 32.6% bf16 MFU | 126104 tok/s step 1045/19560 | loss 4.299061 (-1.20z)| norm 0.4270 (-0.38z)| lr 6.00e-04 | 4161.11 ms | 32.4% bf16 MFU | 126099 tok/s step 1046/19560 | loss 4.299004 (-1.20z)| norm 0.3823 (-0.54z)| lr 6.00e-04 | 4257.42 ms | 31.7% bf16 MFU | 125951 tok/s step 1047/19560 | loss 4.312184 (-0.86z)| norm 0.3678 (-0.58z)| lr 6.00e-04 | 4182.54 ms | 32.3% bf16 MFU | 125921 tok/s step 1048/19560 | loss 4.277516 (-1.69z)| norm 0.3736 (-0.56z)| lr 5.99e-04 | 4165.75 ms | 32.4% bf16 MFU | 125918 tok/s step 1049/19560 | loss 4.306657 (-0.96z)| norm 0.3655 (-0.59z)| lr 5.99e-04 | 4155.27 ms | 32.5% bf16 MFU | 125931 tok/s step 1050/19560 | loss 4.300330 (-1.10z)| norm 0.3549 (-0.62z)| lr 5.99e-04 | 4168.78 ms | 32.4% bf16 MFU | 125923 tok/s step 1051/19560 | loss 4.230321 (-2.71z)| norm 0.3540 (-0.62z)| lr 5.99e-04 | 4149.11 ms | 32.5% bf16 MFU | 125945 tok/s step 1052/19560 | loss 4.239444 (-2.42z)| norm 0.3190 (-0.74z)| lr 5.99e-04 | 4162.23 ms | 32.4% bf16 MFU | 125945 tok/s step 1053/19560 | loss 4.293702 (-1.13z)| norm 0.3146 (-0.75z)| lr 5.99e-04 | 4201.53 ms | 32.1% bf16 MFU | 125887 tok/s step 1054/19560 | loss 4.183838 (-3.49z)| norm 0.7084 (+0.64z)| lr 5.99e-04 | 4151.91 ms | 32.5% bf16 MFU | 125907 tok/s step 1055/19560 | loss 4.315451 (-0.55z)| norm 0.3882 (-0.49z)| lr 5.99e-04 | 4143.77 ms | 32.6% bf16 MFU | 125938 tok/s step 1056/19560 | loss 4.221774 (-2.56z)| norm 0.4692 (-0.20z)| lr 5.99e-04 | 4153.67 ms | 32.5% bf16 MFU | 125952 tok/s step 1057/19560 | loss 4.244912 (-2.00z)| norm 0.4929 (-0.12z)| lr 5.99e-04 | 4153.14 ms | 32.5% bf16 MFU | 125966 tok/s step 1058/19560 | loss 4.390756 (+1.13z)| norm 0.4462 (-0.28z)| lr 5.99e-04 | 4148.70 ms | 32.5% bf16 MFU | 125987 tok/s step 1059/19560 | loss 4.281883 (-1.19z)| norm 0.4356 (-0.32z)| lr 5.99e-04 | 4174.85 ms | 32.3% bf16 MFU | 125967 tok/s step 1060/19560 | loss 4.262926 (-1.56z)| norm 0.4527 (-0.26z)| lr 5.99e-04 | 4201.59 ms | 32.1% bf16 MFU | 125907 tok/s step 1061/19560 | loss 4.254180 (-1.72z)| norm 0.5099 (-0.06z)| lr 5.99e-04 | 4178.52 ms | 32.3% bf16 MFU | 125886 tok/s step 1062/19560 | loss 4.263987 (-1.49z)| norm 0.4848 (-0.15z)| lr 5.99e-04 | 4156.46 ms | 32.5% bf16 MFU | 125898 tok/s step 1063/19560 | loss 4.232977 (-2.08z)| norm 0.4781 (-0.18z)| lr 5.99e-04 | 4162.67 ms | 32.4% bf16 MFU | 125901 tok/s step 1064/19560 | loss 4.260969 (-1.48z)| norm 0.4161 (-0.39z)| lr 5.99e-04 | 4171.03 ms | 32.4% bf16 MFU | 125891 tok/s step 1065/19560 | loss 4.281371 (-1.04z)| norm 0.3983 (-0.45z)| lr 5.99e-04 | 4215.77 ms | 32.0% bf16 MFU | 125814 tok/s step 1066/19560 | loss 4.266944 (-1.31z)| norm 0.4192 (-0.38z)| lr 5.99e-04 | 4160.61 ms | 32.5% bf16 MFU | 125824 tok/s step 1067/19560 | loss 4.310589 (-0.41z)| norm 0.3888 (-0.48z)| lr 5.99e-04 | 4157.99 ms | 32.5% bf16 MFU | 125838 tok/s step 1068/19560 | loss 4.261951 (-1.39z)| norm 0.3767 (-0.52z)| lr 5.99e-04 | 4160.71 ms | 32.5% bf16 MFU | 125846 tok/s step 1069/19560 | loss 4.387349 (+1.18z)| norm 0.4207 (-0.36z)| lr 5.99e-04 | 4149.88 ms | 32.5% bf16 MFU | 125871 tok/s step 1070/19560 | loss 4.252552 (-1.55z)| norm 0.4761 (-0.16z)| lr 5.99e-04 | 4151.94 ms | 32.5% bf16 MFU | 125891 tok/s step 1071/19560 | loss 4.311353 (-0.35z)| norm 0.4693 (-0.18z)| lr 5.99e-04 | 4175.63 ms | 32.3% bf16 MFU | 125874 tok/s step 1072/19560 | loss 4.241026 (-1.74z)| norm 0.4855 (-0.13z)| lr 5.99e-04 | 4168.10 ms | 32.4% bf16 MFU | 125870 tok/s step 1073/19560 | loss 4.373951 (+0.93z)| norm 0.4495 (-0.25z)| lr 5.99e-04 | 4173.34 ms | 32.4% bf16 MFU | 125858 tok/s step 1074/19560 | loss 4.345436 (+0.36z)| norm 0.4654 (-0.19z)| lr 5.99e-04 | 4172.98 ms | 32.4% bf16 MFU | 125847 tok/s step 1075/19560 | loss 4.244393 (-1.64z)| norm 0.5074 (-0.05z)| lr 5.99e-04 | 4154.77 ms | 32.5% bf16 MFU | 125864 tok/s step 1076/19560 | loss 4.345016 (+0.38z)| norm 0.5105 (-0.03z)| lr 5.99e-04 | 4149.29 ms | 32.5% bf16 MFU | 125889 tok/s step 1077/19560 | loss 4.287313 (-0.77z)| norm 0.4681 (-0.18z)| lr 5.99e-04 | 4149.91 ms | 32.5% bf16 MFU | 125911 tok/s step 1078/19560 | loss 4.295326 (-0.60z)| norm 0.5118 (-0.03z)| lr 5.99e-04 | 4154.28 ms | 32.5% bf16 MFU | 125926 tok/s step 1079/19560 | loss 4.262974 (-1.27z)| norm 0.5777 (+0.20z)| lr 5.99e-04 | 4162.36 ms | 32.4% bf16 MFU | 125927 tok/s step 1080/19560 | loss 4.340506 (+0.33z)| norm 0.6278 (+0.37z)| lr 5.99e-04 | 4142.77 ms | 32.6% bf16 MFU | 125959 tok/s step 1081/19560 | loss 4.323197 (-0.03z)| norm 0.6343 (+0.39z)| lr 5.99e-04 | 4160.02 ms | 32.5% bf16 MFU | 125962 tok/s step 1082/19560 | loss 4.308226 (-0.33z)| norm 0.4403 (-0.29z)| lr 5.99e-04 | 4164.04 ms | 32.4% bf16 MFU | 125960 tok/s step 1083/19560 | loss 4.344406 (+0.43z)| norm 0.5084 (-0.04z)| lr 5.99e-04 | 4146.92 ms | 32.6% bf16 MFU | 125983 tok/s step 1084/19560 | loss 4.302060 (-0.45z)| norm 0.4402 (-0.28z)| lr 5.99e-04 | 4150.47 ms | 32.5% bf16 MFU | 126000 tok/s step 1085/19560 | loss 4.300982 (-0.46z)| norm 0.4273 (-0.32z)| lr 5.99e-04 | 4152.80 ms | 32.5% bf16 MFU | 126012 tok/s step 1086/19560 | loss 4.275191 (-0.98z)| norm 0.4166 (-0.36z)| lr 5.99e-04 | 4163.30 ms | 32.4% bf16 MFU | 126008 tok/s step 1087/19560 | loss 4.276183 (-0.94z)| norm 0.3956 (-0.43z)| lr 5.99e-04 | 4169.35 ms | 32.4% bf16 MFU | 125995 tok/s step 1088/19560 | loss 4.225406 (-1.95z)| norm 0.3909 (-0.45z)| lr 5.99e-04 | 4148.18 ms | 32.5% bf16 MFU | 126015 tok/s step 1089/19560 | loss 4.292347 (-0.57z)| norm 0.3937 (-0.44z)| lr 5.99e-04 | 4160.58 ms | 32.5% bf16 MFU | 126015 tok/s step 1090/19560 | loss 4.263203 (-1.15z)| norm 0.3466 (-0.61z)| lr 5.99e-04 | 4168.41 ms | 32.4% bf16 MFU | 126003 tok/s step 1091/19560 | loss 4.305612 (-0.27z)| norm 0.3291 (-0.66z)| lr 5.99e-04 | 4150.48 ms | 32.5% bf16 MFU | 126019 tok/s step 1092/19560 | loss 4.236430 (-1.68z)| norm 0.3260 (-0.67z)| lr 5.99e-04 | 4155.55 ms | 32.5% bf16 MFU | 126026 tok/s step 1093/19560 | loss 4.325904 (+0.18z)| norm 0.3426 (-0.60z)| lr 5.99e-04 | 4150.19 ms | 32.5% bf16 MFU | 126041 tok/s step 1094/19560 | loss 4.247007 (-1.44z)| norm 0.3164 (-0.69z)| lr 5.99e-04 | 4149.24 ms | 32.5% bf16 MFU | 126057 tok/s step 1095/19560 | loss 4.276564 (-0.81z)| norm 0.3373 (-0.61z)| lr 5.99e-04 | 4159.08 ms | 32.5% bf16 MFU | 126057 tok/s step 1096/19560 | loss 4.269342 (-0.95z)| norm 0.3711 (-0.49z)| lr 5.99e-04 | 4158.96 ms | 32.5% bf16 MFU | 126057 tok/s step 1097/19560 | loss 4.258439 (-1.16z)| norm 0.3888 (-0.42z)| lr 5.99e-04 | 4163.92 ms | 32.4% bf16 MFU | 126050 tok/s step 1098/19560 | loss 4.173323 (-2.80z)| norm 0.4274 (-0.28z)| lr 5.99e-04 | 4156.16 ms | 32.5% bf16 MFU | 126055 tok/s step 1099/19560 | loss 4.249163 (-1.26z)| norm 0.4436 (-0.22z)| lr 5.99e-04 | 4158.07 ms | 32.5% bf16 MFU | 126057 tok/s step 1100/19560 | loss 4.283993 (-0.57z)| norm 0.4388 (-0.24z)| lr 5.99e-04 | 4143.46 ms | 32.6% bf16 MFU | 126081 tok/s step 1101/19560 | loss 4.208506 (-2.03z)| norm 0.3779 (-0.45z)| lr 5.99e-04 | 4153.99 ms | 32.5% bf16 MFU | 126087 tok/s step 1102/19560 | loss 4.290565 (-0.40z)| norm 0.3542 (-0.53z)| lr 5.99e-04 | 4160.13 ms | 32.5% bf16 MFU | 126084 tok/s step 1103/19560 | loss 4.236670 (-1.44z)| norm 0.3712 (-0.47z)| lr 5.99e-04 | 4158.06 ms | 32.5% bf16 MFU | 126084 tok/s step 1104/19560 | loss 4.255854 (-1.05z)| norm 0.3755 (-0.45z)| lr 5.99e-04 | 4156.82 ms | 32.5% bf16 MFU | 126087 tok/s step 1105/19560 | loss 4.256534 (-1.02z)| norm 0.3950 (-0.39z)| lr 5.99e-04 | 4155.90 ms | 32.5% bf16 MFU | 126090 tok/s step 1106/19560 | loss 4.246343 (-1.20z)| norm 0.4729 (-0.12z)| lr 5.99e-04 | 4152.87 ms | 32.5% bf16 MFU | 126098 tok/s step 1107/19560 | loss 4.249442 (-1.13z)| norm 0.5118 (+0.01z)| lr 5.99e-04 | 4152.75 ms | 32.5% bf16 MFU | 126106 tok/s step 1108/19560 | loss 4.192074 (-2.19z)| norm 0.4591 (-0.17z)| lr 5.99e-04 | 4157.25 ms | 32.5% bf16 MFU | 126106 tok/s step 1109/19560 | loss 4.365904 (+1.16z)| norm 0.4933 (-0.05z)| lr 5.99e-04 | 4160.37 ms | 32.5% bf16 MFU | 126102 tok/s step 1110/19560 | loss 4.196320 (-2.06z)| norm 0.5786 (+0.24z)| lr 5.99e-04 | 4164.10 ms | 32.4% bf16 MFU | 126092 tok/s step 1111/19560 | loss 4.329447 (+0.46z)| norm 0.6846 (+0.60z)| lr 5.99e-04 | 4164.08 ms | 32.4% bf16 MFU | 126083 tok/s step 1112/19560 | loss 4.271304 (-0.64z)| norm 0.6024 (+0.31z)| lr 5.99e-04 | 4161.51 ms | 32.4% bf16 MFU | 126078 tok/s step 1113/19560 | loss 4.296364 (-0.17z)| norm 0.5653 (+0.18z)| lr 5.99e-04 | 4156.24 ms | 32.5% bf16 MFU | 126081 tok/s step 1114/19560 | loss 4.290073 (-0.29z)| norm 0.5242 (+0.03z)| lr 5.99e-04 | 4153.76 ms | 32.5% bf16 MFU | 126088 tok/s step 1115/19560 | loss 4.206369 (-1.85z)| norm 0.4376 (-0.27z)| lr 5.99e-04 | 4156.12 ms | 32.5% bf16 MFU | 126091 tok/s step 1116/19560 | loss 4.239775 (-1.20z)| norm 0.3599 (-0.54z)| lr 5.99e-04 | 4156.10 ms | 32.5% bf16 MFU | 126094 tok/s step 1117/19560 | loss 4.225384 (-1.45z)| norm 0.3548 (-0.55z)| lr 5.99e-04 | 4157.06 ms | 32.5% bf16 MFU | 126095 tok/s step 1118/19560 | loss 4.259481 (-0.80z)| norm 0.3319 (-0.62z)| lr 5.99e-04 | 4151.62 ms | 32.5% bf16 MFU | 126105 tok/s step 1119/19560 | loss 4.260368 (-0.77z)| norm 0.3358 (-0.60z)| lr 5.99e-04 | 4162.77 ms | 32.4% bf16 MFU | 126097 tok/s step 1120/19560 | loss 4.263307 (-0.70z)| norm 0.3028 (-0.70z)| lr 5.99e-04 | 4175.51 ms | 32.3% bf16 MFU | 126070 tok/s step 1121/19560 | loss 4.212974 (-1.61z)| norm 0.3210 (-0.64z)| lr 5.99e-04 | 4161.32 ms | 32.4% bf16 MFU | 126066 tok/s step 1122/19560 | loss 4.267073 (-0.59z)| norm 0.3440 (-0.55z)| lr 5.99e-04 | 4152.61 ms | 32.5% bf16 MFU | 126076 tok/s step 1123/19560 | loss 4.250837 (-0.89z)| norm 0.3533 (-0.51z)| lr 5.99e-04 | 4144.05 ms | 32.6% bf16 MFU | 126098 tok/s step 1124/19560 | loss 4.230604 (-1.24z)| norm 0.3436 (-0.54z)| lr 5.99e-04 | 4161.67 ms | 32.4% bf16 MFU | 126092 tok/s step 1125/19560 | loss 4.181985 (-2.10z)| norm 0.3752 (-0.42z)| lr 5.99e-04 | 4185.20 ms | 32.3% bf16 MFU | 126051 tok/s step 1126/19560 | loss 4.217503 (-1.43z)| norm 0.3742 (-0.42z)| lr 5.99e-04 | 4166.58 ms | 32.4% bf16 MFU | 126040 tok/s step 1127/19560 | loss 4.215940 (-1.44z)| norm 0.3850 (-0.38z)| lr 5.99e-04 | 4153.90 ms | 32.5% bf16 MFU | 126049 tok/s step 1128/19560 | loss 4.281063 (-0.23z)| norm 0.4714 (-0.08z)| lr 5.99e-04 | 4159.00 ms | 32.5% bf16 MFU | 126049 tok/s step 1129/19560 | loss 4.210401 (-1.51z)| norm 0.4633 (-0.10z)| lr 5.99e-04 | 4173.40 ms | 32.4% bf16 MFU | 126028 tok/s step 1130/19560 | loss 4.323111 (+0.56z)| norm 0.5117 (+0.06z)| lr 5.99e-04 | 4161.61 ms | 32.4% bf16 MFU | 126026 tok/s step 1131/19560 | loss 4.307950 (+0.29z)| norm 0.6448 (+0.52z)| lr 5.99e-04 | 4171.89 ms | 32.4% bf16 MFU | 126008 tok/s step 1132/19560 | loss 4.237323 (-1.00z)| norm 0.6856 (+0.65z)| lr 5.99e-04 | 4153.28 ms | 32.5% bf16 MFU | 126019 tok/s step 1133/19560 | loss 4.283893 (-0.13z)| norm 0.5255 (+0.10z)| lr 5.99e-04 | 4153.68 ms | 32.5% bf16 MFU | 126030 tok/s step 1134/19560 | loss 4.274386 (-0.30z)| norm 0.5144 (+0.06z)| lr 5.99e-04 | 4169.74 ms | 32.4% bf16 MFU | 126015 tok/s step 1135/19560 | loss 4.215639 (-1.38z)| norm 0.3565 (-0.48z)| lr 5.99e-04 | 4153.74 ms | 32.5% bf16 MFU | 126025 tok/s step 1136/19560 | loss 4.250201 (-0.72z)| norm 0.3560 (-0.47z)| lr 5.99e-04 | 4147.68 ms | 32.6% bf16 MFU | 126044 tok/s step 1137/19560 | loss 4.342307 (+0.98z)| norm 0.3495 (-0.49z)| lr 5.99e-04 | 4157.43 ms | 32.5% bf16 MFU | 126047 tok/s step 1138/19560 | loss 4.188517 (-1.85z)| norm 0.3473 (-0.49z)| lr 5.99e-04 | 4155.42 ms | 32.5% bf16 MFU | 126054 tok/s step 1139/19560 | loss 4.193650 (-1.72z)| norm 0.3258 (-0.56z)| lr 5.99e-04 | 4152.68 ms | 32.5% bf16 MFU | 126064 tok/s step 1140/19560 | loss 4.225200 (-1.12z)| norm 0.3375 (-0.51z)| lr 5.99e-04 | 4159.59 ms | 32.5% bf16 MFU | 126062 tok/s step 1141/19560 | loss 4.278828 (-0.13z)| norm 0.3440 (-0.48z)| lr 5.99e-04 | 4149.52 ms | 32.5% bf16 MFU | 126077 tok/s step 1142/19560 | loss 4.192771 (-1.68z)| norm 0.3612 (-0.41z)| lr 5.99e-04 | 4147.28 ms | 32.6% bf16 MFU | 126094 tok/s step 1143/19560 | loss 4.226683 (-1.04z)| norm 0.3611 (-0.40z)| lr 5.99e-04 | 4149.03 ms | 32.5% bf16 MFU | 126107 tok/s step 1144/19560 | loss 4.290435 (+0.14z)| norm 0.4047 (-0.23z)| lr 5.99e-04 | 4156.97 ms | 32.5% bf16 MFU | 126108 tok/s step 1145/19560 | loss 4.243974 (-0.71z)| norm 0.4105 (-0.30z)| lr 5.99e-04 | 4148.22 ms | 32.5% bf16 MFU | 126122 tok/s step 1146/19560 | loss 4.215074 (-1.27z)| norm 0.4173 (-0.22z)| lr 5.99e-04 | 4149.60 ms | 32.5% bf16 MFU | 126133 tok/s step 1147/19560 | loss 4.336260 (+1.13z)| norm 0.3918 (-0.44z)| lr 5.99e-04 | 4159.20 ms | 32.5% bf16 MFU | 126129 tok/s step 1148/19560 | loss 4.213471 (-1.31z)| norm 0.3741 (-0.63z)| lr 5.99e-04 | 4167.65 ms | 32.4% bf16 MFU | 126113 tok/s step 1149/19560 | loss 4.233368 (-0.89z)| norm 0.3960 (-0.38z)| lr 5.99e-04 | 4164.97 ms | 32.4% bf16 MFU | 126101 tok/s step 1150/19560 | loss 4.192377 (-1.71z)| norm 0.4298 (-0.01z)| lr 5.99e-04 | 4143.37 ms | 32.6% bf16 MFU | 126123 tok/s step 1151/19560 | loss 4.315876 (+0.83z)| norm 0.4334 (+0.03z)| lr 5.99e-04 | 4152.05 ms | 32.5% bf16 MFU | 126131 tok/s step 1152/19560 | loss 4.183195 (-1.86z)| norm 0.4189 (-0.12z)| lr 5.99e-04 | 4151.59 ms | 32.5% bf16 MFU | 126138 tok/s step 1153/19560 | loss 4.193790 (-1.61z)| norm 0.4581 (+0.30z)| lr 5.99e-04 | 4157.95 ms | 32.5% bf16 MFU | 126136 tok/s step 1154/19560 | loss 4.220124 (-1.06z)| norm 0.4419 (+0.12z)| lr 5.99e-04 | 4162.95 ms | 32.4% bf16 MFU | 126126 tok/s step 1155/19560 | loss 4.178442 (-1.87z)| norm 0.4631 (+0.36z)| lr 5.99e-04 | 4163.93 ms | 32.4% bf16 MFU | 126116 tok/s step 1156/19560 | loss 4.222424 (-0.97z)| norm 0.4338 (+0.03z)| lr 5.99e-04 | 4150.96 ms | 32.5% bf16 MFU | 126125 tok/s step 1157/19560 | loss 4.232497 (-0.76z)| norm 0.3811 (-0.54z)| lr 5.99e-04 | 4156.86 ms | 32.5% bf16 MFU | 126125 tok/s step 1158/19560 | loss 4.216930 (-1.05z)| norm 0.3697 (-0.66z)| lr 5.99e-04 | 4149.01 ms | 32.5% bf16 MFU | 126137 tok/s step 1159/19560 | loss 4.232981 (-0.72z)| norm 0.4008 (-0.32z)| lr 5.99e-04 | 4158.41 ms | 32.5% bf16 MFU | 126134 tok/s step 1160/19560 | loss 4.142522 (-2.46z)| norm 0.4214 (-0.10z)| lr 5.99e-04 | 4162.52 ms | 32.4% bf16 MFU | 126125 tok/s step 1161/19560 | loss 4.192179 (-1.46z)| norm 0.4297 (-0.01z)| lr 5.99e-04 | 4157.49 ms | 32.5% bf16 MFU | 126124 tok/s step 1162/19560 | loss 4.163322 (-1.98z)| norm 0.4154 (-0.18z)| lr 5.99e-04 | 4176.65 ms | 32.3% bf16 MFU | 126094 tok/s step 1163/19560 | loss 4.208247 (-1.10z)| norm 0.3411 (-1.01z)| lr 5.99e-04 | 4159.50 ms | 32.5% bf16 MFU | 126092 tok/s step 1164/19560 | loss 4.163204 (-1.94z)| norm 0.3345 (-1.09z)| lr 5.99e-04 | 4156.37 ms | 32.5% bf16 MFU | 126094 tok/s step 1165/19560 | loss 4.209502 (-1.02z)| norm 0.3343 (-1.10z)| lr 5.99e-04 | 4154.01 ms | 32.5% bf16 MFU | 126100 tok/s step 1166/19560 | loss 4.191047 (-1.36z)| norm 0.3334 (-1.10z)| lr 5.99e-04 | 4147.62 ms | 32.6% bf16 MFU | 126116 tok/s step 1167/19560 | loss 4.202676 (-1.12z)| norm 0.3810 (-0.56z)| lr 5.99e-04 | 4157.55 ms | 32.5% bf16 MFU | 126115 tok/s step 1168/19560 | loss 4.175856 (-1.63z)| norm 0.3865 (-0.48z)| lr 5.99e-04 | 4157.56 ms | 32.5% bf16 MFU | 126115 tok/s step 1169/19560 | loss 4.246909 (-0.22z)| norm 0.4153 (-0.14z)| lr 5.99e-04 | 4140.16 ms | 32.6% bf16 MFU | 126141 tok/s step 1170/19560 | loss 4.257369 (-0.01z)| norm 0.5347 (+1.27z)| lr 5.99e-04 | 4146.21 ms | 32.6% bf16 MFU | 126156 tok/s step 1171/19560 | loss 4.184413 (-1.43z)| norm 0.5961 (+2.01z)| lr 5.99e-04 | 4202.32 ms | 32.1% bf16 MFU | 126086 tok/s step 1172/19560 | loss 4.167616 (-1.74z)| norm 0.5224 (+1.13z)| lr 5.99e-04 | 4170.47 ms | 32.4% bf16 MFU | 126068 tok/s step 1173/19560 | loss 4.177494 (-1.51z)| norm 0.5291 (+1.19z)| lr 5.99e-04 | 4148.47 ms | 32.5% bf16 MFU | 126083 tok/s step 1174/19560 | loss 4.221114 (-0.65z)| norm 0.4592 (+0.36z)| lr 5.99e-04 | 4158.44 ms | 32.5% bf16 MFU | 126083 tok/s step 1175/19560 | loss 4.230139 (-0.47z)| norm 0.4041 (-0.29z)| lr 5.99e-04 | 4186.93 ms | 32.2% bf16 MFU | 126040 tok/s step 1176/19560 | loss 4.223759 (-0.58z)| norm 0.3680 (-0.71z)| lr 5.99e-04 | 4146.67 ms | 32.6% bf16 MFU | 126060 tok/s step 1177/19560 | loss 4.188406 (-1.25z)| norm 0.3384 (-1.05z)| lr 5.99e-04 | 4154.58 ms | 32.5% bf16 MFU | 126067 tok/s step 1178/19560 | loss 4.231542 (-0.40z)| norm 0.3402 (-1.03z)| lr 5.99e-04 | 4154.62 ms | 32.5% bf16 MFU | 126073 tok/s step 1179/19560 | loss 4.169176 (-1.60z)| norm 0.3479 (-0.94z)| lr 5.99e-04 | 4152.56 ms | 32.5% bf16 MFU | 126082 tok/s step 1180/19560 | loss 4.182320 (-1.33z)| norm 0.3313 (-1.14z)| lr 5.99e-04 | 4152.86 ms | 32.5% bf16 MFU | 126090 tok/s step 1181/19560 | loss 4.254360 (+0.07z)| norm 0.3656 (-0.74z)| lr 5.99e-04 | 4146.06 ms | 32.6% bf16 MFU | 126109 tok/s step 1182/19560 | loss 4.222141 (-0.56z)| norm 0.3988 (-0.34z)| lr 5.99e-04 | 4155.39 ms | 32.5% bf16 MFU | 126112 tok/s step 1183/19560 | loss 4.217433 (-0.64z)| norm 0.4014 (-0.31z)| lr 5.99e-04 | 4159.34 ms | 32.5% bf16 MFU | 126109 tok/s step 1184/19560 | loss 4.141148 (-2.09z)| norm 0.3448 (-0.99z)| lr 5.99e-04 | 4145.50 ms | 32.6% bf16 MFU | 126127 tok/s step 1185/19560 | loss 4.191503 (-1.11z)| norm 0.3341 (-1.10z)| lr 5.99e-04 | 4160.17 ms | 32.5% bf16 MFU | 126122 tok/s step 1186/19560 | loss 4.289211 (+0.80z)| norm 0.3785 (-0.55z)| lr 5.99e-04 | 4151.41 ms | 32.5% bf16 MFU | 126130 tok/s step 1187/19560 | loss 4.176647 (-1.39z)| norm 0.3482 (-0.91z)| lr 5.99e-04 | 4150.99 ms | 32.5% bf16 MFU | 126139 tok/s step 1188/19560 | loss 4.228955 (-0.36z)| norm 0.3626 (-0.73z)| lr 5.99e-04 | 4150.95 ms | 32.5% bf16 MFU | 126147 tok/s step 1189/19560 | loss 4.200817 (-0.90z)| norm 0.3701 (-0.62z)| lr 5.99e-04 | 4148.49 ms | 32.5% bf16 MFU | 126159 tok/s step 1190/19560 | loss 4.161013 (-1.64z)| norm 0.3699 (-0.61z)| lr 5.99e-04 | 4148.07 ms | 32.5% bf16 MFU | 126171 tok/s step 1191/19560 | loss 4.111918 (-2.50z)| norm 0.3635 (-0.68z)| lr 5.99e-04 | 4158.66 ms | 32.5% bf16 MFU | 126166 tok/s step 1192/19560 | loss 4.247333 (+0.04z)| norm 0.3830 (-0.44z)| lr 5.99e-04 | 4169.92 ms | 32.4% bf16 MFU | 126144 tok/s step 1193/19560 | loss 4.186617 (-1.08z)| norm 0.3903 (-0.35z)| lr 5.99e-04 | 4145.17 ms | 32.6% bf16 MFU | 126161 tok/s step 1194/19560 | loss 4.145500 (-1.81z)| norm 0.3433 (-0.91z)| lr 5.99e-04 | 4158.57 ms | 32.5% bf16 MFU | 126156 tok/s step 1195/19560 | loss 4.166458 (-1.40z)| norm 0.3632 (-0.67z)| lr 5.99e-04 | 4152.78 ms | 32.5% bf16 MFU | 126161 tok/s step 1196/19560 | loss 4.151194 (-1.65z)| norm 0.4104 (-0.10z)| lr 5.99e-04 | 4165.31 ms | 32.4% bf16 MFU | 126147 tok/s step 1197/19560 | loss 4.228669 (-0.22z)| norm 0.4461 (+0.34z)| lr 5.99e-04 | 4149.64 ms | 32.5% bf16 MFU | 126157 tok/s step 1198/19560 | loss 4.154518 (-1.58z)| norm 0.4386 (+0.25z)| lr 5.99e-04 | 4197.30 ms | 32.2% bf16 MFU | 126094 tok/s step 1199/19560 | loss 4.209315 (-0.55z)| norm 0.4614 (+0.53z)| lr 5.99e-04 | 4166.48 ms | 32.4% bf16 MFU | 126081 tok/s step 1200/19560 | loss 4.222900 (-0.29z)| norm 0.4231 (+0.07z)| lr 5.99e-04 | 4161.83 ms | 32.4% bf16 MFU | 126076 tok/s step 1201/19560 | loss 4.325585 (+1.66z)| norm 0.3898 (-0.34z)| lr 5.99e-04 | 4166.85 ms | 32.4% bf16 MFU | 126063 tok/s step 1202/19560 | loss 4.169475 (-1.29z)| norm 0.4362 (+0.24z)| lr 5.99e-04 | 4168.63 ms | 32.4% bf16 MFU | 126049 tok/s step 1203/19560 | loss 4.238844 (+0.04z)| norm 0.4103 (-0.07z)| lr 5.99e-04 | 4155.17 ms | 32.5% bf16 MFU | 126055 tok/s step 1204/19560 | loss 4.214625 (-0.41z)| norm 0.3938 (-0.27z)| lr 5.99e-04 | 4151.85 ms | 32.5% bf16 MFU | 126066 tok/s step 1205/19560 | loss 4.163965 (-1.38z)| norm 0.3720 (-0.53z)| lr 5.99e-04 | 4162.66 ms | 32.4% bf16 MFU | 126060 tok/s step 1206/19560 | loss 4.150769 (-1.61z)| norm 0.4025 (-0.14z)| lr 5.99e-04 | 4149.67 ms | 32.5% bf16 MFU | 126075 tok/s step 1207/19560 | loss 4.219890 (-0.26z)| norm 0.4088 (-0.05z)| lr 5.99e-04 | 4143.06 ms | 32.6% bf16 MFU | 126098 tok/s step 1208/19560 | loss 4.138553 (-1.81z)| norm 0.3967 (-0.18z)| lr 5.99e-04 | 4156.21 ms | 32.5% bf16 MFU | 126101 tok/s step 1209/19560 | loss 4.178210 (-1.03z)| norm 0.3612 (-0.64z)| lr 5.99e-04 | 4163.29 ms | 32.4% bf16 MFU | 126092 tok/s step 1210/19560 | loss 4.137240 (-1.80z)| norm 0.3623 (-0.61z)| lr 5.99e-04 | 4153.57 ms | 32.5% bf16 MFU | 126099 tok/s step 1211/19560 | loss 4.199777 (-0.57z)| norm 0.3588 (-0.65z)| lr 5.99e-04 | 4146.23 ms | 32.6% bf16 MFU | 126116 tok/s step 1212/19560 | loss 4.177480 (-1.00z)| norm 0.3594 (-0.63z)| lr 5.99e-04 | 4181.96 ms | 32.3% bf16 MFU | 126079 tok/s step 1213/19560 | loss 4.106661 (-2.35z)| norm 0.4343 (+0.38z)| lr 5.99e-04 | 4153.55 ms | 32.5% bf16 MFU | 126086 tok/s step 1214/19560 | loss 4.223362 (-0.04z)| norm 0.4469 (+0.55z)| lr 5.99e-04 | 4164.97 ms | 32.4% bf16 MFU | 126076 tok/s step 1215/19560 | loss 4.176313 (-0.96z)| norm 0.4194 (+0.18z)| lr 5.99e-04 | 4171.34 ms | 32.4% bf16 MFU | 126057 tok/s step 1216/19560 | loss 4.262867 (+0.75z)| norm 0.4702 (+0.85z)| lr 5.99e-04 | 6119.60 ms | 22.1% bf16 MFU | 124037 tok/s step 1217/19560 | loss 4.164916 (-1.17z)| norm 0.4784 (+0.95z)| lr 5.99e-04 | 4158.02 ms | 32.5% bf16 MFU | 124140 tok/s step 1218/19560 | loss 4.202124 (-0.42z)| norm 0.4506 (+0.57z)| lr 5.99e-04 | 4162.21 ms | 32.4% bf16 MFU | 124231 tok/s step 1219/19560 | loss 4.184434 (-0.76z)| norm 0.4565 (+0.63z)| lr 5.99e-04 | 4144.85 ms | 32.6% bf16 MFU | 124344 tok/s step 1220/19560 | loss 4.105036 (-2.28z)| norm 0.5227 (+1.50z)| lr 5.99e-04 | 4167.41 ms | 32.4% bf16 MFU | 124417 tok/s step 1221/19560 | loss 4.137187 (-1.63z)| norm 0.5163 (+1.39z)| lr 5.99e-04 | 4159.82 ms | 32.5% bf16 MFU | 124498 tok/s step 1222/19560 | loss 4.321037 (+1.96z)| norm 0.4107 (-0.03z)| lr 5.99e-04 | 4160.90 ms | 32.4% bf16 MFU | 124574 tok/s step 1223/19560 | loss 4.177799 (-0.82z)| norm 0.4366 (+0.31z)| lr 5.99e-04 | 4157.53 ms | 32.5% bf16 MFU | 124650 tok/s step 1224/19560 | loss 4.247835 (+0.55z)| norm 0.4016 (-0.17z)| lr 5.99e-04 | 4149.69 ms | 32.5% bf16 MFU | 124735 tok/s step 1225/19560 | loss 4.180982 (-0.74z)| norm 0.4216 (+0.10z)| lr 5.99e-04 | 4158.72 ms | 32.5% bf16 MFU | 124802 tok/s step 1226/19560 | loss 4.154353 (-1.26z)| norm 0.3762 (-0.51z)| lr 5.99e-04 | 4154.57 ms | 32.5% bf16 MFU | 124871 tok/s step 1227/19560 | loss 4.153351 (-1.25z)| norm 0.3510 (-0.84z)| lr 5.99e-04 | 4158.69 ms | 32.5% bf16 MFU | 124931 tok/s step 1228/19560 | loss 4.188533 (-0.56z)| norm 0.3623 (-0.68z)| lr 5.99e-04 | 4209.54 ms | 32.1% bf16 MFU | 124912 tok/s step 1229/19560 | loss 4.139368 (-1.50z)| norm 0.3480 (-0.87z)| lr 5.99e-04 | 4464.09 ms | 30.2% bf16 MFU | 124539 tok/s step 1230/19560 | loss 4.200839 (-0.30z)| norm 0.3288 (-1.12z)| lr 5.99e-04 | 4190.21 ms | 32.2% bf16 MFU | 124568 tok/s step 1231/19560 | loss 4.072086 (-2.70z)| norm 0.3298 (-1.10z)| lr 5.99e-04 | 4141.35 ms | 32.6% bf16 MFU | 124669 tok/s step 1232/19560 | loss 4.107648 (-1.98z)| norm 0.3898 (-0.30z)| lr 5.99e-04 | 4144.21 ms | 32.6% bf16 MFU | 124762 tok/s step 1233/19560 | loss 4.142797 (-1.30z)| norm 0.3870 (-0.34z)| lr 5.99e-04 | 4147.45 ms | 32.6% bf16 MFU | 124844 tok/s step 1234/19560 | loss 4.246367 (+0.63z)| norm 0.3427 (-0.92z)| lr 5.99e-04 | 4144.55 ms | 32.6% bf16 MFU | 124927 tok/s step 1235/19560 | loss 4.125064 (-1.60z)| norm 0.3499 (-0.81z)| lr 5.99e-04 | 4145.81 ms | 32.6% bf16 MFU | 125004 tok/s step 1236/19560 | loss 4.143192 (-1.25z)| norm 0.3581 (-0.68z)| lr 5.99e-04 | 4161.13 ms | 32.4% bf16 MFU | 125053 tok/s step 1237/19560 | loss 4.338872 (+2.37z)| norm 0.3946 (-0.18z)| lr 5.99e-04 | 4151.21 ms | 32.5% bf16 MFU | 125115 tok/s step 1238/19560 | loss 4.127550 (-1.53z)| norm 0.4044 (-0.03z)| lr 5.99e-04 | 4206.93 ms | 32.1% bf16 MFU | 125091 tok/s step 1239/19560 | loss 4.181747 (-0.52z)| norm 0.4390 (+0.50z)| lr 5.99e-04 | 4150.34 ms | 32.5% bf16 MFU | 125153 tok/s step 1240/19560 | loss 4.162191 (-0.87z)| norm 0.4838 (+1.21z)| lr 5.99e-04 | 4153.25 ms | 32.5% bf16 MFU | 125207 tok/s step 1241/19560 | loss 4.164372 (-0.82z)| norm 0.5433 (+2.12z)| lr 5.99e-04 | 4163.49 ms | 32.4% bf16 MFU | 125243 tok/s step 1242/19560 | loss 4.134038 (-1.37z)| norm 0.4223 (+0.30z)| lr 5.99e-04 | 4153.03 ms | 32.5% bf16 MFU | 125293 tok/s step 1243/19560 | loss 4.113739 (-1.72z)| norm 0.3499 (-0.81z)| lr 5.99e-04 | 4147.54 ms | 32.6% bf16 MFU | 125348 tok/s step 1244/19560 | loss 4.179411 (-0.48z)| norm 0.3368 (-1.00z)| lr 5.99e-04 | 4154.25 ms | 32.5% bf16 MFU | 125391 tok/s step 1245/19560 | loss 4.145994 (-1.09z)| norm 0.3095 (-1.41z)| lr 5.99e-04 | 4141.22 ms | 32.6% bf16 MFU | 125452 tok/s step 1246/19560 | loss 4.152015 (-0.96z)| norm 0.3329 (-1.05z)| lr 5.99e-04 | 4161.79 ms | 32.4% bf16 MFU | 125478 tok/s step 1247/19560 | loss 4.147424 (-1.03z)| norm 0.3685 (-0.51z)| lr 5.99e-04 | 4154.92 ms | 32.5% bf16 MFU | 125513 tok/s step 1248/19560 | loss 4.102256 (-1.84z)| norm 0.3209 (-1.25z)| lr 5.99e-04 | 4145.29 ms | 32.6% bf16 MFU | 125562 tok/s step 1249/19560 | loss 4.209902 (+0.16z)| norm 0.3050 (-1.49z)| lr 5.99e-04 | 4151.70 ms | 32.5% bf16 MFU | 125598 tok/s step 1250/19560 | loss 4.136126 (-1.19z)| norm 0.3413 (-0.93z)| lr 5.99e-04 | 4166.42 ms | 32.4% bf16 MFU | 125610 tok/s val loss 4.133231 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2616/10042 = 0.260506 step 1251/19560 | loss 4.138877 (-1.12z)| norm 0.3272 (-1.14z)| lr 5.99e-04 | 4179.27 ms | 32.3% bf16 MFU | 125602 tok/s step 1252/19560 | loss 4.098310 (-1.83z)| norm 0.3178 (-1.28z)| lr 5.99e-04 | 4155.32 ms | 32.5% bf16 MFU | 125630 tok/s step 1253/19560 | loss 4.094078 (-1.87z)| norm 0.3197 (-1.23z)| lr 5.99e-04 | 4175.17 ms | 32.3% bf16 MFU | 125627 tok/s step 1254/19560 | loss 4.183186 (-0.26z)| norm 0.3564 (-0.67z)| lr 5.99e-04 | 4152.00 ms | 32.5% bf16 MFU | 125660 tok/s step 1255/19560 | loss 4.099162 (-1.74z)| norm 0.3870 (-0.21z)| lr 5.99e-04 | 4157.77 ms | 32.5% bf16 MFU | 125682 tok/s step 1256/19560 | loss 4.081024 (-2.02z)| norm 0.4364 (+0.54z)| lr 5.99e-04 | 4145.33 ms | 32.6% bf16 MFU | 125721 tok/s step 1257/19560 | loss 4.095538 (-1.73z)| norm 0.4552 (+0.83z)| lr 5.99e-04 | 4156.80 ms | 32.5% bf16 MFU | 125742 tok/s step 1258/19560 | loss 4.083988 (-1.91z)| norm 0.4492 (+0.75z)| lr 5.99e-04 | 4147.03 ms | 32.6% bf16 MFU | 125776 tok/s step 1259/19560 | loss 4.047972 (-2.48z)| norm 0.4919 (+1.49z)| lr 5.99e-04 | 4150.53 ms | 32.5% bf16 MFU | 125803 tok/s step 1260/19560 | loss 4.110475 (-1.37z)| norm 0.4631 (+1.15z)| lr 5.99e-04 | 4159.09 ms | 32.5% bf16 MFU | 125816 tok/s step 1261/19560 | loss 4.071917 (-2.00z)| norm 0.3706 (-0.46z)| lr 5.99e-04 | 4141.55 ms | 32.6% bf16 MFU | 125855 tok/s step 1262/19560 | loss 4.178889 (-0.14z)| norm 0.3538 (-0.75z)| lr 5.99e-04 | 4158.33 ms | 32.5% bf16 MFU | 125866 tok/s step 1263/19560 | loss 4.102501 (-1.45z)| norm 0.3581 (-0.67z)| lr 5.99e-04 | 4157.78 ms | 32.5% bf16 MFU | 125877 tok/s step 1264/19560 | loss 4.105432 (-1.37z)| norm 0.4018 (+0.12z)| lr 5.99e-04 | 4143.92 ms | 32.6% bf16 MFU | 125910 tok/s step 1265/19560 | loss 4.116680 (-1.18z)| norm 0.3819 (-0.25z)| lr 5.99e-04 | 4155.48 ms | 32.5% bf16 MFU | 125922 tok/s step 1266/19560 | loss 4.071416 (-1.93z)| norm 0.3335 (-1.13z)| lr 5.99e-04 | 4145.92 ms | 32.6% bf16 MFU | 125949 tok/s step 1267/19560 | loss 4.101641 (-1.38z)| norm 0.3318 (-1.16z)| lr 5.99e-04 | 4150.61 ms | 32.5% bf16 MFU | 125968 tok/s step 1268/19560 | loss 4.125016 (-0.96z)| norm 0.3633 (-0.59z)| lr 5.99e-04 | 4142.32 ms | 32.6% bf16 MFU | 125998 tok/s step 1269/19560 | loss 4.062887 (-2.00z)| norm 0.3820 (-0.26z)| lr 5.99e-04 | 4169.88 ms | 32.4% bf16 MFU | 125984 tok/s step 1270/19560 | loss 4.106933 (-1.22z)| norm 0.3357 (-1.10z)| lr 5.99e-04 | 4153.50 ms | 32.5% bf16 MFU | 125997 tok/s step 1271/19560 | loss 4.090157 (-1.48z)| norm 0.3264 (-1.26z)| lr 5.99e-04 | 4158.26 ms | 32.5% bf16 MFU | 126001 tok/s step 1272/19560 | loss 4.045341 (-2.20z)| norm 0.3709 (-0.44z)| lr 5.99e-04 | 4144.98 ms | 32.6% bf16 MFU | 126025 tok/s step 1273/19560 | loss 4.168099 (-0.11z)| norm 0.4476 (+0.94z)| lr 5.99e-04 | 4146.34 ms | 32.6% bf16 MFU | 126046 tok/s step 1274/19560 | loss 4.157125 (-0.29z)| norm 0.5318 (+2.40z)| lr 5.99e-04 | 4151.83 ms | 32.5% bf16 MFU | 126058 tok/s step 1275/19560 | loss 4.084178 (-1.53z)| norm 0.4801 (+1.46z)| lr 5.99e-04 | 4147.94 ms | 32.6% bf16 MFU | 126075 tok/s step 1276/19560 | loss 4.181385 (+0.16z)| norm 0.4796 (+1.42z)| lr 5.99e-04 | 4168.81 ms | 32.4% bf16 MFU | 126059 tok/s step 1277/19560 | loss 4.092202 (-1.37z)| norm 0.4580 (+1.04z)| lr 5.99e-04 | 4158.89 ms | 32.5% bf16 MFU | 126060 tok/s step 1278/19560 | loss 4.130548 (-0.69z)| norm 0.4177 (+0.34z)| lr 5.99e-04 | 4141.57 ms | 32.6% bf16 MFU | 126086 tok/s step 1279/19560 | loss 4.116803 (-0.93z)| norm 0.3921 (-0.10z)| lr 5.99e-04 | 4140.75 ms | 32.6% bf16 MFU | 126113 tok/s step 1280/19560 | loss 4.055504 (-1.97z)| norm 0.4019 (+0.07z)| lr 5.99e-04 | 4156.56 ms | 32.5% bf16 MFU | 126114 tok/s step 1281/19560 | loss 4.126050 (-0.72z)| norm 0.3482 (-0.85z)| lr 5.99e-04 | 4155.32 ms | 32.5% bf16 MFU | 126117 tok/s step 1282/19560 | loss 4.149294 (-0.31z)| norm 0.4215 (+0.43z)| lr 5.99e-04 | 4149.44 ms | 32.5% bf16 MFU | 126129 tok/s step 1283/19560 | loss 4.136240 (-0.53z)| norm 0.3614 (-0.61z)| lr 5.99e-04 | 4165.53 ms | 32.4% bf16 MFU | 126115 tok/s step 1284/19560 | loss 4.071015 (-1.64z)| norm 0.3588 (-0.64z)| lr 5.99e-04 | 4158.36 ms | 32.5% bf16 MFU | 126114 tok/s step 1285/19560 | loss 4.096064 (-1.19z)| norm 0.3957 (+0.00z)| lr 5.99e-04 | 4161.84 ms | 32.4% bf16 MFU | 126107 tok/s step 1286/19560 | loss 4.088334 (-1.30z)| norm 0.3758 (-0.35z)| lr 5.99e-04 | 4154.81 ms | 32.5% bf16 MFU | 126111 tok/s step 1287/19560 | loss 4.050595 (-1.91z)| norm 0.3571 (-0.67z)| lr 5.99e-04 | 4140.93 ms | 32.6% bf16 MFU | 126136 tok/s step 1288/19560 | loss 4.060297 (-1.71z)| norm 0.3602 (-0.61z)| lr 5.99e-04 | 4146.13 ms | 32.6% bf16 MFU | 126152 tok/s step 1289/19560 | loss 4.085924 (-1.26z)| norm 0.3554 (-0.68z)| lr 5.99e-04 | 4144.51 ms | 32.6% bf16 MFU | 126169 tok/s step 1290/19560 | loss 4.134763 (-0.43z)| norm 0.4020 (+0.14z)| lr 5.99e-04 | 4162.10 ms | 32.4% bf16 MFU | 126159 tok/s step 1291/19560 | loss 4.096067 (-1.06z)| norm 0.4677 (+1.27z)| lr 5.99e-04 | 4160.58 ms | 32.5% bf16 MFU | 126152 tok/s step 1292/19560 | loss 4.162080 (+0.05z)| norm 0.4288 (+0.58z)| lr 5.99e-04 | 4143.83 ms | 32.6% bf16 MFU | 126170 tok/s step 1293/19560 | loss 4.059417 (-1.65z)| norm 0.4340 (+0.66z)| lr 5.99e-04 | 4154.54 ms | 32.5% bf16 MFU | 126171 tok/s step 1294/19560 | loss 4.132478 (-0.42z)| norm 0.4141 (+0.30z)| lr 5.99e-04 | 4144.93 ms | 32.6% bf16 MFU | 126187 tok/s step 1295/19560 | loss 4.112230 (-0.75z)| norm 0.3563 (-0.72z)| lr 5.99e-04 | 4147.15 ms | 32.6% bf16 MFU | 126199 tok/s step 1296/19560 | loss 4.119065 (-0.63z)| norm 0.3762 (-0.36z)| lr 5.99e-04 | 4159.54 ms | 32.5% bf16 MFU | 126191 tok/s step 1297/19560 | loss 4.043270 (-1.85z)| norm 0.3517 (-0.79z)| lr 5.99e-04 | 4152.33 ms | 32.5% bf16 MFU | 126195 tok/s step 1298/19560 | loss 4.136993 (-0.29z)| norm 0.3406 (-0.97z)| lr 5.99e-04 | 4159.75 ms | 32.5% bf16 MFU | 126187 tok/s step 1299/19560 | loss 4.106909 (-0.78z)| norm 0.3434 (-0.94z)| lr 5.99e-04 | 4220.02 ms | 32.0% bf16 MFU | 126090 tok/s step 1300/19560 | loss 4.117524 (-0.59z)| norm 0.3238 (-1.30z)| lr 5.99e-04 | 4163.75 ms | 32.4% bf16 MFU | 126081 tok/s step 1301/19560 | loss 4.080853 (-1.19z)| norm 0.3051 (-1.65z)| lr 5.99e-04 | 4153.13 ms | 32.5% bf16 MFU | 126089 tok/s step 1302/19560 | loss 4.093196 (-0.97z)| norm 0.3052 (-1.62z)| lr 5.98e-04 | 4221.20 ms | 32.0% bf16 MFU | 125995 tok/s step 1303/19560 | loss 4.071684 (-1.31z)| norm 0.2865 (-1.94z)| lr 5.98e-04 | 4146.36 ms | 32.6% bf16 MFU | 126017 tok/s step 1304/19560 | loss 4.061430 (-1.45z)| norm 0.3147 (-1.38z)| lr 5.98e-04 | 4152.61 ms | 32.5% bf16 MFU | 126029 tok/s step 1305/19560 | loss 4.207664 (+0.98z)| norm 0.3495 (-0.72z)| lr 5.98e-04 | 4143.63 ms | 32.6% bf16 MFU | 126054 tok/s step 1306/19560 | loss 4.103129 (-0.75z)| norm 0.3700 (-0.33z)| lr 5.98e-04 | 4157.67 ms | 32.5% bf16 MFU | 126056 tok/s step 1307/19560 | loss 4.130148 (-0.29z)| norm 0.3456 (-0.80z)| lr 5.98e-04 | 4150.73 ms | 32.5% bf16 MFU | 126069 tok/s step 1308/19560 | loss 4.106179 (-0.68z)| norm 0.3601 (-0.53z)| lr 5.98e-04 | 4157.14 ms | 32.5% bf16 MFU | 126072 tok/s step 1309/19560 | loss 4.096457 (-0.83z)| norm 0.3623 (-0.49z)| lr 5.98e-04 | 4153.13 ms | 32.5% bf16 MFU | 126080 tok/s step 1310/19560 | loss 4.061640 (-1.40z)| norm 0.3944 (+0.13z)| lr 5.98e-04 | 4154.28 ms | 32.5% bf16 MFU | 126086 tok/s step 1311/19560 | loss 4.060778 (-1.39z)| norm 0.4251 (+0.71z)| lr 5.98e-04 | 4157.96 ms | 32.5% bf16 MFU | 126087 tok/s step 1312/19560 | loss 4.100513 (-0.71z)| norm 0.4531 (+1.23z)| lr 5.98e-04 | 4147.06 ms | 32.6% bf16 MFU | 126103 tok/s step 1313/19560 | loss 4.077844 (-1.08z)| norm 0.4781 (+1.67z)| lr 5.98e-04 | 4152.54 ms | 32.5% bf16 MFU | 126111 tok/s step 1314/19560 | loss 4.048790 (-1.55z)| norm 0.4768 (+1.62z)| lr 5.98e-04 | 4151.76 ms | 32.5% bf16 MFU | 126120 tok/s step 1315/19560 | loss 4.098862 (-0.69z)| norm 0.4422 (+0.95z)| lr 5.98e-04 | 4211.40 ms | 32.1% bf16 MFU | 126038 tok/s step 1316/19560 | loss 4.114697 (-0.41z)| norm 0.3972 (+0.11z)| lr 5.98e-04 | 4151.43 ms | 32.5% bf16 MFU | 126051 tok/s step 1317/19560 | loss 4.068137 (-1.19z)| norm 0.4089 (+0.32z)| lr 5.98e-04 | 4147.49 ms | 32.6% bf16 MFU | 126069 tok/s step 1318/19560 | loss 4.086067 (-0.87z)| norm 0.3851 (-0.13z)| lr 5.98e-04 | 4149.27 ms | 32.5% bf16 MFU | 126083 tok/s step 1319/19560 | loss 4.077515 (-1.01z)| norm 0.3612 (-0.58z)| lr 5.98e-04 | 4140.73 ms | 32.6% bf16 MFU | 126110 tok/s step 1320/19560 | loss 4.100163 (-0.61z)| norm 0.3299 (-1.15z)| lr 5.98e-04 | 4154.37 ms | 32.5% bf16 MFU | 126115 tok/s step 1321/19560 | loss 4.071874 (-1.08z)| norm 0.3199 (-1.32z)| lr 5.98e-04 | 4144.27 ms | 32.6% bf16 MFU | 126134 tok/s step 1322/19560 | loss 4.102250 (-0.55z)| norm 0.3263 (-1.19z)| lr 5.98e-04 | 4305.71 ms | 31.4% bf16 MFU | 125916 tok/s step 1323/19560 | loss 4.083601 (-0.86z)| norm 0.3362 (-1.00z)| lr 5.98e-04 | 4225.54 ms | 32.0% bf16 MFU | 125824 tok/s step 1324/19560 | loss 4.118218 (-0.26z)| norm 0.3241 (-1.21z)| lr 5.98e-04 | 4154.44 ms | 32.5% bf16 MFU | 125843 tok/s step 1325/19560 | loss 4.062725 (-1.20z)| norm 0.3158 (-1.34z)| lr 5.98e-04 | 4150.00 ms | 32.5% bf16 MFU | 125867 tok/s step 1326/19560 | loss 4.083053 (-0.84z)| norm 0.3470 (-0.76z)| lr 5.98e-04 | 4154.78 ms | 32.5% bf16 MFU | 125883 tok/s step 1327/19560 | loss 4.054746 (-1.30z)| norm 0.3634 (-0.44z)| lr 5.98e-04 | 4151.08 ms | 32.5% bf16 MFU | 125904 tok/s step 1328/19560 | loss 4.114219 (-0.27z)| norm 0.3456 (-0.76z)| lr 5.98e-04 | 4178.59 ms | 32.3% bf16 MFU | 125883 tok/s step 1329/19560 | loss 4.050230 (-1.40z)| norm 0.3439 (-0.78z)| lr 5.98e-04 | 4152.78 ms | 32.5% bf16 MFU | 125901 tok/s step 1330/19560 | loss 4.056049 (-1.27z)| norm 0.3973 (+0.21z)| lr 5.98e-04 | 4149.71 ms | 32.5% bf16 MFU | 125923 tok/s step 1331/19560 | loss 4.093928 (-0.58z)| norm 0.3963 (+0.19z)| lr 5.98e-04 | 4156.51 ms | 32.5% bf16 MFU | 125934 tok/s step 1332/19560 | loss 4.143180 (+0.34z)| norm 0.3418 (-0.81z)| lr 5.98e-04 | 4137.73 ms | 32.6% bf16 MFU | 125972 tok/s step 1333/19560 | loss 4.073628 (-0.93z)| norm 0.3883 (+0.05z)| lr 5.98e-04 | 4145.00 ms | 32.6% bf16 MFU | 125998 tok/s step 1334/19560 | loss 4.055507 (-1.25z)| norm 0.4068 (+0.39z)| lr 5.98e-04 | 4138.97 ms | 32.6% bf16 MFU | 126032 tok/s step 1335/19560 | loss 4.090287 (-0.60z)| norm 0.3772 (-0.15z)| lr 5.98e-04 | 4144.88 ms | 32.6% bf16 MFU | 126055 tok/s step 1336/19560 | loss 4.061556 (-1.11z)| norm 0.3266 (-1.07z)| lr 5.98e-04 | 4139.00 ms | 32.6% bf16 MFU | 126086 tok/s step 1337/19560 | loss 4.066380 (-1.01z)| norm 0.3066 (-1.42z)| lr 5.98e-04 | 4179.80 ms | 32.3% bf16 MFU | 126053 tok/s step 1338/19560 | loss 4.078762 (-0.77z)| norm 0.3285 (-1.01z)| lr 5.98e-04 | 4153.31 ms | 32.5% bf16 MFU | 126062 tok/s step 1339/19560 | loss 4.026655 (-1.70z)| norm 0.3377 (-0.84z)| lr 5.98e-04 | 4155.06 ms | 32.5% bf16 MFU | 126068 tok/s step 1340/19560 | loss 4.049459 (-1.26z)| norm 0.3496 (-0.62z)| lr 5.98e-04 | 4167.12 ms | 32.4% bf16 MFU | 126055 tok/s step 1341/19560 | loss 4.034089 (-1.52z)| norm 0.3853 (+0.03z)| lr 5.98e-04 | 4154.37 ms | 32.5% bf16 MFU | 126063 tok/s step 1342/19560 | loss 4.063208 (-0.98z)| norm 0.4084 (+0.46z)| lr 5.98e-04 | 4139.68 ms | 32.6% bf16 MFU | 126092 tok/s step 1343/19560 | loss 4.070957 (-0.82z)| norm 0.4219 (+0.70z)| lr 5.98e-04 | 4141.25 ms | 32.6% bf16 MFU | 126117 tok/s step 1344/19560 | loss 4.013413 (-1.88z)| norm 0.3876 (+0.09z)| lr 5.98e-04 | 4141.81 ms | 32.6% bf16 MFU | 126141 tok/s step 1345/19560 | loss 4.011081 (-1.88z)| norm 0.3757 (-0.12z)| lr 5.98e-04 | 4144.01 ms | 32.6% bf16 MFU | 126160 tok/s step 1346/19560 | loss 4.025114 (-1.59z)| norm 0.3931 (+0.22z)| lr 5.98e-04 | 4141.79 ms | 32.6% bf16 MFU | 126181 tok/s step 1347/19560 | loss 4.084439 (-0.48z)| norm 0.4143 (+0.62z)| lr 5.98e-04 | 4164.31 ms | 32.4% bf16 MFU | 126167 tok/s step 1348/19560 | loss 3.989254 (-2.20z)| norm 0.4599 (+1.52z)| lr 5.98e-04 | 4147.58 ms | 32.6% bf16 MFU | 126179 tok/s step 1349/19560 | loss 4.070383 (-0.70z)| norm 0.4864 (+2.06z)| lr 5.98e-04 | 4140.03 ms | 32.6% bf16 MFU | 126202 tok/s step 1350/19560 | loss 4.004803 (-1.95z)| norm 0.3753 (-0.10z)| lr 5.98e-04 | 4145.73 ms | 32.6% bf16 MFU | 126215 tok/s step 1351/19560 | loss 4.135154 (+0.56z)| norm 0.3379 (-0.81z)| lr 5.98e-04 | 4140.19 ms | 32.6% bf16 MFU | 126236 tok/s step 1352/19560 | loss 4.100020 (-0.10z)| norm 0.3435 (-0.69z)| lr 5.98e-04 | 4167.54 ms | 32.4% bf16 MFU | 126214 tok/s step 1353/19560 | loss 4.084199 (-0.40z)| norm 0.3825 (+0.07z)| lr 5.98e-04 | 4150.08 ms | 32.5% bf16 MFU | 126220 tok/s step 1354/19560 | loss 4.045982 (-1.15z)| norm 0.4183 (+0.77z)| lr 5.98e-04 | 4142.85 ms | 32.6% bf16 MFU | 126237 tok/s step 1355/19560 | loss 4.058801 (-0.88z)| norm 0.3445 (-0.67z)| lr 5.98e-04 | 4158.44 ms | 32.5% bf16 MFU | 126229 tok/s step 1356/19560 | loss 4.065527 (-0.73z)| norm 0.3414 (-0.73z)| lr 5.98e-04 | 4233.75 ms | 31.9% bf16 MFU | 126109 tok/s step 1357/19560 | loss 4.094640 (-0.13z)| norm 0.3518 (-0.53z)| lr 5.98e-04 | 4137.08 ms | 32.6% bf16 MFU | 126140 tok/s step 1358/19560 | loss 4.047607 (-1.08z)| norm 0.3385 (-0.79z)| lr 5.98e-04 | 4209.20 ms | 32.1% bf16 MFU | 126061 tok/s step 1359/19560 | loss 4.110395 (+0.21z)| norm 0.3611 (-0.35z)| lr 5.98e-04 | 4291.12 ms | 31.5% bf16 MFU | 125867 tok/s step 1360/19560 | loss 4.060978 (-0.80z)| norm 0.4000 (+0.41z)| lr 5.98e-04 | 4272.21 ms | 31.6% bf16 MFU | 125710 tok/s step 1361/19560 | loss 4.085377 (-0.29z)| norm 0.4058 (+0.52z)| lr 5.98e-04 | 4362.61 ms | 30.9% bf16 MFU | 125433 tok/s step 1362/19560 | loss 4.001692 (-2.02z)| norm 0.3878 (+0.16z)| lr 5.98e-04 | 4145.30 ms | 32.6% bf16 MFU | 125485 tok/s step 1363/19560 | loss 4.082251 (-0.32z)| norm 0.3254 (-1.05z)| lr 5.98e-04 | 4203.07 ms | 32.1% bf16 MFU | 125448 tok/s step 1364/19560 | loss 4.106166 (+0.20z)| norm 0.3278 (-1.00z)| lr 5.98e-04 | 4329.11 ms | 31.2% bf16 MFU | 125231 tok/s step 1365/19560 | loss 4.068453 (-0.62z)| norm 0.3563 (-0.44z)| lr 5.98e-04 | 4146.19 ms | 32.6% bf16 MFU | 125292 tok/s step 1366/19560 | loss 4.110487 (+0.38z)| norm 0.3613 (-0.34z)| lr 5.98e-04 | 4331.93 ms | 31.2% bf16 MFU | 125079 tok/s step 1367/19560 | loss 3.983100 (-2.59z)| norm 0.3284 (-0.96z)| lr 5.98e-04 | 4150.60 ms | 32.5% bf16 MFU | 125141 tok/s step 1368/19560 | loss 4.043264 (-1.16z)| norm 0.3150 (-1.21z)| lr 5.98e-04 | 4142.20 ms | 32.6% bf16 MFU | 125212 tok/s step 1369/19560 | loss 4.045465 (-1.09z)| norm 0.3851 (+0.20z)| lr 5.98e-04 | 4141.72 ms | 32.6% bf16 MFU | 125281 tok/s step 1370/19560 | loss 4.018004 (-1.71z)| norm 0.4274 (+1.07z)| lr 5.98e-04 | 4136.01 ms | 32.6% bf16 MFU | 125355 tok/s step 1371/19560 | loss 4.119315 (+0.68z)| norm 0.3913 (+0.32z)| lr 5.98e-04 | 4140.49 ms | 32.6% bf16 MFU | 125419 tok/s step 1372/19560 | loss 4.068673 (-0.50z)| norm 0.3909 (+0.31z)| lr 5.98e-04 | 4138.17 ms | 32.6% bf16 MFU | 125482 tok/s step 1373/19560 | loss 4.162537 (+1.74z)| norm 0.4005 (+0.49z)| lr 5.98e-04 | 4143.18 ms | 32.6% bf16 MFU | 125535 tok/s step 1374/19560 | loss 4.054788 (-0.82z)| norm 0.4213 (+0.91z)| lr 5.98e-04 | 4161.69 ms | 32.4% bf16 MFU | 125558 tok/s step 1375/19560 | loss 4.087248 (-0.03z)| norm 0.4370 (+1.22z)| lr 5.98e-04 | 4345.78 ms | 31.1% bf16 MFU | 125312 tok/s step 1376/19560 | loss 3.986905 (-2.38z)| norm 0.4190 (+0.83z)| lr 5.98e-04 | 4148.96 ms | 32.5% bf16 MFU | 125365 tok/s step 1377/19560 | loss 4.085859 (-0.02z)| norm 0.4004 (+0.44z)| lr 5.98e-04 | 4293.70 ms | 31.4% bf16 MFU | 125202 tok/s step 1378/19560 | loss 4.022181 (-1.55z)| norm 0.4128 (+0.68z)| lr 5.98e-04 | 4181.32 ms | 32.3% bf16 MFU | 125211 tok/s step 1379/19560 | loss 4.057325 (-0.68z)| norm 0.3674 (-0.27z)| lr 5.98e-04 | 4164.48 ms | 32.4% bf16 MFU | 125245 tok/s step 1380/19560 | loss 4.153331 (+1.65z)| norm 0.3355 (-0.95z)| lr 5.98e-04 | 4153.95 ms | 32.5% bf16 MFU | 125294 tok/s step 1381/19560 | loss 4.099140 (+0.33z)| norm 0.3097 (-1.49z)| lr 5.98e-04 | 4138.27 ms | 32.6% bf16 MFU | 125364 tok/s step 1382/19560 | loss 4.007174 (-1.88z)| norm 0.3169 (-1.32z)| lr 5.98e-04 | 4914.51 ms | 27.5% bf16 MFU | 124429 tok/s step 1383/19560 | loss 4.090917 (+0.17z)| norm 0.3205 (-1.23z)| lr 5.98e-04 | 4155.81 ms | 32.5% bf16 MFU | 124516 tok/s step 1384/19560 | loss 4.068784 (-0.37z)| norm 0.3286 (-1.04z)| lr 5.98e-04 | 4137.97 ms | 32.6% bf16 MFU | 124625 tok/s step 1385/19560 | loss 4.050194 (-0.82z)| norm 0.3455 (-0.68z)| lr 5.98e-04 | 4135.24 ms | 32.7% bf16 MFU | 124733 tok/s step 1386/19560 | loss 4.045074 (-0.93z)| norm 0.3378 (-0.83z)| lr 5.98e-04 | 4136.66 ms | 32.6% bf16 MFU | 124834 tok/s step 1387/19560 | loss 4.061503 (-0.53z)| norm 0.3268 (-1.05z)| lr 5.98e-04 | 4149.87 ms | 32.5% bf16 MFU | 124909 tok/s step 1388/19560 | loss 4.070098 (-0.32z)| norm 0.3146 (-1.30z)| lr 5.98e-04 | 4137.88 ms | 32.6% bf16 MFU | 124999 tok/s step 1389/19560 | loss 4.054092 (-0.70z)| norm 0.2990 (-1.61z)| lr 5.98e-04 | 4138.73 ms | 32.6% bf16 MFU | 125083 tok/s step 1390/19560 | loss 4.061243 (-0.52z)| norm 0.3063 (-1.43z)| lr 5.98e-04 | 4149.49 ms | 32.5% bf16 MFU | 125146 tok/s step 1391/19560 | loss 4.077216 (-0.12z)| norm 0.3458 (-0.59z)| lr 5.98e-04 | 4253.89 ms | 31.7% bf16 MFU | 125051 tok/s step 1392/19560 | loss 4.106256 (+0.61z)| norm 0.3661 (-0.15z)| lr 5.98e-04 | 4147.13 ms | 32.6% bf16 MFU | 125120 tok/s step 1393/19560 | loss 4.039793 (-1.03z)| norm 0.4392 (+1.38z)| lr 5.98e-04 | 4137.81 ms | 32.6% bf16 MFU | 125199 tok/s step 1394/19560 | loss 4.077171 (-0.10z)| norm 0.5270 (+3.09z)| lr 5.98e-04 | 4147.64 ms | 32.6% bf16 MFU | 125259 tok/s step 1395/19560 | loss 4.147923 (+1.63z)| norm 0.4828 (+2.14z)| lr 5.98e-04 | 4160.90 ms | 32.4% bf16 MFU | 125297 tok/s step 1396/19560 | loss 4.075005 (-0.16z)| norm 0.3828 (+0.12z)| lr 5.98e-04 | 4141.70 ms | 32.6% bf16 MFU | 125361 tok/s step 1397/19560 | loss 4.040260 (-1.01z)| norm 0.3531 (-0.47z)| lr 5.98e-04 | 4156.45 ms | 32.5% bf16 MFU | 125400 tok/s step 1398/19560 | loss 4.019169 (-1.50z)| norm 0.3527 (-0.48z)| lr 5.98e-04 | 4139.91 ms | 32.6% bf16 MFU | 125462 tok/s step 1399/19560 | loss 4.083131 (+0.07z)| norm 0.3598 (-0.34z)| lr 5.98e-04 | 4153.92 ms | 32.5% bf16 MFU | 125500 tok/s step 1400/19560 | loss 4.080296 (-0.01z)| norm 0.3176 (-1.18z)| lr 5.98e-04 | 4156.09 ms | 32.5% bf16 MFU | 125532 tok/s step 1401/19560 | loss 4.056652 (-0.58z)| norm 0.3276 (-0.96z)| lr 5.98e-04 | 4139.45 ms | 32.6% bf16 MFU | 125588 tok/s step 1402/19560 | loss 4.060439 (-0.47z)| norm 0.3367 (-0.78z)| lr 5.98e-04 | 4141.70 ms | 32.6% bf16 MFU | 125638 tok/s step 1403/19560 | loss 4.041421 (-0.94z)| norm 0.3140 (-1.24z)| lr 5.98e-04 | 4147.62 ms | 32.6% bf16 MFU | 125677 tok/s step 1404/19560 | loss 4.030243 (-1.22z)| norm 0.3521 (-0.42z)| lr 5.98e-04 | 4138.10 ms | 32.6% bf16 MFU | 125728 tok/s step 1405/19560 | loss 4.065372 (-0.31z)| norm 0.3873 (+0.36z)| lr 5.98e-04 | 4143.13 ms | 32.6% bf16 MFU | 125769 tok/s step 1406/19560 | loss 4.018030 (-1.50z)| norm 0.3646 (-0.13z)| lr 5.98e-04 | 4148.37 ms | 32.5% bf16 MFU | 125799 tok/s step 1407/19560 | loss 4.065060 (-0.28z)| norm 0.3908 (+0.44z)| lr 5.98e-04 | 4156.04 ms | 32.5% bf16 MFU | 125817 tok/s step 1408/19560 | loss 4.035456 (-1.04z)| norm 0.4309 (+1.32z)| lr 5.98e-04 | 4150.07 ms | 32.5% bf16 MFU | 125843 tok/s step 1409/19560 | loss 4.091438 (+0.41z)| norm 0.4170 (+1.00z)| lr 5.98e-04 | 4152.19 ms | 32.5% bf16 MFU | 125864 tok/s step 1410/19560 | loss 4.037520 (-0.98z)| norm 0.3825 (+0.25z)| lr 5.98e-04 | 4139.31 ms | 32.6% bf16 MFU | 125904 tok/s step 1411/19560 | loss 4.027384 (-1.22z)| norm 0.3742 (+0.07z)| lr 5.98e-04 | 4149.78 ms | 32.5% bf16 MFU | 125926 tok/s step 1412/19560 | loss 4.052374 (-0.56z)| norm 0.3575 (-0.30z)| lr 5.98e-04 | 4151.94 ms | 32.5% bf16 MFU | 125943 tok/s step 1413/19560 | loss 4.103661 (+0.79z)| norm 0.3702 (-0.02z)| lr 5.98e-04 | 4135.09 ms | 32.7% bf16 MFU | 125986 tok/s step 1414/19560 | loss 4.049316 (-0.63z)| norm 0.3918 (+0.46z)| lr 5.98e-04 | 4243.56 ms | 31.8% bf16 MFU | 125864 tok/s step 1415/19560 | loss 4.113701 (+1.04z)| norm 0.3565 (-0.32z)| lr 5.98e-04 | 4141.95 ms | 32.6% bf16 MFU | 125900 tok/s step 1416/19560 | loss 4.023464 (-1.31z)| norm 0.2969 (-1.60z)| lr 5.98e-04 | 4313.19 ms | 31.3% bf16 MFU | 125682 tok/s step 1417/19560 | loss 3.986540 (-2.21z)| norm 0.2961 (-1.59z)| lr 5.98e-04 | 4145.39 ms | 32.6% bf16 MFU | 125722 tok/s step 1418/19560 | loss 4.046720 (-0.66z)| norm 0.3045 (-1.39z)| lr 5.98e-04 | 4142.53 ms | 32.6% bf16 MFU | 125764 tok/s step 1419/19560 | loss 3.986756 (-2.14z)| norm 0.3330 (-0.77z)| lr 5.98e-04 | 4283.90 ms | 31.5% bf16 MFU | 125595 tok/s step 1420/19560 | loss 4.029109 (-1.06z)| norm 0.3279 (-0.87z)| lr 5.98e-04 | 4160.13 ms | 32.5% bf16 MFU | 125617 tok/s step 1421/19560 | loss 4.053893 (-0.42z)| norm 0.3020 (-1.41z)| lr 5.98e-04 | 4143.65 ms | 32.6% bf16 MFU | 125662 tok/s step 1422/19560 | loss 3.980757 (-2.25z)| norm 0.3020 (-1.38z)| lr 5.98e-04 | 4137.97 ms | 32.6% bf16 MFU | 125714 tok/s step 1423/19560 | loss 4.031435 (-0.94z)| norm 0.3214 (-0.95z)| lr 5.98e-04 | 4175.47 ms | 32.3% bf16 MFU | 125707 tok/s step 1424/19560 | loss 4.016802 (-1.30z)| norm 0.3306 (-0.74z)| lr 5.98e-04 | 4138.04 ms | 32.6% bf16 MFU | 125756 tok/s step 1425/19560 | loss 4.006907 (-1.53z)| norm 0.3641 (-0.02z)| lr 5.98e-04 | 4151.76 ms | 32.5% bf16 MFU | 125783 tok/s step 1426/19560 | loss 4.006615 (-1.52z)| norm 0.3763 (+0.24z)| lr 5.98e-04 | 4147.19 ms | 32.6% bf16 MFU | 125814 tok/s step 1427/19560 | loss 4.067390 (+0.03z)| norm 0.3992 (+0.72z)| lr 5.98e-04 | 4146.50 ms | 32.6% bf16 MFU | 125846 tok/s step 1428/19560 | loss 4.028804 (-0.93z)| norm 0.3735 (+0.16z)| lr 5.98e-04 | 4134.49 ms | 32.7% bf16 MFU | 125894 tok/s step 1429/19560 | loss 3.996804 (-1.72z)| norm 0.3143 (-1.13z)| lr 5.98e-04 | 4138.60 ms | 32.6% bf16 MFU | 125933 tok/s step 1430/19560 | loss 4.048080 (-0.41z)| norm 0.3554 (-0.24z)| lr 5.98e-04 | 4152.65 ms | 32.5% bf16 MFU | 125949 tok/s step 1431/19560 | loss 4.025758 (-0.96z)| norm 0.3976 (+0.67z)| lr 5.98e-04 | 4136.61 ms | 32.6% bf16 MFU | 125989 tok/s step 1432/19560 | loss 4.002123 (-1.54z)| norm 0.3583 (-0.21z)| lr 5.98e-04 | 4133.25 ms | 32.7% bf16 MFU | 126032 tok/s step 1433/19560 | loss 3.989578 (-1.88z)| norm 0.4052 (+0.82z)| lr 5.98e-04 | 4137.30 ms | 32.6% bf16 MFU | 126066 tok/s step 1434/19560 | loss 4.027915 (-0.87z)| norm 0.3673 (-0.02z)| lr 5.98e-04 | 4280.77 ms | 31.5% bf16 MFU | 125887 tok/s step 1435/19560 | loss 4.037521 (-0.61z)| norm 0.3088 (-1.31z)| lr 5.98e-04 | 4153.23 ms | 32.5% bf16 MFU | 125904 tok/s step 1436/19560 | loss 3.987756 (-1.88z)| norm 0.3193 (-1.06z)| lr 5.98e-04 | 4141.90 ms | 32.6% bf16 MFU | 125938 tok/s step 1437/19560 | loss 4.014803 (-1.15z)| norm 0.3585 (-0.20z)| lr 5.98e-04 | 4138.02 ms | 32.6% bf16 MFU | 125976 tok/s step 1438/19560 | loss 4.003682 (-1.42z)| norm 0.4719 (+2.24z)| lr 5.98e-04 | 4175.12 ms | 32.3% bf16 MFU | 125956 tok/s step 1439/19560 | loss 4.064091 (+0.15z)| norm 0.5159 (+3.07z)| lr 5.98e-04 | 4134.21 ms | 32.7% bf16 MFU | 125999 tok/s step 1440/19560 | loss 3.996700 (-1.57z)| norm 0.3642 (-0.08z)| lr 5.98e-04 | 4145.79 ms | 32.6% bf16 MFU | 126022 tok/s step 1441/19560 | loss 4.075992 (+0.47z)| norm 0.4091 (+0.89z)| lr 5.98e-04 | 4181.10 ms | 32.3% bf16 MFU | 125991 tok/s step 1442/19560 | loss 4.026113 (-0.81z)| norm 0.4180 (+1.11z)| lr 5.98e-04 | 4174.29 ms | 32.3% bf16 MFU | 125971 tok/s step 1443/19560 | loss 4.057787 (+0.02z)| norm 0.4148 (+1.05z)| lr 5.98e-04 | 4155.68 ms | 32.5% bf16 MFU | 125981 tok/s step 1444/19560 | loss 3.987820 (-1.76z)| norm 0.3770 (+0.22z)| lr 5.98e-04 | 5160.61 ms | 26.2% bf16 MFU | 124762 tok/s step 1445/19560 | loss 4.066071 (+0.26z)| norm 0.3070 (-1.30z)| lr 5.98e-04 | 4160.38 ms | 32.5% bf16 MFU | 124824 tok/s step 1446/19560 | loss 4.029122 (-0.69z)| norm 0.3219 (-0.96z)| lr 5.98e-04 | 4173.84 ms | 32.3% bf16 MFU | 124864 tok/s step 1447/19560 | loss 4.032639 (-0.59z)| norm 0.3207 (-0.98z)| lr 5.98e-04 | 4154.12 ms | 32.5% bf16 MFU | 124931 tok/s step 1448/19560 | loss 4.049228 (-0.15z)| norm 0.3270 (-0.84z)| lr 5.98e-04 | 4161.03 ms | 32.4% bf16 MFU | 124985 tok/s step 1449/19560 | loss 4.011869 (-1.10z)| norm 0.3053 (-1.30z)| lr 5.98e-04 | 4149.92 ms | 32.5% bf16 MFU | 125052 tok/s step 1450/19560 | loss 3.984418 (-1.77z)| norm 0.3254 (-0.86z)| lr 5.98e-04 | 4656.64 ms | 29.0% bf16 MFU | 124429 tok/s step 1451/19560 | loss 4.070199 (+0.43z)| norm 0.3512 (-0.31z)| lr 5.98e-04 | 4181.64 ms | 32.3% bf16 MFU | 124477 tok/s step 1452/19560 | loss 4.086965 (+0.87z)| norm 0.3658 (+0.01z)| lr 5.98e-04 | 4152.70 ms | 32.5% bf16 MFU | 124565 tok/s step 1453/19560 | loss 4.033706 (-0.50z)| norm 0.4048 (+0.85z)| lr 5.98e-04 | 4154.41 ms | 32.5% bf16 MFU | 124647 tok/s step 1454/19560 | loss 4.085656 (+0.84z)| norm 0.4004 (+0.74z)| lr 5.98e-04 | 4178.01 ms | 32.3% bf16 MFU | 124689 tok/s step 1455/19560 | loss 4.048707 (-0.11z)| norm 0.3376 (-0.63z)| lr 5.98e-04 | 4163.33 ms | 32.4% bf16 MFU | 124751 tok/s step 1456/19560 | loss 4.016557 (-0.93z)| norm 0.3158 (-1.10z)| lr 5.98e-04 | 4156.97 ms | 32.5% bf16 MFU | 124820 tok/s step 1457/19560 | loss 4.100817 (+1.25z)| norm 0.3048 (-1.32z)| lr 5.98e-04 | 4165.13 ms | 32.4% bf16 MFU | 124873 tok/s step 1458/19560 | loss 4.086365 (+0.86z)| norm 0.3367 (-0.62z)| lr 5.98e-04 | 4167.82 ms | 32.4% bf16 MFU | 124919 tok/s step 1459/19560 | loss 4.056376 (+0.10z)| norm 0.3628 (-0.05z)| lr 5.98e-04 | 4155.52 ms | 32.5% bf16 MFU | 124981 tok/s step 1460/19560 | loss 4.055312 (+0.09z)| norm 0.3745 (+0.20z)| lr 5.98e-04 | 4162.36 ms | 32.4% bf16 MFU | 125030 tok/s step 1461/19560 | loss 4.059365 (+0.20z)| norm 0.3696 (+0.09z)| lr 5.98e-04 | 4179.34 ms | 32.3% bf16 MFU | 125051 tok/s step 1462/19560 | loss 4.074407 (+0.60z)| norm 0.3608 (-0.09z)| lr 5.98e-04 | 4288.92 ms | 31.5% bf16 MFU | 124910 tok/s step 1463/19560 | loss 4.080900 (+0.77z)| norm 0.3403 (-0.53z)| lr 5.98e-04 | 4147.74 ms | 32.6% bf16 MFU | 124985 tok/s step 1464/19560 | loss 4.010510 (-1.08z)| norm 0.3528 (-0.26z)| lr 5.98e-04 | 4159.72 ms | 32.5% bf16 MFU | 125038 tok/s step 1465/19560 | loss 4.023858 (-0.72z)| norm 0.3630 (-0.05z)| lr 5.98e-04 | 7801.85 ms | 17.3% bf16 MFU | 122146 tok/s step 1466/19560 | loss 4.033130 (-0.46z)| norm 0.3395 (-0.57z)| lr 5.98e-04 | 4139.04 ms | 32.6% bf16 MFU | 122372 tok/s step 1467/19560 | loss 4.028226 (-0.59z)| norm 0.3100 (-1.21z)| lr 5.98e-04 | 4537.06 ms | 29.8% bf16 MFU | 122031 tok/s step 1468/19560 | loss 4.033018 (-0.46z)| norm 0.2980 (-1.45z)| lr 5.98e-04 | 4138.68 ms | 32.6% bf16 MFU | 122264 tok/s step 1469/19560 | loss 4.053002 (+0.06z)| norm 0.3058 (-1.26z)| lr 5.98e-04 | 4156.09 ms | 32.5% bf16 MFU | 122458 tok/s step 1470/19560 | loss 4.083473 (+0.86z)| norm 0.2980 (-1.41z)| lr 5.98e-04 | 4167.43 ms | 32.4% bf16 MFU | 122625 tok/s step 1471/19560 | loss 3.992651 (-1.51z)| norm 0.2816 (-1.73z)| lr 5.98e-04 | 4174.91 ms | 32.3% bf16 MFU | 122773 tok/s step 1472/19560 | loss 4.070119 (+0.51z)| norm 0.2895 (-1.53z)| lr 5.98e-04 | 4162.78 ms | 32.4% bf16 MFU | 122932 tok/s step 1473/19560 | loss 3.991822 (-1.53z)| norm 0.3104 (-1.07z)| lr 5.98e-04 | 4157.02 ms | 32.5% bf16 MFU | 123091 tok/s step 1474/19560 | loss 4.038320 (-0.32z)| norm 0.3221 (-0.81z)| lr 5.98e-04 | 4181.81 ms | 32.3% bf16 MFU | 123205 tok/s step 1475/19560 | loss 3.983180 (-1.73z)| norm 0.3536 (-0.13z)| lr 5.98e-04 | 4147.82 ms | 32.6% bf16 MFU | 123365 tok/s step 1476/19560 | loss 4.020987 (-0.76z)| norm 0.3492 (-0.21z)| lr 5.98e-04 | 4226.15 ms | 31.9% bf16 MFU | 123400 tok/s step 1477/19560 | loss 4.002943 (-1.21z)| norm 0.3276 (-0.67z)| lr 5.97e-04 | 4147.36 ms | 32.6% bf16 MFU | 123551 tok/s step 1478/19560 | loss 4.015341 (-0.89z)| norm 0.3422 (-0.34z)| lr 5.97e-04 | 4161.26 ms | 32.4% bf16 MFU | 123673 tok/s step 1479/19560 | loss 3.943166 (-2.70z)| norm 0.3608 (+0.07z)| lr 5.97e-04 | 4157.99 ms | 32.5% bf16 MFU | 123794 tok/s step 1480/19560 | loss 4.076415 (+0.73z)| norm 0.3832 (+0.56z)| lr 5.97e-04 | 4161.13 ms | 32.4% bf16 MFU | 123904 tok/s step 1481/19560 | loss 4.022736 (-0.64z)| norm 0.4719 (+2.46z)| lr 5.97e-04 | 4160.20 ms | 32.5% bf16 MFU | 124010 tok/s step 1482/19560 | loss 4.026414 (-0.54z)| norm 0.4818 (+2.61z)| lr 5.97e-04 | 4167.64 ms | 32.4% bf16 MFU | 124099 tok/s step 1483/19560 | loss 4.079231 (+0.82z)| norm 0.4331 (+1.55z)| lr 5.97e-04 | 4168.61 ms | 32.4% bf16 MFU | 124183 tok/s step 1484/19560 | loss 3.982927 (-1.64z)| norm 0.4186 (+1.22z)| lr 5.97e-04 | 4150.65 ms | 32.5% bf16 MFU | 124289 tok/s step 1485/19560 | loss 4.085460 (+0.99z)| norm 0.3505 (-0.21z)| lr 5.97e-04 | 4275.29 ms | 31.6% bf16 MFU | 124207 tok/s step 1486/19560 | loss 4.019005 (-0.71z)| norm 0.4380 (+1.60z)| lr 5.97e-04 | 4168.77 ms | 32.4% bf16 MFU | 124285 tok/s step 1487/19560 | loss 4.066308 (+0.52z)| norm 0.3847 (+0.48z)| lr 5.97e-04 | 4179.95 ms | 32.3% bf16 MFU | 124342 tok/s step 1488/19560 | loss 3.971329 (-1.89z)| norm 0.3683 (+0.15z)| lr 5.97e-04 | 4168.53 ms | 32.4% bf16 MFU | 124413 tok/s step 1489/19560 | loss 4.037946 (-0.19z)| norm 0.3274 (-0.69z)| lr 5.97e-04 | 4160.63 ms | 32.5% bf16 MFU | 124493 tok/s step 1490/19560 | loss 4.044537 (-0.03z)| norm 0.3194 (-0.85z)| lr 5.97e-04 | 4262.15 ms | 31.7% bf16 MFU | 124419 tok/s step 1491/19560 | loss 4.054058 (+0.22z)| norm 0.3053 (-1.13z)| lr 5.97e-04 | 4159.66 ms | 32.5% bf16 MFU | 124500 tok/s step 1492/19560 | loss 4.006034 (-1.00z)| norm 0.2759 (-1.72z)| lr 5.97e-04 | 4151.98 ms | 32.5% bf16 MFU | 124589 tok/s step 1493/19560 | loss 3.990182 (-1.38z)| norm 0.2794 (-1.62z)| lr 5.97e-04 | 4170.96 ms | 32.4% bf16 MFU | 124644 tok/s step 1494/19560 | loss 4.074639 (+0.81z)| norm 0.2862 (-1.45z)| lr 5.97e-04 | 4177.50 ms | 32.3% bf16 MFU | 124687 tok/s step 1495/19560 | loss 4.003119 (-1.06z)| norm 0.3163 (-0.85z)| lr 5.97e-04 | 4152.99 ms | 32.5% bf16 MFU | 124765 tok/s step 1496/19560 | loss 4.028099 (-0.41z)| norm 0.3284 (-0.60z)| lr 5.97e-04 | 4186.92 ms | 32.2% bf16 MFU | 124788 tok/s step 1497/19560 | loss 4.058869 (+0.39z)| norm 0.3167 (-0.83z)| lr 5.97e-04 | 4178.64 ms | 32.3% bf16 MFU | 124822 tok/s step 1498/19560 | loss 3.997634 (-1.20z)| norm 0.3501 (-0.14z)| lr 5.97e-04 | 4162.37 ms | 32.4% bf16 MFU | 124879 tok/s step 1499/19560 | loss 3.980770 (-1.62z)| norm 0.3803 (+0.47z)| lr 5.97e-04 | 4166.71 ms | 32.4% bf16 MFU | 124926 tok/s step 1500/19560 | loss 4.041120 (-0.03z)| norm 0.3695 (+0.26z)| lr 5.97e-04 | 4190.84 ms | 32.2% bf16 MFU | 124935 tok/s val loss 4.012183 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2590/10042 = 0.257917 step 1501/19560 | loss 4.034241 (-0.19z)| norm 0.3534 (-0.06z)| lr 5.97e-04 | 4173.19 ms | 32.4% bf16 MFU | 124970 tok/s step 1502/19560 | loss 3.969722 (-1.91z)| norm 0.3954 (+0.80z)| lr 5.97e-04 | 4158.01 ms | 32.5% bf16 MFU | 125026 tok/s step 1503/19560 | loss 4.016825 (-0.63z)| norm 0.3927 (+0.76z)| lr 5.97e-04 | 4171.24 ms | 32.4% bf16 MFU | 125059 tok/s step 1504/19560 | loss 4.041062 (+0.01z)| norm 0.3930 (+0.77z)| lr 5.97e-04 | 4154.30 ms | 32.5% bf16 MFU | 125116 tok/s step 1505/19560 | loss 4.027656 (-0.34z)| norm 0.3658 (+0.21z)| lr 5.97e-04 | 4212.74 ms | 32.0% bf16 MFU | 125083 tok/s step 1506/19560 | loss 4.045825 (+0.15z)| norm 0.3814 (+0.55z)| lr 5.97e-04 | 4175.25 ms | 32.3% bf16 MFU | 125108 tok/s step 1507/19560 | loss 3.936978 (-2.72z)| norm 0.3253 (-0.62z)| lr 5.97e-04 | 4164.31 ms | 32.4% bf16 MFU | 125147 tok/s step 1508/19560 | loss 4.017808 (-0.57z)| norm 0.3210 (-0.71z)| lr 5.97e-04 | 4159.71 ms | 32.5% bf16 MFU | 125192 tok/s step 1509/19560 | loss 3.974292 (-1.74z)| norm 0.3219 (-0.69z)| lr 5.97e-04 | 4287.12 ms | 31.5% bf16 MFU | 125047 tok/s step 1510/19560 | loss 3.988281 (-1.34z)| norm 0.3416 (-0.28z)| lr 5.97e-04 | 4151.56 ms | 32.5% bf16 MFU | 125109 tok/s step 1511/19560 | loss 3.981612 (-1.50z)| norm 0.3346 (-0.43z)| lr 5.97e-04 | 4155.44 ms | 32.5% bf16 MFU | 125162 tok/s step 1512/19560 | loss 4.047326 (+0.31z)| norm 0.3272 (-0.59z)| lr 5.97e-04 | 4168.54 ms | 32.4% bf16 MFU | 125193 tok/s step 1513/19560 | loss 4.021227 (-0.40z)| norm 0.3272 (-0.58z)| lr 5.97e-04 | 4169.30 ms | 32.4% bf16 MFU | 125220 tok/s step 1514/19560 | loss 4.064857 (+0.79z)| norm 0.3071 (-1.00z)| lr 5.97e-04 | 4164.23 ms | 32.4% bf16 MFU | 125255 tok/s step 1515/19560 | loss 4.057930 (+0.60z)| norm 0.3237 (-0.65z)| lr 5.97e-04 | 4365.87 ms | 30.9% bf16 MFU | 124996 tok/s step 1516/19560 | loss 3.968381 (-1.82z)| norm 0.2991 (-1.16z)| lr 5.97e-04 | 4175.76 ms | 32.3% bf16 MFU | 125024 tok/s step 1517/19560 | loss 4.024704 (-0.28z)| norm 0.3246 (-0.63z)| lr 5.97e-04 | 4179.17 ms | 32.3% bf16 MFU | 125046 tok/s step 1518/19560 | loss 3.958471 (-2.03z)| norm 0.3860 (+0.64z)| lr 5.97e-04 | 4158.81 ms | 32.5% bf16 MFU | 125097 tok/s step 1519/19560 | loss 4.023279 (-0.28z)| norm 0.4631 (+2.19z)| lr 5.97e-04 | 4165.88 ms | 32.4% bf16 MFU | 125134 tok/s step 1520/19560 | loss 4.019423 (-0.37z)| norm 0.4853 (+2.56z)| lr 5.97e-04 | 4151.26 ms | 32.5% bf16 MFU | 125193 tok/s step 1521/19560 | loss 4.007163 (-0.70z)| norm 0.4576 (+1.99z)| lr 5.97e-04 | 4165.09 ms | 32.4% bf16 MFU | 125227 tok/s step 1522/19560 | loss 4.072832 (+1.09z)| norm 0.3712 (+0.31z)| lr 5.97e-04 | 4166.14 ms | 32.4% bf16 MFU | 125258 tok/s step 1523/19560 | loss 4.036339 (+0.13z)| norm 0.3347 (-0.44z)| lr 5.97e-04 | 4159.99 ms | 32.5% bf16 MFU | 125296 tok/s step 1524/19560 | loss 4.064939 (+0.94z)| norm 0.2935 (-1.30z)| lr 5.97e-04 | 4226.10 ms | 31.9% bf16 MFU | 125234 tok/s step 1525/19560 | loss 4.109181 (+2.15z)| norm 0.3263 (-0.60z)| lr 5.97e-04 | 4147.65 ms | 32.6% bf16 MFU | 125293 tok/s step 1526/19560 | loss 4.007512 (-0.69z)| norm 0.3425 (-0.25z)| lr 5.97e-04 | 4168.68 ms | 32.4% bf16 MFU | 125317 tok/s step 1527/19560 | loss 3.965676 (-1.83z)| norm 0.3292 (-0.53z)| lr 5.97e-04 | 4166.28 ms | 32.4% bf16 MFU | 125343 tok/s step 1528/19560 | loss 3.986236 (-1.24z)| norm 0.3219 (-0.68z)| lr 5.97e-04 | 4165.50 ms | 32.4% bf16 MFU | 125369 tok/s step 1529/19560 | loss 4.026261 (-0.12z)| norm 0.2982 (-1.18z)| lr 5.97e-04 | 4176.44 ms | 32.3% bf16 MFU | 125377 tok/s step 1530/19560 | loss 3.952002 (-2.13z)| norm 0.2812 (-1.52z)| lr 5.97e-04 | 4170.32 ms | 32.4% bf16 MFU | 125394 tok/s step 1531/19560 | loss 4.092696 (+1.71z)| norm 0.2785 (-1.56z)| lr 5.97e-04 | 4301.11 ms | 31.4% bf16 MFU | 125220 tok/s step 1532/19560 | loss 4.009165 (-0.56z)| norm 0.3245 (-0.59z)| lr 5.97e-04 | 4145.07 ms | 32.6% bf16 MFU | 125283 tok/s step 1533/19560 | loss 4.079559 (+1.34z)| norm 0.3623 (+0.20z)| lr 5.97e-04 | 4224.23 ms | 32.0% bf16 MFU | 125224 tok/s step 1534/19560 | loss 4.020166 (-0.26z)| norm 0.3734 (+0.43z)| lr 5.97e-04 | 4151.24 ms | 32.5% bf16 MFU | 125278 tok/s step 1535/19560 | loss 3.970563 (-1.58z)| norm 0.3657 (+0.28z)| lr 5.97e-04 | 4205.89 ms | 32.1% bf16 MFU | 125247 tok/s step 1536/19560 | loss 4.008801 (-0.54z)| norm 0.3553 (+0.07z)| lr 5.97e-04 | 4199.93 ms | 32.1% bf16 MFU | 125226 tok/s step 1537/19560 | loss 4.014627 (-0.37z)| norm 0.3229 (-0.60z)| lr 5.97e-04 | 4157.52 ms | 32.5% bf16 MFU | 125270 tok/s step 1538/19560 | loss 3.985021 (-1.16z)| norm 0.3175 (-0.70z)| lr 5.97e-04 | 4151.52 ms | 32.5% bf16 MFU | 125321 tok/s step 1539/19560 | loss 4.010603 (-0.47z)| norm 0.3381 (-0.26z)| lr 5.97e-04 | 4144.42 ms | 32.6% bf16 MFU | 125380 tok/s step 1540/19560 | loss 4.078638 (+1.36z)| norm 0.3677 (+0.37z)| lr 5.97e-04 | 4163.22 ms | 32.4% bf16 MFU | 125408 tok/s step 1541/19560 | loss 4.077405 (+1.34z)| norm 0.3699 (+0.41z)| lr 5.97e-04 | 4153.74 ms | 32.5% bf16 MFU | 125448 tok/s step 1542/19560 | loss 4.037698 (+0.27z)| norm 0.3515 (+0.03z)| lr 5.97e-04 | 4165.76 ms | 32.4% bf16 MFU | 125469 tok/s step 1543/19560 | loss 3.978325 (-1.33z)| norm 0.3548 (+0.10z)| lr 5.97e-04 | 4151.95 ms | 32.5% bf16 MFU | 125509 tok/s step 1544/19560 | loss 4.005136 (-0.59z)| norm 0.3142 (-0.77z)| lr 5.97e-04 | 4203.40 ms | 32.1% bf16 MFU | 125470 tok/s step 1545/19560 | loss 4.097244 (+1.90z)| norm 0.3215 (-0.62z)| lr 5.97e-04 | 4181.52 ms | 32.3% bf16 MFU | 125466 tok/s step 1546/19560 | loss 3.972785 (-1.46z)| norm 0.3222 (-0.61z)| lr 5.97e-04 | 4162.22 ms | 32.4% bf16 MFU | 125491 tok/s step 1547/19560 | loss 4.015722 (-0.31z)| norm 0.3534 (+0.06z)| lr 5.97e-04 | 4165.17 ms | 32.4% bf16 MFU | 125510 tok/s step 1548/19560 | loss 3.982702 (-1.19z)| norm 0.3450 (-0.13z)| lr 5.97e-04 | 4162.93 ms | 32.4% bf16 MFU | 125532 tok/s step 1549/19560 | loss 4.002223 (-0.65z)| norm 0.3542 (+0.06z)| lr 5.97e-04 | 4169.43 ms | 32.4% bf16 MFU | 125542 tok/s step 1550/19560 | loss 3.998457 (-0.76z)| norm 0.3937 (+0.91z)| lr 5.97e-04 | 4159.96 ms | 32.5% bf16 MFU | 125567 tok/s step 1551/19560 | loss 4.003455 (-0.62z)| norm 0.4316 (+1.69z)| lr 5.97e-04 | 4161.20 ms | 32.4% bf16 MFU | 125588 tok/s step 1552/19560 | loss 4.103220 (+2.03z)| norm 0.4029 (+1.06z)| lr 5.97e-04 | 4154.85 ms | 32.5% bf16 MFU | 125618 tok/s step 1553/19560 | loss 4.067199 (+1.06z)| norm 0.3863 (+0.70z)| lr 5.97e-04 | 4157.57 ms | 32.5% bf16 MFU | 125642 tok/s step 1554/19560 | loss 3.998189 (-0.77z)| norm 0.3126 (-0.86z)| lr 5.97e-04 | 4174.69 ms | 32.3% bf16 MFU | 125640 tok/s step 1555/19560 | loss 4.019450 (-0.20z)| norm 0.3277 (-0.53z)| lr 5.97e-04 | 4150.29 ms | 32.5% bf16 MFU | 125674 tok/s step 1556/19560 | loss 4.011826 (-0.40z)| norm 0.3337 (-0.40z)| lr 5.97e-04 | 4165.09 ms | 32.4% bf16 MFU | 125684 tok/s step 1557/19560 | loss 4.005812 (-0.56z)| norm 0.3447 (-0.17z)| lr 5.97e-04 | 4159.34 ms | 32.5% bf16 MFU | 125702 tok/s step 1558/19560 | loss 4.000173 (-0.70z)| norm 0.3637 (+0.24z)| lr 5.97e-04 | 4167.35 ms | 32.4% bf16 MFU | 125708 tok/s step 1559/19560 | loss 3.991338 (-0.93z)| norm 0.3202 (-0.68z)| lr 5.97e-04 | 4157.52 ms | 32.5% bf16 MFU | 125728 tok/s step 1560/19560 | loss 4.037683 (+0.30z)| norm 0.3123 (-0.84z)| lr 5.97e-04 | 4178.99 ms | 32.3% bf16 MFU | 125714 tok/s step 1561/19560 | loss 4.025635 (-0.03z)| norm 0.3381 (-0.28z)| lr 5.97e-04 | 4358.90 ms | 31.0% bf16 MFU | 125442 tok/s step 1562/19560 | loss 4.061599 (+0.92z)| norm 0.3728 (+0.47z)| lr 5.97e-04 | 4139.16 ms | 32.6% bf16 MFU | 125504 tok/s step 1563/19560 | loss 4.035546 (+0.23z)| norm 0.3790 (+0.59z)| lr 5.97e-04 | 4167.19 ms | 32.4% bf16 MFU | 125519 tok/s step 1564/19560 | loss 4.013919 (-0.36z)| norm 0.3448 (-0.15z)| lr 5.97e-04 | 4245.41 ms | 31.8% bf16 MFU | 125418 tok/s step 1565/19560 | loss 3.985674 (-1.10z)| norm 0.3240 (-0.59z)| lr 5.97e-04 | 4157.81 ms | 32.5% bf16 MFU | 125452 tok/s step 1566/19560 | loss 3.973859 (-1.40z)| norm 0.3412 (-0.21z)| lr 5.97e-04 | 4144.48 ms | 32.6% bf16 MFU | 125504 tok/s step 1567/19560 | loss 3.993880 (-0.86z)| norm 0.3179 (-0.73z)| lr 5.97e-04 | 4152.73 ms | 32.5% bf16 MFU | 125542 tok/s step 1568/19560 | loss 4.008071 (-0.48z)| norm 0.3249 (-0.55z)| lr 5.97e-04 | 4162.52 ms | 32.4% bf16 MFU | 125562 tok/s step 1569/19560 | loss 3.970844 (-1.45z)| norm 0.3289 (-0.45z)| lr 5.97e-04 | 4168.10 ms | 32.4% bf16 MFU | 125573 tok/s step 1570/19560 | loss 3.969525 (-1.46z)| norm 0.3479 (+0.01z)| lr 5.97e-04 | 4288.67 ms | 31.5% bf16 MFU | 125407 tok/s step 1571/19560 | loss 3.996724 (-0.73z)| norm 0.2941 (-1.25z)| lr 5.97e-04 | 4184.31 ms | 32.3% bf16 MFU | 125402 tok/s step 1572/19560 | loss 3.997687 (-0.71z)| norm 0.3096 (-0.87z)| lr 5.97e-04 | 4157.65 ms | 32.5% bf16 MFU | 125437 tok/s step 1573/19560 | loss 4.078168 (+1.40z)| norm 0.3613 (+0.36z)| lr 5.97e-04 | 4420.26 ms | 30.5% bf16 MFU | 125096 tok/s step 1574/19560 | loss 3.996319 (-0.74z)| norm 0.2979 (-1.15z)| lr 5.97e-04 | 4153.38 ms | 32.5% bf16 MFU | 125152 tok/s step 1575/19560 | loss 4.006536 (-0.46z)| norm 0.3246 (-0.52z)| lr 5.97e-04 | 4158.01 ms | 32.5% bf16 MFU | 125199 tok/s step 1576/19560 | loss 3.993450 (-0.79z)| norm 0.3727 (+0.62z)| lr 5.97e-04 | 4478.48 ms | 30.1% bf16 MFU | 124793 tok/s step 1577/19560 | loss 4.003177 (-0.54z)| norm 0.4294 (+1.93z)| lr 5.97e-04 | 4163.86 ms | 32.4% bf16 MFU | 124849 tok/s step 1578/19560 | loss 3.981374 (-1.11z)| norm 0.3949 (+1.10z)| lr 5.97e-04 | 4163.59 ms | 32.4% bf16 MFU | 124902 tok/s step 1579/19560 | loss 3.952655 (-1.82z)| norm 0.3230 (-0.59z)| lr 5.97e-04 | 4159.65 ms | 32.5% bf16 MFU | 124959 tok/s step 1580/19560 | loss 4.066499 (+1.14z)| norm 0.3125 (-0.82z)| lr 5.97e-04 | 4169.06 ms | 32.4% bf16 MFU | 124999 tok/s step 1581/19560 | loss 3.984880 (-0.97z)| norm 0.3412 (-0.14z)| lr 5.97e-04 | 4207.31 ms | 32.1% bf16 MFU | 124980 tok/s step 1582/19560 | loss 4.033737 (+0.31z)| norm 0.3333 (-0.31z)| lr 5.97e-04 | 4160.81 ms | 32.4% bf16 MFU | 125031 tok/s step 1583/19560 | loss 4.035981 (+0.37z)| norm 0.3293 (-0.41z)| lr 5.97e-04 | 4175.24 ms | 32.3% bf16 MFU | 125058 tok/s step 1584/19560 | loss 4.031797 (+0.26z)| norm 0.3089 (-0.89z)| lr 5.97e-04 | 4500.58 ms | 30.0% bf16 MFU | 124630 tok/s step 1585/19560 | loss 4.061172 (+1.05z)| norm 0.3314 (-0.36z)| lr 5.97e-04 | 4236.95 ms | 31.9% bf16 MFU | 124586 tok/s step 1586/19560 | loss 4.005598 (-0.41z)| norm 0.3236 (-0.54z)| lr 5.97e-04 | 4160.83 ms | 32.4% bf16 MFU | 124657 tok/s step 1587/19560 | loss 4.022540 (+0.05z)| norm 0.3187 (-0.65z)| lr 5.97e-04 | 4148.17 ms | 32.5% bf16 MFU | 124743 tok/s step 1588/19560 | loss 3.984517 (-0.96z)| norm 0.3403 (-0.13z)| lr 5.97e-04 | 4312.68 ms | 31.3% bf16 MFU | 124585 tok/s step 1589/19560 | loss 3.951066 (-1.83z)| norm 0.3195 (-0.62z)| lr 5.97e-04 | 4153.59 ms | 32.5% bf16 MFU | 124667 tok/s step 1590/19560 | loss 3.993626 (-0.67z)| norm 0.3064 (-0.92z)| lr 5.97e-04 | 4288.68 ms | 31.5% bf16 MFU | 124546 tok/s step 1591/19560 | loss 3.978088 (-1.08z)| norm 0.2729 (-1.68z)| lr 5.97e-04 | 4224.47 ms | 32.0% bf16 MFU | 124524 tok/s step 1592/19560 | loss 4.058054 (+1.07z)| norm 0.2992 (-1.05z)| lr 5.97e-04 | 4152.43 ms | 32.5% bf16 MFU | 124611 tok/s step 1593/19560 | loss 3.974919 (-1.15z)| norm 0.3971 (+1.22z)| lr 5.97e-04 | 4156.37 ms | 32.5% bf16 MFU | 124687 tok/s step 1594/19560 | loss 4.025923 (+0.22z)| norm 0.3977 (+1.22z)| lr 5.97e-04 | 4171.45 ms | 32.4% bf16 MFU | 124737 tok/s step 1595/19560 | loss 4.017559 (-0.00z)| norm 0.4171 (+1.63z)| lr 5.97e-04 | 4170.82 ms | 32.4% bf16 MFU | 124785 tok/s step 1596/19560 | loss 3.919348 (-2.55z)| norm 0.3752 (+0.66z)| lr 5.97e-04 | 4142.24 ms | 32.6% bf16 MFU | 124875 tok/s step 1597/19560 | loss 3.982072 (-0.90z)| norm 0.3448 (-0.04z)| lr 5.97e-04 | 7228.75 ms | 18.7% bf16 MFU | 122257 tok/s step 1598/19560 | loss 3.954803 (-1.59z)| norm 0.3245 (-0.52z)| lr 5.97e-04 | 4142.10 ms | 32.6% bf16 MFU | 122473 tok/s step 1599/19560 | loss 3.967866 (-1.23z)| norm 0.3133 (-0.78z)| lr 5.97e-04 | 4295.68 ms | 31.4% bf16 MFU | 122452 tok/s step 1600/19560 | loss 3.952762 (-1.60z)| norm 0.2878 (-1.38z)| lr 5.97e-04 | 4144.82 ms | 32.6% bf16 MFU | 122654 tok/s step 1601/19560 | loss 3.977406 (-0.95z)| norm 0.2966 (-1.17z)| lr 5.97e-04 | 4182.66 ms | 32.3% bf16 MFU | 122789 tok/s step 1602/19560 | loss 3.972268 (-1.07z)| norm 0.3196 (-0.63z)| lr 5.97e-04 | 4194.34 ms | 32.2% bf16 MFU | 122899 tok/s step 1603/19560 | loss 3.987926 (-0.67z)| norm 0.3264 (-0.47z)| lr 5.97e-04 | 4276.51 ms | 31.6% bf16 MFU | 122884 tok/s step 1604/19560 | loss 3.980524 (-0.85z)| norm 0.3250 (-0.50z)| lr 5.97e-04 | 4151.01 ms | 32.5% bf16 MFU | 123055 tok/s step 1605/19560 | loss 3.938621 (-1.89z)| norm 0.3451 (-0.04z)| lr 5.97e-04 | 4159.93 ms | 32.5% bf16 MFU | 123204 tok/s step 1606/19560 | loss 4.004590 (-0.21z)| norm 0.3502 (+0.08z)| lr 5.97e-04 | 4150.44 ms | 32.5% bf16 MFU | 123360 tok/s step 1607/19560 | loss 4.010300 (-0.08z)| norm 0.3838 (+0.85z)| lr 5.97e-04 | 4229.21 ms | 31.9% bf16 MFU | 123390 tok/s step 1608/19560 | loss 3.995526 (-0.45z)| norm 0.3769 (+0.69z)| lr 5.97e-04 | 4290.74 ms | 31.5% bf16 MFU | 123330 tok/s step 1609/19560 | loss 4.045016 (+0.84z)| norm 0.3814 (+0.84z)| lr 5.97e-04 | 4288.70 ms | 31.5% bf16 MFU | 123276 tok/s step 1610/19560 | loss 3.959019 (-1.38z)| norm 0.3548 (+0.24z)| lr 5.97e-04 | 4136.91 ms | 32.6% bf16 MFU | 123449 tok/s step 1611/19560 | loss 3.986742 (-0.65z)| norm 0.3566 (+0.31z)| lr 5.97e-04 | 4145.00 ms | 32.6% bf16 MFU | 123601 tok/s step 1612/19560 | loss 3.958377 (-1.38z)| norm 0.3561 (+0.31z)| lr 5.97e-04 | 4149.70 ms | 32.5% bf16 MFU | 123738 tok/s step 1613/19560 | loss 3.993985 (-0.44z)| norm 0.3326 (-0.29z)| lr 5.97e-04 | 4163.14 ms | 32.4% bf16 MFU | 123848 tok/s step 1614/19560 | loss 3.988480 (-0.58z)| norm 0.3140 (-0.76z)| lr 5.97e-04 | 4239.73 ms | 31.8% bf16 MFU | 123839 tok/s step 1615/19560 | loss 3.973841 (-0.95z)| norm 0.3196 (-0.60z)| lr 5.97e-04 | 4140.73 ms | 32.6% bf16 MFU | 123978 tok/s step 1616/19560 | loss 3.950844 (-1.54z)| norm 0.2871 (-1.43z)| lr 5.97e-04 | 4149.06 ms | 32.5% bf16 MFU | 124097 tok/s step 1617/19560 | loss 3.983489 (-0.67z)| norm 0.3166 (-0.66z)| lr 5.97e-04 | 4180.30 ms | 32.3% bf16 MFU | 124163 tok/s step 1618/19560 | loss 3.980319 (-0.74z)| norm 0.3249 (-0.44z)| lr 5.97e-04 | 4279.94 ms | 31.5% bf16 MFU | 124080 tok/s step 1619/19560 | loss 3.979851 (-0.74z)| norm 0.3476 (+0.14z)| lr 5.96e-04 | 4137.38 ms | 32.6% bf16 MFU | 124212 tok/s step 1620/19560 | loss 3.979923 (-0.73z)| norm 0.3386 (-0.11z)| lr 5.96e-04 | 4145.20 ms | 32.6% bf16 MFU | 124325 tok/s step 1621/19560 | loss 3.961355 (-1.21z)| norm 0.3383 (-0.13z)| lr 5.96e-04 | 4152.22 ms | 32.5% bf16 MFU | 124422 tok/s step 1622/19560 | loss 3.958911 (-1.26z)| norm 0.3237 (-0.53z)| lr 5.96e-04 | 4213.95 ms | 32.0% bf16 MFU | 124422 tok/s step 1623/19560 | loss 3.961621 (-1.17z)| norm 0.3206 (-0.62z)| lr 5.96e-04 | 4166.83 ms | 32.4% bf16 MFU | 124492 tok/s step 1624/19560 | loss 3.953599 (-1.36z)| norm 0.3251 (-0.50z)| lr 5.96e-04 | 4154.05 ms | 32.5% bf16 MFU | 124578 tok/s step 1625/19560 | loss 3.969232 (-0.94z)| norm 0.3402 (-0.09z)| lr 5.96e-04 | 4233.05 ms | 31.9% bf16 MFU | 124542 tok/s step 1626/19560 | loss 3.948458 (-1.46z)| norm 0.3172 (-0.71z)| lr 5.96e-04 | 4149.64 ms | 32.5% bf16 MFU | 124632 tok/s step 1627/19560 | loss 3.911872 (-2.35z)| norm 0.3425 (-0.01z)| lr 5.96e-04 | 5141.63 ms | 26.3% bf16 MFU | 123499 tok/s step 1628/19560 | loss 3.940170 (-1.60z)| norm 0.3195 (-0.63z)| lr 5.96e-04 | 4569.71 ms | 29.5% bf16 MFU | 123061 tok/s step 1629/19560 | loss 3.997046 (-0.15z)| norm 0.2990 (-1.17z)| lr 5.96e-04 | 4180.37 ms | 32.3% bf16 MFU | 123178 tok/s step 1630/19560 | loss 3.931287 (-1.79z)| norm 0.3119 (-0.81z)| lr 5.96e-04 | 4142.11 ms | 32.6% bf16 MFU | 123348 tok/s step 1631/19560 | loss 3.965595 (-0.92z)| norm 0.3143 (-0.73z)| lr 5.96e-04 | 4155.31 ms | 32.5% bf16 MFU | 123489 tok/s step 1632/19560 | loss 3.975179 (-0.67z)| norm 0.2900 (-1.37z)| lr 5.96e-04 | 10421.17 ms | 13.0% bf16 MFU | 119830 tok/s step 1633/19560 | loss 3.939261 (-1.54z)| norm 0.3182 (-0.59z)| lr 5.96e-04 | 4515.82 ms | 29.9% bf16 MFU | 119644 tok/s step 1634/19560 | loss 4.016311 (+0.39z)| norm 0.3432 (+0.10z)| lr 5.96e-04 | 4153.89 ms | 32.5% bf16 MFU | 119973 tok/s step 1635/19560 | loss 3.979282 (-0.55z)| norm 0.4053 (+1.78z)| lr 5.96e-04 | 4283.97 ms | 31.5% bf16 MFU | 120093 tok/s step 1636/19560 | loss 3.925076 (-1.88z)| norm 0.4542 (+2.98z)| lr 5.96e-04 | 4163.76 ms | 32.4% bf16 MFU | 120384 tok/s step 1637/19560 | loss 4.122527 (+2.91z)| norm 0.3677 (+0.69z)| lr 5.96e-04 | 4146.82 ms | 32.6% bf16 MFU | 120687 tok/s step 1638/19560 | loss 3.996544 (-0.12z)| norm 0.3801 (+1.01z)| lr 5.96e-04 | 4235.99 ms | 31.9% bf16 MFU | 120841 tok/s step 1639/19560 | loss 3.939292 (-1.49z)| norm 0.3541 (+0.32z)| lr 5.96e-04 | 4134.57 ms | 32.7% bf16 MFU | 121139 tok/s step 1640/19560 | loss 4.006631 (+0.13z)| norm 0.3432 (+0.03z)| lr 5.96e-04 | 4389.88 ms | 30.8% bf16 MFU | 121054 tok/s step 1641/19560 | loss 3.953623 (-1.12z)| norm 0.3485 (+0.16z)| lr 5.96e-04 | 4276.55 ms | 31.6% bf16 MFU | 121131 tok/s step 1642/19560 | loss 4.027565 (+0.66z)| norm 0.3082 (-0.89z)| lr 5.96e-04 | 4197.23 ms | 32.2% bf16 MFU | 121320 tok/s step 1643/19560 | loss 3.958876 (-0.98z)| norm 0.2895 (-1.37z)| lr 5.96e-04 | 4231.76 ms | 31.9% bf16 MFU | 121449 tok/s step 1644/19560 | loss 3.933790 (-1.57z)| norm 0.2922 (-1.29z)| lr 5.96e-04 | 4147.72 ms | 32.6% bf16 MFU | 121696 tok/s step 1645/19560 | loss 3.981158 (-0.43z)| norm 0.2753 (-1.71z)| lr 5.96e-04 | 4150.29 ms | 32.5% bf16 MFU | 121928 tok/s step 1646/19560 | loss 4.043295 (+1.05z)| norm 0.2905 (-1.29z)| lr 5.96e-04 | 4180.22 ms | 32.3% bf16 MFU | 122102 tok/s step 1647/19560 | loss 3.971283 (-0.67z)| norm 0.3114 (-0.75z)| lr 5.96e-04 | 4277.63 ms | 31.6% bf16 MFU | 122126 tok/s step 1648/19560 | loss 3.958315 (-0.97z)| norm 0.3206 (-0.50z)| lr 5.96e-04 | 4177.97 ms | 32.3% bf16 MFU | 122294 tok/s step 1649/19560 | loss 3.990814 (-0.18z)| norm 0.3610 (+0.70z)| lr 5.96e-04 | 4172.60 ms | 32.4% bf16 MFU | 122462 tok/s step 1650/19560 | loss 4.032125 (+0.82z)| norm 0.3706 (+0.98z)| lr 5.96e-04 | 4179.98 ms | 32.3% bf16 MFU | 122610 tok/s step 1651/19560 | loss 3.983444 (-0.35z)| norm 0.3588 (+0.63z)| lr 5.96e-04 | 4153.74 ms | 32.5% bf16 MFU | 122790 tok/s step 1652/19560 | loss 4.025780 (+0.69z)| norm 0.3464 (+0.25z)| lr 5.96e-04 | 4262.40 ms | 31.7% bf16 MFU | 122801 tok/s step 1653/19560 | loss 3.972930 (-0.59z)| norm 0.3430 (+0.14z)| lr 5.96e-04 | 4167.59 ms | 32.4% bf16 MFU | 122951 tok/s step 1654/19560 | loss 3.966910 (-0.73z)| norm 0.3883 (+1.48z)| lr 5.96e-04 | 4144.18 ms | 32.6% bf16 MFU | 123129 tok/s step 1655/19560 | loss 3.982316 (-0.35z)| norm 0.4099 (+2.06z)| lr 5.96e-04 | 4147.01 ms | 32.6% bf16 MFU | 123294 tok/s step 1656/19560 | loss 3.984148 (-0.30z)| norm 0.3750 (+1.03z)| lr 5.96e-04 | 4199.06 ms | 32.2% bf16 MFU | 123372 tok/s step 1657/19560 | loss 3.993425 (-0.06z)| norm 0.3231 (-0.49z)| lr 5.96e-04 | 4163.61 ms | 32.4% bf16 MFU | 123500 tok/s step 1658/19560 | loss 3.977388 (-0.48z)| norm 0.3266 (-0.40z)| lr 5.96e-04 | 4152.22 ms | 32.5% bf16 MFU | 123638 tok/s step 1659/19560 | loss 3.953950 (-1.06z)| norm 0.3379 (-0.08z)| lr 5.96e-04 | 4158.24 ms | 32.5% bf16 MFU | 123760 tok/s step 1660/19560 | loss 3.998734 (+0.10z)| norm 0.3386 (-0.07z)| lr 5.96e-04 | 4162.27 ms | 32.4% bf16 MFU | 123870 tok/s step 1661/19560 | loss 3.958921 (-0.92z)| norm 0.3082 (-0.96z)| lr 5.96e-04 | 4195.54 ms | 32.2% bf16 MFU | 123925 tok/s step 1662/19560 | loss 3.904419 (-2.29z)| norm 0.3017 (-1.14z)| lr 5.96e-04 | 4155.05 ms | 32.5% bf16 MFU | 124038 tok/s step 1663/19560 | loss 3.940779 (-1.34z)| norm 0.2973 (-1.25z)| lr 5.96e-04 | 4151.61 ms | 32.5% bf16 MFU | 124150 tok/s step 1664/19560 | loss 3.978010 (-0.38z)| norm 0.3020 (-1.09z)| lr 5.96e-04 | 4154.70 ms | 32.5% bf16 MFU | 124252 tok/s step 1665/19560 | loss 3.962944 (-0.75z)| norm 0.2805 (-1.70z)| lr 5.96e-04 | 4146.93 ms | 32.6% bf16 MFU | 124361 tok/s step 1666/19560 | loss 3.984955 (-0.19z)| norm 0.2784 (-1.74z)| lr 5.96e-04 | 4168.97 ms | 32.4% bf16 MFU | 124431 tok/s step 1667/19560 | loss 4.043385 (+1.30z)| norm 0.3266 (-0.34z)| lr 5.96e-04 | 4158.85 ms | 32.5% bf16 MFU | 124513 tok/s step 1668/19560 | loss 3.965195 (-0.69z)| norm 0.3936 (+1.60z)| lr 5.96e-04 | 4163.38 ms | 32.4% bf16 MFU | 124583 tok/s step 1669/19560 | loss 4.029250 (+1.00z)| norm 0.3584 (+0.58z)| lr 5.96e-04 | 4175.34 ms | 32.3% bf16 MFU | 124633 tok/s step 1670/19560 | loss 3.938546 (-1.37z)| norm 0.3439 (+0.17z)| lr 5.96e-04 | 4169.32 ms | 32.4% bf16 MFU | 124689 tok/s step 1671/19560 | loss 4.023335 (+0.85z)| norm 0.3788 (+1.16z)| lr 5.96e-04 | 4157.41 ms | 32.5% bf16 MFU | 124760 tok/s step 1672/19560 | loss 3.965181 (-0.66z)| norm 0.3976 (+1.67z)| lr 5.96e-04 | 4165.14 ms | 32.4% bf16 MFU | 124815 tok/s step 1673/19560 | loss 3.932539 (-1.52z)| norm 0.3940 (+1.54z)| lr 5.96e-04 | 4182.15 ms | 32.3% bf16 MFU | 124843 tok/s step 1674/19560 | loss 3.959251 (-0.80z)| norm 0.3677 (+0.78z)| lr 5.96e-04 | 4151.92 ms | 32.5% bf16 MFU | 124914 tok/s step 1675/19560 | loss 3.992509 (+0.10z)| norm 0.3848 (+1.25z)| lr 5.96e-04 | 4174.45 ms | 32.3% bf16 MFU | 124948 tok/s step 1676/19560 | loss 3.972776 (-0.43z)| norm 0.4086 (+1.88z)| lr 5.96e-04 | 4159.55 ms | 32.5% bf16 MFU | 125003 tok/s step 1677/19560 | loss 4.022384 (+0.89z)| norm 0.3166 (-0.66z)| lr 5.96e-04 | 4161.61 ms | 32.4% bf16 MFU | 125052 tok/s step 1678/19560 | loss 3.912120 (-2.00z)| norm 0.2988 (-1.13z)| lr 5.96e-04 | 4163.45 ms | 32.4% bf16 MFU | 125096 tok/s step 1679/19560 | loss 3.979399 (-0.23z)| norm 0.3028 (-1.02z)| lr 5.96e-04 | 4153.16 ms | 32.5% bf16 MFU | 125153 tok/s step 1680/19560 | loss 4.028877 (+1.12z)| norm 0.2991 (-1.11z)| lr 5.96e-04 | 4166.53 ms | 32.4% bf16 MFU | 125187 tok/s step 1681/19560 | loss 3.965183 (-0.60z)| norm 0.2802 (-1.62z)| lr 5.96e-04 | 4174.71 ms | 32.3% bf16 MFU | 125207 tok/s step 1682/19560 | loss 3.985761 (-0.03z)| norm 0.3381 (+0.03z)| lr 5.96e-04 | 4151.61 ms | 32.5% bf16 MFU | 125261 tok/s step 1683/19560 | loss 3.948493 (-1.04z)| norm 0.3523 (+0.43z)| lr 5.96e-04 | 4165.51 ms | 32.4% bf16 MFU | 125291 tok/s step 1684/19560 | loss 3.954549 (-0.86z)| norm 0.3694 (+0.90z)| lr 5.96e-04 | 4313.51 ms | 31.3% bf16 MFU | 125104 tok/s step 1685/19560 | loss 3.959321 (-0.72z)| norm 0.3331 (-0.13z)| lr 5.96e-04 | 4160.65 ms | 32.5% bf16 MFU | 125149 tok/s step 1686/19560 | loss 3.954930 (-0.83z)| norm 0.3019 (-1.00z)| lr 5.96e-04 | 4171.12 ms | 32.4% bf16 MFU | 125176 tok/s step 1687/19560 | loss 3.986111 (+0.03z)| norm 0.3153 (-0.62z)| lr 5.96e-04 | 4160.69 ms | 32.5% bf16 MFU | 125218 tok/s step 1688/19560 | loss 4.012789 (+0.78z)| norm 0.3059 (-0.88z)| lr 5.96e-04 | 4150.93 ms | 32.5% bf16 MFU | 125272 tok/s step 1689/19560 | loss 3.968760 (-0.43z)| norm 0.2813 (-1.55z)| lr 5.96e-04 | 4156.39 ms | 32.5% bf16 MFU | 125316 tok/s step 1690/19560 | loss 3.960939 (-0.64z)| norm 0.3050 (-0.87z)| lr 5.96e-04 | 4200.56 ms | 32.1% bf16 MFU | 125291 tok/s step 1691/19560 | loss 3.908818 (-2.07z)| norm 0.3106 (-0.70z)| lr 5.96e-04 | 4165.77 ms | 32.4% bf16 MFU | 125319 tok/s step 1692/19560 | loss 3.872528 (-2.96z)| norm 0.3304 (-0.14z)| lr 5.96e-04 | 4150.05 ms | 32.5% bf16 MFU | 125370 tok/s step 1693/19560 | loss 3.949490 (-0.86z)| norm 0.3881 (+1.46z)| lr 5.96e-04 | 4436.61 ms | 30.4% bf16 MFU | 125010 tok/s step 1694/19560 | loss 3.932321 (-1.30z)| norm 0.3281 (-0.21z)| lr 5.96e-04 | 4203.06 ms | 32.1% bf16 MFU | 124996 tok/s step 1695/19560 | loss 3.972385 (-0.22z)| norm 0.3810 (+1.25z)| lr 5.96e-04 | 4190.82 ms | 32.2% bf16 MFU | 125002 tok/s step 1696/19560 | loss 3.961196 (-0.51z)| norm 0.3685 (+0.89z)| lr 5.96e-04 | 4146.88 ms | 32.6% bf16 MFU | 125073 tok/s step 1697/19560 | loss 3.971133 (-0.25z)| norm 0.3492 (+0.35z)| lr 5.96e-04 | 4253.49 ms | 31.7% bf16 MFU | 124983 tok/s step 1698/19560 | loss 3.936841 (-1.16z)| norm 0.3363 (-0.01z)| lr 5.96e-04 | 4146.45 ms | 32.6% bf16 MFU | 125056 tok/s step 1699/19560 | loss 3.945080 (-0.92z)| norm 0.2859 (-1.41z)| lr 5.96e-04 | 4147.64 ms | 32.6% bf16 MFU | 125123 tok/s step 1700/19560 | loss 4.012111 (+0.86z)| norm 0.2554 (-2.20z)| lr 5.96e-04 | 4159.49 ms | 32.5% bf16 MFU | 125169 tok/s step 1701/19560 | loss 3.934422 (-1.20z)| norm 0.4916 (+3.95z)| lr 5.96e-04 | 4166.32 ms | 32.4% bf16 MFU | 125203 tok/s step 1702/19560 | loss 3.908554 (-1.87z)| norm 0.3117 (-0.66z)| lr 5.96e-04 | 4152.16 ms | 32.5% bf16 MFU | 125256 tok/s step 1703/19560 | loss 3.951151 (-0.71z)| norm 0.3652 (+0.71z)| lr 5.96e-04 | 4153.62 ms | 32.5% bf16 MFU | 125304 tok/s step 1704/19560 | loss 3.922906 (-1.44z)| norm 0.3727 (+0.90z)| lr 5.96e-04 | 4158.31 ms | 32.5% bf16 MFU | 125343 tok/s step 1705/19560 | loss 3.990575 (+0.37z)| norm 0.3936 (+1.46z)| lr 5.96e-04 | 4234.11 ms | 31.9% bf16 MFU | 125267 tok/s step 1706/19560 | loss 4.029683 (+1.39z)| norm 0.4141 (+1.97z)| lr 5.96e-04 | 4178.76 ms | 32.3% bf16 MFU | 125277 tok/s step 1707/19560 | loss 3.964881 (-0.33z)| norm 0.3971 (+1.51z)| lr 5.96e-04 | 4175.19 ms | 32.3% bf16 MFU | 125292 tok/s step 1708/19560 | loss 3.960980 (-0.42z)| norm 0.4204 (+2.05z)| lr 5.96e-04 | 4162.24 ms | 32.4% bf16 MFU | 125326 tok/s step 1709/19560 | loss 4.044741 (+1.82z)| norm 0.4292 (+2.21z)| lr 5.96e-04 | 4146.40 ms | 32.6% bf16 MFU | 125382 tok/s step 1710/19560 | loss 3.949861 (-0.71z)| norm 0.4543 (+2.73z)| lr 5.96e-04 | 4148.32 ms | 32.5% bf16 MFU | 125432 tok/s step 1711/19560 | loss 3.955071 (-0.56z)| norm 0.3872 (+1.10z)| lr 5.96e-04 | 4146.86 ms | 32.6% bf16 MFU | 125482 tok/s step 1712/19560 | loss 4.003808 (+0.78z)| norm 0.2865 (-1.29z)| lr 5.96e-04 | 4156.56 ms | 32.5% bf16 MFU | 125514 tok/s step 1713/19560 | loss 3.947887 (-0.75z)| norm 0.3306 (-0.24z)| lr 5.96e-04 | 4164.27 ms | 32.4% bf16 MFU | 125534 tok/s step 1714/19560 | loss 3.977344 (+0.08z)| norm 0.3109 (-0.71z)| lr 5.96e-04 | 4157.02 ms | 32.5% bf16 MFU | 125563 tok/s step 1715/19560 | loss 3.986314 (+0.35z)| norm 0.3122 (-0.68z)| lr 5.96e-04 | 4160.91 ms | 32.4% bf16 MFU | 125585 tok/s step 1716/19560 | loss 3.951044 (-0.64z)| norm 0.2861 (-1.28z)| lr 5.96e-04 | 4147.21 ms | 32.6% bf16 MFU | 125627 tok/s step 1717/19560 | loss 3.932635 (-1.15z)| norm 0.3026 (-0.88z)| lr 5.96e-04 | 4171.07 ms | 32.4% bf16 MFU | 125630 tok/s step 1718/19560 | loss 3.962674 (-0.30z)| norm 0.3242 (-0.38z)| lr 5.96e-04 | 4159.98 ms | 32.5% bf16 MFU | 125650 tok/s step 1719/19560 | loss 3.969293 (-0.11z)| norm 0.3276 (-0.31z)| lr 5.96e-04 | 4152.21 ms | 32.5% bf16 MFU | 125681 tok/s step 1720/19560 | loss 3.947496 (-0.72z)| norm 0.3326 (-0.20z)| lr 5.96e-04 | 4185.80 ms | 32.3% bf16 MFU | 125660 tok/s step 1721/19560 | loss 3.951978 (-0.58z)| norm 0.3276 (-0.31z)| lr 5.96e-04 | 4177.93 ms | 32.3% bf16 MFU | 125651 tok/s step 1722/19560 | loss 3.964921 (-0.20z)| norm 0.3153 (-0.59z)| lr 5.96e-04 | 4180.78 ms | 32.3% bf16 MFU | 125639 tok/s step 1723/19560 | loss 3.920582 (-1.46z)| norm 0.2991 (-0.97z)| lr 5.96e-04 | 4176.14 ms | 32.3% bf16 MFU | 125634 tok/s step 1724/19560 | loss 4.026864 (+1.59z)| norm 0.3202 (-0.45z)| lr 5.96e-04 | 4162.75 ms | 32.4% bf16 MFU | 125650 tok/s step 1725/19560 | loss 3.946159 (-0.74z)| norm 0.3162 (-0.54z)| lr 5.96e-04 | 4145.77 ms | 32.6% bf16 MFU | 125690 tok/s step 1726/19560 | loss 4.010299 (+1.10z)| norm 0.3662 (+0.68z)| lr 5.96e-04 | 4155.44 ms | 32.5% bf16 MFU | 125714 tok/s step 1727/19560 | loss 3.952645 (-0.55z)| norm 0.3273 (-0.28z)| lr 5.96e-04 | 4156.59 ms | 32.5% bf16 MFU | 125735 tok/s step 1728/19560 | loss 3.995789 (+0.68z)| norm 0.3153 (-0.58z)| lr 5.96e-04 | 4147.02 ms | 32.6% bf16 MFU | 125770 tok/s step 1729/19560 | loss 3.903913 (-1.92z)| norm 0.3169 (-0.54z)| lr 5.96e-04 | 4158.11 ms | 32.5% bf16 MFU | 125786 tok/s step 1730/19560 | loss 3.875941 (-2.62z)| norm 0.3417 (+0.06z)| lr 5.96e-04 | 4150.75 ms | 32.5% bf16 MFU | 125812 tok/s step 1731/19560 | loss 3.918693 (-1.41z)| norm 0.3773 (+0.93z)| lr 5.96e-04 | 4178.33 ms | 32.3% bf16 MFU | 125795 tok/s step 1732/19560 | loss 3.974177 (+0.10z)| norm 0.3331 (-0.16z)| lr 5.96e-04 | 4171.22 ms | 32.4% bf16 MFU | 125790 tok/s step 1733/19560 | loss 3.923002 (-1.29z)| norm 0.2899 (-1.21z)| lr 5.96e-04 | 4143.10 ms | 32.6% bf16 MFU | 125828 tok/s step 1734/19560 | loss 3.951641 (-0.50z)| norm 0.2858 (-1.29z)| lr 5.96e-04 | 4293.20 ms | 31.4% bf16 MFU | 125643 tok/s step 1735/19560 | loss 3.964084 (-0.15z)| norm 0.2867 (-1.24z)| lr 5.96e-04 | 4199.64 ms | 32.1% bf16 MFU | 125602 tok/s step 1736/19560 | loss 3.957166 (-0.33z)| norm 0.2896 (-1.15z)| lr 5.96e-04 | 4143.87 ms | 32.6% bf16 MFU | 125648 tok/s step 1737/19560 | loss 3.988881 (+0.56z)| norm 0.2965 (-0.97z)| lr 5.96e-04 | 4159.35 ms | 32.5% bf16 MFU | 125669 tok/s step 1738/19560 | loss 3.905266 (-1.74z)| norm 0.3119 (-0.59z)| lr 5.96e-04 | 4144.67 ms | 32.6% bf16 MFU | 125710 tok/s step 1739/19560 | loss 3.931042 (-1.01z)| norm 0.3086 (-0.66z)| lr 5.96e-04 | 4154.19 ms | 32.5% bf16 MFU | 125735 tok/s step 1740/19560 | loss 3.918213 (-1.35z)| norm 0.2990 (-0.88z)| lr 5.96e-04 | 4204.09 ms | 32.1% bf16 MFU | 125684 tok/s step 1741/19560 | loss 3.942061 (-0.69z)| norm 0.3134 (-0.53z)| lr 5.96e-04 | 4149.66 ms | 32.5% bf16 MFU | 125717 tok/s step 1742/19560 | loss 3.930836 (-0.98z)| norm 0.3294 (-0.14z)| lr 5.96e-04 | 4246.03 ms | 31.8% bf16 MFU | 125605 tok/s step 1743/19560 | loss 3.948803 (-0.48z)| norm 0.3978 (+1.48z)| lr 5.95e-04 | 4174.27 ms | 32.3% bf16 MFU | 125604 tok/s step 1744/19560 | loss 3.987686 (+0.57z)| norm 0.4353 (+2.31z)| lr 5.95e-04 | 4150.03 ms | 32.5% bf16 MFU | 125641 tok/s step 1745/19560 | loss 4.017683 (+1.37z)| norm 0.3719 (+0.81z)| lr 5.95e-04 | 4155.20 ms | 32.5% bf16 MFU | 125668 tok/s step 1746/19560 | loss 4.012960 (+1.23z)| norm 0.3557 (+0.42z)| lr 5.95e-04 | 4162.18 ms | 32.4% bf16 MFU | 125682 tok/s step 1747/19560 | loss 3.998608 (+0.84z)| norm 0.3447 (+0.16z)| lr 5.95e-04 | 4277.27 ms | 31.6% bf16 MFU | 125527 tok/s step 1748/19560 | loss 3.958799 (-0.23z)| norm 0.3207 (-0.40z)| lr 5.95e-04 | 4161.84 ms | 32.4% bf16 MFU | 125549 tok/s step 1749/19560 | loss 4.019256 (+1.37z)| norm 0.3151 (-0.53z)| lr 5.95e-04 | 4162.03 ms | 32.4% bf16 MFU | 125570 tok/s step 1750/19560 | loss 3.937516 (-0.80z)| norm 0.3176 (-0.47z)| lr 5.95e-04 | 4156.57 ms | 32.5% bf16 MFU | 125599 tok/s val loss 3.929217 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2628/10042 = 0.261701 step 1751/19560 | loss 3.888037 (-2.06z)| norm 0.3104 (-0.63z)| lr 5.95e-04 | 4518.23 ms | 29.9% bf16 MFU | 125121 tok/s step 1752/19560 | loss 3.886413 (-2.06z)| norm 0.2896 (-1.11z)| lr 5.95e-04 | 4157.28 ms | 32.5% bf16 MFU | 125170 tok/s step 1753/19560 | loss 3.943183 (-0.59z)| norm 0.2944 (-0.99z)| lr 5.95e-04 | 4153.60 ms | 32.5% bf16 MFU | 125223 tok/s step 1754/19560 | loss 3.942380 (-0.61z)| norm 0.2914 (-1.05z)| lr 5.95e-04 | 4161.02 ms | 32.4% bf16 MFU | 125262 tok/s step 1755/19560 | loss 3.917696 (-1.25z)| norm 0.3300 (-0.15z)| lr 5.95e-04 | 4193.30 ms | 32.2% bf16 MFU | 125250 tok/s step 1756/19560 | loss 3.959876 (-0.17z)| norm 0.4798 (+3.17z)| lr 5.95e-04 | 4143.58 ms | 32.6% bf16 MFU | 125314 tok/s step 1757/19560 | loss 3.983310 (+0.44z)| norm 0.3249 (-0.29z)| lr 5.95e-04 | 4155.62 ms | 32.5% bf16 MFU | 125357 tok/s step 1758/19560 | loss 3.994614 (+0.72z)| norm 0.3488 (+0.24z)| lr 5.95e-04 | 4349.81 ms | 31.0% bf16 MFU | 125115 tok/s step 1759/19560 | loss 3.930358 (-0.93z)| norm 0.3748 (+0.81z)| lr 5.95e-04 | 4162.21 ms | 32.4% bf16 MFU | 125158 tok/s step 1760/19560 | loss 3.957759 (-0.22z)| norm 0.3880 (+1.08z)| lr 5.95e-04 | 4167.27 ms | 32.4% bf16 MFU | 125191 tok/s step 1761/19560 | loss 3.930505 (-0.92z)| norm 0.3770 (+0.83z)| lr 5.95e-04 | 4300.91 ms | 31.4% bf16 MFU | 125026 tok/s step 1762/19560 | loss 3.989680 (+0.61z)| norm 0.3227 (-0.38z)| lr 5.95e-04 | 4246.55 ms | 31.8% bf16 MFU | 124948 tok/s step 1763/19560 | loss 3.993615 (+0.71z)| norm 0.3245 (-0.33z)| lr 5.95e-04 | 4249.07 ms | 31.8% bf16 MFU | 124870 tok/s step 1764/19560 | loss 4.008709 (+1.08z)| norm 0.3175 (-0.47z)| lr 5.95e-04 | 4203.75 ms | 32.1% bf16 MFU | 124862 tok/s step 1765/19560 | loss 3.978428 (+0.35z)| norm 0.2968 (-0.94z)| lr 5.95e-04 | 4193.01 ms | 32.2% bf16 MFU | 124871 tok/s step 1766/19560 | loss 3.925528 (-1.09z)| norm 0.3121 (-0.57z)| lr 5.95e-04 | 4154.97 ms | 32.5% bf16 MFU | 124937 tok/s step 1767/19560 | loss 3.956080 (-0.26z)| norm 0.3413 (+0.10z)| lr 5.95e-04 | 4226.20 ms | 31.9% bf16 MFU | 124893 tok/s step 1768/19560 | loss 3.958418 (-0.18z)| norm 0.3345 (-0.05z)| lr 5.95e-04 | 4161.14 ms | 32.4% bf16 MFU | 124948 tok/s step 1769/19560 | loss 3.952976 (-0.33z)| norm 0.3422 (+0.13z)| lr 5.95e-04 | 4362.99 ms | 30.9% bf16 MFU | 124709 tok/s step 1770/19560 | loss 3.934230 (-0.84z)| norm 0.3207 (-0.37z)| lr 5.95e-04 | 4204.98 ms | 32.1% bf16 MFU | 124708 tok/s step 1771/19560 | loss 3.886235 (-2.13z)| norm 0.3637 (+0.61z)| lr 5.95e-04 | 4159.85 ms | 32.5% bf16 MFU | 124774 tok/s step 1772/19560 | loss 3.907478 (-1.53z)| norm 0.3863 (+1.12z)| lr 5.95e-04 | 4190.98 ms | 32.2% bf16 MFU | 124790 tok/s step 1773/19560 | loss 3.944726 (-0.50z)| norm 0.3710 (+0.75z)| lr 5.95e-04 | 4141.93 ms | 32.6% bf16 MFU | 124880 tok/s step 1774/19560 | loss 3.942379 (-0.56z)| norm 0.3370 (-0.05z)| lr 5.95e-04 | 4249.68 ms | 31.8% bf16 MFU | 124804 tok/s step 1775/19560 | loss 3.978445 (+0.44z)| norm 0.3177 (-0.51z)| lr 5.95e-04 | 4251.78 ms | 31.8% bf16 MFU | 124730 tok/s step 1776/19560 | loss 3.916198 (-1.27z)| norm 0.2853 (-1.25z)| lr 5.95e-04 | 4153.85 ms | 32.5% bf16 MFU | 124804 tok/s step 1777/19560 | loss 3.962707 (+0.02z)| norm 0.2990 (-0.92z)| lr 5.95e-04 | 4179.52 ms | 32.3% bf16 MFU | 124836 tok/s step 1778/19560 | loss 3.944798 (-0.46z)| norm 0.3111 (-0.63z)| lr 5.95e-04 | 4156.48 ms | 32.5% bf16 MFU | 124901 tok/s step 1779/19560 | loss 3.920120 (-1.14z)| norm 0.3265 (-0.26z)| lr 5.95e-04 | 4152.64 ms | 32.5% bf16 MFU | 124969 tok/s step 1780/19560 | loss 3.995140 (+0.98z)| norm 0.3179 (-0.46z)| lr 5.95e-04 | 4166.19 ms | 32.4% bf16 MFU | 125012 tok/s step 1781/19560 | loss 3.924067 (-1.01z)| norm 0.2717 (-1.51z)| lr 5.95e-04 | 4157.79 ms | 32.5% bf16 MFU | 125067 tok/s step 1782/19560 | loss 3.938289 (-0.61z)| norm 0.2853 (-1.18z)| lr 5.95e-04 | 4160.90 ms | 32.4% bf16 MFU | 125113 tok/s step 1783/19560 | loss 3.952399 (-0.20z)| norm 0.2824 (-1.23z)| lr 5.95e-04 | 4173.18 ms | 32.4% bf16 MFU | 125139 tok/s step 1784/19560 | loss 3.911668 (-1.33z)| norm 0.2806 (-1.25z)| lr 5.95e-04 | 4192.95 ms | 32.2% bf16 MFU | 125134 tok/s step 1785/19560 | loss 3.968156 (+0.26z)| norm 0.2801 (-1.24z)| lr 5.95e-04 | 4166.63 ms | 32.4% bf16 MFU | 125169 tok/s step 1786/19560 | loss 3.983905 (+0.70z)| norm 0.2881 (-1.05z)| lr 5.95e-04 | 4158.04 ms | 32.5% bf16 MFU | 125215 tok/s step 1787/19560 | loss 3.910109 (-1.35z)| norm 0.2995 (-0.78z)| lr 5.95e-04 | 4172.85 ms | 32.4% bf16 MFU | 125237 tok/s step 1788/19560 | loss 3.929329 (-0.80z)| norm 0.3244 (-0.21z)| lr 5.95e-04 | 4167.79 ms | 32.4% bf16 MFU | 125265 tok/s step 1789/19560 | loss 3.883992 (-2.02z)| norm 0.3375 (+0.09z)| lr 5.95e-04 | 4178.25 ms | 32.3% bf16 MFU | 125275 tok/s step 1790/19560 | loss 3.886873 (-1.92z)| norm 0.3109 (-0.52z)| lr 5.95e-04 | 4164.45 ms | 32.4% bf16 MFU | 125306 tok/s step 1791/19560 | loss 3.944897 (-0.34z)| norm 0.3233 (-0.24z)| lr 5.95e-04 | 4152.53 ms | 32.5% bf16 MFU | 125354 tok/s step 1792/19560 | loss 3.898477 (-1.58z)| norm 0.3348 (+0.01z)| lr 5.95e-04 | 4155.25 ms | 32.5% bf16 MFU | 125395 tok/s step 1793/19560 | loss 3.894263 (-1.66z)| norm 0.3554 (+0.48z)| lr 5.95e-04 | 4154.85 ms | 32.5% bf16 MFU | 125435 tok/s step 1794/19560 | loss 3.907652 (-1.28z)| norm 0.3434 (+0.19z)| lr 5.95e-04 | 4174.26 ms | 32.3% bf16 MFU | 125443 tok/s step 1795/19560 | loss 3.931591 (-0.63z)| norm 0.3417 (+0.15z)| lr 5.95e-04 | 4161.92 ms | 32.4% bf16 MFU | 125469 tok/s step 1796/19560 | loss 3.970122 (+0.42z)| norm 0.3681 (+0.77z)| lr 5.95e-04 | 4158.73 ms | 32.5% bf16 MFU | 125499 tok/s step 1797/19560 | loss 3.963978 (+0.27z)| norm 0.3526 (+0.41z)| lr 5.95e-04 | 4165.95 ms | 32.4% bf16 MFU | 125517 tok/s step 1798/19560 | loss 3.957798 (+0.09z)| norm 0.3603 (+0.58z)| lr 5.95e-04 | 4289.55 ms | 31.5% bf16 MFU | 125352 tok/s step 1799/19560 | loss 3.919679 (-0.95z)| norm 0.3999 (+1.50z)| lr 5.95e-04 | 4278.67 ms | 31.6% bf16 MFU | 125211 tok/s step 1800/19560 | loss 3.972484 (+0.53z)| norm 0.3196 (-0.36z)| lr 5.95e-04 | 4213.44 ms | 32.0% bf16 MFU | 125172 tok/s step 1801/19560 | loss 3.945802 (-0.22z)| norm 0.3029 (-0.74z)| lr 5.95e-04 | 4154.41 ms | 32.5% bf16 MFU | 125224 tok/s step 1802/19560 | loss 3.973099 (+0.54z)| norm 0.2965 (-0.87z)| lr 5.95e-04 | 4159.83 ms | 32.5% bf16 MFU | 125264 tok/s step 1803/19560 | loss 3.914977 (-1.07z)| norm 0.3039 (-0.69z)| lr 5.95e-04 | 4203.61 ms | 32.1% bf16 MFU | 125237 tok/s step 1804/19560 | loss 3.902676 (-1.39z)| norm 0.3077 (-0.58z)| lr 5.95e-04 | 4152.30 ms | 32.5% bf16 MFU | 125289 tok/s step 1805/19560 | loss 3.911478 (-1.13z)| norm 0.3076 (-0.59z)| lr 5.95e-04 | 4152.16 ms | 32.5% bf16 MFU | 125338 tok/s step 1806/19560 | loss 3.975297 (+0.65z)| norm 0.3190 (-0.32z)| lr 5.95e-04 | 4380.75 ms | 30.8% bf16 MFU | 125055 tok/s step 1807/19560 | loss 3.906882 (-1.26z)| norm 0.3143 (-0.43z)| lr 5.95e-04 | 4157.42 ms | 32.5% bf16 MFU | 125108 tok/s step 1808/19560 | loss 3.939861 (-0.32z)| norm 0.3302 (-0.06z)| lr 5.95e-04 | 4287.48 ms | 31.5% bf16 MFU | 124966 tok/s step 1809/19560 | loss 3.964920 (+0.40z)| norm 0.3686 (+0.85z)| lr 5.95e-04 | 4156.46 ms | 32.5% bf16 MFU | 125025 tok/s step 1810/19560 | loss 3.898650 (-1.47z)| norm 0.3507 (+0.42z)| lr 5.95e-04 | 4259.86 ms | 31.7% bf16 MFU | 124928 tok/s step 1811/19560 | loss 3.999707 (+1.38z)| norm 0.3337 (+0.01z)| lr 5.95e-04 | 4161.44 ms | 32.4% bf16 MFU | 124981 tok/s step 1812/19560 | loss 3.943436 (-0.20z)| norm 0.3060 (-0.65z)| lr 5.95e-04 | 4224.87 ms | 32.0% bf16 MFU | 124936 tok/s step 1813/19560 | loss 3.997723 (+1.31z)| norm 0.3812 (+1.16z)| lr 5.95e-04 | 4211.40 ms | 32.1% bf16 MFU | 124914 tok/s step 1814/19560 | loss 3.927725 (-0.65z)| norm 0.3517 (+0.44z)| lr 5.95e-04 | 4165.44 ms | 32.4% bf16 MFU | 124962 tok/s step 1815/19560 | loss 3.954565 (+0.11z)| norm 0.3443 (+0.25z)| lr 5.95e-04 | 4306.64 ms | 31.4% bf16 MFU | 124801 tok/s step 1816/19560 | loss 3.960309 (+0.29z)| norm 0.4525 (+2.75z)| lr 5.95e-04 | 4212.48 ms | 32.1% bf16 MFU | 124784 tok/s step 1817/19560 | loss 3.882025 (-1.89z)| norm 0.3742 (+0.91z)| lr 5.95e-04 | 4140.80 ms | 32.6% bf16 MFU | 124875 tok/s step 1818/19560 | loss 3.903895 (-1.26z)| norm 0.3573 (+0.50z)| lr 5.95e-04 | 4156.88 ms | 32.5% bf16 MFU | 124938 tok/s step 1819/19560 | loss 3.939852 (-0.26z)| norm 0.2949 (-0.96z)| lr 5.95e-04 | 4144.19 ms | 32.6% bf16 MFU | 125016 tok/s step 1820/19560 | loss 3.900950 (-1.37z)| norm 0.2993 (-0.85z)| lr 5.95e-04 | 5970.28 ms | 22.6% bf16 MFU | 123156 tok/s step 1821/19560 | loss 4.069125 (+3.22z)| norm 0.3122 (-0.54z)| lr 5.95e-04 | 4165.64 ms | 32.4% bf16 MFU | 123292 tok/s step 1822/19560 | loss 3.913929 (-0.98z)| norm 0.3263 (-0.21z)| lr 5.95e-04 | 4350.02 ms | 31.0% bf16 MFU | 123153 tok/s step 1823/19560 | loss 3.870585 (-2.10z)| norm 0.3396 (+0.12z)| lr 5.95e-04 | 4189.54 ms | 32.2% bf16 MFU | 123253 tok/s step 1824/19560 | loss 3.852088 (-2.51z)| norm 0.3303 (-0.10z)| lr 5.95e-04 | 4181.68 ms | 32.3% bf16 MFU | 123359 tok/s step 1825/19560 | loss 3.817025 (-3.25z)| norm 0.2831 (-1.20z)| lr 5.95e-04 | 4170.24 ms | 32.4% bf16 MFU | 123477 tok/s step 1826/19560 | loss 3.796564 (-3.54z)| norm 0.2935 (-0.94z)| lr 5.95e-04 | 4179.68 ms | 32.3% bf16 MFU | 123575 tok/s step 1827/19560 | loss 3.872019 (-1.72z)| norm 0.2856 (-1.12z)| lr 5.95e-04 | 4169.14 ms | 32.4% bf16 MFU | 123684 tok/s step 1828/19560 | loss 3.878104 (-1.56z)| norm 0.2994 (-0.82z)| lr 5.95e-04 | 4237.90 ms | 31.9% bf16 MFU | 123686 tok/s step 1829/19560 | loss 3.889158 (-1.28z)| norm 0.3159 (-0.42z)| lr 5.95e-04 | 4167.55 ms | 32.4% bf16 MFU | 123791 tok/s step 1830/19560 | loss 3.964956 (+0.47z)| norm 0.2915 (-1.02z)| lr 5.95e-04 | 4165.46 ms | 32.4% bf16 MFU | 123895 tok/s step 1831/19560 | loss 3.834503 (-2.48z)| norm 0.2927 (-0.98z)| lr 5.95e-04 | 4171.90 ms | 32.4% bf16 MFU | 123984 tok/s step 1832/19560 | loss 3.808453 (-2.95z)| norm 0.2996 (-0.79z)| lr 5.95e-04 | 4165.69 ms | 32.4% bf16 MFU | 124078 tok/s step 1833/19560 | loss 3.926520 (-0.35z)| norm 0.3231 (-0.19z)| lr 5.95e-04 | 4164.68 ms | 32.4% bf16 MFU | 124168 tok/s step 1834/19560 | loss 3.789089 (-3.24z)| norm 0.3029 (-0.69z)| lr 5.95e-04 | 4193.09 ms | 32.2% bf16 MFU | 124212 tok/s step 1835/19560 | loss 3.901413 (-0.83z)| norm 0.2985 (-0.79z)| lr 5.95e-04 | 4171.73 ms | 32.4% bf16 MFU | 124285 tok/s step 1836/19560 | loss 3.960675 (+0.44z)| norm 0.2971 (-0.82z)| lr 5.95e-04 | 4158.83 ms | 32.5% bf16 MFU | 124374 tok/s step 1837/19560 | loss 3.977913 (+0.83z)| norm 0.2808 (-1.25z)| lr 5.95e-04 | 4181.87 ms | 32.3% bf16 MFU | 124424 tok/s step 1838/19560 | loss 3.832726 (-2.26z)| norm 0.2938 (-0.90z)| lr 5.95e-04 | 4178.86 ms | 32.3% bf16 MFU | 124476 tok/s step 1839/19560 | loss 3.882315 (-1.18z)| norm 0.3128 (-0.35z)| lr 5.95e-04 | 4206.85 ms | 32.1% bf16 MFU | 124483 tok/s step 1840/19560 | loss 3.919460 (-0.38z)| norm 0.2751 (-1.43z)| lr 5.95e-04 | 4159.44 ms | 32.5% bf16 MFU | 124561 tok/s step 1841/19560 | loss 3.850638 (-1.81z)| norm 0.3080 (-0.48z)| lr 5.95e-04 | 4160.64 ms | 32.5% bf16 MFU | 124634 tok/s step 1842/19560 | loss 3.880308 (-1.17z)| norm 0.3650 (+1.13z)| lr 5.95e-04 | 4173.77 ms | 32.3% bf16 MFU | 124683 tok/s step 1843/19560 | loss 3.932827 (-0.06z)| norm 0.4752 (+3.96z)| lr 5.95e-04 | 4186.42 ms | 32.3% bf16 MFU | 124711 tok/s step 1844/19560 | loss 3.892700 (-0.89z)| norm 0.4371 (+2.83z)| lr 5.95e-04 | 4172.91 ms | 32.4% bf16 MFU | 124757 tok/s step 1845/19560 | loss 3.898599 (-0.76z)| norm 0.3812 (+1.36z)| lr 5.95e-04 | 4162.53 ms | 32.4% bf16 MFU | 124817 tok/s step 1846/19560 | loss 3.905204 (-0.61z)| norm 0.3887 (+1.53z)| lr 5.95e-04 | 4177.06 ms | 32.3% bf16 MFU | 124852 tok/s step 1847/19560 | loss 3.889998 (-0.92z)| norm 0.3646 (+0.90z)| lr 5.95e-04 | 4178.12 ms | 32.3% bf16 MFU | 124884 tok/s step 1848/19560 | loss 3.907832 (-0.54z)| norm 0.3285 (-0.01z)| lr 5.95e-04 | 4155.58 ms | 32.5% bf16 MFU | 124948 tok/s step 1849/19560 | loss 3.872543 (-1.26z)| norm 0.3004 (-0.72z)| lr 5.95e-04 | 4173.33 ms | 32.4% bf16 MFU | 124982 tok/s step 1850/19560 | loss 3.871926 (-1.25z)| norm 0.2839 (-1.13z)| lr 5.95e-04 | 4180.60 ms | 32.3% bf16 MFU | 125003 tok/s step 1851/19560 | loss 3.925881 (-0.13z)| norm 0.3002 (-0.72z)| lr 5.95e-04 | 4199.00 ms | 32.2% bf16 MFU | 124996 tok/s step 1852/19560 | loss 3.989197 (+1.20z)| norm 0.2994 (-0.73z)| lr 5.95e-04 | 4155.79 ms | 32.5% bf16 MFU | 125054 tok/s step 1853/19560 | loss 3.836870 (-1.94z)| norm 0.2917 (-0.92z)| lr 5.94e-04 | 4168.27 ms | 32.4% bf16 MFU | 125090 tok/s step 1854/19560 | loss 3.924479 (-0.12z)| norm 0.3136 (-0.36z)| lr 5.94e-04 | 4160.82 ms | 32.4% bf16 MFU | 125136 tok/s step 1855/19560 | loss 3.962477 (+0.67z)| norm 0.3384 (+0.26z)| lr 5.94e-04 | 4170.19 ms | 32.4% bf16 MFU | 125165 tok/s step 1856/19560 | loss 3.841833 (-1.81z)| norm 0.3300 (+0.05z)| lr 5.94e-04 | 4189.43 ms | 32.2% bf16 MFU | 125164 tok/s step 1857/19560 | loss 3.940444 (+0.23z)| norm 0.3011 (-0.68z)| lr 5.94e-04 | 4172.42 ms | 32.4% bf16 MFU | 125189 tok/s step 1858/19560 | loss 3.934788 (+0.10z)| norm 0.3306 (+0.07z)| lr 5.94e-04 | 4174.30 ms | 32.3% bf16 MFU | 125210 tok/s step 1859/19560 | loss 3.867156 (-1.29z)| norm 0.3432 (+0.40z)| lr 5.94e-04 | 4174.94 ms | 32.3% bf16 MFU | 125228 tok/s step 1860/19560 | loss 3.889593 (-0.81z)| norm 0.3591 (+0.79z)| lr 5.94e-04 | 4161.59 ms | 32.4% bf16 MFU | 125266 tok/s step 1861/19560 | loss 3.905509 (-0.48z)| norm 0.3995 (+1.77z)| lr 5.94e-04 | 4184.03 ms | 32.3% bf16 MFU | 125268 tok/s step 1862/19560 | loss 3.856431 (-1.47z)| norm 0.3996 (+1.74z)| lr 5.94e-04 | 4196.77 ms | 32.2% bf16 MFU | 125251 tok/s step 1863/19560 | loss 3.806340 (-2.42z)| norm 0.3065 (-0.58z)| lr 5.94e-04 | 4179.08 ms | 32.3% bf16 MFU | 125261 tok/s step 1864/19560 | loss 3.926324 (-0.00z)| norm 0.3031 (-0.67z)| lr 5.94e-04 | 4169.68 ms | 32.4% bf16 MFU | 125285 tok/s step 1865/19560 | loss 3.900133 (-0.52z)| norm 0.2778 (-1.29z)| lr 5.94e-04 | 4166.46 ms | 32.4% bf16 MFU | 125312 tok/s step 1866/19560 | loss 3.912429 (-0.27z)| norm 0.2744 (-1.36z)| lr 5.94e-04 | 4208.35 ms | 32.1% bf16 MFU | 125276 tok/s step 1867/19560 | loss 4.024815 (+1.96z)| norm 0.2949 (-0.85z)| lr 5.94e-04 | 4203.87 ms | 32.1% bf16 MFU | 125248 tok/s step 1868/19560 | loss 3.837057 (-1.75z)| norm 0.3078 (-0.53z)| lr 5.94e-04 | 4188.43 ms | 32.2% bf16 MFU | 125244 tok/s step 1869/19560 | loss 3.828139 (-1.88z)| norm 0.3171 (-0.30z)| lr 5.94e-04 | 4170.22 ms | 32.4% bf16 MFU | 125268 tok/s step 1870/19560 | loss 3.864013 (-1.17z)| norm 0.3566 (+0.67z)| lr 5.94e-04 | 4181.29 ms | 32.3% bf16 MFU | 125274 tok/s step 1871/19560 | loss 3.915440 (-0.17z)| norm 0.3744 (+1.12z)| lr 5.94e-04 | 4167.96 ms | 32.4% bf16 MFU | 125300 tok/s step 1872/19560 | loss 3.855126 (-1.32z)| norm 0.3536 (+0.63z)| lr 5.94e-04 | 4204.71 ms | 32.1% bf16 MFU | 125269 tok/s step 1873/19560 | loss 3.869307 (-1.03z)| norm 0.3078 (-0.52z)| lr 5.94e-04 | 4304.69 ms | 31.4% bf16 MFU | 125096 tok/s step 1874/19560 | loss 3.918834 (-0.05z)| norm 0.3091 (-0.48z)| lr 5.94e-04 | 4174.50 ms | 32.3% bf16 MFU | 125121 tok/s step 1875/19560 | loss 3.881515 (-0.77z)| norm 0.3141 (-0.35z)| lr 5.94e-04 | 4195.34 ms | 32.2% bf16 MFU | 125113 tok/s step 1876/19560 | loss 3.852758 (-1.32z)| norm 0.3376 (+0.25z)| lr 5.94e-04 | 4189.73 ms | 32.2% bf16 MFU | 125114 tok/s step 1877/19560 | loss 3.869260 (-0.99z)| norm 0.3154 (-0.32z)| lr 5.94e-04 | 4202.31 ms | 32.1% bf16 MFU | 125097 tok/s step 1878/19560 | loss 3.832567 (-1.69z)| norm 0.2851 (-1.08z)| lr 5.94e-04 | 4167.35 ms | 32.4% bf16 MFU | 125132 tok/s step 1879/19560 | loss 3.810633 (-2.08z)| norm 0.2703 (-1.44z)| lr 5.94e-04 | 4190.25 ms | 32.2% bf16 MFU | 125132 tok/s step 1880/19560 | loss 4.021516 (+1.99z)| norm 0.2904 (-0.93z)| lr 5.94e-04 | 4187.50 ms | 32.2% bf16 MFU | 125135 tok/s step 1881/19560 | loss 3.860485 (-1.09z)| norm 0.2931 (-0.86z)| lr 5.94e-04 | 4271.14 ms | 31.6% bf16 MFU | 125016 tok/s step 1882/19560 | loss 3.862667 (-1.03z)| norm 0.2586 (-1.72z)| lr 5.94e-04 | 4244.77 ms | 31.8% bf16 MFU | 124941 tok/s step 1883/19560 | loss 3.816898 (-1.87z)| norm 0.2667 (-1.49z)| lr 5.94e-04 | 4194.70 ms | 32.2% bf16 MFU | 124943 tok/s step 1884/19560 | loss 4.015663 (+1.85z)| norm 0.3101 (-0.40z)| lr 5.94e-04 | 4193.84 ms | 32.2% bf16 MFU | 124947 tok/s step 1885/19560 | loss 3.911748 (-0.08z)| norm 0.3688 (+1.14z)| lr 5.94e-04 | 4175.04 ms | 32.3% bf16 MFU | 124978 tok/s step 1886/19560 | loss 3.832978 (-1.53z)| norm 0.3367 (+0.30z)| lr 5.94e-04 | 4189.56 ms | 32.2% bf16 MFU | 124986 tok/s step 1887/19560 | loss 3.870859 (-0.81z)| norm 0.3184 (-0.17z)| lr 5.94e-04 | 4158.09 ms | 32.5% bf16 MFU | 125042 tok/s step 1888/19560 | loss 3.860990 (-0.98z)| norm 0.3114 (-0.35z)| lr 5.94e-04 | 4418.37 ms | 30.6% bf16 MFU | 124722 tok/s step 1889/19560 | loss 3.937169 (+0.44z)| norm 0.3534 (+0.79z)| lr 5.94e-04 | 4207.48 ms | 32.1% bf16 MFU | 124717 tok/s step 1890/19560 | loss 3.943696 (+0.58z)| norm 0.3204 (-0.10z)| lr 5.94e-04 | 4158.59 ms | 32.5% bf16 MFU | 124785 tok/s step 1891/19560 | loss 3.874239 (-0.72z)| norm 0.2969 (-0.73z)| lr 5.94e-04 | 4184.75 ms | 32.3% bf16 MFU | 124810 tok/s step 1892/19560 | loss 3.895683 (-0.30z)| norm 0.3164 (-0.20z)| lr 5.94e-04 | 4175.77 ms | 32.3% bf16 MFU | 124847 tok/s step 1893/19560 | loss 3.915534 (+0.09z)| norm 0.3237 (-0.01z)| lr 5.94e-04 | 4178.04 ms | 32.3% bf16 MFU | 124879 tok/s step 1894/19560 | loss 3.892411 (-0.35z)| norm 0.3049 (-0.52z)| lr 5.94e-04 | 4169.36 ms | 32.4% bf16 MFU | 124922 tok/s step 1895/19560 | loss 3.881522 (-0.55z)| norm 0.3069 (-0.45z)| lr 5.94e-04 | 4157.32 ms | 32.5% bf16 MFU | 124982 tok/s step 1896/19560 | loss 3.893287 (-0.31z)| norm 0.2985 (-0.67z)| lr 5.94e-04 | 4191.15 ms | 32.2% bf16 MFU | 124987 tok/s step 1897/19560 | loss 3.828008 (-1.55z)| norm 0.2782 (-1.20z)| lr 5.94e-04 | 4182.28 ms | 32.3% bf16 MFU | 125006 tok/s step 1898/19560 | loss 3.827787 (-1.52z)| norm 0.2826 (-1.07z)| lr 5.94e-04 | 4169.44 ms | 32.4% bf16 MFU | 125043 tok/s step 1899/19560 | loss 3.910977 (+0.06z)| norm 0.3291 (+0.18z)| lr 5.94e-04 | 4225.80 ms | 32.0% bf16 MFU | 124994 tok/s step 1900/19560 | loss 3.881515 (-0.50z)| norm 0.3281 (+0.16z)| lr 5.94e-04 | 4174.77 ms | 32.3% bf16 MFU | 125024 tok/s step 1901/19560 | loss 3.885159 (-0.42z)| norm 0.3362 (+0.40z)| lr 5.94e-04 | 4170.31 ms | 32.4% bf16 MFU | 125059 tok/s step 1902/19560 | loss 3.905007 (-0.04z)| norm 0.3256 (+0.11z)| lr 5.94e-04 | 4174.43 ms | 32.3% bf16 MFU | 125085 tok/s step 1903/19560 | loss 3.906779 (+0.01z)| norm 0.3194 (-0.06z)| lr 5.94e-04 | 4214.83 ms | 32.0% bf16 MFU | 125051 tok/s step 1904/19560 | loss 3.854552 (-0.99z)| norm 0.3184 (-0.10z)| lr 5.94e-04 | 4186.11 ms | 32.3% bf16 MFU | 125060 tok/s step 1905/19560 | loss 3.892251 (-0.25z)| norm 0.3450 (+0.62z)| lr 5.94e-04 | 4194.64 ms | 32.2% bf16 MFU | 125057 tok/s step 1906/19560 | loss 3.919702 (+0.28z)| norm 0.3370 (+0.40z)| lr 5.94e-04 | 4163.77 ms | 32.4% bf16 MFU | 125100 tok/s step 1907/19560 | loss 3.790597 (-2.16z)| norm 0.3355 (+0.35z)| lr 5.94e-04 | 4172.45 ms | 32.4% bf16 MFU | 125128 tok/s step 1908/19560 | loss 3.916015 (+0.24z)| norm 0.3074 (-0.41z)| lr 5.94e-04 | 4175.42 ms | 32.3% bf16 MFU | 125149 tok/s step 1909/19560 | loss 3.833577 (-1.32z)| norm 0.2885 (-0.94z)| lr 5.94e-04 | 4198.33 ms | 32.2% bf16 MFU | 125136 tok/s step 1910/19560 | loss 3.946486 (+0.83z)| norm 0.2937 (-0.80z)| lr 5.94e-04 | 4233.49 ms | 31.9% bf16 MFU | 125071 tok/s step 1911/19560 | loss 3.874346 (-0.53z)| norm 0.3038 (-0.52z)| lr 5.94e-04 | 4181.29 ms | 32.3% bf16 MFU | 125087 tok/s step 1912/19560 | loss 3.878399 (-0.45z)| norm 0.3639 (+1.12z)| lr 5.94e-04 | 4213.69 ms | 32.0% bf16 MFU | 125054 tok/s step 1913/19560 | loss 3.855781 (-0.87z)| norm 0.3705 (+1.28z)| lr 5.94e-04 | 4174.90 ms | 32.3% bf16 MFU | 125080 tok/s step 1914/19560 | loss 3.861341 (-0.75z)| norm 0.3892 (+1.76z)| lr 5.94e-04 | 4199.29 ms | 32.2% bf16 MFU | 125069 tok/s step 1915/19560 | loss 3.874539 (-0.49z)| norm 0.3802 (+1.48z)| lr 5.94e-04 | 4182.82 ms | 32.3% bf16 MFU | 125083 tok/s step 1916/19560 | loss 3.906678 (+0.14z)| norm 0.3448 (+0.52z)| lr 5.94e-04 | 4186.13 ms | 32.3% bf16 MFU | 125091 tok/s step 1917/19560 | loss 3.908998 (+0.18z)| norm 0.2884 (-1.00z)| lr 5.94e-04 | 4159.50 ms | 32.5% bf16 MFU | 125139 tok/s step 1918/19560 | loss 3.952305 (+1.00z)| norm 0.3089 (-0.44z)| lr 5.94e-04 | 4194.14 ms | 32.2% bf16 MFU | 125132 tok/s step 1919/19560 | loss 3.888005 (-0.23z)| norm 0.3333 (+0.21z)| lr 5.94e-04 | 4166.77 ms | 32.4% bf16 MFU | 125167 tok/s step 1920/19560 | loss 3.863029 (-0.71z)| norm 0.3313 (+0.16z)| lr 5.94e-04 | 4281.85 ms | 31.5% bf16 MFU | 125030 tok/s step 1921/19560 | loss 3.953647 (+1.03z)| norm 0.3365 (+0.31z)| lr 5.94e-04 | 4181.38 ms | 32.3% bf16 MFU | 125048 tok/s step 1922/19560 | loss 3.877468 (-0.43z)| norm 0.2920 (-0.89z)| lr 5.94e-04 | 4192.81 ms | 32.2% bf16 MFU | 125048 tok/s step 1923/19560 | loss 3.873847 (-0.49z)| norm 0.2902 (-0.92z)| lr 5.94e-04 | 4202.78 ms | 32.1% bf16 MFU | 125033 tok/s step 1924/19560 | loss 3.871996 (-0.51z)| norm 0.4254 (+2.65z)| lr 5.94e-04 | 4168.36 ms | 32.4% bf16 MFU | 125070 tok/s step 1925/19560 | loss 3.852249 (-0.88z)| norm 0.3179 (-0.18z)| lr 5.94e-04 | 4190.17 ms | 32.2% bf16 MFU | 125073 tok/s step 1926/19560 | loss 3.846191 (-0.98z)| norm 0.2930 (-0.82z)| lr 5.94e-04 | 4159.62 ms | 32.5% bf16 MFU | 125121 tok/s step 1927/19560 | loss 3.928673 (+0.62z)| norm 0.2952 (-0.75z)| lr 5.94e-04 | 4155.15 ms | 32.5% bf16 MFU | 125174 tok/s step 1928/19560 | loss 3.906506 (+0.20z)| norm 0.2777 (-1.21z)| lr 5.94e-04 | 4168.27 ms | 32.4% bf16 MFU | 125205 tok/s step 1929/19560 | loss 3.882669 (-0.26z)| norm 0.2607 (-1.64z)| lr 5.94e-04 | 4181.44 ms | 32.3% bf16 MFU | 125214 tok/s step 1930/19560 | loss 3.855949 (-0.77z)| norm 0.2625 (-1.57z)| lr 5.94e-04 | 4182.66 ms | 32.3% bf16 MFU | 125220 tok/s step 1931/19560 | loss 3.904994 (+0.20z)| norm 0.2908 (-0.82z)| lr 5.94e-04 | 4172.44 ms | 32.4% bf16 MFU | 125242 tok/s step 1932/19560 | loss 3.857356 (-0.73z)| norm 0.3049 (-0.45z)| lr 5.94e-04 | 4176.19 ms | 32.3% bf16 MFU | 125257 tok/s step 1933/19560 | loss 3.842141 (-1.02z)| norm 0.3502 (+0.72z)| lr 5.94e-04 | 4170.37 ms | 32.4% bf16 MFU | 125280 tok/s step 1934/19560 | loss 3.887086 (-0.12z)| norm 0.4027 (+2.04z)| lr 5.94e-04 | 4189.81 ms | 32.2% bf16 MFU | 125273 tok/s step 1935/19560 | loss 3.786901 (-2.06z)| norm 0.3381 (+0.38z)| lr 5.94e-04 | 4183.38 ms | 32.3% bf16 MFU | 125275 tok/s step 1936/19560 | loss 3.870741 (-0.41z)| norm 0.3176 (-0.15z)| lr 5.94e-04 | 4174.78 ms | 32.3% bf16 MFU | 125291 tok/s step 1937/19560 | loss 3.908588 (+0.34z)| norm 0.3348 (+0.30z)| lr 5.94e-04 | 4179.73 ms | 32.3% bf16 MFU | 125298 tok/s step 1938/19560 | loss 3.864605 (-0.52z)| norm 0.3123 (-0.27z)| lr 5.94e-04 | 4173.80 ms | 32.3% bf16 MFU | 125314 tok/s step 1939/19560 | loss 3.839710 (-1.00z)| norm 0.2612 (-1.56z)| lr 5.94e-04 | 4167.46 ms | 32.4% bf16 MFU | 125339 tok/s step 1940/19560 | loss 3.923608 (+0.68z)| norm 0.2800 (-1.07z)| lr 5.94e-04 | 4177.35 ms | 32.3% bf16 MFU | 125347 tok/s step 1941/19560 | loss 3.866475 (-0.45z)| norm 0.2970 (-0.63z)| lr 5.94e-04 | 4177.18 ms | 32.3% bf16 MFU | 125355 tok/s step 1942/19560 | loss 3.904890 (+0.33z)| norm 0.3076 (-0.34z)| lr 5.94e-04 | 4161.65 ms | 32.4% bf16 MFU | 125386 tok/s step 1943/19560 | loss 3.909858 (+0.45z)| norm 0.3426 (+0.56z)| lr 5.94e-04 | 4170.42 ms | 32.4% bf16 MFU | 125403 tok/s step 1944/19560 | loss 3.903655 (+0.33z)| norm 0.3388 (+0.50z)| lr 5.94e-04 | 4179.64 ms | 32.3% bf16 MFU | 125405 tok/s step 1945/19560 | loss 3.794672 (-1.89z)| norm 0.3461 (+0.71z)| lr 5.94e-04 | 4239.34 ms | 31.8% bf16 MFU | 125318 tok/s step 1946/19560 | loss 3.856158 (-0.63z)| norm 0.3089 (-0.29z)| lr 5.94e-04 | 4204.64 ms | 32.1% bf16 MFU | 125287 tok/s step 1947/19560 | loss 3.869642 (-0.34z)| norm 0.3044 (-0.41z)| lr 5.94e-04 | 4191.29 ms | 32.2% bf16 MFU | 125277 tok/s step 1948/19560 | loss 3.875520 (-0.21z)| norm 0.3071 (-0.34z)| lr 5.94e-04 | 4289.42 ms | 31.5% bf16 MFU | 125125 tok/s step 1949/19560 | loss 3.866848 (-0.38z)| norm 0.2954 (-0.65z)| lr 5.94e-04 | 4186.66 ms | 32.2% bf16 MFU | 125130 tok/s step 1950/19560 | loss 3.995312 (+2.35z)| norm 0.3255 (+0.17z)| lr 5.94e-04 | 4241.98 ms | 31.8% bf16 MFU | 125053 tok/s step 1951/19560 | loss 3.880366 (-0.10z)| norm 0.3521 (+0.88z)| lr 5.94e-04 | 4193.24 ms | 32.2% bf16 MFU | 125052 tok/s step 1952/19560 | loss 3.895746 (+0.22z)| norm 0.3592 (+1.07z)| lr 5.94e-04 | 4164.02 ms | 32.4% bf16 MFU | 125095 tok/s step 1953/19560 | loss 3.813733 (-1.53z)| norm 0.3592 (+1.05z)| lr 5.93e-04 | 4225.97 ms | 31.9% bf16 MFU | 125043 tok/s step 1954/19560 | loss 3.907605 (+0.46z)| norm 0.3255 (+0.13z)| lr 5.93e-04 | 4192.31 ms | 32.2% bf16 MFU | 125044 tok/s step 1955/19560 | loss 3.942956 (+1.21z)| norm 0.3172 (-0.10z)| lr 5.93e-04 | 4173.38 ms | 32.4% bf16 MFU | 125073 tok/s step 1956/19560 | loss 3.872712 (-0.30z)| norm 0.3039 (-0.46z)| lr 5.93e-04 | 4206.92 ms | 32.1% bf16 MFU | 125051 tok/s step 1957/19560 | loss 3.885135 (-0.04z)| norm 0.3251 (+0.11z)| lr 5.93e-04 | 4166.44 ms | 32.4% bf16 MFU | 125090 tok/s step 1958/19560 | loss 3.819891 (-1.42z)| norm 0.3056 (-0.42z)| lr 5.93e-04 | 4180.67 ms | 32.3% bf16 MFU | 125106 tok/s step 1959/19560 | loss 3.888351 (+0.05z)| norm 0.2998 (-0.58z)| lr 5.93e-04 | 4174.99 ms | 32.3% bf16 MFU | 125130 tok/s step 1960/19560 | loss 3.866688 (-0.43z)| norm 0.3020 (-0.52z)| lr 5.93e-04 | 4188.51 ms | 32.2% bf16 MFU | 125132 tok/s step 1961/19560 | loss 3.881565 (-0.10z)| norm 0.3130 (-0.22z)| lr 5.93e-04 | 4197.61 ms | 32.2% bf16 MFU | 125120 tok/s step 1962/19560 | loss 3.902560 (+0.35z)| norm 0.3328 (+0.31z)| lr 5.93e-04 | 4170.42 ms | 32.4% bf16 MFU | 125150 tok/s step 1963/19560 | loss 3.815059 (-1.58z)| norm 0.3244 (+0.08z)| lr 5.93e-04 | 4237.62 ms | 31.9% bf16 MFU | 125079 tok/s step 1964/19560 | loss 3.797211 (-1.94z)| norm 0.3249 (+0.09z)| lr 5.93e-04 | 4160.57 ms | 32.5% bf16 MFU | 125125 tok/s step 1965/19560 | loss 3.865869 (-0.41z)| norm 0.3586 (+0.99z)| lr 5.93e-04 | 4198.96 ms | 32.2% bf16 MFU | 125112 tok/s step 1966/19560 | loss 3.830916 (-1.20z)| norm 0.3078 (-0.40z)| lr 5.93e-04 | 4166.78 ms | 32.4% bf16 MFU | 125148 tok/s step 1967/19560 | loss 3.888575 (+0.10z)| norm 0.2738 (-1.32z)| lr 5.93e-04 | 4178.65 ms | 32.3% bf16 MFU | 125164 tok/s step 1968/19560 | loss 3.894933 (+0.24z)| norm 0.3005 (-0.60z)| lr 5.93e-04 | 4174.02 ms | 32.3% bf16 MFU | 125186 tok/s step 1969/19560 | loss 3.824430 (-1.33z)| norm 0.2639 (-1.58z)| lr 5.93e-04 | 4165.03 ms | 32.4% bf16 MFU | 125221 tok/s step 1970/19560 | loss 3.815546 (-1.50z)| norm 0.2723 (-1.33z)| lr 5.93e-04 | 4185.29 ms | 32.3% bf16 MFU | 125223 tok/s step 1971/19560 | loss 3.847948 (-0.77z)| norm 0.2909 (-0.85z)| lr 5.93e-04 | 4193.07 ms | 32.2% bf16 MFU | 125214 tok/s step 1972/19560 | loss 3.881422 (-0.03z)| norm 0.3093 (-0.29z)| lr 5.93e-04 | 4208.55 ms | 32.1% bf16 MFU | 125182 tok/s step 1973/19560 | loss 3.837891 (-0.98z)| norm 0.3442 (+0.80z)| lr 5.93e-04 | 4188.39 ms | 32.2% bf16 MFU | 125182 tok/s step 1974/19560 | loss 3.860161 (-0.48z)| norm 0.3927 (+2.29z)| lr 5.93e-04 | 4177.64 ms | 32.3% bf16 MFU | 125198 tok/s step 1975/19560 | loss 3.864649 (-0.38z)| norm 0.3301 (+0.37z)| lr 5.93e-04 | 4205.62 ms | 32.1% bf16 MFU | 125171 tok/s step 1976/19560 | loss 3.910681 (+0.64z)| norm 0.2991 (-0.59z)| lr 5.93e-04 | 4154.88 ms | 32.5% bf16 MFU | 125222 tok/s step 1977/19560 | loss 3.890062 (+0.18z)| norm 0.3234 (+0.16z)| lr 5.93e-04 | 4226.45 ms | 31.9% bf16 MFU | 125163 tok/s step 1978/19560 | loss 3.829589 (-1.14z)| norm 0.3194 (+0.03z)| lr 5.93e-04 | 4183.81 ms | 32.3% bf16 MFU | 125170 tok/s step 1979/19560 | loss 3.957283 (+1.66z)| norm 0.3182 (-0.01z)| lr 5.93e-04 | 4167.71 ms | 32.4% bf16 MFU | 125202 tok/s step 1980/19560 | loss 3.909628 (+0.64z)| norm 0.3401 (+0.66z)| lr 5.93e-04 | 4176.53 ms | 32.3% bf16 MFU | 125218 tok/s step 1981/19560 | loss 3.847551 (-0.75z)| norm 0.3244 (+0.16z)| lr 5.93e-04 | 4214.52 ms | 32.0% bf16 MFU | 125177 tok/s step 1982/19560 | loss 3.880563 (-0.00z)| norm 0.3050 (-0.45z)| lr 5.93e-04 | 4168.02 ms | 32.4% bf16 MFU | 125208 tok/s step 1983/19560 | loss 3.849854 (-0.68z)| norm 0.2808 (-1.19z)| lr 5.93e-04 | 4179.69 ms | 32.3% bf16 MFU | 125219 tok/s step 1984/19560 | loss 3.847778 (-0.73z)| norm 0.3356 (+0.53z)| lr 5.93e-04 | 4179.22 ms | 32.3% bf16 MFU | 125231 tok/s step 1985/19560 | loss 3.888304 (+0.20z)| norm 0.2891 (-0.92z)| lr 5.93e-04 | 4184.31 ms | 32.3% bf16 MFU | 125234 tok/s step 1986/19560 | loss 3.915785 (+0.84z)| norm 0.3244 (+0.18z)| lr 5.93e-04 | 4243.43 ms | 31.8% bf16 MFU | 125150 tok/s step 1987/19560 | loss 3.909277 (+0.68z)| norm 0.3178 (-0.02z)| lr 5.93e-04 | 4199.26 ms | 32.2% bf16 MFU | 125135 tok/s step 1988/19560 | loss 3.858608 (-0.48z)| norm 0.3212 (+0.10z)| lr 5.93e-04 | 4170.11 ms | 32.4% bf16 MFU | 125165 tok/s step 1989/19560 | loss 3.904093 (+0.57z)| norm 0.3185 (+0.03z)| lr 5.93e-04 | 4188.90 ms | 32.2% bf16 MFU | 125165 tok/s step 1990/19560 | loss 3.900060 (+0.47z)| norm 0.2862 (-1.01z)| lr 5.93e-04 | 4339.00 ms | 31.1% bf16 MFU | 124948 tok/s step 1991/19560 | loss 3.854287 (-0.60z)| norm 0.3018 (-0.49z)| lr 5.93e-04 | 4176.20 ms | 32.3% bf16 MFU | 124978 tok/s step 1992/19560 | loss 3.808729 (-1.63z)| norm 0.3415 (+0.81z)| lr 5.93e-04 | 4155.69 ms | 32.5% bf16 MFU | 125037 tok/s step 1993/19560 | loss 3.845676 (-0.76z)| norm 0.3217 (+0.15z)| lr 5.93e-04 | 4162.78 ms | 32.4% bf16 MFU | 125082 tok/s step 1994/19560 | loss 3.836782 (-0.95z)| norm 0.3123 (-0.18z)| lr 5.93e-04 | 4230.12 ms | 31.9% bf16 MFU | 125025 tok/s step 1995/19560 | loss 3.892962 (+0.38z)| norm 0.3663 (+1.60z)| lr 5.93e-04 | 4157.62 ms | 32.5% bf16 MFU | 125079 tok/s step 1996/19560 | loss 3.875401 (-0.05z)| norm 0.3934 (+2.42z)| lr 5.93e-04 | 4173.61 ms | 32.4% bf16 MFU | 125106 tok/s step 1997/19560 | loss 3.966629 (+2.10z)| norm 0.3825 (+2.02z)| lr 5.93e-04 | 4172.38 ms | 32.4% bf16 MFU | 125134 tok/s step 1998/19560 | loss 3.859981 (-0.44z)| norm 0.3265 (+0.24z)| lr 5.93e-04 | 4170.44 ms | 32.4% bf16 MFU | 125163 tok/s step 1999/19560 | loss 3.851682 (-0.63z)| norm 0.2932 (-0.82z)| lr 5.93e-04 | 4332.15 ms | 31.2% bf16 MFU | 124956 tok/s step 2000/19560 | loss 3.854443 (-0.56z)| norm 0.2946 (-0.76z)| lr 5.93e-04 | 4204.10 ms | 32.1% bf16 MFU | 124944 tok/s val loss 3.871290 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2623/10042 = 0.261203 step 2001/19560 | loss 3.883659 (+0.13z)| norm 0.3102 (-0.25z)| lr 5.93e-04 | 4649.53 ms | 29.0% bf16 MFU | 124334 tok/s step 2002/19560 | loss 3.808556 (-1.63z)| norm 0.3485 (+0.98z)| lr 5.93e-04 | 4420.44 ms | 30.5% bf16 MFU | 124048 tok/s step 2003/19560 | loss 3.822787 (-1.28z)| norm 0.3551 (+1.18z)| lr 5.93e-04 | 4555.52 ms | 29.6% bf16 MFU | 123600 tok/s step 2004/19560 | loss 3.865980 (-0.26z)| norm 0.2935 (-0.80z)| lr 5.93e-04 | 4399.33 ms | 30.7% bf16 MFU | 123379 tok/s step 2005/19560 | loss 3.780046 (-2.23z)| norm 0.2856 (-1.04z)| lr 5.93e-04 | 4440.56 ms | 30.4% bf16 MFU | 123113 tok/s step 2006/19560 | loss 3.827246 (-1.13z)| norm 0.2656 (-1.66z)| lr 5.93e-04 | 4707.97 ms | 28.7% bf16 MFU | 122526 tok/s step 2007/19560 | loss 3.879881 (+0.07z)| norm 0.2954 (-0.73z)| lr 5.93e-04 | 4211.30 ms | 32.1% bf16 MFU | 122624 tok/s step 2008/19560 | loss 3.858625 (-0.41z)| norm 0.2776 (-1.29z)| lr 5.93e-04 | 4177.96 ms | 32.3% bf16 MFU | 122767 tok/s step 2009/19560 | loss 3.884988 (+0.23z)| norm 0.3189 (+0.02z)| lr 5.93e-04 | 4690.66 ms | 28.8% bf16 MFU | 122218 tok/s step 2010/19560 | loss 3.859571 (-0.39z)| norm 0.3037 (-0.48z)| lr 5.93e-04 | 4379.54 ms | 30.8% bf16 MFU | 122092 tok/s step 2011/19560 | loss 3.790921 (-2.05z)| norm 0.3182 (-0.02z)| lr 5.93e-04 | 4287.06 ms | 31.5% bf16 MFU | 122103 tok/s step 2012/19560 | loss 3.849776 (-0.62z)| norm 0.2902 (-0.94z)| lr 5.93e-04 | 4245.60 ms | 31.8% bf16 MFU | 122172 tok/s step 2013/19560 | loss 3.875588 (+0.04z)| norm 0.3053 (-0.43z)| lr 5.93e-04 | 4320.29 ms | 31.3% bf16 MFU | 122131 tok/s step 2014/19560 | loss 3.823895 (-1.27z)| norm 0.2804 (-1.23z)| lr 5.93e-04 | 4172.76 ms | 32.4% bf16 MFU | 122307 tok/s step 2015/19560 | loss 3.787682 (-2.13z)| norm 0.2838 (-1.11z)| lr 5.93e-04 | 4165.80 ms | 32.4% bf16 MFU | 122484 tok/s step 2016/19560 | loss 3.839936 (-0.82z)| norm 0.2753 (-1.36z)| lr 5.93e-04 | 4175.35 ms | 32.3% bf16 MFU | 122638 tok/s step 2017/19560 | loss 3.851200 (-0.53z)| norm 0.2887 (-0.92z)| lr 5.93e-04 | 4154.84 ms | 32.5% bf16 MFU | 122816 tok/s step 2018/19560 | loss 3.824729 (-1.18z)| norm 0.3092 (-0.24z)| lr 5.93e-04 | 4155.61 ms | 32.5% bf16 MFU | 122983 tok/s step 2019/19560 | loss 3.789091 (-2.03z)| norm 0.3067 (-0.33z)| lr 5.93e-04 | 4159.72 ms | 32.5% bf16 MFU | 123136 tok/s step 2020/19560 | loss 3.826842 (-1.08z)| norm 0.3306 (+0.45z)| lr 5.93e-04 | 4163.92 ms | 32.4% bf16 MFU | 123275 tok/s step 2021/19560 | loss 3.809158 (-1.49z)| norm 0.3444 (+0.89z)| lr 5.93e-04 | 4159.50 ms | 32.5% bf16 MFU | 123413 tok/s step 2022/19560 | loss 3.862594 (-0.16z)| norm 0.3352 (+0.58z)| lr 5.93e-04 | 4188.69 ms | 32.2% bf16 MFU | 123501 tok/s step 2023/19560 | loss 3.887064 (+0.44z)| norm 0.3284 (+0.36z)| lr 5.93e-04 | 4166.78 ms | 32.4% bf16 MFU | 123617 tok/s step 2024/19560 | loss 3.833235 (-0.88z)| norm 0.3488 (+1.00z)| lr 5.93e-04 | 4159.71 ms | 32.5% bf16 MFU | 123738 tok/s step 2025/19560 | loss 3.890955 (+0.54z)| norm 0.3845 (+2.11z)| lr 5.93e-04 | 4165.06 ms | 32.4% bf16 MFU | 123845 tok/s step 2026/19560 | loss 3.914544 (+1.11z)| norm 0.3573 (+1.22z)| lr 5.93e-04 | 4167.84 ms | 32.4% bf16 MFU | 123943 tok/s step 2027/19560 | loss 3.769826 (-2.40z)| norm 0.3422 (+0.73z)| lr 5.93e-04 | 4169.95 ms | 32.4% bf16 MFU | 124032 tok/s step 2028/19560 | loss 3.820539 (-1.15z)| norm 0.3023 (-0.54z)| lr 5.93e-04 | 4208.00 ms | 32.1% bf16 MFU | 124060 tok/s step 2029/19560 | loss 3.814246 (-1.29z)| norm 0.2707 (-1.52z)| lr 5.93e-04 | 4162.74 ms | 32.4% bf16 MFU | 124155 tok/s step 2030/19560 | loss 3.840096 (-0.65z)| norm 0.2763 (-1.32z)| lr 5.93e-04 | 4172.62 ms | 32.4% bf16 MFU | 124229 tok/s step 2031/19560 | loss 3.829024 (-0.91z)| norm 0.2843 (-1.06z)| lr 5.93e-04 | 4157.26 ms | 32.5% bf16 MFU | 124324 tok/s step 2032/19560 | loss 3.888328 (+0.52z)| norm 0.3119 (-0.19z)| lr 5.93e-04 | 4275.98 ms | 31.6% bf16 MFU | 124238 tok/s step 2033/19560 | loss 3.823607 (-1.03z)| norm 0.3309 (+0.41z)| lr 5.93e-04 | 4274.79 ms | 31.6% bf16 MFU | 124158 tok/s step 2034/19560 | loss 3.865003 (-0.02z)| norm 0.3597 (+1.30z)| lr 5.93e-04 | 4174.60 ms | 32.3% bf16 MFU | 124230 tok/s step 2035/19560 | loss 3.817558 (-1.18z)| norm 0.3988 (+2.45z)| lr 5.93e-04 | 4169.72 ms | 32.4% bf16 MFU | 124305 tok/s step 2036/19560 | loss 3.816824 (-1.18z)| norm 0.3484 (+0.90z)| lr 5.93e-04 | 4162.59 ms | 32.4% bf16 MFU | 124388 tok/s step 2037/19560 | loss 3.899629 (+0.82z)| norm 0.3162 (-0.09z)| lr 5.93e-04 | 4156.73 ms | 32.5% bf16 MFU | 124475 tok/s step 2038/19560 | loss 3.842626 (-0.55z)| norm 0.2780 (-1.25z)| lr 5.93e-04 | 4173.55 ms | 32.4% bf16 MFU | 124532 tok/s step 2039/19560 | loss 3.876303 (+0.28z)| norm 0.2791 (-1.20z)| lr 5.93e-04 | 4157.35 ms | 32.5% bf16 MFU | 124611 tok/s step 2040/19560 | loss 3.852894 (-0.30z)| norm 0.2813 (-1.12z)| lr 5.93e-04 | 4176.00 ms | 32.3% bf16 MFU | 124658 tok/s step 2041/19560 | loss 3.873553 (+0.21z)| norm 0.2830 (-1.05z)| lr 5.93e-04 | 4166.15 ms | 32.4% bf16 MFU | 124717 tok/s step 2042/19560 | loss 3.764449 (-2.41z)| norm 0.2665 (-1.54z)| lr 5.93e-04 | 4160.71 ms | 32.5% bf16 MFU | 124782 tok/s step 2043/19560 | loss 3.861589 (-0.06z)| norm 0.2611 (-1.68z)| lr 5.93e-04 | 4174.02 ms | 32.3% bf16 MFU | 124823 tok/s step 2044/19560 | loss 3.850447 (-0.32z)| norm 0.2952 (-0.62z)| lr 5.93e-04 | 4171.43 ms | 32.4% bf16 MFU | 124866 tok/s step 2045/19560 | loss 3.859848 (-0.08z)| norm 0.2767 (-1.18z)| lr 5.93e-04 | 4175.65 ms | 32.3% bf16 MFU | 124901 tok/s step 2046/19560 | loss 3.825215 (-0.92z)| norm 0.2692 (-1.40z)| lr 5.93e-04 | 4156.70 ms | 32.5% bf16 MFU | 124962 tok/s step 2047/19560 | loss 3.845064 (-0.42z)| norm 0.2992 (-0.47z)| lr 5.92e-04 | 4157.17 ms | 32.5% bf16 MFU | 125020 tok/s step 2048/19560 | loss 3.884164 (+0.54z)| norm 0.2843 (-0.91z)| lr 5.92e-04 | 4162.71 ms | 32.4% bf16 MFU | 125066 tok/s step 2049/19560 | loss 3.832813 (-0.72z)| norm 0.3135 (-0.01z)| lr 5.92e-04 | 4207.49 ms | 32.1% bf16 MFU | 125044 tok/s step 2050/19560 | loss 3.806545 (-1.35z)| norm 0.3322 (+0.55z)| lr 5.92e-04 | 4165.97 ms | 32.4% bf16 MFU | 125084 tok/s step 2051/19560 | loss 3.815812 (-1.11z)| norm 0.3160 (+0.05z)| lr 5.92e-04 | 4174.65 ms | 32.3% bf16 MFU | 125109 tok/s step 2052/19560 | loss 3.818576 (-1.02z)| norm 0.3368 (+0.74z)| lr 5.92e-04 | 4244.19 ms | 31.8% bf16 MFU | 125030 tok/s step 2053/19560 | loss 3.809189 (-1.24z)| norm 0.3388 (+0.80z)| lr 5.92e-04 | 4170.92 ms | 32.4% bf16 MFU | 125064 tok/s step 2054/19560 | loss 3.869471 (+0.24z)| norm 0.3113 (-0.09z)| lr 5.92e-04 | 4165.02 ms | 32.4% bf16 MFU | 125104 tok/s step 2055/19560 | loss 3.771035 (-2.14z)| norm 0.2873 (-0.86z)| lr 5.92e-04 | 4222.96 ms | 32.0% bf16 MFU | 125057 tok/s step 2056/19560 | loss 3.838859 (-0.47z)| norm 0.2838 (-0.97z)| lr 5.92e-04 | 4159.42 ms | 32.5% bf16 MFU | 125106 tok/s step 2057/19560 | loss 3.866023 (+0.20z)| norm 0.2886 (-0.83z)| lr 5.92e-04 | 4160.90 ms | 32.4% bf16 MFU | 125151 tok/s step 2058/19560 | loss 3.815365 (-1.03z)| norm 0.3073 (-0.24z)| lr 5.92e-04 | 4166.56 ms | 32.4% bf16 MFU | 125185 tok/s step 2059/19560 | loss 3.847209 (-0.24z)| norm 0.2923 (-0.73z)| lr 5.92e-04 | 4162.26 ms | 32.4% bf16 MFU | 125224 tok/s step 2060/19560 | loss 3.849405 (-0.19z)| norm 0.3101 (-0.15z)| lr 5.92e-04 | 4158.14 ms | 32.5% bf16 MFU | 125267 tok/s step 2061/19560 | loss 3.797892 (-1.44z)| norm 0.3352 (+0.68z)| lr 5.92e-04 | 4160.59 ms | 32.5% bf16 MFU | 125305 tok/s step 2062/19560 | loss 3.869664 (+0.32z)| norm 0.3067 (-0.24z)| lr 5.92e-04 | 4161.76 ms | 32.4% bf16 MFU | 125338 tok/s step 2063/19560 | loss 3.785708 (-1.73z)| norm 0.3061 (-0.25z)| lr 5.92e-04 | 4180.58 ms | 32.3% bf16 MFU | 125342 tok/s step 2064/19560 | loss 3.836103 (-0.49z)| norm 0.2893 (-0.82z)| lr 5.92e-04 | 4167.98 ms | 32.4% bf16 MFU | 125364 tok/s step 2065/19560 | loss 3.818445 (-0.91z)| norm 0.2938 (-0.65z)| lr 5.92e-04 | 4202.50 ms | 32.1% bf16 MFU | 125334 tok/s step 2066/19560 | loss 3.825160 (-0.74z)| norm 0.2878 (-0.85z)| lr 5.92e-04 | 4173.10 ms | 32.4% bf16 MFU | 125349 tok/s step 2067/19560 | loss 3.852355 (-0.07z)| norm 0.2827 (-1.04z)| lr 5.92e-04 | 4160.37 ms | 32.5% bf16 MFU | 125382 tok/s step 2068/19560 | loss 3.824536 (-0.74z)| norm 0.2664 (-1.59z)| lr 5.92e-04 | 4183.84 ms | 32.3% bf16 MFU | 125379 tok/s step 2069/19560 | loss 3.864123 (+0.24z)| norm 0.2862 (-0.90z)| lr 5.92e-04 | 4167.51 ms | 32.4% bf16 MFU | 125400 tok/s step 2070/19560 | loss 3.803810 (-1.24z)| norm 0.2771 (-1.20z)| lr 5.92e-04 | 4168.50 ms | 32.4% bf16 MFU | 125419 tok/s step 2071/19560 | loss 3.767607 (-2.08z)| norm 0.2946 (-0.60z)| lr 5.92e-04 | 4164.02 ms | 32.4% bf16 MFU | 125443 tok/s step 2072/19560 | loss 3.833011 (-0.47z)| norm 0.3171 (+0.18z)| lr 5.92e-04 | 4184.33 ms | 32.3% bf16 MFU | 125436 tok/s step 2073/19560 | loss 3.861849 (+0.23z)| norm 0.3866 (+2.49z)| lr 5.92e-04 | 4154.62 ms | 32.5% bf16 MFU | 125474 tok/s step 2074/19560 | loss 3.847047 (-0.13z)| norm 0.3749 (+2.05z)| lr 5.92e-04 | 4169.25 ms | 32.4% bf16 MFU | 125488 tok/s step 2075/19560 | loss 3.803124 (-1.21z)| norm 0.3370 (+0.79z)| lr 5.92e-04 | 4170.14 ms | 32.4% bf16 MFU | 125500 tok/s step 2076/19560 | loss 3.829048 (-0.56z)| norm 0.3078 (-0.17z)| lr 5.92e-04 | 4155.41 ms | 32.5% bf16 MFU | 125533 tok/s step 2077/19560 | loss 3.817661 (-0.83z)| norm 0.2812 (-1.04z)| lr 5.92e-04 | 4163.96 ms | 32.4% bf16 MFU | 125552 tok/s step 2078/19560 | loss 3.889370 (+1.01z)| norm 0.2902 (-0.74z)| lr 5.92e-04 | 4179.45 ms | 32.3% bf16 MFU | 125547 tok/s step 2079/19560 | loss 3.803310 (-1.20z)| norm 0.3021 (-0.33z)| lr 5.92e-04 | 4158.54 ms | 32.5% bf16 MFU | 125573 tok/s step 2080/19560 | loss 3.814608 (-0.89z)| norm 0.3543 (+1.39z)| lr 5.92e-04 | 4169.10 ms | 32.4% bf16 MFU | 125582 tok/s step 2081/19560 | loss 3.845936 (-0.09z)| norm 0.3616 (+1.63z)| lr 5.92e-04 | 4179.90 ms | 32.3% bf16 MFU | 125575 tok/s step 2082/19560 | loss 3.771083 (-1.99z)| norm 0.2829 (-0.95z)| lr 5.92e-04 | 4169.98 ms | 32.4% bf16 MFU | 125582 tok/s step 2083/19560 | loss 3.827517 (-0.52z)| norm 0.3120 (+0.00z)| lr 5.92e-04 | 4181.72 ms | 32.3% bf16 MFU | 125572 tok/s step 2084/19560 | loss 3.865252 (+0.47z)| norm 0.3267 (+0.48z)| lr 5.92e-04 | 4165.32 ms | 32.4% bf16 MFU | 125587 tok/s step 2085/19560 | loss 3.845052 (-0.05z)| norm 0.3200 (+0.26z)| lr 5.92e-04 | 8547.72 ms | 15.8% bf16 MFU | 122374 tok/s step 2086/19560 | loss 3.825233 (-0.58z)| norm 0.2980 (-0.46z)| lr 5.92e-04 | 4160.23 ms | 32.5% bf16 MFU | 122557 tok/s step 2087/19560 | loss 3.772466 (-1.94z)| norm 0.2874 (-0.80z)| lr 5.92e-04 | 4160.97 ms | 32.4% bf16 MFU | 122729 tok/s step 2088/19560 | loss 3.836587 (-0.25z)| norm 0.2924 (-0.63z)| lr 5.92e-04 | 4147.82 ms | 32.6% bf16 MFU | 122913 tok/s step 2089/19560 | loss 3.826365 (-0.50z)| norm 0.2766 (-1.14z)| lr 5.92e-04 | 4173.14 ms | 32.4% bf16 MFU | 123049 tok/s step 2090/19560 | loss 3.801134 (-1.15z)| norm 0.2458 (-2.08z)| lr 5.92e-04 | 4185.70 ms | 32.3% bf16 MFU | 123159 tok/s step 2091/19560 | loss 3.802986 (-1.10z)| norm 0.2643 (-1.47z)| lr 5.92e-04 | 4162.97 ms | 32.4% bf16 MFU | 123298 tok/s step 2092/19560 | loss 4.009490 (+4.05z)| norm 0.2855 (-0.78z)| lr 5.92e-04 | 4167.65 ms | 32.4% bf16 MFU | 123423 tok/s step 2093/19560 | loss 3.769201 (-1.87z)| norm 0.3230 (+0.43z)| lr 5.92e-04 | 4156.79 ms | 32.5% bf16 MFU | 123559 tok/s step 2094/19560 | loss 3.842071 (-0.09z)| norm 0.3371 (+0.87z)| lr 5.92e-04 | 4155.50 ms | 32.5% bf16 MFU | 123689 tok/s step 2095/19560 | loss 3.796568 (-1.18z)| norm 0.3767 (+2.08z)| lr 5.92e-04 | 4178.96 ms | 32.3% bf16 MFU | 123777 tok/s step 2096/19560 | loss 3.907322 (+1.52z)| norm 0.3924 (+2.49z)| lr 5.92e-04 | 4150.74 ms | 32.5% bf16 MFU | 123904 tok/s step 2097/19560 | loss 3.794637 (-1.22z)| norm 0.3685 (+1.72z)| lr 5.92e-04 | 4153.44 ms | 32.5% bf16 MFU | 124020 tok/s step 2098/19560 | loss 3.885318 (+0.97z)| norm 0.3670 (+1.65z)| lr 5.92e-04 | 4155.13 ms | 32.5% bf16 MFU | 124128 tok/s step 2099/19560 | loss 3.773688 (-1.70z)| norm 0.3429 (+0.90z)| lr 5.92e-04 | 4157.91 ms | 32.5% bf16 MFU | 124227 tok/s step 2100/19560 | loss 3.843929 (-0.01z)| norm 0.3236 (+0.31z)| lr 5.92e-04 | 4188.70 ms | 32.2% bf16 MFU | 124274 tok/s step 2101/19560 | loss 3.863286 (+0.45z)| norm 0.3285 (+0.46z)| lr 5.92e-04 | 4156.15 ms | 32.5% bf16 MFU | 124367 tok/s step 2102/19560 | loss 3.777018 (-1.59z)| norm 0.3078 (-0.16z)| lr 5.92e-04 | 4168.48 ms | 32.4% bf16 MFU | 124438 tok/s step 2103/19560 | loss 3.785147 (-1.37z)| norm 0.3078 (-0.15z)| lr 5.92e-04 | 4169.25 ms | 32.4% bf16 MFU | 124503 tok/s step 2104/19560 | loss 3.798125 (-1.05z)| norm 0.3047 (-0.25z)| lr 5.92e-04 | 4162.23 ms | 32.4% bf16 MFU | 124576 tok/s step 2105/19560 | loss 3.808367 (-0.80z)| norm 0.2980 (-0.45z)| lr 5.92e-04 | 4146.34 ms | 32.6% bf16 MFU | 124670 tok/s step 2106/19560 | loss 3.904806 (+1.47z)| norm 0.3031 (-0.29z)| lr 5.92e-04 | 4173.69 ms | 32.3% bf16 MFU | 124717 tok/s step 2107/19560 | loss 3.855515 (+0.34z)| norm 0.2937 (-0.58z)| lr 5.92e-04 | 4157.43 ms | 32.5% bf16 MFU | 124787 tok/s step 2108/19560 | loss 3.824358 (-0.41z)| norm 0.2870 (-0.78z)| lr 5.92e-04 | 4153.25 ms | 32.5% bf16 MFU | 124859 tok/s step 2109/19560 | loss 3.824087 (-0.41z)| norm 0.2923 (-0.60z)| lr 5.92e-04 | 4161.73 ms | 32.4% bf16 MFU | 124915 tok/s step 2110/19560 | loss 3.826676 (-0.34z)| norm 0.2978 (-0.42z)| lr 5.92e-04 | 4161.44 ms | 32.4% bf16 MFU | 124969 tok/s step 2111/19560 | loss 3.815216 (-0.61z)| norm 0.2733 (-1.19z)| lr 5.92e-04 | 4221.50 ms | 32.0% bf16 MFU | 124930 tok/s step 2112/19560 | loss 3.990911 (+3.50z)| norm 0.2700 (-1.27z)| lr 5.92e-04 | 4177.41 ms | 32.3% bf16 MFU | 124959 tok/s step 2113/19560 | loss 3.839569 (-0.03z)| norm 0.2940 (-0.52z)| lr 5.92e-04 | 4164.79 ms | 32.4% bf16 MFU | 125005 tok/s step 2114/19560 | loss 3.808885 (-0.74z)| norm 0.2968 (-0.43z)| lr 5.92e-04 | 4176.64 ms | 32.3% bf16 MFU | 125031 tok/s step 2115/19560 | loss 3.740693 (-2.30z)| norm 0.3152 (+0.14z)| lr 5.92e-04 | 4164.93 ms | 32.4% bf16 MFU | 125074 tok/s step 2116/19560 | loss 3.905822 (+1.56z)| norm 0.3216 (+0.34z)| lr 5.92e-04 | 4159.93 ms | 32.5% bf16 MFU | 125122 tok/s step 2117/19560 | loss 3.824531 (-0.33z)| norm 0.3041 (-0.20z)| lr 5.92e-04 | 4154.62 ms | 32.5% bf16 MFU | 125175 tok/s step 2118/19560 | loss 3.854062 (+0.38z)| norm 0.2930 (-0.55z)| lr 5.92e-04 | 4230.30 ms | 31.9% bf16 MFU | 125113 tok/s step 2119/19560 | loss 3.864662 (+0.63z)| norm 0.2962 (-0.45z)| lr 5.92e-04 | 4159.58 ms | 32.5% bf16 MFU | 125160 tok/s step 2120/19560 | loss 3.818780 (-0.46z)| norm 0.2649 (-1.40z)| lr 5.92e-04 | 4210.70 ms | 32.1% bf16 MFU | 125128 tok/s step 2121/19560 | loss 3.844501 (+0.15z)| norm 0.2906 (-0.59z)| lr 5.92e-04 | 4169.15 ms | 32.4% bf16 MFU | 125159 tok/s step 2122/19560 | loss 3.822162 (-0.38z)| norm 0.3280 (+0.56z)| lr 5.92e-04 | 4164.95 ms | 32.4% bf16 MFU | 125195 tok/s step 2123/19560 | loss 3.741710 (-2.22z)| norm 0.3282 (+0.58z)| lr 5.92e-04 | 4148.69 ms | 32.5% bf16 MFU | 125254 tok/s step 2124/19560 | loss 3.738187 (-2.24z)| norm 0.3425 (+1.07z)| lr 5.92e-04 | 4167.92 ms | 32.4% bf16 MFU | 125281 tok/s step 2125/19560 | loss 3.811287 (-0.56z)| norm 0.3313 (+0.74z)| lr 5.92e-04 | 4156.93 ms | 32.5% bf16 MFU | 125323 tok/s step 2126/19560 | loss 3.896590 (+1.46z)| norm 0.3524 (+1.41z)| lr 5.92e-04 | 4193.50 ms | 32.2% bf16 MFU | 125308 tok/s step 2127/19560 | loss 3.760221 (-1.73z)| norm 0.3389 (+0.96z)| lr 5.92e-04 | 4163.22 ms | 32.4% bf16 MFU | 125339 tok/s step 2128/19560 | loss 3.811620 (-0.52z)| norm 0.3088 (-0.02z)| lr 5.92e-04 | 4156.28 ms | 32.5% bf16 MFU | 125380 tok/s step 2129/19560 | loss 3.755973 (-1.78z)| norm 0.3252 (+0.51z)| lr 5.92e-04 | 4165.84 ms | 32.4% bf16 MFU | 125403 tok/s step 2130/19560 | loss 3.784103 (-1.12z)| norm 0.3031 (-0.20z)| lr 5.92e-04 | 4150.84 ms | 32.5% bf16 MFU | 125449 tok/s step 2131/19560 | loss 3.801190 (-0.72z)| norm 0.2787 (-0.98z)| lr 5.92e-04 | 4243.19 ms | 31.8% bf16 MFU | 125354 tok/s step 2132/19560 | loss 3.819638 (-0.29z)| norm 0.2627 (-1.48z)| lr 5.92e-04 | 4166.82 ms | 32.4% bf16 MFU | 125378 tok/s step 2133/19560 | loss 3.842049 (+0.22z)| norm 0.2820 (-0.85z)| lr 5.92e-04 | 4160.63 ms | 32.5% bf16 MFU | 125409 tok/s step 2134/19560 | loss 3.803946 (-0.66z)| norm 0.2621 (-1.50z)| lr 5.91e-04 | 4158.63 ms | 32.5% bf16 MFU | 125442 tok/s step 2135/19560 | loss 3.861807 (+0.69z)| norm 0.2838 (-0.79z)| lr 5.91e-04 | 4154.43 ms | 32.5% bf16 MFU | 125480 tok/s step 2136/19560 | loss 3.801723 (-0.70z)| norm 0.3006 (-0.25z)| lr 5.91e-04 | 4150.96 ms | 32.5% bf16 MFU | 125522 tok/s step 2137/19560 | loss 3.850316 (+0.44z)| norm 0.3238 (+0.50z)| lr 5.91e-04 | 4199.01 ms | 32.2% bf16 MFU | 125489 tok/s step 2138/19560 | loss 3.833700 (+0.05z)| norm 0.2987 (-0.32z)| lr 5.91e-04 | 4149.59 ms | 32.5% bf16 MFU | 125531 tok/s step 2139/19560 | loss 3.810527 (-0.49z)| norm 0.2976 (-0.35z)| lr 5.91e-04 | 4170.24 ms | 32.4% bf16 MFU | 125541 tok/s step 2140/19560 | loss 3.804290 (-0.63z)| norm 0.2871 (-0.68z)| lr 5.91e-04 | 4166.69 ms | 32.4% bf16 MFU | 125555 tok/s step 2141/19560 | loss 3.876768 (+1.07z)| norm 0.3182 (+0.32z)| lr 5.91e-04 | 4175.02 ms | 32.3% bf16 MFU | 125556 tok/s step 2142/19560 | loss 3.834624 (+0.08z)| norm 0.3177 (+0.30z)| lr 5.91e-04 | 4178.86 ms | 32.3% bf16 MFU | 125552 tok/s step 2143/19560 | loss 3.793650 (-0.88z)| norm 0.3008 (-0.26z)| lr 5.91e-04 | 4167.28 ms | 32.4% bf16 MFU | 125565 tok/s step 2144/19560 | loss 3.781776 (-1.15z)| norm 0.2833 (-0.83z)| lr 5.91e-04 | 4153.16 ms | 32.5% bf16 MFU | 125598 tok/s step 2145/19560 | loss 3.759766 (-1.63z)| norm 0.2708 (-1.23z)| lr 5.91e-04 | 4153.44 ms | 32.5% bf16 MFU | 125630 tok/s step 2146/19560 | loss 3.734694 (-2.15z)| norm 0.2532 (-1.77z)| lr 5.91e-04 | 4145.26 ms | 32.6% bf16 MFU | 125672 tok/s step 2147/19560 | loss 3.823218 (-0.15z)| norm 0.2516 (-1.78z)| lr 5.91e-04 | 4157.25 ms | 32.5% bf16 MFU | 125694 tok/s step 2148/19560 | loss 3.814749 (-0.34z)| norm 0.2551 (-1.64z)| lr 5.91e-04 | 4157.39 ms | 32.5% bf16 MFU | 125715 tok/s step 2149/19560 | loss 3.763582 (-1.48z)| norm 0.2522 (-1.70z)| lr 5.91e-04 | 4145.73 ms | 32.6% bf16 MFU | 125753 tok/s step 2150/19560 | loss 3.721054 (-2.37z)| norm 0.2564 (-1.54z)| lr 5.91e-04 | 4155.28 ms | 32.5% bf16 MFU | 125774 tok/s step 2151/19560 | loss 3.809956 (-0.39z)| norm 0.2566 (-1.50z)| lr 5.91e-04 | 4150.96 ms | 32.5% bf16 MFU | 125800 tok/s step 2152/19560 | loss 3.834912 (+0.16z)| norm 0.2872 (-0.55z)| lr 5.91e-04 | 4159.38 ms | 32.5% bf16 MFU | 125813 tok/s step 2153/19560 | loss 3.792039 (-0.78z)| norm 0.2727 (-0.99z)| lr 5.91e-04 | 4196.14 ms | 32.2% bf16 MFU | 125769 tok/s step 2154/19560 | loss 3.716851 (-2.41z)| norm 0.2868 (-0.53z)| lr 5.91e-04 | 4161.55 ms | 32.4% bf16 MFU | 125780 tok/s step 2155/19560 | loss 3.802588 (-0.51z)| norm 0.3311 (+0.89z)| lr 5.91e-04 | 4158.04 ms | 32.5% bf16 MFU | 125796 tok/s step 2156/19560 | loss 3.784278 (-0.91z)| norm 0.4009 (+2.99z)| lr 5.91e-04 | 4162.29 ms | 32.4% bf16 MFU | 125804 tok/s step 2157/19560 | loss 3.853589 (+0.62z)| norm 0.4135 (+3.22z)| lr 5.91e-04 | 4153.31 ms | 32.5% bf16 MFU | 125825 tok/s step 2158/19560 | loss 3.916226 (+1.97z)| norm 0.4281 (+3.46z)| lr 5.91e-04 | 4155.32 ms | 32.5% bf16 MFU | 125843 tok/s step 2159/19560 | loss 3.897673 (+1.54z)| norm 0.3675 (+1.70z)| lr 5.91e-04 | 4151.27 ms | 32.5% bf16 MFU | 125865 tok/s step 2160/19560 | loss 3.876594 (+1.09z)| norm 0.3569 (+1.38z)| lr 5.91e-04 | 4163.79 ms | 32.4% bf16 MFU | 125868 tok/s step 2161/19560 | loss 3.838990 (+0.27z)| norm 0.3384 (+0.86z)| lr 5.91e-04 | 4162.16 ms | 32.4% bf16 MFU | 125873 tok/s step 2162/19560 | loss 3.849175 (+0.49z)| norm 0.3087 (+0.05z)| lr 5.91e-04 | 4153.51 ms | 32.5% bf16 MFU | 125891 tok/s step 2163/19560 | loss 3.824310 (-0.05z)| norm 0.2844 (-0.63z)| lr 5.91e-04 | 4148.84 ms | 32.5% bf16 MFU | 125914 tok/s step 2164/19560 | loss 3.804903 (-0.47z)| norm 0.2931 (-0.36z)| lr 5.91e-04 | 4161.15 ms | 32.4% bf16 MFU | 125919 tok/s step 2165/19560 | loss 3.913257 (+1.88z)| norm 0.3057 (+0.00z)| lr 5.91e-04 | 4159.03 ms | 32.5% bf16 MFU | 125926 tok/s step 2166/19560 | loss 3.814297 (-0.26z)| norm 0.2810 (-0.71z)| lr 5.91e-04 | 4164.46 ms | 32.4% bf16 MFU | 125924 tok/s step 2167/19560 | loss 3.788776 (-0.80z)| norm 0.2721 (-0.97z)| lr 5.91e-04 | 4165.91 ms | 32.4% bf16 MFU | 125921 tok/s step 2168/19560 | loss 3.894123 (+1.47z)| norm 0.2888 (-0.49z)| lr 5.91e-04 | 4155.51 ms | 32.5% bf16 MFU | 125933 tok/s step 2169/19560 | loss 3.818042 (-0.16z)| norm 0.3087 (+0.08z)| lr 5.91e-04 | 4156.80 ms | 32.5% bf16 MFU | 125943 tok/s step 2170/19560 | loss 3.744581 (-1.74z)| norm 0.3029 (-0.10z)| lr 5.91e-04 | 4164.72 ms | 32.4% bf16 MFU | 125940 tok/s step 2171/19560 | loss 3.804596 (-0.44z)| norm 0.2973 (-0.27z)| lr 5.91e-04 | 4187.91 ms | 32.2% bf16 MFU | 125902 tok/s step 2172/19560 | loss 3.828866 (+0.09z)| norm 0.3112 (+0.14z)| lr 5.91e-04 | 4154.01 ms | 32.5% bf16 MFU | 125918 tok/s step 2173/19560 | loss 3.820192 (-0.09z)| norm 0.2874 (-0.57z)| lr 5.91e-04 | 4162.93 ms | 32.4% bf16 MFU | 125919 tok/s step 2174/19560 | loss 3.802448 (-0.47z)| norm 0.3020 (-0.15z)| lr 5.91e-04 | 4158.69 ms | 32.5% bf16 MFU | 125927 tok/s step 2175/19560 | loss 3.737890 (-1.83z)| norm 0.3216 (+0.43z)| lr 5.91e-04 | 4167.18 ms | 32.4% bf16 MFU | 125921 tok/s step 2176/19560 | loss 3.857131 (+0.73z)| norm 0.3278 (+0.60z)| lr 5.91e-04 | 4160.16 ms | 32.5% bf16 MFU | 125926 tok/s step 2177/19560 | loss 3.808117 (-0.32z)| norm 0.3142 (+0.20z)| lr 5.91e-04 | 4183.91 ms | 32.3% bf16 MFU | 125895 tok/s step 2178/19560 | loss 3.797050 (-0.56z)| norm 0.3936 (+2.48z)| lr 5.91e-04 | 4154.81 ms | 32.5% bf16 MFU | 125910 tok/s step 2179/19560 | loss 3.834428 (+0.24z)| norm 0.3676 (+1.70z)| lr 5.91e-04 | 4160.71 ms | 32.5% bf16 MFU | 125915 tok/s step 2180/19560 | loss 3.825300 (+0.04z)| norm 0.3157 (+0.22z)| lr 5.91e-04 | 4178.16 ms | 32.3% bf16 MFU | 125893 tok/s step 2181/19560 | loss 3.797801 (-0.54z)| norm 0.2793 (-0.81z)| lr 5.91e-04 | 4161.32 ms | 32.4% bf16 MFU | 125898 tok/s step 2182/19560 | loss 3.788971 (-0.72z)| norm 0.2839 (-0.67z)| lr 5.91e-04 | 4523.29 ms | 29.8% bf16 MFU | 125399 tok/s step 2183/19560 | loss 3.830750 (+0.17z)| norm 0.2843 (-0.66z)| lr 5.91e-04 | 4153.88 ms | 32.5% bf16 MFU | 125440 tok/s step 2184/19560 | loss 3.832784 (+0.21z)| norm 0.2595 (-1.36z)| lr 5.91e-04 | 4160.12 ms | 32.5% bf16 MFU | 125469 tok/s step 2185/19560 | loss 3.820331 (-0.05z)| norm 0.2506 (-1.59z)| lr 5.91e-04 | 4160.67 ms | 32.5% bf16 MFU | 125496 tok/s step 2186/19560 | loss 3.839090 (+0.35z)| norm 0.2755 (-0.88z)| lr 5.91e-04 | 4224.13 ms | 32.0% bf16 MFU | 125427 tok/s step 2187/19560 | loss 3.763432 (-1.26z)| norm 0.2822 (-0.68z)| lr 5.91e-04 | 4326.26 ms | 31.2% bf16 MFU | 125215 tok/s step 2188/19560 | loss 3.728955 (-1.96z)| norm 0.2990 (-0.21z)| lr 5.91e-04 | 4158.61 ms | 32.5% bf16 MFU | 125258 tok/s step 2189/19560 | loss 3.808696 (-0.27z)| norm 0.2787 (-0.77z)| lr 5.91e-04 | 4163.58 ms | 32.4% bf16 MFU | 125291 tok/s step 2190/19560 | loss 3.776823 (-0.93z)| norm 0.2721 (-0.94z)| lr 5.91e-04 | 4467.38 ms | 30.2% bf16 MFU | 124895 tok/s step 2191/19560 | loss 3.857473 (+0.77z)| norm 0.2552 (-1.39z)| lr 5.91e-04 | 4177.52 ms | 32.3% bf16 MFU | 124925 tok/s step 2192/19560 | loss 3.856820 (+0.75z)| norm 0.2451 (-1.65z)| lr 5.91e-04 | 4200.84 ms | 32.1% bf16 MFU | 124919 tok/s step 2193/19560 | loss 3.801160 (-0.42z)| norm 0.2850 (-0.55z)| lr 5.91e-04 | 4162.79 ms | 32.4% bf16 MFU | 124970 tok/s step 2194/19560 | loss 3.805058 (-0.34z)| norm 0.2740 (-0.85z)| lr 5.91e-04 | 4163.75 ms | 32.4% bf16 MFU | 125018 tok/s step 2195/19560 | loss 3.783479 (-0.78z)| norm 0.2509 (-1.46z)| lr 5.91e-04 | 4197.53 ms | 32.2% bf16 MFU | 125012 tok/s step 2196/19560 | loss 3.880540 (+1.26z)| norm 0.2846 (-0.55z)| lr 5.91e-04 | 4153.12 ms | 32.5% bf16 MFU | 125073 tok/s step 2197/19560 | loss 3.944809 (+2.54z)| norm 0.3014 (-0.10z)| lr 5.91e-04 | 4161.29 ms | 32.4% bf16 MFU | 125119 tok/s step 2198/19560 | loss 3.786709 (-0.71z)| norm 0.3006 (-0.12z)| lr 5.91e-04 | 4353.35 ms | 31.0% bf16 MFU | 124885 tok/s step 2199/19560 | loss 3.858927 (+0.76z)| norm 0.3152 (+0.27z)| lr 5.91e-04 | 4179.36 ms | 32.3% bf16 MFU | 124913 tok/s step 2200/19560 | loss 3.852872 (+0.63z)| norm 0.3180 (+0.35z)| lr 5.91e-04 | 4156.72 ms | 32.5% bf16 MFU | 124974 tok/s step 2201/19560 | loss 3.783269 (-0.79z)| norm 0.3499 (+1.25z)| lr 5.91e-04 | 4185.61 ms | 32.3% bf16 MFU | 124988 tok/s step 2202/19560 | loss 3.756287 (-1.32z)| norm 0.3329 (+0.79z)| lr 5.91e-04 | 4152.88 ms | 32.5% bf16 MFU | 125051 tok/s step 2203/19560 | loss 3.806162 (-0.30z)| norm 0.3164 (+0.34z)| lr 5.91e-04 | 4159.70 ms | 32.5% bf16 MFU | 125101 tok/s step 2204/19560 | loss 3.798590 (-0.45z)| norm 0.3285 (+0.67z)| lr 5.91e-04 | 4192.49 ms | 32.2% bf16 MFU | 125098 tok/s step 2205/19560 | loss 3.810689 (-0.20z)| norm 0.3379 (+0.92z)| lr 5.91e-04 | 4191.67 ms | 32.2% bf16 MFU | 125097 tok/s step 2206/19560 | loss 3.827519 (+0.15z)| norm 0.3305 (+0.71z)| lr 5.91e-04 | 4173.22 ms | 32.4% bf16 MFU | 125124 tok/s step 2207/19560 | loss 3.795441 (-0.51z)| norm 0.3467 (+1.15z)| lr 5.91e-04 | 4189.69 ms | 32.2% bf16 MFU | 125125 tok/s step 2208/19560 | loss 3.781320 (-0.79z)| norm 0.3155 (+0.28z)| lr 5.91e-04 | 4162.32 ms | 32.4% bf16 MFU | 125167 tok/s step 2209/19560 | loss 3.842386 (+0.46z)| norm 0.3300 (+0.70z)| lr 5.91e-04 | 4165.93 ms | 32.4% bf16 MFU | 125201 tok/s step 2210/19560 | loss 3.786338 (-0.69z)| norm 0.3871 (+2.26z)| lr 5.91e-04 | 4156.57 ms | 32.5% bf16 MFU | 125247 tok/s step 2211/19560 | loss 3.831232 (+0.23z)| norm 0.3765 (+1.92z)| lr 5.91e-04 | 4183.02 ms | 32.3% bf16 MFU | 125252 tok/s step 2212/19560 | loss 3.858774 (+0.80z)| norm 0.3454 (+1.06z)| lr 5.91e-04 | 4172.36 ms | 32.4% bf16 MFU | 125272 tok/s step 2213/19560 | loss 3.761256 (-1.19z)| norm 0.3254 (+0.51z)| lr 5.91e-04 | 4156.50 ms | 32.5% bf16 MFU | 125315 tok/s step 2214/19560 | loss 3.766122 (-1.08z)| norm 0.3350 (+0.76z)| lr 5.91e-04 | 4182.81 ms | 32.3% bf16 MFU | 125317 tok/s step 2215/19560 | loss 3.781798 (-0.76z)| norm 0.3217 (+0.39z)| lr 5.91e-04 | 4168.71 ms | 32.4% bf16 MFU | 125339 tok/s step 2216/19560 | loss 3.798418 (-0.41z)| norm 0.2727 (-0.94z)| lr 5.90e-04 | 6044.04 ms | 22.3% bf16 MFU | 123410 tok/s step 2217/19560 | loss 3.797201 (-0.43z)| norm 0.3091 (+0.05z)| lr 5.90e-04 | 4165.67 ms | 32.4% bf16 MFU | 123532 tok/s step 2218/19560 | loss 3.851327 (+0.67z)| norm 0.2865 (-0.58z)| lr 5.90e-04 | 4169.01 ms | 32.4% bf16 MFU | 123643 tok/s step 2219/19560 | loss 3.817162 (-0.03z)| norm 0.2826 (-0.70z)| lr 5.90e-04 | 4155.83 ms | 32.5% bf16 MFU | 123769 tok/s step 2220/19560 | loss 3.830940 (+0.29z)| norm 0.2571 (-1.39z)| lr 5.90e-04 | 4156.27 ms | 32.5% bf16 MFU | 123888 tok/s step 2221/19560 | loss 3.761483 (-1.21z)| norm 0.2720 (-0.97z)| lr 5.90e-04 | 4240.75 ms | 31.8% bf16 MFU | 123875 tok/s step 2222/19560 | loss 3.805115 (-0.26z)| norm 0.2929 (-0.38z)| lr 5.90e-04 | 4164.09 ms | 32.4% bf16 MFU | 123977 tok/s step 2223/19560 | loss 3.789645 (-0.59z)| norm 0.2931 (-0.36z)| lr 5.90e-04 | 4153.12 ms | 32.5% bf16 MFU | 124090 tok/s step 2224/19560 | loss 3.828843 (+0.27z)| norm 0.3157 (+0.29z)| lr 5.90e-04 | 4191.04 ms | 32.2% bf16 MFU | 124140 tok/s step 2225/19560 | loss 3.760070 (-1.23z)| norm 0.3147 (+0.27z)| lr 5.90e-04 | 4195.03 ms | 32.2% bf16 MFU | 124182 tok/s step 2226/19560 | loss 3.806163 (-0.21z)| norm 0.2982 (-0.19z)| lr 5.90e-04 | 4159.74 ms | 32.5% bf16 MFU | 124275 tok/s step 2227/19560 | loss 3.815717 (-0.00z)| norm 0.3153 (+0.32z)| lr 5.90e-04 | 4167.90 ms | 32.4% bf16 MFU | 124351 tok/s step 2228/19560 | loss 3.765381 (-1.10z)| norm 0.3180 (+0.40z)| lr 5.90e-04 | 4172.03 ms | 32.4% bf16 MFU | 124417 tok/s step 2229/19560 | loss 3.820091 (+0.12z)| norm 0.2876 (-0.48z)| lr 5.90e-04 | 4161.18 ms | 32.4% bf16 MFU | 124495 tok/s step 2230/19560 | loss 3.773927 (-0.91z)| norm 0.2884 (-0.46z)| lr 5.90e-04 | 4154.65 ms | 32.5% bf16 MFU | 124580 tok/s step 2231/19560 | loss 3.754441 (-1.33z)| norm 0.2670 (-1.07z)| lr 5.90e-04 | 4157.96 ms | 32.5% bf16 MFU | 124656 tok/s step 2232/19560 | loss 3.801570 (-0.29z)| norm 0.2655 (-1.10z)| lr 5.90e-04 | 4173.80 ms | 32.3% bf16 MFU | 124704 tok/s step 2233/19560 | loss 3.792702 (-0.48z)| norm 0.2656 (-1.08z)| lr 5.90e-04 | 4159.57 ms | 32.5% bf16 MFU | 124771 tok/s step 2234/19560 | loss 3.816765 (+0.07z)| norm 0.2897 (-0.38z)| lr 5.90e-04 | 4152.43 ms | 32.5% bf16 MFU | 124845 tok/s step 2235/19560 | loss 3.776451 (-0.83z)| norm 0.3031 (+0.00z)| lr 5.90e-04 | 4162.15 ms | 32.4% bf16 MFU | 124901 tok/s step 2236/19560 | loss 3.809942 (-0.07z)| norm 0.3075 (+0.13z)| lr 5.90e-04 | 4177.64 ms | 32.3% bf16 MFU | 124931 tok/s step 2237/19560 | loss 3.799430 (-0.30z)| norm 0.2845 (-0.54z)| lr 5.90e-04 | 4168.96 ms | 32.4% bf16 MFU | 124973 tok/s step 2238/19560 | loss 3.789474 (-0.52z)| norm 0.3059 (+0.08z)| lr 5.90e-04 | 4176.29 ms | 32.3% bf16 MFU | 125001 tok/s step 2239/19560 | loss 3.807614 (-0.11z)| norm 0.3342 (+0.89z)| lr 5.90e-04 | 4172.34 ms | 32.4% bf16 MFU | 125034 tok/s step 2240/19560 | loss 3.801998 (-0.22z)| norm 0.2924 (-0.33z)| lr 5.90e-04 | 4198.72 ms | 32.2% bf16 MFU | 125026 tok/s step 2241/19560 | loss 3.844355 (+0.80z)| norm 0.2938 (-0.29z)| lr 5.90e-04 | 4175.58 ms | 32.3% bf16 MFU | 125052 tok/s step 2242/19560 | loss 3.817542 (+0.15z)| norm 0.3052 (+0.04z)| lr 5.90e-04 | 4180.36 ms | 32.3% bf16 MFU | 125071 tok/s step 2243/19560 | loss 3.764951 (-1.12z)| norm 0.2960 (-0.22z)| lr 5.90e-04 | 4170.08 ms | 32.4% bf16 MFU | 125103 tok/s step 2244/19560 | loss 3.710368 (-2.40z)| norm 0.3085 (+0.14z)| lr 5.90e-04 | 4165.99 ms | 32.4% bf16 MFU | 125141 tok/s step 2245/19560 | loss 3.790611 (-0.46z)| norm 0.2973 (-0.18z)| lr 5.90e-04 | 4169.37 ms | 32.4% bf16 MFU | 125171 tok/s step 2246/19560 | loss 3.843119 (+0.81z)| norm 0.2959 (-0.22z)| lr 5.90e-04 | 4163.55 ms | 32.4% bf16 MFU | 125209 tok/s step 2247/19560 | loss 3.782902 (-0.63z)| norm 0.3003 (-0.10z)| lr 5.90e-04 | 4166.69 ms | 32.4% bf16 MFU | 125240 tok/s step 2248/19560 | loss 3.832277 (+0.56z)| norm 0.2887 (-0.44z)| lr 5.90e-04 | 4174.89 ms | 32.3% bf16 MFU | 125257 tok/s step 2249/19560 | loss 3.787800 (-0.50z)| norm 0.3009 (-0.09z)| lr 5.90e-04 | 4148.89 ms | 32.5% bf16 MFU | 125312 tok/s step 2250/19560 | loss 3.795244 (-0.32z)| norm 0.3164 (+0.37z)| lr 5.90e-04 | 4167.96 ms | 32.4% bf16 MFU | 125336 tok/s val loss 3.815366 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2619/10042 = 0.260805 step 2251/19560 | loss 3.828815 (+0.49z)| norm 0.3436 (+1.16z)| lr 5.90e-04 | 4170.16 ms | 32.4% bf16 MFU | 125356 tok/s step 2252/19560 | loss 3.816211 (+0.16z)| norm 0.3179 (+0.42z)| lr 5.90e-04 | 4174.17 ms | 32.3% bf16 MFU | 125368 tok/s step 2253/19560 | loss 3.863175 (+1.31z)| norm 0.3471 (+1.26z)| lr 5.90e-04 | 4160.71 ms | 32.5% bf16 MFU | 125400 tok/s step 2254/19560 | loss 3.761638 (-1.18z)| norm 0.3643 (+1.76z)| lr 5.90e-04 | 4159.01 ms | 32.5% bf16 MFU | 125433 tok/s step 2255/19560 | loss 3.734898 (-1.83z)| norm 0.3698 (+1.89z)| lr 5.90e-04 | 4308.99 ms | 31.3% bf16 MFU | 125245 tok/s step 2256/19560 | loss 3.755124 (-1.31z)| norm 0.3209 (+0.48z)| lr 5.90e-04 | 4187.36 ms | 32.2% bf16 MFU | 125243 tok/s step 2257/19560 | loss 3.758832 (-1.22z)| norm 0.3359 (+0.91z)| lr 5.90e-04 | 4159.85 ms | 32.5% bf16 MFU | 125283 tok/s step 2258/19560 | loss 3.846058 (+0.91z)| norm 0.3433 (+1.10z)| lr 5.90e-04 | 4160.82 ms | 32.4% bf16 MFU | 125319 tok/s step 2259/19560 | loss 3.903278 (+2.25z)| norm 0.3356 (+0.87z)| lr 5.90e-04 | 4357.48 ms | 31.0% bf16 MFU | 125069 tok/s step 2260/19560 | loss 3.791344 (-0.44z)| norm 0.3031 (-0.07z)| lr 5.90e-04 | 4168.56 ms | 32.4% bf16 MFU | 125104 tok/s step 2261/19560 | loss 3.726915 (-1.94z)| norm 0.3060 (+0.01z)| lr 5.90e-04 | 4159.44 ms | 32.5% bf16 MFU | 125151 tok/s step 2262/19560 | loss 3.827939 (+0.46z)| norm 0.3082 (+0.06z)| lr 5.90e-04 | 4176.62 ms | 32.3% bf16 MFU | 125170 tok/s step 2263/19560 | loss 3.803275 (-0.12z)| norm 0.2869 (-0.55z)| lr 5.90e-04 | 4159.10 ms | 32.5% bf16 MFU | 125215 tok/s step 2264/19560 | loss 3.855441 (+1.11z)| norm 0.3071 (+0.03z)| lr 5.90e-04 | 4152.11 ms | 32.5% bf16 MFU | 125267 tok/s step 2265/19560 | loss 3.806826 (-0.04z)| norm 0.2848 (-0.61z)| lr 5.90e-04 | 4178.08 ms | 32.3% bf16 MFU | 125278 tok/s step 2266/19560 | loss 3.780943 (-0.64z)| norm 0.2789 (-0.77z)| lr 5.90e-04 | 4157.49 ms | 32.5% bf16 MFU | 125320 tok/s step 2267/19560 | loss 3.788132 (-0.47z)| norm 0.2610 (-1.27z)| lr 5.90e-04 | 4188.21 ms | 32.2% bf16 MFU | 125313 tok/s step 2268/19560 | loss 3.750813 (-1.34z)| norm 0.2880 (-0.50z)| lr 5.90e-04 | 4160.94 ms | 32.4% bf16 MFU | 125347 tok/s step 2269/19560 | loss 3.832976 (+0.62z)| norm 0.3110 (+0.16z)| lr 5.90e-04 | 4165.95 ms | 32.4% bf16 MFU | 125372 tok/s step 2270/19560 | loss 3.845807 (+0.92z)| norm 0.2912 (-0.40z)| lr 5.90e-04 | 4158.02 ms | 32.5% bf16 MFU | 125408 tok/s step 2271/19560 | loss 3.781963 (-0.60z)| norm 0.2891 (-0.46z)| lr 5.90e-04 | 4168.91 ms | 32.4% bf16 MFU | 125426 tok/s step 2272/19560 | loss 3.777178 (-0.71z)| norm 0.3072 (+0.06z)| lr 5.90e-04 | 4158.14 ms | 32.5% bf16 MFU | 125459 tok/s step 2273/19560 | loss 3.784269 (-0.55z)| norm 0.3098 (+0.13z)| lr 5.90e-04 | 4445.99 ms | 30.4% bf16 MFU | 125082 tok/s step 2274/19560 | loss 3.788478 (-0.46z)| norm 0.2785 (-0.79z)| lr 5.90e-04 | 4153.30 ms | 32.5% bf16 MFU | 125140 tok/s step 2275/19560 | loss 3.829581 (+0.53z)| norm 0.2769 (-0.85z)| lr 5.90e-04 | 4160.19 ms | 32.5% bf16 MFU | 125184 tok/s step 2276/19560 | loss 3.804536 (-0.07z)| norm 0.3078 (+0.05z)| lr 5.90e-04 | 4168.16 ms | 32.4% bf16 MFU | 125214 tok/s step 2277/19560 | loss 3.706252 (-2.39z)| norm 0.3219 (+0.45z)| lr 5.90e-04 | 4186.50 ms | 32.3% bf16 MFU | 125215 tok/s step 2278/19560 | loss 3.809130 (+0.03z)| norm 0.2681 (-1.16z)| lr 5.90e-04 | 4177.21 ms | 32.3% bf16 MFU | 125230 tok/s step 2279/19560 | loss 3.831012 (+0.56z)| norm 0.2694 (-1.13z)| lr 5.90e-04 | 4146.84 ms | 32.6% bf16 MFU | 125290 tok/s step 2280/19560 | loss 3.782355 (-0.61z)| norm 0.3001 (-0.21z)| lr 5.90e-04 | 4154.96 ms | 32.5% bf16 MFU | 125335 tok/s step 2281/19560 | loss 3.815214 (+0.18z)| norm 0.3196 (+0.36z)| lr 5.90e-04 | 4195.13 ms | 32.2% bf16 MFU | 125317 tok/s step 2282/19560 | loss 3.806558 (-0.05z)| norm 0.3176 (+0.30z)| lr 5.90e-04 | 4172.02 ms | 32.4% bf16 MFU | 125334 tok/s step 2283/19560 | loss 3.739012 (-1.67z)| norm 0.3250 (+0.52z)| lr 5.90e-04 | 4160.29 ms | 32.5% bf16 MFU | 125369 tok/s step 2284/19560 | loss 3.778595 (-0.71z)| norm 0.2767 (-0.93z)| lr 5.90e-04 | 4286.83 ms | 31.5% bf16 MFU | 125215 tok/s step 2285/19560 | loss 3.818471 (+0.27z)| norm 0.2878 (-0.58z)| lr 5.90e-04 | 4171.02 ms | 32.4% bf16 MFU | 125239 tok/s step 2286/19560 | loss 3.756441 (-1.25z)| norm 0.3198 (+0.52z)| lr 5.90e-04 | 4163.93 ms | 32.4% bf16 MFU | 125273 tok/s step 2287/19560 | loss 3.794613 (-0.28z)| norm 0.2984 (-0.21z)| lr 5.90e-04 | 4158.89 ms | 32.5% bf16 MFU | 125313 tok/s step 2288/19560 | loss 3.852985 (+1.22z)| norm 0.2663 (-1.33z)| lr 5.90e-04 | 4195.91 ms | 32.2% bf16 MFU | 125295 tok/s step 2289/19560 | loss 3.783355 (-0.55z)| norm 0.3056 (+0.08z)| lr 5.90e-04 | 4162.85 ms | 32.4% bf16 MFU | 125327 tok/s step 2290/19560 | loss 3.786164 (-0.47z)| norm 0.3271 (+0.84z)| lr 5.90e-04 | 4163.71 ms | 32.4% bf16 MFU | 125357 tok/s step 2291/19560 | loss 3.786309 (-0.46z)| norm 0.3176 (+0.50z)| lr 5.90e-04 | 4179.63 ms | 32.3% bf16 MFU | 125361 tok/s step 2292/19560 | loss 3.749980 (-1.38z)| norm 0.2868 (-0.61z)| lr 5.90e-04 | 4165.55 ms | 32.4% bf16 MFU | 125386 tok/s step 2293/19560 | loss 3.757661 (-1.18z)| norm 0.3012 (-0.09z)| lr 5.90e-04 | 4155.09 ms | 32.5% bf16 MFU | 125425 tok/s step 2294/19560 | loss 3.804875 (+0.06z)| norm 0.2783 (-0.90z)| lr 5.90e-04 | 4225.69 ms | 32.0% bf16 MFU | 125358 tok/s step 2295/19560 | loss 3.886129 (+2.14z)| norm 0.3060 (+0.07z)| lr 5.89e-04 | 4161.02 ms | 32.4% bf16 MFU | 125390 tok/s step 2296/19560 | loss 3.774753 (-0.72z)| norm 0.3238 (+0.70z)| lr 5.89e-04 | 4171.94 ms | 32.4% bf16 MFU | 125404 tok/s step 2297/19560 | loss 3.788806 (-0.35z)| norm 0.3171 (+0.46z)| lr 5.89e-04 | 4157.94 ms | 32.5% bf16 MFU | 125438 tok/s step 2298/19560 | loss 3.800770 (-0.04z)| norm 0.3019 (-0.09z)| lr 5.89e-04 | 4162.60 ms | 32.4% bf16 MFU | 125464 tok/s step 2299/19560 | loss 3.762851 (-1.04z)| norm 0.2933 (-0.39z)| lr 5.89e-04 | 4164.57 ms | 32.4% bf16 MFU | 125485 tok/s step 2300/19560 | loss 3.753711 (-1.26z)| norm 0.2988 (-0.19z)| lr 5.89e-04 | 4184.86 ms | 32.3% bf16 MFU | 125475 tok/s step 2301/19560 | loss 3.747724 (-1.40z)| norm 0.3063 (+0.07z)| lr 5.89e-04 | 4156.56 ms | 32.5% bf16 MFU | 125508 tok/s step 2302/19560 | loss 3.820301 (+0.50z)| norm 0.2906 (-0.49z)| lr 5.89e-04 | 4167.58 ms | 32.4% bf16 MFU | 125523 tok/s step 2303/19560 | loss 3.794014 (-0.20z)| norm 0.2826 (-0.76z)| lr 5.89e-04 | 4172.89 ms | 32.4% bf16 MFU | 125529 tok/s step 2304/19560 | loss 3.770817 (-0.80z)| norm 0.3120 (+0.30z)| lr 5.89e-04 | 4179.01 ms | 32.3% bf16 MFU | 125525 tok/s step 2305/19560 | loss 3.782331 (-0.49z)| norm 0.3032 (-0.02z)| lr 5.89e-04 | 4167.56 ms | 32.4% bf16 MFU | 125539 tok/s step 2306/19560 | loss 3.854100 (+1.40z)| norm 0.2624 (-1.50z)| lr 5.89e-04 | 4156.68 ms | 32.5% bf16 MFU | 125569 tok/s step 2307/19560 | loss 3.805388 (+0.12z)| norm 0.2937 (-0.32z)| lr 5.89e-04 | 4166.87 ms | 32.4% bf16 MFU | 125581 tok/s step 2308/19560 | loss 3.792697 (-0.21z)| norm 0.2994 (-0.10z)| lr 5.89e-04 | 4165.41 ms | 32.4% bf16 MFU | 125596 tok/s step 2309/19560 | loss 3.794610 (-0.16z)| norm 0.2726 (-1.11z)| lr 5.89e-04 | 4167.59 ms | 32.4% bf16 MFU | 125606 tok/s step 2310/19560 | loss 3.801817 (+0.03z)| norm 0.2994 (-0.10z)| lr 5.89e-04 | 4176.60 ms | 32.3% bf16 MFU | 125602 tok/s step 2311/19560 | loss 3.755339 (-1.18z)| norm 0.2881 (-0.53z)| lr 5.89e-04 | 4179.33 ms | 32.3% bf16 MFU | 125594 tok/s step 2312/19560 | loss 3.861110 (+1.60z)| norm 0.3039 (+0.06z)| lr 5.89e-04 | 4170.18 ms | 32.4% bf16 MFU | 125601 tok/s step 2313/19560 | loss 3.831297 (+0.81z)| norm 0.2930 (-0.38z)| lr 5.89e-04 | 4186.81 ms | 32.2% bf16 MFU | 125582 tok/s step 2314/19560 | loss 3.761031 (-1.02z)| norm 0.2996 (-0.13z)| lr 5.89e-04 | 4173.90 ms | 32.3% bf16 MFU | 125584 tok/s step 2315/19560 | loss 3.831473 (+0.82z)| norm 0.2983 (-0.18z)| lr 5.89e-04 | 4157.27 ms | 32.5% bf16 MFU | 125610 tok/s step 2316/19560 | loss 3.821130 (+0.53z)| norm 0.2674 (-1.38z)| lr 5.89e-04 | 4158.73 ms | 32.5% bf16 MFU | 125633 tok/s step 2317/19560 | loss 3.807215 (+0.16z)| norm 0.3005 (-0.10z)| lr 5.89e-04 | 4175.17 ms | 32.3% bf16 MFU | 125630 tok/s step 2318/19560 | loss 3.795362 (-0.16z)| norm 0.2960 (-0.28z)| lr 5.89e-04 | 4179.92 ms | 32.3% bf16 MFU | 125620 tok/s step 2319/19560 | loss 3.798991 (-0.05z)| norm 0.2771 (-1.04z)| lr 5.89e-04 | 4188.05 ms | 32.2% bf16 MFU | 125598 tok/s step 2320/19560 | loss 3.762397 (-1.02z)| norm 0.2662 (-1.51z)| lr 5.89e-04 | 4165.09 ms | 32.4% bf16 MFU | 125612 tok/s step 2321/19560 | loss 3.843575 (+1.16z)| norm 0.2941 (-0.38z)| lr 5.89e-04 | 4162.21 ms | 32.4% bf16 MFU | 125630 tok/s step 2322/19560 | loss 3.787609 (-0.34z)| norm 0.3354 (+1.26z)| lr 5.89e-04 | 4166.28 ms | 32.4% bf16 MFU | 125640 tok/s step 2323/19560 | loss 3.800735 (+0.01z)| norm 0.3725 (+2.70z)| lr 5.89e-04 | 4173.73 ms | 32.3% bf16 MFU | 125639 tok/s step 2324/19560 | loss 3.779675 (-0.54z)| norm 0.3741 (+2.66z)| lr 5.89e-04 | 4166.48 ms | 32.4% bf16 MFU | 125649 tok/s step 2325/19560 | loss 3.727517 (-2.02z)| norm 0.3013 (-0.17z)| lr 5.89e-04 | 4149.49 ms | 32.5% bf16 MFU | 125684 tok/s step 2326/19560 | loss 3.773665 (-0.69z)| norm 0.2906 (-0.58z)| lr 5.89e-04 | 4171.97 ms | 32.4% bf16 MFU | 125683 tok/s step 2327/19560 | loss 3.764640 (-0.94z)| norm 0.2626 (-1.64z)| lr 5.89e-04 | 4170.33 ms | 32.4% bf16 MFU | 125685 tok/s step 2328/19560 | loss 3.790701 (-0.17z)| norm 0.2643 (-1.54z)| lr 5.89e-04 | 4167.04 ms | 32.4% bf16 MFU | 125692 tok/s step 2329/19560 | loss 3.738214 (-1.68z)| norm 0.2742 (-1.15z)| lr 5.89e-04 | 4162.08 ms | 32.4% bf16 MFU | 125705 tok/s step 2330/19560 | loss 3.839397 (+1.23z)| norm 0.2961 (-0.30z)| lr 5.89e-04 | 4178.20 ms | 32.3% bf16 MFU | 125694 tok/s step 2331/19560 | loss 3.848393 (+1.47z)| norm 0.2901 (-0.53z)| lr 5.89e-04 | 4162.24 ms | 32.4% bf16 MFU | 125708 tok/s step 2332/19560 | loss 3.790867 (-0.18z)| norm 0.2820 (-0.82z)| lr 5.89e-04 | 4158.67 ms | 32.5% bf16 MFU | 125726 tok/s step 2333/19560 | loss 3.812850 (+0.45z)| norm 0.2907 (-0.48z)| lr 5.89e-04 | 4167.81 ms | 32.4% bf16 MFU | 125729 tok/s step 2334/19560 | loss 3.780983 (-0.45z)| norm 0.3245 (+0.84z)| lr 5.89e-04 | 4179.37 ms | 32.3% bf16 MFU | 125715 tok/s step 2335/19560 | loss 3.742201 (-1.54z)| norm 0.3152 (+0.49z)| lr 5.89e-04 | 4158.95 ms | 32.5% bf16 MFU | 125733 tok/s step 2336/19560 | loss 3.785901 (-0.30z)| norm 0.3422 (+1.53z)| lr 5.89e-04 | 4202.17 ms | 32.1% bf16 MFU | 125684 tok/s step 2337/19560 | loss 3.743852 (-1.47z)| norm 0.3539 (+1.96z)| lr 5.89e-04 | 4161.64 ms | 32.4% bf16 MFU | 125699 tok/s step 2338/19560 | loss 3.764090 (-0.89z)| norm 0.3172 (+0.59z)| lr 5.89e-04 | 4161.93 ms | 32.4% bf16 MFU | 125713 tok/s step 2339/19560 | loss 3.811101 (+0.45z)| norm 0.3249 (+0.95z)| lr 5.89e-04 | 4172.09 ms | 32.4% bf16 MFU | 125710 tok/s step 2340/19560 | loss 3.673433 (-3.32z)| norm 0.3068 (+0.21z)| lr 5.89e-04 | 4154.16 ms | 32.5% bf16 MFU | 125735 tok/s step 2341/19560 | loss 3.753554 (-1.11z)| norm 0.3422 (+1.68z)| lr 5.89e-04 | 4162.05 ms | 32.4% bf16 MFU | 125747 tok/s step 2342/19560 | loss 3.748023 (-1.25z)| norm 0.3344 (+1.36z)| lr 5.89e-04 | 4508.02 ms | 30.0% bf16 MFU | 125275 tok/s step 2343/19560 | loss 3.784241 (-0.26z)| norm 0.2871 (-0.61z)| lr 5.89e-04 | 4160.13 ms | 32.5% bf16 MFU | 125312 tok/s step 2344/19560 | loss 3.898784 (+2.77z)| norm 0.2865 (-0.64z)| lr 5.89e-04 | 4217.73 ms | 32.0% bf16 MFU | 125262 tok/s step 2345/19560 | loss 3.814713 (+0.53z)| norm 0.2869 (-0.62z)| lr 5.89e-04 | 4158.32 ms | 32.5% bf16 MFU | 125303 tok/s step 2346/19560 | loss 3.726066 (-1.78z)| norm 0.2857 (-0.67z)| lr 5.89e-04 | 4189.55 ms | 32.2% bf16 MFU | 125295 tok/s step 2347/19560 | loss 3.762712 (-0.80z)| norm 0.3043 (+0.11z)| lr 5.89e-04 | 4358.21 ms | 31.0% bf16 MFU | 125045 tok/s step 2348/19560 | loss 3.788749 (-0.11z)| norm 0.3174 (+0.65z)| lr 5.89e-04 | 4150.10 ms | 32.5% bf16 MFU | 125109 tok/s step 2349/19560 | loss 3.819755 (+0.70z)| norm 0.2977 (-0.20z)| lr 5.89e-04 | 4166.79 ms | 32.4% bf16 MFU | 125145 tok/s step 2350/19560 | loss 3.761162 (-0.84z)| norm 0.2852 (-0.73z)| lr 5.89e-04 | 4164.29 ms | 32.4% bf16 MFU | 125183 tok/s step 2351/19560 | loss 3.823880 (+0.81z)| norm 0.2794 (-0.97z)| lr 5.89e-04 | 4161.17 ms | 32.4% bf16 MFU | 125224 tok/s step 2352/19560 | loss 3.785263 (-0.20z)| norm 0.2808 (-0.90z)| lr 5.89e-04 | 4171.76 ms | 32.4% bf16 MFU | 125246 tok/s step 2353/19560 | loss 3.784281 (-0.23z)| norm 0.2734 (-1.20z)| lr 5.89e-04 | 4175.42 ms | 32.3% bf16 MFU | 125262 tok/s step 2354/19560 | loss 3.798655 (+0.15z)| norm 0.3115 (+0.41z)| lr 5.89e-04 | 4176.60 ms | 32.3% bf16 MFU | 125275 tok/s step 2355/19560 | loss 3.785623 (-0.19z)| norm 0.3559 (+2.24z)| lr 5.89e-04 | 4171.08 ms | 32.4% bf16 MFU | 125297 tok/s step 2356/19560 | loss 3.761051 (-0.84z)| norm 0.2922 (-0.40z)| lr 5.89e-04 | 4168.04 ms | 32.4% bf16 MFU | 125321 tok/s step 2357/19560 | loss 3.779246 (-0.35z)| norm 0.2819 (-0.83z)| lr 5.89e-04 | 4168.39 ms | 32.4% bf16 MFU | 125344 tok/s step 2358/19560 | loss 3.758196 (-0.91z)| norm 0.2709 (-1.27z)| lr 5.89e-04 | 4177.71 ms | 32.3% bf16 MFU | 125352 tok/s step 2359/19560 | loss 3.762730 (-0.79z)| norm 0.2681 (-1.39z)| lr 5.89e-04 | 4150.53 ms | 32.5% bf16 MFU | 125400 tok/s step 2360/19560 | loss 3.786999 (-0.14z)| norm 0.2771 (-1.03z)| lr 5.89e-04 | 4185.71 ms | 32.3% bf16 MFU | 125393 tok/s step 2361/19560 | loss 3.739388 (-1.38z)| norm 0.3116 (+0.39z)| lr 5.89e-04 | 4163.73 ms | 32.4% bf16 MFU | 125419 tok/s step 2362/19560 | loss 3.747529 (-1.15z)| norm 0.2889 (-0.56z)| lr 5.89e-04 | 4163.16 ms | 32.4% bf16 MFU | 125445 tok/s step 2363/19560 | loss 3.832663 (+1.07z)| norm 0.2899 (-0.51z)| lr 5.89e-04 | 4155.66 ms | 32.5% bf16 MFU | 125481 tok/s step 2364/19560 | loss 3.758629 (-0.85z)| norm 0.3143 (+0.51z)| lr 5.89e-04 | 4183.20 ms | 32.3% bf16 MFU | 125473 tok/s step 2365/19560 | loss 3.731577 (-1.53z)| norm 0.2803 (-0.91z)| lr 5.89e-04 | 4241.50 ms | 31.8% bf16 MFU | 125380 tok/s step 2366/19560 | loss 3.770467 (-0.52z)| norm 0.2839 (-0.75z)| lr 5.89e-04 | 4195.56 ms | 32.2% bf16 MFU | 125359 tok/s step 2367/19560 | loss 3.797311 (+0.17z)| norm 0.2809 (-0.86z)| lr 5.89e-04 | 4152.52 ms | 32.5% bf16 MFU | 125404 tok/s step 2368/19560 | loss 3.754236 (-0.93z)| norm 0.2799 (-0.90z)| lr 5.89e-04 | 4153.66 ms | 32.5% bf16 MFU | 125445 tok/s step 2369/19560 | loss 3.725885 (-1.63z)| norm 0.2707 (-1.26z)| lr 5.88e-04 | 4157.98 ms | 32.5% bf16 MFU | 125477 tok/s step 2370/19560 | loss 3.785188 (-0.10z)| norm 0.2779 (-0.95z)| lr 5.88e-04 | 4240.84 ms | 31.8% bf16 MFU | 125385 tok/s step 2371/19560 | loss 3.762949 (-0.67z)| norm 0.2964 (-0.19z)| lr 5.88e-04 | 4223.50 ms | 32.0% bf16 MFU | 125322 tok/s step 2372/19560 | loss 3.745398 (-1.14z)| norm 0.2526 (-1.95z)| lr 5.88e-04 | 4229.43 ms | 31.9% bf16 MFU | 125254 tok/s step 2373/19560 | loss 3.786098 (-0.08z)| norm 0.2712 (-1.18z)| lr 5.88e-04 | 4166.79 ms | 32.4% bf16 MFU | 125283 tok/s step 2374/19560 | loss 3.761192 (-0.72z)| norm 0.2671 (-1.33z)| lr 5.88e-04 | 4281.23 ms | 31.5% bf16 MFU | 125142 tok/s step 2375/19560 | loss 3.870653 (+2.10z)| norm 0.3098 (+0.39z)| lr 5.88e-04 | 4159.71 ms | 32.5% bf16 MFU | 125187 tok/s step 2376/19560 | loss 3.738028 (-1.30z)| norm 0.3255 (+1.00z)| lr 5.88e-04 | 4318.77 ms | 31.3% bf16 MFU | 124997 tok/s step 2377/19560 | loss 3.781843 (-0.17z)| norm 0.3269 (+1.05z)| lr 5.88e-04 | 4157.86 ms | 32.5% bf16 MFU | 125052 tok/s step 2378/19560 | loss 3.736204 (-1.32z)| norm 0.3238 (+0.92z)| lr 5.88e-04 | 4152.10 ms | 32.5% bf16 MFU | 125113 tok/s step 2379/19560 | loss 3.779176 (-0.22z)| norm 0.2705 (-1.19z)| lr 5.88e-04 | 4162.51 ms | 32.4% bf16 MFU | 125155 tok/s step 2380/19560 | loss 3.795963 (+0.22z)| norm 0.2797 (-0.81z)| lr 5.88e-04 | 4169.11 ms | 32.4% bf16 MFU | 125185 tok/s step 2381/19560 | loss 3.792587 (+0.15z)| norm 0.2894 (-0.41z)| lr 5.88e-04 | 4159.20 ms | 32.5% bf16 MFU | 125229 tok/s step 2382/19560 | loss 3.768316 (-0.49z)| norm 0.2561 (-1.76z)| lr 5.88e-04 | 4152.31 ms | 32.5% bf16 MFU | 125280 tok/s step 2383/19560 | loss 3.732563 (-1.42z)| norm 0.2825 (-0.66z)| lr 5.88e-04 | 4155.43 ms | 32.5% bf16 MFU | 125325 tok/s step 2384/19560 | loss 3.711522 (-1.94z)| norm 0.3125 (+0.63z)| lr 5.88e-04 | 4185.12 ms | 32.3% bf16 MFU | 125322 tok/s step 2385/19560 | loss 3.733853 (-1.35z)| norm 0.3170 (+0.83z)| lr 5.88e-04 | 4174.99 ms | 32.3% bf16 MFU | 125335 tok/s step 2386/19560 | loss 3.786518 (+0.01z)| norm 0.3029 (+0.24z)| lr 5.88e-04 | 4231.73 ms | 31.9% bf16 MFU | 125263 tok/s step 2387/19560 | loss 3.795521 (+0.28z)| norm 0.3232 (+1.14z)| lr 5.88e-04 | 4157.90 ms | 32.5% bf16 MFU | 125305 tok/s step 2388/19560 | loss 3.757138 (-0.74z)| norm 0.3180 (+0.90z)| lr 5.88e-04 | 4152.42 ms | 32.5% bf16 MFU | 125353 tok/s step 2389/19560 | loss 3.766619 (-0.50z)| norm 0.3400 (+1.84z)| lr 5.88e-04 | 4217.53 ms | 32.0% bf16 MFU | 125300 tok/s step 2390/19560 | loss 3.821172 (+0.98z)| norm 0.3133 (+0.67z)| lr 5.88e-04 | 4158.23 ms | 32.5% bf16 MFU | 125340 tok/s step 2391/19560 | loss 3.750800 (-0.92z)| norm 0.3000 (+0.10z)| lr 5.88e-04 | 4158.21 ms | 32.5% bf16 MFU | 125377 tok/s step 2392/19560 | loss 3.765438 (-0.51z)| norm 0.3125 (+0.63z)| lr 5.88e-04 | 4169.41 ms | 32.4% bf16 MFU | 125395 tok/s step 2393/19560 | loss 3.750750 (-0.90z)| norm 0.3061 (+0.35z)| lr 5.88e-04 | 4274.26 ms | 31.6% bf16 MFU | 125259 tok/s step 2394/19560 | loss 3.775150 (-0.23z)| norm 0.2661 (-1.37z)| lr 5.88e-04 | 4235.12 ms | 31.9% bf16 MFU | 125186 tok/s step 2395/19560 | loss 3.814198 (+0.83z)| norm 0.2966 (-0.07z)| lr 5.88e-04 | 4273.92 ms | 31.6% bf16 MFU | 125060 tok/s step 2396/19560 | loss 3.832321 (+1.30z)| norm 0.2912 (-0.31z)| lr 5.88e-04 | 4173.56 ms | 32.4% bf16 MFU | 125088 tok/s step 2397/19560 | loss 3.834146 (+1.35z)| norm 0.2919 (-0.27z)| lr 5.88e-04 | 4157.98 ms | 32.5% bf16 MFU | 125138 tok/s step 2398/19560 | loss 3.807935 (+0.65z)| norm 0.3162 (+0.78z)| lr 5.88e-04 | 4163.12 ms | 32.4% bf16 MFU | 125178 tok/s step 2399/19560 | loss 3.777694 (-0.17z)| norm 0.3447 (+1.97z)| lr 5.88e-04 | 4168.46 ms | 32.4% bf16 MFU | 125208 tok/s step 2400/19560 | loss 3.771201 (-0.35z)| norm 0.3449 (+1.94z)| lr 5.88e-04 | 7547.24 ms | 17.9% bf16 MFU | 122421 tok/s step 2401/19560 | loss 3.806616 (+0.61z)| norm 0.3121 (+0.55z)| lr 5.88e-04 | 4151.58 ms | 32.5% bf16 MFU | 122614 tok/s step 2402/19560 | loss 3.803791 (+0.53z)| norm 0.2946 (-0.19z)| lr 5.88e-04 | 4158.93 ms | 32.5% bf16 MFU | 122787 tok/s step 2403/19560 | loss 3.844665 (+1.64z)| norm 0.2928 (-0.27z)| lr 5.88e-04 | 4152.11 ms | 32.5% bf16 MFU | 122961 tok/s step 2404/19560 | loss 3.847775 (+1.70z)| norm 0.3000 (+0.03z)| lr 5.88e-04 | 4147.96 ms | 32.6% bf16 MFU | 123133 tok/s step 2405/19560 | loss 3.828688 (+1.17z)| norm 0.2849 (-0.60z)| lr 5.88e-04 | 4140.82 ms | 32.6% bf16 MFU | 123307 tok/s step 2406/19560 | loss 3.716203 (-1.85z)| norm 0.2651 (-1.44z)| lr 5.88e-04 | 4157.20 ms | 32.5% bf16 MFU | 123447 tok/s step 2407/19560 | loss 3.772524 (-0.33z)| norm 0.2617 (-1.57z)| lr 5.88e-04 | 4154.92 ms | 32.5% bf16 MFU | 123584 tok/s step 2408/19560 | loss 3.744818 (-1.06z)| norm 0.2518 (-1.95z)| lr 5.88e-04 | 4154.09 ms | 32.5% bf16 MFU | 123715 tok/s step 2409/19560 | loss 3.810502 (+0.71z)| norm 0.2735 (-1.03z)| lr 5.88e-04 | 4146.84 ms | 32.6% bf16 MFU | 123851 tok/s step 2410/19560 | loss 3.756102 (-0.75z)| norm 0.2594 (-1.58z)| lr 5.88e-04 | 4161.96 ms | 32.4% bf16 MFU | 123957 tok/s step 2411/19560 | loss 3.822209 (+1.02z)| norm 0.2349 (-2.52z)| lr 5.88e-04 | 4159.57 ms | 32.5% bf16 MFU | 124061 tok/s step 2412/19560 | loss 3.783305 (-0.03z)| norm 0.2583 (-1.55z)| lr 5.88e-04 | 4148.07 ms | 32.5% bf16 MFU | 124178 tok/s step 2413/19560 | loss 3.807946 (+0.64z)| norm 0.2858 (-0.44z)| lr 5.88e-04 | 4146.38 ms | 32.6% bf16 MFU | 124291 tok/s step 2414/19560 | loss 3.776537 (-0.22z)| norm 0.3576 (+2.39z)| lr 5.88e-04 | 4149.58 ms | 32.5% bf16 MFU | 124394 tok/s step 2415/19560 | loss 3.764639 (-0.53z)| norm 0.3227 (+1.00z)| lr 5.88e-04 | 4147.65 ms | 32.6% bf16 MFU | 124495 tok/s step 2416/19560 | loss 3.738760 (-1.22z)| norm 0.3353 (+1.47z)| lr 5.88e-04 | 4154.21 ms | 32.5% bf16 MFU | 124580 tok/s step 2417/19560 | loss 3.795812 (+0.34z)| norm 0.3730 (+2.84z)| lr 5.88e-04 | 4155.32 ms | 32.5% bf16 MFU | 124660 tok/s step 2418/19560 | loss 3.783272 (-0.01z)| norm 0.3602 (+2.30z)| lr 5.88e-04 | 4145.22 ms | 32.6% bf16 MFU | 124751 tok/s step 2419/19560 | loss 3.774156 (-0.25z)| norm 0.3454 (+1.73z)| lr 5.88e-04 | 4166.89 ms | 32.4% bf16 MFU | 124804 tok/s step 2420/19560 | loss 3.803159 (+0.53z)| norm 0.3205 (+0.79z)| lr 5.88e-04 | 4183.45 ms | 32.3% bf16 MFU | 124830 tok/s step 2421/19560 | loss 3.807349 (+0.63z)| norm 0.2912 (-0.29z)| lr 5.88e-04 | 4441.48 ms | 30.4% bf16 MFU | 124491 tok/s step 2422/19560 | loss 3.756263 (-0.75z)| norm 0.3095 (+0.38z)| lr 5.88e-04 | 4148.34 ms | 32.5% bf16 MFU | 124586 tok/s step 2423/19560 | loss 3.793063 (+0.28z)| norm 0.3165 (+0.63z)| lr 5.88e-04 | 4154.69 ms | 32.5% bf16 MFU | 124666 tok/s step 2424/19560 | loss 3.797352 (+0.40z)| norm 0.3103 (+0.41z)| lr 5.88e-04 | 4147.97 ms | 32.6% bf16 MFU | 124753 tok/s step 2425/19560 | loss 3.872832 (+2.45z)| norm 0.2962 (-0.11z)| lr 5.88e-04 | 4156.72 ms | 32.5% bf16 MFU | 124822 tok/s step 2426/19560 | loss 3.806631 (+0.62z)| norm 0.3434 (+1.62z)| lr 5.88e-04 | 4148.06 ms | 32.5% bf16 MFU | 124900 tok/s step 2427/19560 | loss 3.775798 (-0.23z)| norm 0.3458 (+1.67z)| lr 5.88e-04 | 4145.83 ms | 32.6% bf16 MFU | 124978 tok/s step 2428/19560 | loss 3.805420 (+0.58z)| norm 0.3000 (+0.01z)| lr 5.88e-04 | 4169.40 ms | 32.4% bf16 MFU | 125017 tok/s step 2429/19560 | loss 3.827290 (+1.16z)| norm 0.3347 (+1.25z)| lr 5.88e-04 | 4150.92 ms | 32.5% bf16 MFU | 125081 tok/s step 2430/19560 | loss 3.754469 (-0.83z)| norm 0.3161 (+0.57z)| lr 5.88e-04 | 4152.28 ms | 32.5% bf16 MFU | 125140 tok/s step 2431/19560 | loss 3.753775 (-0.84z)| norm 0.3414 (+1.46z)| lr 5.88e-04 | 4149.76 ms | 32.5% bf16 MFU | 125200 tok/s step 2432/19560 | loss 3.742276 (-1.14z)| norm 0.3170 (+0.58z)| lr 5.88e-04 | 4153.86 ms | 32.5% bf16 MFU | 125251 tok/s step 2433/19560 | loss 3.777026 (-0.19z)| norm 0.2877 (-0.46z)| lr 5.88e-04 | 4152.65 ms | 32.5% bf16 MFU | 125301 tok/s step 2434/19560 | loss 3.769999 (-0.37z)| norm 0.2821 (-0.67z)| lr 5.88e-04 | 4144.82 ms | 32.6% bf16 MFU | 125361 tok/s step 2435/19560 | loss 3.797754 (+0.40z)| norm 0.2969 (-0.14z)| lr 5.88e-04 | 4162.25 ms | 32.4% bf16 MFU | 125391 tok/s step 2436/19560 | loss 3.812005 (+0.79z)| norm 0.2962 (-0.16z)| lr 5.88e-04 | 4160.26 ms | 32.5% bf16 MFU | 125423 tok/s step 2437/19560 | loss 3.782564 (-0.02z)| norm 0.3366 (+1.27z)| lr 5.88e-04 | 4156.32 ms | 32.5% bf16 MFU | 125459 tok/s step 2438/19560 | loss 3.770438 (-0.35z)| norm 0.2933 (-0.28z)| lr 5.88e-04 | 4153.76 ms | 32.5% bf16 MFU | 125497 tok/s step 2439/19560 | loss 3.772655 (-0.29z)| norm 0.2906 (-0.38z)| lr 5.88e-04 | 4162.03 ms | 32.4% bf16 MFU | 125520 tok/s step 2440/19560 | loss 3.731129 (-1.44z)| norm 0.2900 (-0.40z)| lr 5.88e-04 | 4152.17 ms | 32.5% bf16 MFU | 125558 tok/s step 2441/19560 | loss 3.808369 (+0.74z)| norm 0.3318 (+1.09z)| lr 5.87e-04 | 4162.02 ms | 32.4% bf16 MFU | 125578 tok/s step 2442/19560 | loss 3.785342 (+0.09z)| norm 0.3152 (+0.49z)| lr 5.87e-04 | 4162.38 ms | 32.4% bf16 MFU | 125597 tok/s step 2443/19560 | loss 3.760212 (-0.61z)| norm 0.2799 (-0.77z)| lr 5.87e-04 | 4160.01 ms | 32.5% bf16 MFU | 125619 tok/s step 2444/19560 | loss 3.804090 (+0.64z)| norm 0.3051 (+0.12z)| lr 5.87e-04 | 4152.99 ms | 32.5% bf16 MFU | 125650 tok/s step 2445/19560 | loss 3.772203 (-0.26z)| norm 0.3002 (-0.05z)| lr 5.87e-04 | 4154.09 ms | 32.5% bf16 MFU | 125678 tok/s step 2446/19560 | loss 3.839270 (+1.64z)| norm 0.3029 (+0.04z)| lr 5.87e-04 | 4151.29 ms | 32.5% bf16 MFU | 125709 tok/s step 2447/19560 | loss 3.749500 (-0.90z)| norm 0.2984 (-0.12z)| lr 5.87e-04 | 4149.95 ms | 32.5% bf16 MFU | 125740 tok/s step 2448/19560 | loss 3.830410 (+1.37z)| norm 0.3404 (+1.36z)| lr 5.87e-04 | 4147.59 ms | 32.6% bf16 MFU | 125774 tok/s step 2449/19560 | loss 3.836019 (+1.53z)| norm 0.3061 (+0.13z)| lr 5.87e-04 | 4146.24 ms | 32.6% bf16 MFU | 125807 tok/s step 2450/19560 | loss 3.796076 (+0.40z)| norm 0.3194 (+0.61z)| lr 5.87e-04 | 4152.15 ms | 32.5% bf16 MFU | 125831 tok/s step 2451/19560 | loss 3.757674 (-0.67z)| norm 0.3606 (+2.12z)| lr 5.87e-04 | 4160.36 ms | 32.5% bf16 MFU | 125840 tok/s step 2452/19560 | loss 3.792943 (+0.32z)| norm 0.3651 (+2.30z)| lr 5.87e-04 | 4163.85 ms | 32.4% bf16 MFU | 125844 tok/s step 2453/19560 | loss 3.793671 (+0.33z)| norm 0.3608 (+2.09z)| lr 5.87e-04 | 4158.87 ms | 32.5% bf16 MFU | 125855 tok/s step 2454/19560 | loss 3.756634 (-0.72z)| norm 0.3311 (+1.00z)| lr 5.87e-04 | 4151.14 ms | 32.5% bf16 MFU | 125877 tok/s step 2455/19560 | loss 3.769232 (-0.36z)| norm 0.3014 (-0.07z)| lr 5.87e-04 | 4149.97 ms | 32.5% bf16 MFU | 125900 tok/s step 2456/19560 | loss 3.804572 (+0.64z)| norm 0.3104 (+0.24z)| lr 5.87e-04 | 4150.91 ms | 32.5% bf16 MFU | 125920 tok/s step 2457/19560 | loss 3.722924 (-1.66z)| norm 0.2813 (-0.82z)| lr 5.87e-04 | 4153.71 ms | 32.5% bf16 MFU | 125935 tok/s step 2458/19560 | loss 3.732591 (-1.37z)| norm 0.3016 (-0.08z)| lr 5.87e-04 | 4140.65 ms | 32.6% bf16 MFU | 125970 tok/s step 2459/19560 | loss 3.705143 (-2.11z)| norm 0.2717 (-1.16z)| lr 5.87e-04 | 4155.81 ms | 32.5% bf16 MFU | 125979 tok/s step 2460/19560 | loss 3.807949 (+0.78z)| norm 0.2884 (-0.56z)| lr 5.87e-04 | 4156.40 ms | 32.5% bf16 MFU | 125987 tok/s step 2461/19560 | loss 3.715025 (-1.79z)| norm 0.2786 (-0.91z)| lr 5.87e-04 | 4149.97 ms | 32.5% bf16 MFU | 126004 tok/s step 2462/19560 | loss 3.762683 (-0.46z)| norm 0.2958 (-0.28z)| lr 5.87e-04 | 4150.01 ms | 32.5% bf16 MFU | 126021 tok/s step 2463/19560 | loss 3.759280 (-0.56z)| norm 0.2820 (-0.77z)| lr 5.87e-04 | 4145.88 ms | 32.6% bf16 MFU | 126043 tok/s step 2464/19560 | loss 3.812566 (+0.92z)| norm 0.2863 (-0.60z)| lr 5.87e-04 | 4140.96 ms | 32.6% bf16 MFU | 126071 tok/s step 2465/19560 | loss 3.765520 (-0.40z)| norm 0.3021 (-0.01z)| lr 5.87e-04 | 4146.66 ms | 32.6% bf16 MFU | 126090 tok/s step 2466/19560 | loss 3.763994 (-0.44z)| norm 0.2972 (-0.18z)| lr 5.87e-04 | 4145.64 ms | 32.6% bf16 MFU | 126108 tok/s step 2467/19560 | loss 3.873159 (+2.54z)| norm 0.3145 (+0.46z)| lr 5.87e-04 | 4161.57 ms | 32.4% bf16 MFU | 126102 tok/s step 2468/19560 | loss 3.807089 (+0.73z)| norm 0.3019 (-0.01z)| lr 5.87e-04 | 4159.14 ms | 32.5% bf16 MFU | 126100 tok/s step 2469/19560 | loss 3.799603 (+0.51z)| norm 0.2809 (-0.77z)| lr 5.87e-04 | 4161.93 ms | 32.4% bf16 MFU | 126094 tok/s step 2470/19560 | loss 3.747480 (-0.96z)| norm 0.2853 (-0.60z)| lr 5.87e-04 | 4143.85 ms | 32.6% bf16 MFU | 126115 tok/s step 2471/19560 | loss 3.871449 (+2.46z)| norm 0.3164 (+0.56z)| lr 5.87e-04 | 4451.08 ms | 30.3% bf16 MFU | 125699 tok/s step 2472/19560 | loss 3.797242 (+0.45z)| norm 0.3159 (+0.53z)| lr 5.87e-04 | 4152.66 ms | 32.5% bf16 MFU | 125726 tok/s step 2473/19560 | loss 3.805020 (+0.68z)| norm 0.2942 (-0.28z)| lr 5.87e-04 | 4153.10 ms | 32.5% bf16 MFU | 125752 tok/s step 2474/19560 | loss 3.767687 (-0.41z)| norm 0.2743 (-1.02z)| lr 5.87e-04 | 4151.13 ms | 32.5% bf16 MFU | 125779 tok/s step 2475/19560 | loss 3.797975 (+0.46z)| norm 0.2916 (-0.37z)| lr 5.87e-04 | 4149.40 ms | 32.5% bf16 MFU | 125808 tok/s step 2476/19560 | loss 3.781383 (-0.02z)| norm 0.2431 (-2.13z)| lr 5.87e-04 | 4158.49 ms | 32.5% bf16 MFU | 125822 tok/s step 2477/19560 | loss 3.763824 (-0.52z)| norm 0.2450 (-2.01z)| lr 5.87e-04 | 4154.28 ms | 32.5% bf16 MFU | 125841 tok/s step 2478/19560 | loss 3.730308 (-1.48z)| norm 0.2745 (-0.94z)| lr 5.87e-04 | 4155.68 ms | 32.5% bf16 MFU | 125857 tok/s step 2479/19560 | loss 3.777827 (-0.09z)| norm 0.2811 (-0.70z)| lr 5.87e-04 | 4156.04 ms | 32.5% bf16 MFU | 125871 tok/s step 2480/19560 | loss 3.804615 (+0.68z)| norm 0.2697 (-1.11z)| lr 5.87e-04 | 4153.20 ms | 32.5% bf16 MFU | 125890 tok/s step 2481/19560 | loss 3.753533 (-0.79z)| norm 0.2742 (-0.94z)| lr 5.87e-04 | 4174.20 ms | 32.3% bf16 MFU | 125875 tok/s step 2482/19560 | loss 3.796103 (+0.44z)| norm 0.3198 (+0.70z)| lr 5.87e-04 | 4147.34 ms | 32.6% bf16 MFU | 125902 tok/s step 2483/19560 | loss 3.802091 (+0.61z)| norm 0.3205 (+0.74z)| lr 5.87e-04 | 4161.56 ms | 32.4% bf16 MFU | 125906 tok/s step 2484/19560 | loss 3.820403 (+1.12z)| norm 0.3053 (+0.19z)| lr 5.87e-04 | 4163.42 ms | 32.4% bf16 MFU | 125907 tok/s step 2485/19560 | loss 3.765054 (-0.47z)| norm 0.2842 (-0.59z)| lr 5.87e-04 | 4149.39 ms | 32.5% bf16 MFU | 125930 tok/s step 2486/19560 | loss 3.743718 (-1.08z)| norm 0.3023 (+0.07z)| lr 5.87e-04 | 4160.74 ms | 32.5% bf16 MFU | 125934 tok/s step 2487/19560 | loss 3.727987 (-1.51z)| norm 0.3271 (+0.96z)| lr 5.87e-04 | 4160.63 ms | 32.5% bf16 MFU | 125938 tok/s step 2488/19560 | loss 3.813332 (+0.92z)| norm 0.3183 (+0.63z)| lr 5.87e-04 | 4147.31 ms | 32.6% bf16 MFU | 125962 tok/s step 2489/19560 | loss 3.755209 (-0.74z)| norm 0.2969 (-0.16z)| lr 5.87e-04 | 4206.83 ms | 32.1% bf16 MFU | 125895 tok/s step 2490/19560 | loss 3.810985 (+0.83z)| norm 0.3226 (+0.78z)| lr 5.87e-04 | 4147.32 ms | 32.6% bf16 MFU | 125921 tok/s step 2491/19560 | loss 3.753160 (-0.80z)| norm 0.2716 (-1.09z)| lr 5.87e-04 | 4165.41 ms | 32.4% bf16 MFU | 125918 tok/s step 2492/19560 | loss 3.759172 (-0.63z)| norm 0.2995 (-0.06z)| lr 5.87e-04 | 4173.06 ms | 32.4% bf16 MFU | 125904 tok/s step 2493/19560 | loss 3.823452 (+1.20z)| norm 0.3832 (+2.89z)| lr 5.87e-04 | 4154.38 ms | 32.5% bf16 MFU | 125919 tok/s step 2494/19560 | loss 3.790632 (+0.25z)| norm 0.3563 (+1.89z)| lr 5.87e-04 | 4156.10 ms | 32.5% bf16 MFU | 125930 tok/s step 2495/19560 | loss 3.806661 (+0.71z)| norm 0.3611 (+2.01z)| lr 5.87e-04 | 4144.65 ms | 32.6% bf16 MFU | 125959 tok/s step 2496/19560 | loss 3.850909 (+1.93z)| norm 0.3289 (+0.88z)| lr 5.87e-04 | 4148.20 ms | 32.5% bf16 MFU | 125980 tok/s step 2497/19560 | loss 3.748372 (-0.99z)| norm 0.2953 (-0.29z)| lr 5.87e-04 | 4154.33 ms | 32.5% bf16 MFU | 125991 tok/s step 2498/19560 | loss 3.721453 (-1.73z)| norm 0.2948 (-0.32z)| lr 5.87e-04 | 4146.24 ms | 32.6% bf16 MFU | 126014 tok/s step 2499/19560 | loss 3.841586 (+1.64z)| norm 0.3077 (+0.13z)| lr 5.87e-04 | 4158.43 ms | 32.5% bf16 MFU | 126018 tok/s step 2500/19560 | loss 3.841106 (+1.59z)| norm 0.2800 (-0.85z)| lr 5.87e-04 | 4149.03 ms | 32.5% bf16 MFU | 126035 tok/s val loss 3.768145 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2644/10042 = 0.263294 step 2501/19560 | loss 3.796995 (+0.36z)| norm 0.3107 (+0.22z)| lr 5.87e-04 | 4157.98 ms | 32.5% bf16 MFU | 126038 tok/s step 2502/19560 | loss 3.780154 (-0.11z)| norm 0.3206 (+0.56z)| lr 5.87e-04 | 4147.76 ms | 32.6% bf16 MFU | 126056 tok/s step 2503/19560 | loss 3.761770 (-0.61z)| norm 0.3443 (+1.38z)| lr 5.87e-04 | 4314.03 ms | 31.3% bf16 MFU | 125830 tok/s step 2504/19560 | loss 3.754834 (-0.82z)| norm 0.3189 (+0.49z)| lr 5.87e-04 | 4211.25 ms | 32.1% bf16 MFU | 125763 tok/s step 2505/19560 | loss 3.826024 (+1.20z)| norm 0.2930 (-0.42z)| lr 5.87e-04 | 4254.71 ms | 31.7% bf16 MFU | 125636 tok/s step 2506/19560 | loss 3.762351 (-0.62z)| norm 0.2788 (-0.91z)| lr 5.87e-04 | 4157.39 ms | 32.5% bf16 MFU | 125660 tok/s step 2507/19560 | loss 3.834777 (+1.43z)| norm 0.2789 (-0.91z)| lr 5.87e-04 | 4231.47 ms | 31.9% bf16 MFU | 125572 tok/s step 2508/19560 | loss 3.842393 (+1.62z)| norm 0.2944 (-0.37z)| lr 5.87e-04 | 4192.33 ms | 32.2% bf16 MFU | 125546 tok/s step 2509/19560 | loss 3.731391 (-1.47z)| norm 0.2766 (-0.99z)| lr 5.86e-04 | 4158.35 ms | 32.5% bf16 MFU | 125573 tok/s step 2510/19560 | loss 3.748350 (-0.99z)| norm 0.2617 (-1.53z)| lr 5.86e-04 | 4166.78 ms | 32.4% bf16 MFU | 125586 tok/s step 2511/19560 | loss 3.739204 (-1.25z)| norm 0.2643 (-1.42z)| lr 5.86e-04 | 4165.46 ms | 32.4% bf16 MFU | 125600 tok/s step 2512/19560 | loss 3.755510 (-0.82z)| norm 0.2866 (-0.63z)| lr 5.86e-04 | 4155.42 ms | 32.5% bf16 MFU | 125628 tok/s step 2513/19560 | loss 3.725683 (-1.65z)| norm 0.2384 (-2.26z)| lr 5.86e-04 | 4159.70 ms | 32.5% bf16 MFU | 125649 tok/s step 2514/19560 | loss 3.729830 (-1.51z)| norm 0.2556 (-1.64z)| lr 5.86e-04 | 4146.05 ms | 32.6% bf16 MFU | 125689 tok/s step 2515/19560 | loss 3.775392 (-0.24z)| norm 0.2567 (-1.57z)| lr 5.86e-04 | 4159.98 ms | 32.5% bf16 MFU | 125706 tok/s step 2516/19560 | loss 3.759293 (-0.68z)| norm 0.2749 (-0.93z)| lr 5.86e-04 | 4152.93 ms | 32.5% bf16 MFU | 125733 tok/s step 2517/19560 | loss 3.808752 (+0.68z)| norm 0.2768 (-0.86z)| lr 5.86e-04 | 4151.33 ms | 32.5% bf16 MFU | 125761 tok/s step 2518/19560 | loss 3.760805 (-0.64z)| norm 0.2742 (-0.93z)| lr 5.86e-04 | 4157.51 ms | 32.5% bf16 MFU | 125778 tok/s step 2519/19560 | loss 3.793454 (+0.26z)| norm 0.3332 (+1.06z)| lr 5.86e-04 | 4149.79 ms | 32.5% bf16 MFU | 125807 tok/s step 2520/19560 | loss 3.733294 (-1.40z)| norm 0.3220 (+0.67z)| lr 5.86e-04 | 4149.57 ms | 32.5% bf16 MFU | 125834 tok/s step 2521/19560 | loss 3.747337 (-1.01z)| norm 0.2635 (-1.28z)| lr 5.86e-04 | 4230.69 ms | 31.9% bf16 MFU | 125738 tok/s step 2522/19560 | loss 3.710277 (-2.00z)| norm 0.2669 (-1.17z)| lr 5.86e-04 | 4149.61 ms | 32.5% bf16 MFU | 125769 tok/s step 2523/19560 | loss 3.740621 (-1.15z)| norm 0.2983 (-0.11z)| lr 5.86e-04 | 4166.18 ms | 32.4% bf16 MFU | 125772 tok/s step 2524/19560 | loss 3.741303 (-1.11z)| norm 0.3265 (+0.82z)| lr 5.86e-04 | 4151.25 ms | 32.5% bf16 MFU | 125799 tok/s step 2525/19560 | loss 3.784306 (+0.08z)| norm 0.3197 (+0.59z)| lr 5.86e-04 | 4148.96 ms | 32.5% bf16 MFU | 125827 tok/s step 2526/19560 | loss 3.699690 (-2.20z)| norm 0.2968 (-0.17z)| lr 5.86e-04 | 4160.82 ms | 32.4% bf16 MFU | 125836 tok/s step 2527/19560 | loss 3.810040 (+0.79z)| norm 0.2930 (-0.29z)| lr 5.86e-04 | 4151.55 ms | 32.5% bf16 MFU | 125858 tok/s step 2528/19560 | loss 3.744117 (-0.99z)| norm 0.3135 (+0.41z)| lr 5.86e-04 | 4153.34 ms | 32.5% bf16 MFU | 125877 tok/s step 2529/19560 | loss 3.708193 (-1.91z)| norm 0.3217 (+0.69z)| lr 5.86e-04 | 4153.96 ms | 32.5% bf16 MFU | 125894 tok/s step 2530/19560 | loss 3.769279 (-0.28z)| norm 0.3461 (+1.49z)| lr 5.86e-04 | 4165.67 ms | 32.4% bf16 MFU | 125892 tok/s step 2531/19560 | loss 3.779615 (+0.01z)| norm 0.3130 (+0.37z)| lr 5.86e-04 | 4157.15 ms | 32.5% bf16 MFU | 125904 tok/s step 2532/19560 | loss 3.780107 (+0.04z)| norm 0.2979 (-0.14z)| lr 5.86e-04 | 4147.57 ms | 32.6% bf16 MFU | 125929 tok/s step 2533/19560 | loss 3.832903 (+1.48z)| norm 0.3453 (+1.43z)| lr 5.86e-04 | 4144.53 ms | 32.6% bf16 MFU | 125957 tok/s step 2534/19560 | loss 3.768870 (-0.28z)| norm 0.3196 (+0.56z)| lr 5.86e-04 | 4158.66 ms | 32.5% bf16 MFU | 125963 tok/s step 2535/19560 | loss 3.790684 (+0.31z)| norm 0.2768 (-0.89z)| lr 5.86e-04 | 4151.45 ms | 32.5% bf16 MFU | 125979 tok/s step 2536/19560 | loss 3.835346 (+1.52z)| norm 0.2769 (-0.90z)| lr 5.86e-04 | 4213.07 ms | 32.0% bf16 MFU | 125903 tok/s step 2537/19560 | loss 3.801274 (+0.59z)| norm 0.3116 (+0.28z)| lr 5.86e-04 | 4150.63 ms | 32.5% bf16 MFU | 125923 tok/s step 2538/19560 | loss 3.759964 (-0.55z)| norm 0.3174 (+0.46z)| lr 5.86e-04 | 4152.74 ms | 32.5% bf16 MFU | 125940 tok/s step 2539/19560 | loss 3.725907 (-1.46z)| norm 0.2917 (-0.45z)| lr 5.86e-04 | 4185.47 ms | 32.3% bf16 MFU | 125906 tok/s step 2540/19560 | loss 3.801493 (+0.61z)| norm 0.2901 (-0.52z)| lr 5.86e-04 | 4154.34 ms | 32.5% bf16 MFU | 125921 tok/s step 2541/19560 | loss 3.736515 (-1.15z)| norm 0.2727 (-1.13z)| lr 5.86e-04 | 4152.49 ms | 32.5% bf16 MFU | 125938 tok/s step 2542/19560 | loss 3.719219 (-1.60z)| norm 0.3504 (+1.63z)| lr 5.86e-04 | 4155.40 ms | 32.5% bf16 MFU | 125949 tok/s step 2543/19560 | loss 3.773944 (-0.12z)| norm 0.3326 (+0.99z)| lr 5.86e-04 | 4154.41 ms | 32.5% bf16 MFU | 125962 tok/s step 2544/19560 | loss 3.756202 (-0.61z)| norm 0.3037 (-0.02z)| lr 5.86e-04 | 4159.63 ms | 32.5% bf16 MFU | 125966 tok/s step 2545/19560 | loss 3.784210 (+0.16z)| norm 0.2986 (-0.19z)| lr 5.86e-04 | 4149.15 ms | 32.5% bf16 MFU | 125986 tok/s step 2546/19560 | loss 3.742074 (-0.98z)| norm 0.2997 (-0.13z)| lr 5.86e-04 | 4145.57 ms | 32.6% bf16 MFU | 126010 tok/s step 2547/19560 | loss 3.822895 (+1.19z)| norm 0.2965 (-0.24z)| lr 5.86e-04 | 4147.91 ms | 32.6% bf16 MFU | 126029 tok/s step 2548/19560 | loss 3.800064 (+0.58z)| norm 0.3002 (-0.09z)| lr 5.86e-04 | 4146.58 ms | 32.6% bf16 MFU | 126050 tok/s step 2549/19560 | loss 3.789092 (+0.29z)| norm 0.2958 (-0.26z)| lr 5.86e-04 | 4155.42 ms | 32.5% bf16 MFU | 126056 tok/s step 2550/19560 | loss 3.744349 (-0.91z)| norm 0.2785 (-0.90z)| lr 5.86e-04 | 4186.82 ms | 32.2% bf16 MFU | 126014 tok/s step 2551/19560 | loss 3.784589 (+0.17z)| norm 0.3090 (+0.24z)| lr 5.86e-04 | 4138.92 ms | 32.6% bf16 MFU | 126047 tok/s step 2552/19560 | loss 3.787467 (+0.25z)| norm 0.4083 (+3.72z)| lr 5.86e-04 | 4219.41 ms | 32.0% bf16 MFU | 125957 tok/s step 2553/19560 | loss 3.825733 (+1.32z)| norm 0.3385 (+1.23z)| lr 5.86e-04 | 4150.99 ms | 32.5% bf16 MFU | 125975 tok/s step 2554/19560 | loss 3.725457 (-1.41z)| norm 0.3543 (+1.78z)| lr 5.86e-04 | 4146.48 ms | 32.6% bf16 MFU | 125998 tok/s step 2555/19560 | loss 3.843966 (+1.79z)| norm 0.2696 (-1.18z)| lr 5.86e-04 | 4153.33 ms | 32.5% bf16 MFU | 126010 tok/s step 2556/19560 | loss 3.801118 (+0.63z)| norm 0.2818 (-0.74z)| lr 5.86e-04 | 4143.73 ms | 32.6% bf16 MFU | 126036 tok/s step 2557/19560 | loss 3.794719 (+0.47z)| norm 0.2758 (-0.94z)| lr 5.86e-04 | 4149.66 ms | 32.5% bf16 MFU | 126051 tok/s step 2558/19560 | loss 3.845999 (+1.82z)| norm 0.2674 (-1.21z)| lr 5.86e-04 | 4172.88 ms | 32.4% bf16 MFU | 126031 tok/s step 2559/19560 | loss 3.734958 (-1.15z)| norm 0.2709 (-1.07z)| lr 5.86e-04 | 4153.93 ms | 32.5% bf16 MFU | 126040 tok/s step 2560/19560 | loss 3.745439 (-0.87z)| norm 0.2713 (-1.04z)| lr 5.86e-04 | 4155.94 ms | 32.5% bf16 MFU | 126046 tok/s step 2561/19560 | loss 3.740975 (-0.98z)| norm 0.2598 (-1.43z)| lr 5.86e-04 | 4195.44 ms | 32.2% bf16 MFU | 125992 tok/s step 2562/19560 | loss 3.798945 (+0.56z)| norm 0.2732 (-0.96z)| lr 5.86e-04 | 4154.81 ms | 32.5% bf16 MFU | 126001 tok/s step 2563/19560 | loss 3.773080 (-0.12z)| norm 0.2885 (-0.43z)| lr 5.86e-04 | 4145.06 ms | 32.6% bf16 MFU | 126026 tok/s step 2564/19560 | loss 3.758907 (-0.49z)| norm 0.2961 (-0.16z)| lr 5.86e-04 | 4162.73 ms | 32.4% bf16 MFU | 126022 tok/s step 2565/19560 | loss 3.698387 (-2.06z)| norm 0.2892 (-0.39z)| lr 5.86e-04 | 4156.03 ms | 32.5% bf16 MFU | 126028 tok/s step 2566/19560 | loss 3.755541 (-0.55z)| norm 0.2694 (-1.07z)| lr 5.86e-04 | 4155.68 ms | 32.5% bf16 MFU | 126035 tok/s step 2567/19560 | loss 3.743956 (-0.85z)| norm 0.2693 (-1.06z)| lr 5.86e-04 | 4193.56 ms | 32.2% bf16 MFU | 125984 tok/s step 2568/19560 | loss 3.752082 (-0.64z)| norm 0.2656 (-1.18z)| lr 5.86e-04 | 4150.27 ms | 32.5% bf16 MFU | 126001 tok/s step 2569/19560 | loss 3.753612 (-0.59z)| norm 0.2912 (-0.29z)| lr 5.86e-04 | 4155.15 ms | 32.5% bf16 MFU | 126010 tok/s step 2570/19560 | loss 3.698977 (-1.98z)| norm 0.2814 (-0.62z)| lr 5.86e-04 | 4222.41 ms | 32.0% bf16 MFU | 125918 tok/s step 2571/19560 | loss 3.769574 (-0.15z)| norm 0.2991 (-0.01z)| lr 5.86e-04 | 4198.23 ms | 32.2% bf16 MFU | 125866 tok/s step 2572/19560 | loss 3.819480 (+1.14z)| norm 0.3299 (+1.04z)| lr 5.86e-04 | 4216.49 ms | 32.0% bf16 MFU | 125790 tok/s step 2573/19560 | loss 3.772478 (-0.08z)| norm 0.3380 (+1.30z)| lr 5.86e-04 | 4147.88 ms | 32.6% bf16 MFU | 125821 tok/s step 2574/19560 | loss 3.725852 (-1.27z)| norm 0.3052 (+0.18z)| lr 5.86e-04 | 4148.14 ms | 32.5% bf16 MFU | 125849 tok/s step 2575/19560 | loss 3.723874 (-1.31z)| norm 0.3049 (+0.17z)| lr 5.86e-04 | 4143.02 ms | 32.6% bf16 MFU | 125884 tok/s step 2576/19560 | loss 3.756873 (-0.44z)| norm 0.3114 (+0.40z)| lr 5.85e-04 | 4150.19 ms | 32.5% bf16 MFU | 125906 tok/s step 2577/19560 | loss 3.751544 (-0.57z)| norm 0.3130 (+0.45z)| lr 5.85e-04 | 4156.80 ms | 32.5% bf16 MFU | 125917 tok/s step 2578/19560 | loss 3.767495 (-0.14z)| norm 0.3342 (+1.17z)| lr 5.85e-04 | 4154.02 ms | 32.5% bf16 MFU | 125932 tok/s step 2579/19560 | loss 3.718856 (-1.41z)| norm 0.3491 (+1.70z)| lr 5.85e-04 | 4312.51 ms | 31.3% bf16 MFU | 125714 tok/s step 2580/19560 | loss 3.717452 (-1.42z)| norm 0.3149 (+0.54z)| lr 5.85e-04 | 4155.19 ms | 32.5% bf16 MFU | 125737 tok/s step 2581/19560 | loss 3.815552 (+1.12z)| norm 0.3056 (+0.23z)| lr 5.85e-04 | 4149.59 ms | 32.5% bf16 MFU | 125768 tok/s step 2582/19560 | loss 3.760214 (-0.31z)| norm 0.3173 (+0.66z)| lr 5.85e-04 | 4464.62 ms | 30.2% bf16 MFU | 125351 tok/s step 2583/19560 | loss 3.761499 (-0.28z)| norm 0.2907 (-0.29z)| lr 5.85e-04 | 4146.07 ms | 32.6% bf16 MFU | 125406 tok/s step 2584/19560 | loss 3.795827 (+0.61z)| norm 0.3117 (+0.46z)| lr 5.85e-04 | 4150.09 ms | 32.5% bf16 MFU | 125452 tok/s step 2585/19560 | loss 3.795473 (+0.59z)| norm 0.2591 (-1.41z)| lr 5.85e-04 | 8313.15 ms | 16.2% bf16 MFU | 122333 tok/s step 2586/19560 | loss 3.797670 (+0.64z)| norm 0.2775 (-0.75z)| lr 5.85e-04 | 4155.20 ms | 32.5% bf16 MFU | 122525 tok/s step 2587/19560 | loss 3.772325 (-0.04z)| norm 0.2794 (-0.68z)| lr 5.85e-04 | 4153.83 ms | 32.5% bf16 MFU | 122710 tok/s step 2588/19560 | loss 3.767243 (-0.16z)| norm 0.3006 (+0.07z)| lr 5.85e-04 | 4145.86 ms | 32.6% bf16 MFU | 122897 tok/s step 2589/19560 | loss 3.750681 (-0.62z)| norm 0.2772 (-0.76z)| lr 5.85e-04 | 4156.21 ms | 32.5% bf16 MFU | 123060 tok/s step 2590/19560 | loss 3.681415 (-2.39z)| norm 0.2798 (-0.66z)| lr 5.85e-04 | 4166.99 ms | 32.4% bf16 MFU | 123198 tok/s step 2591/19560 | loss 3.793796 (+0.53z)| norm 0.2729 (-0.91z)| lr 5.85e-04 | 4169.92 ms | 32.4% bf16 MFU | 123324 tok/s step 2592/19560 | loss 3.742934 (-0.78z)| norm 0.2634 (-1.23z)| lr 5.85e-04 | 4149.86 ms | 32.5% bf16 MFU | 123475 tok/s step 2593/19560 | loss 3.752333 (-0.53z)| norm 0.2912 (-0.25z)| lr 5.85e-04 | 4145.85 ms | 32.6% bf16 MFU | 123624 tok/s step 2594/19560 | loss 3.839297 (+1.70z)| norm 0.2776 (-0.72z)| lr 5.85e-04 | 4152.46 ms | 32.5% bf16 MFU | 123756 tok/s step 2595/19560 | loss 3.767055 (-0.14z)| norm 0.2881 (-0.34z)| lr 5.85e-04 | 4153.17 ms | 32.5% bf16 MFU | 123880 tok/s step 2596/19560 | loss 3.775079 (+0.08z)| norm 0.3162 (+0.64z)| lr 5.85e-04 | 4151.45 ms | 32.5% bf16 MFU | 124001 tok/s step 2597/19560 | loss 3.767295 (-0.12z)| norm 0.3204 (+0.78z)| lr 5.85e-04 | 4149.16 ms | 32.5% bf16 MFU | 124119 tok/s step 2598/19560 | loss 3.785121 (+0.34z)| norm 0.2953 (-0.10z)| lr 5.85e-04 | 4153.43 ms | 32.5% bf16 MFU | 124224 tok/s step 2599/19560 | loss 3.739038 (-0.88z)| norm 0.2795 (-0.65z)| lr 5.85e-04 | 4172.68 ms | 32.4% bf16 MFU | 124296 tok/s step 2600/19560 | loss 3.775636 (+0.13z)| norm 0.3069 (+0.31z)| lr 5.85e-04 | 4164.05 ms | 32.4% bf16 MFU | 124376 tok/s step 2601/19560 | loss 3.835792 (+1.75z)| norm 0.2967 (-0.04z)| lr 5.85e-04 | 4159.61 ms | 32.5% bf16 MFU | 124459 tok/s step 2602/19560 | loss 3.797231 (+0.70z)| norm 0.3062 (+0.28z)| lr 5.85e-04 | 4164.10 ms | 32.4% bf16 MFU | 124532 tok/s step 2603/19560 | loss 3.755957 (-0.41z)| norm 0.2776 (-0.72z)| lr 5.85e-04 | 4167.86 ms | 32.4% bf16 MFU | 124595 tok/s step 2604/19560 | loss 3.756305 (-0.40z)| norm 0.2676 (-1.09z)| lr 5.85e-04 | 4160.35 ms | 32.5% bf16 MFU | 124666 tok/s step 2605/19560 | loss 3.768208 (-0.08z)| norm 0.2702 (-1.02z)| lr 5.85e-04 | 4163.86 ms | 32.4% bf16 MFU | 124729 tok/s step 2606/19560 | loss 3.749904 (-0.58z)| norm 0.2601 (-1.37z)| lr 5.85e-04 | 4150.40 ms | 32.5% bf16 MFU | 124808 tok/s step 2607/19560 | loss 3.808420 (+1.00z)| norm 0.2909 (-0.27z)| lr 5.85e-04 | 4156.93 ms | 32.5% bf16 MFU | 124874 tok/s step 2608/19560 | loss 3.791620 (+0.55z)| norm 0.2933 (-0.19z)| lr 5.85e-04 | 4158.27 ms | 32.5% bf16 MFU | 124934 tok/s step 2609/19560 | loss 3.770007 (-0.04z)| norm 0.3015 (+0.09z)| lr 5.85e-04 | 4145.29 ms | 32.6% bf16 MFU | 125012 tok/s step 2610/19560 | loss 3.734961 (-0.98z)| norm 0.3250 (+0.94z)| lr 5.85e-04 | 4209.89 ms | 32.1% bf16 MFU | 124988 tok/s step 2611/19560 | loss 3.794773 (+0.65z)| norm 0.2782 (-0.74z)| lr 5.85e-04 | 4150.94 ms | 32.5% bf16 MFU | 125054 tok/s step 2612/19560 | loss 3.798336 (+0.75z)| norm 0.2740 (-0.87z)| lr 5.85e-04 | 4152.20 ms | 32.5% bf16 MFU | 125114 tok/s step 2613/19560 | loss 3.798340 (+0.75z)| norm 0.3235 (+0.89z)| lr 5.85e-04 | 4146.00 ms | 32.6% bf16 MFU | 125182 tok/s step 2614/19560 | loss 3.741915 (-0.79z)| norm 0.2654 (-1.18z)| lr 5.85e-04 | 4155.48 ms | 32.5% bf16 MFU | 125231 tok/s step 2615/19560 | loss 3.745340 (-0.71z)| norm 0.2744 (-0.84z)| lr 5.85e-04 | 4150.43 ms | 32.5% bf16 MFU | 125285 tok/s step 2616/19560 | loss 3.736901 (-0.92z)| norm 0.2887 (-0.32z)| lr 5.85e-04 | 4158.89 ms | 32.5% bf16 MFU | 125324 tok/s step 2617/19560 | loss 3.761566 (-0.25z)| norm 0.2865 (-0.40z)| lr 5.85e-04 | 4153.76 ms | 32.5% bf16 MFU | 125369 tok/s step 2618/19560 | loss 3.756826 (-0.37z)| norm 0.2834 (-0.50z)| lr 5.85e-04 | 4155.51 ms | 32.5% bf16 MFU | 125409 tok/s step 2619/19560 | loss 3.743284 (-0.74z)| norm 0.2773 (-0.72z)| lr 5.85e-04 | 4158.43 ms | 32.5% bf16 MFU | 125443 tok/s step 2620/19560 | loss 3.764385 (-0.16z)| norm 0.2943 (-0.11z)| lr 5.85e-04 | 4149.65 ms | 32.5% bf16 MFU | 125488 tok/s step 2621/19560 | loss 3.793868 (+0.66z)| norm 0.3287 (+1.18z)| lr 5.85e-04 | 4153.32 ms | 32.5% bf16 MFU | 125525 tok/s step 2622/19560 | loss 3.785529 (+0.43z)| norm 0.3064 (+0.37z)| lr 5.85e-04 | 4168.36 ms | 32.4% bf16 MFU | 125538 tok/s step 2623/19560 | loss 3.766035 (-0.10z)| norm 0.3139 (+0.69z)| lr 5.85e-04 | 4169.27 ms | 32.4% bf16 MFU | 125548 tok/s step 2624/19560 | loss 3.793410 (+0.69z)| norm 0.2795 (-0.63z)| lr 5.85e-04 | 4170.82 ms | 32.4% bf16 MFU | 125556 tok/s step 2625/19560 | loss 3.710116 (-1.65z)| norm 0.2631 (-1.25z)| lr 5.85e-04 | 4148.16 ms | 32.5% bf16 MFU | 125598 tok/s step 2626/19560 | loss 3.729907 (-1.10z)| norm 0.2533 (-1.61z)| lr 5.85e-04 | 4163.05 ms | 32.4% bf16 MFU | 125615 tok/s step 2627/19560 | loss 3.778060 (+0.28z)| norm 0.2700 (-0.95z)| lr 5.85e-04 | 4146.83 ms | 32.6% bf16 MFU | 125656 tok/s step 2628/19560 | loss 3.817866 (+1.44z)| norm 0.2807 (-0.54z)| lr 5.85e-04 | 4151.09 ms | 32.5% bf16 MFU | 125688 tok/s step 2629/19560 | loss 3.739626 (-0.81z)| norm 0.2900 (-0.18z)| lr 5.85e-04 | 4181.93 ms | 32.3% bf16 MFU | 125672 tok/s step 2630/19560 | loss 3.748079 (-0.56z)| norm 0.2935 (-0.04z)| lr 5.85e-04 | 4147.60 ms | 32.6% bf16 MFU | 125709 tok/s step 2631/19560 | loss 3.793169 (+0.73z)| norm 0.2785 (-0.60z)| lr 5.85e-04 | 4151.20 ms | 32.5% bf16 MFU | 125738 tok/s step 2632/19560 | loss 3.710553 (-1.62z)| norm 0.2828 (-0.43z)| lr 5.85e-04 | 4153.85 ms | 32.5% bf16 MFU | 125762 tok/s step 2633/19560 | loss 3.745827 (-0.60z)| norm 0.2984 (+0.18z)| lr 5.85e-04 | 4186.59 ms | 32.2% bf16 MFU | 125736 tok/s step 2634/19560 | loss 3.774413 (+0.22z)| norm 0.2934 (-0.02z)| lr 5.85e-04 | 4153.37 ms | 32.5% bf16 MFU | 125760 tok/s step 2635/19560 | loss 3.832029 (+1.88z)| norm 0.2955 (+0.06z)| lr 5.85e-04 | 4161.77 ms | 32.4% bf16 MFU | 125771 tok/s step 2636/19560 | loss 3.797324 (+0.90z)| norm 0.3035 (+0.37z)| lr 5.85e-04 | 4158.37 ms | 32.5% bf16 MFU | 125787 tok/s step 2637/19560 | loss 3.790395 (+0.69z)| norm 0.2959 (+0.06z)| lr 5.85e-04 | 4153.30 ms | 32.5% bf16 MFU | 125809 tok/s step 2638/19560 | loss 3.773463 (+0.19z)| norm 0.2966 (+0.08z)| lr 5.85e-04 | 4151.61 ms | 32.5% bf16 MFU | 125833 tok/s step 2639/19560 | loss 3.758400 (-0.26z)| norm 0.2981 (+0.13z)| lr 5.85e-04 | 4161.52 ms | 32.4% bf16 MFU | 125840 tok/s step 2640/19560 | loss 3.724328 (-1.25z)| norm 0.2938 (-0.04z)| lr 5.84e-04 | 4150.69 ms | 32.5% bf16 MFU | 125864 tok/s step 2641/19560 | loss 3.777611 (+0.30z)| norm 0.2865 (-0.35z)| lr 5.84e-04 | 4167.95 ms | 32.4% bf16 MFU | 125860 tok/s step 2642/19560 | loss 3.741458 (-0.77z)| norm 0.2991 (+0.14z)| lr 5.84e-04 | 4154.60 ms | 32.5% bf16 MFU | 125877 tok/s step 2643/19560 | loss 3.721652 (-1.33z)| norm 0.2992 (+0.14z)| lr 5.84e-04 | 4157.67 ms | 32.5% bf16 MFU | 125888 tok/s step 2644/19560 | loss 3.757528 (-0.28z)| norm 0.2914 (-0.19z)| lr 5.84e-04 | 4153.86 ms | 32.5% bf16 MFU | 125905 tok/s step 2645/19560 | loss 3.787187 (+0.60z)| norm 0.3032 (+0.29z)| lr 5.84e-04 | 4159.67 ms | 32.5% bf16 MFU | 125912 tok/s step 2646/19560 | loss 3.881979 (+3.22z)| norm 0.2806 (-0.65z)| lr 5.84e-04 | 4176.96 ms | 32.3% bf16 MFU | 125892 tok/s step 2647/19560 | loss 3.760130 (-0.21z)| norm 0.2609 (-1.45z)| lr 5.84e-04 | 4145.42 ms | 32.6% bf16 MFU | 125921 tok/s step 2648/19560 | loss 3.744734 (-0.65z)| norm 0.2627 (-1.35z)| lr 5.84e-04 | 4149.19 ms | 32.5% bf16 MFU | 125943 tok/s step 2649/19560 | loss 3.747664 (-0.57z)| norm 0.2760 (-0.80z)| lr 5.84e-04 | 4158.27 ms | 32.5% bf16 MFU | 125950 tok/s step 2650/19560 | loss 3.743838 (-0.69z)| norm 0.2986 (+0.13z)| lr 5.84e-04 | 4156.36 ms | 32.5% bf16 MFU | 125959 tok/s step 2651/19560 | loss 3.742946 (-0.72z)| norm 0.2892 (-0.26z)| lr 5.84e-04 | 4151.25 ms | 32.5% bf16 MFU | 125976 tok/s step 2652/19560 | loss 3.813971 (+1.29z)| norm 0.2874 (-0.33z)| lr 5.84e-04 | 4162.55 ms | 32.4% bf16 MFU | 125975 tok/s step 2653/19560 | loss 3.768337 (-0.00z)| norm 0.2885 (-0.27z)| lr 5.84e-04 | 4179.65 ms | 32.3% bf16 MFU | 125948 tok/s step 2654/19560 | loss 3.756208 (-0.37z)| norm 0.2786 (-0.69z)| lr 5.84e-04 | 4162.39 ms | 32.4% bf16 MFU | 125949 tok/s step 2655/19560 | loss 3.763445 (-0.15z)| norm 0.2791 (-0.66z)| lr 5.84e-04 | 4163.24 ms | 32.4% bf16 MFU | 125948 tok/s step 2656/19560 | loss 3.750852 (-0.52z)| norm 0.2806 (-0.59z)| lr 5.84e-04 | 4162.20 ms | 32.4% bf16 MFU | 125949 tok/s step 2657/19560 | loss 3.736747 (-0.94z)| norm 0.2790 (-0.64z)| lr 5.84e-04 | 4176.35 ms | 32.3% bf16 MFU | 125928 tok/s step 2658/19560 | loss 3.797542 (+0.84z)| norm 0.2800 (-0.58z)| lr 5.84e-04 | 4165.38 ms | 32.4% bf16 MFU | 125925 tok/s step 2659/19560 | loss 3.741165 (-0.81z)| norm 0.2884 (-0.21z)| lr 5.84e-04 | 4148.73 ms | 32.5% bf16 MFU | 125948 tok/s step 2660/19560 | loss 3.759740 (-0.26z)| norm 0.2845 (-0.38z)| lr 5.84e-04 | 4156.19 ms | 32.5% bf16 MFU | 125958 tok/s step 2661/19560 | loss 3.758476 (-0.28z)| norm 0.2865 (-0.28z)| lr 5.84e-04 | 4146.10 ms | 32.6% bf16 MFU | 125982 tok/s step 2662/19560 | loss 3.788050 (+0.59z)| norm 0.2573 (-1.54z)| lr 5.84e-04 | 4156.79 ms | 32.5% bf16 MFU | 125990 tok/s step 2663/19560 | loss 3.717480 (-1.47z)| norm 0.2477 (-1.93z)| lr 5.84e-04 | 4153.50 ms | 32.5% bf16 MFU | 126002 tok/s step 2664/19560 | loss 3.773542 (+0.19z)| norm 0.2695 (-0.97z)| lr 5.84e-04 | 4161.76 ms | 32.4% bf16 MFU | 126000 tok/s step 2665/19560 | loss 3.774771 (+0.24z)| norm 0.2832 (-0.38z)| lr 5.84e-04 | 4152.35 ms | 32.5% bf16 MFU | 126013 tok/s step 2666/19560 | loss 3.799298 (+0.96z)| norm 0.2990 (+0.32z)| lr 5.84e-04 | 4156.76 ms | 32.5% bf16 MFU | 126019 tok/s step 2667/19560 | loss 3.769311 (+0.05z)| norm 0.3471 (+2.34z)| lr 5.84e-04 | 4158.72 ms | 32.5% bf16 MFU | 126022 tok/s step 2668/19560 | loss 3.801223 (+1.01z)| norm 0.3888 (+3.84z)| lr 5.84e-04 | 4159.68 ms | 32.5% bf16 MFU | 126023 tok/s step 2669/19560 | loss 3.773985 (+0.19z)| norm 0.4099 (+4.30z)| lr 5.84e-04 | 4154.91 ms | 32.5% bf16 MFU | 126031 tok/s step 2670/19560 | loss 3.744819 (-0.70z)| norm 0.3708 (+2.80z)| lr 5.84e-04 | 4168.45 ms | 32.4% bf16 MFU | 126018 tok/s step 2671/19560 | loss 3.795583 (+0.83z)| norm 0.3630 (+2.47z)| lr 5.84e-04 | 4161.97 ms | 32.4% bf16 MFU | 126016 tok/s step 2672/19560 | loss 3.763610 (-0.14z)| norm 0.3731 (+2.73z)| lr 5.84e-04 | 4153.77 ms | 32.5% bf16 MFU | 126026 tok/s step 2673/19560 | loss 3.733975 (-1.02z)| norm 0.3147 (+0.69z)| lr 5.84e-04 | 4160.17 ms | 32.5% bf16 MFU | 126026 tok/s step 2674/19560 | loss 3.735275 (-0.98z)| norm 0.2870 (-0.28z)| lr 5.84e-04 | 4152.17 ms | 32.5% bf16 MFU | 126038 tok/s step 2675/19560 | loss 3.767326 (-0.00z)| norm 0.3034 (+0.29z)| lr 5.84e-04 | 4156.99 ms | 32.5% bf16 MFU | 126042 tok/s step 2676/19560 | loss 3.693449 (-2.19z)| norm 0.2763 (-0.64z)| lr 5.84e-04 | 4158.77 ms | 32.5% bf16 MFU | 126044 tok/s step 2677/19560 | loss 3.769338 (+0.09z)| norm 0.2677 (-0.93z)| lr 5.84e-04 | 4154.78 ms | 32.5% bf16 MFU | 126051 tok/s step 2678/19560 | loss 3.711890 (-1.62z)| norm 0.2552 (-1.35z)| lr 5.84e-04 | 4161.22 ms | 32.4% bf16 MFU | 126048 tok/s step 2679/19560 | loss 3.796959 (+0.92z)| norm 0.2707 (-0.81z)| lr 5.84e-04 | 4164.08 ms | 32.4% bf16 MFU | 126041 tok/s step 2680/19560 | loss 3.734141 (-0.94z)| norm 0.2589 (-1.24z)| lr 5.84e-04 | 4171.20 ms | 32.4% bf16 MFU | 126023 tok/s step 2681/19560 | loss 3.803036 (+1.12z)| norm 0.2788 (-0.51z)| lr 5.84e-04 | 4152.63 ms | 32.5% bf16 MFU | 126035 tok/s step 2682/19560 | loss 3.734439 (-0.94z)| norm 0.2784 (-0.51z)| lr 5.84e-04 | 4158.16 ms | 32.5% bf16 MFU | 126038 tok/s step 2683/19560 | loss 3.716558 (-1.46z)| norm 0.2789 (-0.49z)| lr 5.84e-04 | 4147.42 ms | 32.6% bf16 MFU | 126056 tok/s step 2684/19560 | loss 3.752410 (-0.36z)| norm 0.3009 (+0.33z)| lr 5.84e-04 | 4159.72 ms | 32.5% bf16 MFU | 126056 tok/s step 2685/19560 | loss 3.806306 (+1.28z)| norm 0.3021 (+0.37z)| lr 5.84e-04 | 4151.56 ms | 32.5% bf16 MFU | 126067 tok/s step 2686/19560 | loss 3.704646 (-1.81z)| norm 0.2899 (-0.10z)| lr 5.84e-04 | 4153.86 ms | 32.5% bf16 MFU | 126075 tok/s step 2687/19560 | loss 3.717178 (-1.41z)| norm 0.2607 (-1.19z)| lr 5.84e-04 | 4208.42 ms | 32.1% bf16 MFU | 126000 tok/s step 2688/19560 | loss 3.746719 (-0.50z)| norm 0.2734 (-0.72z)| lr 5.84e-04 | 4161.29 ms | 32.4% bf16 MFU | 125999 tok/s step 2689/19560 | loss 3.755057 (-0.25z)| norm 0.2789 (-0.52z)| lr 5.84e-04 | 4163.77 ms | 32.4% bf16 MFU | 125995 tok/s step 2690/19560 | loss 3.828991 (+1.99z)| norm 0.2844 (-0.31z)| lr 5.84e-04 | 4154.80 ms | 32.5% bf16 MFU | 126005 tok/s step 2691/19560 | loss 3.779949 (+0.50z)| norm 0.2685 (-0.91z)| lr 5.84e-04 | 4154.89 ms | 32.5% bf16 MFU | 126014 tok/s step 2692/19560 | loss 3.787392 (+0.72z)| norm 0.2849 (-0.28z)| lr 5.84e-04 | 4246.59 ms | 31.8% bf16 MFU | 125886 tok/s step 2693/19560 | loss 3.758039 (-0.19z)| norm 0.2878 (-0.17z)| lr 5.84e-04 | 4202.21 ms | 32.1% bf16 MFU | 125830 tok/s step 2694/19560 | loss 3.770265 (+0.18z)| norm 0.2930 (+0.01z)| lr 5.84e-04 | 4148.85 ms | 32.5% bf16 MFU | 125857 tok/s step 2695/19560 | loss 3.698345 (-1.99z)| norm 0.2816 (-0.42z)| lr 5.84e-04 | 4155.60 ms | 32.5% bf16 MFU | 125873 tok/s step 2696/19560 | loss 3.762516 (-0.05z)| norm 0.2846 (-0.32z)| lr 5.84e-04 | 4157.99 ms | 32.5% bf16 MFU | 125884 tok/s step 2697/19560 | loss 3.714601 (-1.48z)| norm 0.2824 (-0.40z)| lr 5.84e-04 | 4145.76 ms | 32.6% bf16 MFU | 125913 tok/s step 2698/19560 | loss 3.711446 (-1.59z)| norm 0.2554 (-1.41z)| lr 5.84e-04 | 4149.10 ms | 32.5% bf16 MFU | 125935 tok/s step 2699/19560 | loss 3.778875 (+0.45z)| norm 0.2860 (-0.24z)| lr 5.84e-04 | 4159.59 ms | 32.5% bf16 MFU | 125940 tok/s step 2700/19560 | loss 3.767964 (+0.14z)| norm 0.3090 (+0.64z)| lr 5.84e-04 | 4150.39 ms | 32.5% bf16 MFU | 125960 tok/s step 2701/19560 | loss 3.732239 (-0.95z)| norm 0.2943 (+0.09z)| lr 5.84e-04 | 4147.82 ms | 32.6% bf16 MFU | 125982 tok/s step 2702/19560 | loss 3.850734 (+2.58z)| norm 0.2800 (-0.45z)| lr 5.83e-04 | 4160.33 ms | 32.5% bf16 MFU | 125984 tok/s step 2703/19560 | loss 3.759539 (-0.15z)| norm 0.2656 (-1.00z)| lr 5.83e-04 | 4154.25 ms | 32.5% bf16 MFU | 125995 tok/s step 2704/19560 | loss 3.732508 (-0.95z)| norm 0.3151 (+0.91z)| lr 5.83e-04 | 4170.75 ms | 32.4% bf16 MFU | 125980 tok/s step 2705/19560 | loss 3.829930 (+1.92z)| norm 0.3048 (+0.52z)| lr 5.83e-04 | 4167.33 ms | 32.4% bf16 MFU | 125972 tok/s step 2706/19560 | loss 3.723904 (-1.19z)| norm 0.2682 (-0.88z)| lr 5.83e-04 | 4156.06 ms | 32.5% bf16 MFU | 125981 tok/s step 2707/19560 | loss 3.715124 (-1.45z)| norm 0.2569 (-1.31z)| lr 5.83e-04 | 4155.92 ms | 32.5% bf16 MFU | 125989 tok/s step 2708/19560 | loss 3.772282 (+0.22z)| norm 0.3050 (+0.59z)| lr 5.83e-04 | 4156.52 ms | 32.5% bf16 MFU | 125997 tok/s step 2709/19560 | loss 3.767523 (+0.09z)| norm 0.2777 (-0.48z)| lr 5.83e-04 | 4154.81 ms | 32.5% bf16 MFU | 126006 tok/s step 2710/19560 | loss 3.759234 (-0.16z)| norm 0.2496 (-1.56z)| lr 5.83e-04 | 4186.79 ms | 32.2% bf16 MFU | 125967 tok/s step 2711/19560 | loss 3.730705 (-1.00z)| norm 0.2670 (-0.87z)| lr 5.83e-04 | 4155.98 ms | 32.5% bf16 MFU | 125976 tok/s step 2712/19560 | loss 3.791086 (+0.79z)| norm 0.3185 (+1.15z)| lr 5.83e-04 | 4152.21 ms | 32.5% bf16 MFU | 125991 tok/s step 2713/19560 | loss 3.822123 (+1.70z)| norm 0.3231 (+1.30z)| lr 5.83e-04 | 4162.42 ms | 32.4% bf16 MFU | 125989 tok/s step 2714/19560 | loss 3.777244 (+0.38z)| norm 0.2942 (+0.17z)| lr 5.83e-04 | 4148.87 ms | 32.5% bf16 MFU | 126008 tok/s step 2715/19560 | loss 3.763927 (-0.01z)| norm 0.3048 (+0.58z)| lr 5.83e-04 | 4156.70 ms | 32.5% bf16 MFU | 126014 tok/s step 2716/19560 | loss 3.748743 (-0.45z)| norm 0.3065 (+0.64z)| lr 5.83e-04 | 4147.98 ms | 32.6% bf16 MFU | 126033 tok/s step 2717/19560 | loss 3.720535 (-1.27z)| norm 0.3094 (+0.74z)| lr 5.83e-04 | 4163.26 ms | 32.4% bf16 MFU | 126028 tok/s step 2718/19560 | loss 3.700925 (-1.87z)| norm 0.3046 (+0.55z)| lr 5.83e-04 | 4151.45 ms | 32.5% bf16 MFU | 126041 tok/s step 2719/19560 | loss 3.724831 (-1.14z)| norm 0.2735 (-0.66z)| lr 5.83e-04 | 4165.26 ms | 32.4% bf16 MFU | 126033 tok/s step 2720/19560 | loss 3.725094 (-1.13z)| norm 0.2806 (-0.39z)| lr 5.83e-04 | 4174.50 ms | 32.3% bf16 MFU | 126011 tok/s step 2721/19560 | loss 3.781649 (+0.53z)| norm 0.2925 (+0.07z)| lr 5.83e-04 | 4175.66 ms | 32.3% bf16 MFU | 125988 tok/s step 2722/19560 | loss 3.749808 (-0.39z)| norm 0.2857 (-0.20z)| lr 5.83e-04 | 4168.07 ms | 32.4% bf16 MFU | 125978 tok/s step 2723/19560 | loss 3.733681 (-0.87z)| norm 0.3005 (+0.38z)| lr 5.83e-04 | 4158.96 ms | 32.5% bf16 MFU | 125982 tok/s step 2724/19560 | loss 3.745656 (-0.50z)| norm 0.2980 (+0.29z)| lr 5.83e-04 | 4157.53 ms | 32.5% bf16 MFU | 125989 tok/s step 2725/19560 | loss 3.788218 (+0.77z)| norm 0.3138 (+0.91z)| lr 5.83e-04 | 4158.52 ms | 32.5% bf16 MFU | 125993 tok/s step 2726/19560 | loss 3.791379 (+0.86z)| norm 0.3016 (+0.43z)| lr 5.83e-04 | 4149.37 ms | 32.5% bf16 MFU | 126011 tok/s step 2727/19560 | loss 3.768706 (+0.17z)| norm 0.2968 (+0.24z)| lr 5.83e-04 | 4147.58 ms | 32.6% bf16 MFU | 126031 tok/s step 2728/19560 | loss 3.725971 (-1.09z)| norm 0.3479 (+2.19z)| lr 5.83e-04 | 4138.24 ms | 32.6% bf16 MFU | 126064 tok/s step 2729/19560 | loss 3.736502 (-0.76z)| norm 0.3713 (+2.96z)| lr 5.83e-04 | 4141.15 ms | 32.6% bf16 MFU | 126091 tok/s step 2730/19560 | loss 3.776308 (+0.45z)| norm 0.3180 (+0.97z)| lr 5.83e-04 | 4152.54 ms | 32.5% bf16 MFU | 126099 tok/s step 2731/19560 | loss 3.723437 (-1.14z)| norm 0.2827 (-0.34z)| lr 5.83e-04 | 4151.67 ms | 32.5% bf16 MFU | 126108 tok/s step 2732/19560 | loss 3.738600 (-0.68z)| norm 0.2781 (-0.52z)| lr 5.83e-04 | 4183.07 ms | 32.3% bf16 MFU | 126070 tok/s step 2733/19560 | loss 3.766018 (+0.15z)| norm 0.2607 (-1.16z)| lr 5.83e-04 | 4153.51 ms | 32.5% bf16 MFU | 126078 tok/s step 2734/19560 | loss 3.853724 (+2.69z)| norm 0.2789 (-0.49z)| lr 5.83e-04 | 4206.38 ms | 32.1% bf16 MFU | 126006 tok/s step 2735/19560 | loss 3.772389 (+0.32z)| norm 0.3107 (+0.69z)| lr 5.83e-04 | 4160.51 ms | 32.5% bf16 MFU | 126006 tok/s step 2736/19560 | loss 3.740745 (-0.61z)| norm 0.3277 (+1.31z)| lr 5.83e-04 | 4159.23 ms | 32.5% bf16 MFU | 126009 tok/s step 2737/19560 | loss 3.745016 (-0.47z)| norm 0.3251 (+1.20z)| lr 5.83e-04 | 4224.43 ms | 32.0% bf16 MFU | 125914 tok/s step 2738/19560 | loss 3.800848 (+1.16z)| norm 0.3176 (+0.92z)| lr 5.83e-04 | 4144.82 ms | 32.6% bf16 MFU | 125943 tok/s step 2739/19560 | loss 3.705200 (-1.63z)| norm 0.2985 (+0.21z)| lr 5.83e-04 | 4151.31 ms | 32.5% bf16 MFU | 125960 tok/s step 2740/19560 | loss 3.731265 (-0.85z)| norm 0.2833 (-0.35z)| lr 5.83e-04 | 4157.16 ms | 32.5% bf16 MFU | 125968 tok/s step 2741/19560 | loss 3.735515 (-0.72z)| norm 0.2546 (-1.39z)| lr 5.83e-04 | 4163.48 ms | 32.4% bf16 MFU | 125966 tok/s step 2742/19560 | loss 3.709242 (-1.47z)| norm 0.2613 (-1.14z)| lr 5.83e-04 | 4162.97 ms | 32.4% bf16 MFU | 125965 tok/s step 2743/19560 | loss 3.773966 (+0.41z)| norm 0.2776 (-0.54z)| lr 5.83e-04 | 4151.65 ms | 32.5% bf16 MFU | 125981 tok/s step 2744/19560 | loss 3.716909 (-1.24z)| norm 0.3039 (+0.42z)| lr 5.83e-04 | 4152.67 ms | 32.5% bf16 MFU | 125994 tok/s step 2745/19560 | loss 3.776452 (+0.48z)| norm 0.2862 (-0.23z)| lr 5.83e-04 | 4157.47 ms | 32.5% bf16 MFU | 126000 tok/s step 2746/19560 | loss 3.740172 (-0.57z)| norm 0.2936 (+0.04z)| lr 5.83e-04 | 4169.92 ms | 32.4% bf16 MFU | 125987 tok/s step 2747/19560 | loss 3.768452 (+0.25z)| norm 0.3000 (+0.27z)| lr 5.83e-04 | 4168.05 ms | 32.4% bf16 MFU | 125977 tok/s step 2748/19560 | loss 3.716292 (-1.25z)| norm 0.2612 (-1.15z)| lr 5.83e-04 | 4171.58 ms | 32.4% bf16 MFU | 125962 tok/s step 2749/19560 | loss 3.697348 (-1.76z)| norm 0.2736 (-0.68z)| lr 5.83e-04 | 4152.95 ms | 32.5% bf16 MFU | 125976 tok/s step 2750/19560 | loss 3.748284 (-0.29z)| norm 0.2822 (-0.35z)| lr 5.83e-04 | 4147.39 ms | 32.6% bf16 MFU | 125998 tok/s val loss 3.726779 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2645/10042 = 0.263394 step 2751/19560 | loss 3.754551 (-0.11z)| norm 0.2797 (-0.44z)| lr 5.83e-04 | 4232.12 ms | 31.9% bf16 MFU | 125892 tok/s step 2752/19560 | loss 3.750414 (-0.22z)| norm 0.3031 (+0.42z)| lr 5.83e-04 | 4336.65 ms | 31.1% bf16 MFU | 125642 tok/s step 2753/19560 | loss 3.735695 (-0.65z)| norm 0.2869 (-0.18z)| lr 5.83e-04 | 4274.87 ms | 31.6% bf16 MFU | 125492 tok/s step 2754/19560 | loss 3.747024 (-0.33z)| norm 0.2864 (-0.22z)| lr 5.83e-04 | 4228.46 ms | 31.9% bf16 MFU | 125417 tok/s step 2755/19560 | loss 3.737821 (-0.59z)| norm 0.2844 (-0.29z)| lr 5.83e-04 | 4194.27 ms | 32.2% bf16 MFU | 125396 tok/s step 2756/19560 | loss 3.738487 (-0.55z)| norm 0.2726 (-0.74z)| lr 5.83e-04 | 4200.61 ms | 32.1% bf16 MFU | 125367 tok/s step 2757/19560 | loss 3.767868 (+0.30z)| norm 0.3099 (+0.66z)| lr 5.83e-04 | 4242.76 ms | 31.8% bf16 MFU | 125278 tok/s step 2758/19560 | loss 3.735718 (-0.64z)| norm 0.2898 (-0.09z)| lr 5.83e-04 | 4176.63 ms | 32.3% bf16 MFU | 125290 tok/s step 2759/19560 | loss 3.663784 (-2.65z)| norm 0.2866 (-0.21z)| lr 5.83e-04 | 4195.08 ms | 32.2% bf16 MFU | 125274 tok/s step 2760/19560 | loss 3.724780 (-0.91z)| norm 0.3113 (+0.70z)| lr 5.83e-04 | 4283.46 ms | 31.5% bf16 MFU | 125131 tok/s step 2761/19560 | loss 3.768682 (+0.34z)| norm 0.2841 (-0.31z)| lr 5.83e-04 | 4160.99 ms | 32.4% bf16 MFU | 125174 tok/s step 2762/19560 | loss 3.724757 (-0.91z)| norm 0.3036 (+0.41z)| lr 5.82e-04 | 8784.82 ms | 15.4% bf16 MFU | 121900 tok/s step 2763/19560 | loss 3.729901 (-0.75z)| norm 0.3245 (+1.18z)| lr 5.82e-04 | 4177.25 ms | 32.3% bf16 MFU | 122080 tok/s step 2764/19560 | loss 3.767300 (+0.35z)| norm 0.3000 (+0.27z)| lr 5.82e-04 | 4181.95 ms | 32.3% bf16 MFU | 122245 tok/s step 2765/19560 | loss 3.777088 (+0.64z)| norm 0.2718 (-0.77z)| lr 5.82e-04 | 4333.92 ms | 31.2% bf16 MFU | 122181 tok/s step 2766/19560 | loss 3.721629 (-0.97z)| norm 0.2657 (-0.99z)| lr 5.82e-04 | 4239.55 ms | 31.8% bf16 MFU | 122255 tok/s step 2767/19560 | loss 3.714537 (-1.16z)| norm 0.2960 (+0.13z)| lr 5.82e-04 | 4152.91 ms | 32.5% bf16 MFU | 122455 tok/s step 2768/19560 | loss 3.755652 (+0.03z)| norm 0.2830 (-0.34z)| lr 5.82e-04 | 4214.97 ms | 32.0% bf16 MFU | 122551 tok/s step 2769/19560 | loss 3.779603 (+0.72z)| norm 0.2602 (-1.17z)| lr 5.82e-04 | 4150.86 ms | 32.5% bf16 MFU | 122739 tok/s step 2770/19560 | loss 3.687644 (-1.92z)| norm 0.2730 (-0.69z)| lr 5.82e-04 | 4200.39 ms | 32.1% bf16 MFU | 122843 tok/s step 2771/19560 | loss 3.858288 (+2.87z)| norm 0.2637 (-1.02z)| lr 5.82e-04 | 4152.76 ms | 32.5% bf16 MFU | 123014 tok/s step 2772/19560 | loss 3.762592 (+0.20z)| norm 0.3005 (+0.33z)| lr 5.82e-04 | 4244.85 ms | 31.8% bf16 MFU | 123038 tok/s step 2773/19560 | loss 3.737510 (-0.49z)| norm 0.3296 (+1.37z)| lr 5.82e-04 | 4153.40 ms | 32.5% bf16 MFU | 123198 tok/s step 2774/19560 | loss 3.730653 (-0.68z)| norm 0.3474 (+1.97z)| lr 5.82e-04 | 4186.63 ms | 32.2% bf16 MFU | 123300 tok/s step 2775/19560 | loss 3.714126 (-1.15z)| norm 0.3334 (+1.44z)| lr 5.82e-04 | 4203.89 ms | 32.1% bf16 MFU | 123370 tok/s step 2776/19560 | loss 3.692660 (-1.75z)| norm 0.2981 (+0.18z)| lr 5.82e-04 | 4192.26 ms | 32.2% bf16 MFU | 123455 tok/s step 2777/19560 | loss 3.750920 (-0.07z)| norm 0.2904 (-0.10z)| lr 5.82e-04 | 4196.35 ms | 32.2% bf16 MFU | 123529 tok/s step 2778/19560 | loss 3.704642 (-1.39z)| norm 0.3401 (+1.65z)| lr 5.82e-04 | 4146.67 ms | 32.6% bf16 MFU | 123674 tok/s step 2779/19560 | loss 3.698616 (-1.53z)| norm 0.3226 (+1.02z)| lr 5.82e-04 | 4150.20 ms | 32.5% bf16 MFU | 123807 tok/s step 2780/19560 | loss 3.681229 (-1.99z)| norm 0.2925 (-0.05z)| lr 5.82e-04 | 4150.85 ms | 32.5% bf16 MFU | 123932 tok/s step 2781/19560 | loss 3.725610 (-0.72z)| norm 0.2890 (-0.17z)| lr 5.82e-04 | 4153.50 ms | 32.5% bf16 MFU | 124047 tok/s step 2782/19560 | loss 3.650354 (-2.75z)| norm 0.2907 (-0.12z)| lr 5.82e-04 | 4146.40 ms | 32.6% bf16 MFU | 124167 tok/s step 2783/19560 | loss 3.710938 (-1.07z)| norm 0.2794 (-0.52z)| lr 5.82e-04 | 4149.75 ms | 32.5% bf16 MFU | 124276 tok/s step 2784/19560 | loss 3.690773 (-1.59z)| norm 0.2742 (-0.70z)| lr 5.82e-04 | 4145.09 ms | 32.6% bf16 MFU | 124386 tok/s step 2785/19560 | loss 3.719422 (-0.81z)| norm 0.2845 (-0.34z)| lr 5.82e-04 | 4146.71 ms | 32.6% bf16 MFU | 124488 tok/s step 2786/19560 | loss 3.698087 (-1.36z)| norm 0.3270 (+1.15z)| lr 5.82e-04 | 4140.18 ms | 32.6% bf16 MFU | 124596 tok/s step 2787/19560 | loss 3.690805 (-1.53z)| norm 0.2948 (+0.01z)| lr 5.82e-04 | 4164.08 ms | 32.4% bf16 MFU | 124661 tok/s step 2788/19560 | loss 3.664567 (-2.18z)| norm 0.2952 (+0.02z)| lr 5.82e-04 | 4162.01 ms | 32.4% bf16 MFU | 124727 tok/s step 2789/19560 | loss 3.686690 (-1.57z)| norm 0.3337 (+1.36z)| lr 5.82e-04 | 4166.19 ms | 32.4% bf16 MFU | 124783 tok/s step 2790/19560 | loss 3.688541 (-1.49z)| norm 0.3060 (+0.38z)| lr 5.82e-04 | 4162.35 ms | 32.4% bf16 MFU | 124841 tok/s step 2791/19560 | loss 3.691864 (-1.39z)| norm 0.2788 (-0.59z)| lr 5.82e-04 | 4148.10 ms | 32.5% bf16 MFU | 124919 tok/s step 2792/19560 | loss 3.713899 (-0.81z)| norm 0.2519 (-1.54z)| lr 5.82e-04 | 4158.01 ms | 32.5% bf16 MFU | 124978 tok/s step 2793/19560 | loss 3.726622 (-0.47z)| norm 0.2581 (-1.30z)| lr 5.82e-04 | 4145.48 ms | 32.6% bf16 MFU | 125052 tok/s step 2794/19560 | loss 3.722413 (-0.57z)| norm 0.2802 (-0.52z)| lr 5.82e-04 | 4148.55 ms | 32.5% bf16 MFU | 125119 tok/s step 2795/19560 | loss 3.733320 (-0.28z)| norm 0.2628 (-1.12z)| lr 5.82e-04 | 4168.15 ms | 32.4% bf16 MFU | 125152 tok/s step 2796/19560 | loss 3.755594 (+0.31z)| norm 0.2845 (-0.34z)| lr 5.82e-04 | 4150.42 ms | 32.5% bf16 MFU | 125210 tok/s step 2797/19560 | loss 3.626181 (-2.95z)| norm 0.2760 (-0.66z)| lr 5.82e-04 | 4150.93 ms | 32.5% bf16 MFU | 125265 tok/s step 2798/19560 | loss 3.670563 (-1.79z)| norm 0.2958 (+0.16z)| lr 5.82e-04 | 4156.78 ms | 32.5% bf16 MFU | 125308 tok/s step 2799/19560 | loss 3.683458 (-1.44z)| norm 0.3129 (+0.92z)| lr 5.82e-04 | 4147.98 ms | 32.6% bf16 MFU | 125363 tok/s step 2800/19560 | loss 3.716855 (-0.60z)| norm 0.3526 (+2.70z)| lr 5.82e-04 | 4149.99 ms | 32.5% bf16 MFU | 125411 tok/s step 2801/19560 | loss 3.792025 (+1.26z)| norm 0.3486 (+2.46z)| lr 5.82e-04 | 4159.79 ms | 32.5% bf16 MFU | 125443 tok/s step 2802/19560 | loss 3.636726 (-2.51z)| norm 0.3304 (+1.64z)| lr 5.82e-04 | 4153.84 ms | 32.5% bf16 MFU | 125481 tok/s step 2803/19560 | loss 3.693110 (-1.13z)| norm 0.3215 (+1.25z)| lr 5.82e-04 | 4153.88 ms | 32.5% bf16 MFU | 125518 tok/s step 2804/19560 | loss 3.770251 (+0.72z)| norm 0.2852 (-0.30z)| lr 5.82e-04 | 4146.41 ms | 32.6% bf16 MFU | 125564 tok/s step 2805/19560 | loss 3.720232 (-0.48z)| norm 0.3000 (+0.32z)| lr 5.82e-04 | 4149.28 ms | 32.5% bf16 MFU | 125604 tok/s step 2806/19560 | loss 3.800433 (+1.44z)| norm 0.3593 (+2.76z)| lr 5.82e-04 | 4149.18 ms | 32.5% bf16 MFU | 125642 tok/s step 2807/19560 | loss 3.702784 (-0.90z)| norm 0.3395 (+1.89z)| lr 5.82e-04 | 4146.69 ms | 32.6% bf16 MFU | 125681 tok/s step 2808/19560 | loss 3.729662 (-0.25z)| norm 0.2910 (-0.13z)| lr 5.82e-04 | 4144.75 ms | 32.6% bf16 MFU | 125722 tok/s step 2809/19560 | loss 3.663352 (-1.82z)| norm 0.2864 (-0.32z)| lr 5.82e-04 | 4151.79 ms | 32.5% bf16 MFU | 125750 tok/s step 2810/19560 | loss 3.704176 (-0.83z)| norm 0.2814 (-0.53z)| lr 5.82e-04 | 4161.87 ms | 32.4% bf16 MFU | 125761 tok/s step 2811/19560 | loss 3.691148 (-1.13z)| norm 0.2541 (-1.65z)| lr 5.82e-04 | 4152.07 ms | 32.5% bf16 MFU | 125787 tok/s step 2812/19560 | loss 3.657157 (-1.90z)| norm 0.2578 (-1.47z)| lr 5.82e-04 | 4151.23 ms | 32.5% bf16 MFU | 125812 tok/s step 2813/19560 | loss 3.679516 (-1.36z)| norm 0.2583 (-1.43z)| lr 5.82e-04 | 4151.82 ms | 32.5% bf16 MFU | 125836 tok/s step 2814/19560 | loss 3.676904 (-1.40z)| norm 0.2494 (-1.75z)| lr 5.82e-04 | 4150.34 ms | 32.5% bf16 MFU | 125860 tok/s step 2815/19560 | loss 3.666151 (-1.63z)| norm 0.2714 (-0.87z)| lr 5.82e-04 | 4158.46 ms | 32.5% bf16 MFU | 125871 tok/s step 2816/19560 | loss 3.676789 (-1.36z)| norm 0.2878 (-0.22z)| lr 5.82e-04 | 4157.54 ms | 32.5% bf16 MFU | 125883 tok/s step 2817/19560 | loss 3.705776 (-0.68z)| norm 0.2613 (-1.28z)| lr 5.82e-04 | 4159.11 ms | 32.5% bf16 MFU | 125891 tok/s step 2818/19560 | loss 3.673500 (-1.41z)| norm 0.3071 (+0.56z)| lr 5.82e-04 | 4158.86 ms | 32.5% bf16 MFU | 125900 tok/s step 2819/19560 | loss 3.640778 (-2.12z)| norm 0.2462 (-1.86z)| lr 5.82e-04 | 4144.59 ms | 32.6% bf16 MFU | 125930 tok/s step 2820/19560 | loss 3.710467 (-0.51z)| norm 0.2696 (-0.93z)| lr 5.82e-04 | 4151.84 ms | 32.5% bf16 MFU | 125947 tok/s step 2821/19560 | loss 3.738336 (+0.14z)| norm 0.2902 (-0.11z)| lr 5.81e-04 | 4149.06 ms | 32.5% bf16 MFU | 125968 tok/s step 2822/19560 | loss 3.681783 (-1.15z)| norm 0.2763 (-0.65z)| lr 5.81e-04 | 4687.63 ms | 28.8% bf16 MFU | 125262 tok/s step 2823/19560 | loss 3.726648 (-0.12z)| norm 0.2721 (-0.82z)| lr 5.81e-04 | 4148.38 ms | 32.5% bf16 MFU | 125318 tok/s step 2824/19560 | loss 3.683109 (-1.11z)| norm 0.2765 (-0.64z)| lr 5.81e-04 | 4148.19 ms | 32.5% bf16 MFU | 125372 tok/s step 2825/19560 | loss 3.676476 (-1.25z)| norm 0.2704 (-0.87z)| lr 5.81e-04 | 4147.33 ms | 32.6% bf16 MFU | 125424 tok/s step 2826/19560 | loss 3.706776 (-0.55z)| norm 0.2576 (-1.38z)| lr 5.81e-04 | 4158.85 ms | 32.5% bf16 MFU | 125456 tok/s step 2827/19560 | loss 3.782743 (+1.19z)| norm 0.2534 (-1.52z)| lr 5.81e-04 | 4159.73 ms | 32.5% bf16 MFU | 125485 tok/s step 2828/19560 | loss 3.665414 (-1.48z)| norm 0.2648 (-1.06z)| lr 5.81e-04 | 4154.65 ms | 32.5% bf16 MFU | 125521 tok/s step 2829/19560 | loss 3.666249 (-1.43z)| norm 0.2572 (-1.34z)| lr 5.81e-04 | 4148.65 ms | 32.5% bf16 MFU | 125563 tok/s step 2830/19560 | loss 3.691654 (-0.85z)| norm 0.2703 (-0.82z)| lr 5.81e-04 | 4156.87 ms | 32.5% bf16 MFU | 125591 tok/s step 2831/19560 | loss 3.672364 (-1.28z)| norm 0.2862 (-0.22z)| lr 5.81e-04 | 4150.92 ms | 32.5% bf16 MFU | 125627 tok/s step 2832/19560 | loss 3.762263 (+0.80z)| norm 0.2849 (-0.26z)| lr 5.81e-04 | 4158.61 ms | 32.5% bf16 MFU | 125649 tok/s step 2833/19560 | loss 3.702345 (-0.58z)| norm 0.3304 (+1.50z)| lr 5.81e-04 | 4208.36 ms | 32.1% bf16 MFU | 125596 tok/s step 2834/19560 | loss 3.760050 (+0.78z)| norm 0.3147 (+0.88z)| lr 5.81e-04 | 4151.69 ms | 32.5% bf16 MFU | 125630 tok/s step 2835/19560 | loss 3.748184 (+0.49z)| norm 0.3199 (+1.06z)| lr 5.81e-04 | 4150.85 ms | 32.5% bf16 MFU | 125664 tok/s step 2836/19560 | loss 3.704457 (-0.53z)| norm 0.3383 (+1.74z)| lr 5.81e-04 | 4154.03 ms | 32.5% bf16 MFU | 125692 tok/s step 2837/19560 | loss 3.740572 (+0.33z)| norm 0.3616 (+2.55z)| lr 5.81e-04 | 4156.07 ms | 32.5% bf16 MFU | 125715 tok/s step 2838/19560 | loss 3.778508 (+1.22z)| norm 0.3421 (+1.79z)| lr 5.81e-04 | 4150.49 ms | 32.5% bf16 MFU | 125745 tok/s step 2839/19560 | loss 3.811409 (+1.96z)| norm 0.3050 (+0.39z)| lr 5.81e-04 | 4156.82 ms | 32.5% bf16 MFU | 125764 tok/s step 2840/19560 | loss 3.714733 (-0.28z)| norm 0.2949 (+0.02z)| lr 5.81e-04 | 4161.96 ms | 32.4% bf16 MFU | 125774 tok/s step 2841/19560 | loss 3.672415 (-1.27z)| norm 0.2746 (-0.73z)| lr 5.81e-04 | 4151.88 ms | 32.5% bf16 MFU | 125800 tok/s step 2842/19560 | loss 3.715068 (-0.24z)| norm 0.2977 (+0.14z)| lr 5.81e-04 | 4167.90 ms | 32.4% bf16 MFU | 125799 tok/s step 2843/19560 | loss 3.667111 (-1.37z)| norm 0.2923 (-0.06z)| lr 5.81e-04 | 4143.68 ms | 32.6% bf16 MFU | 125836 tok/s step 2844/19560 | loss 3.724030 (-0.00z)| norm 0.2909 (-0.11z)| lr 5.81e-04 | 4170.31 ms | 32.4% bf16 MFU | 125830 tok/s step 2845/19560 | loss 3.707138 (-0.40z)| norm 0.2796 (-0.52z)| lr 5.81e-04 | 4164.26 ms | 32.4% bf16 MFU | 125833 tok/s step 2846/19560 | loss 3.727016 (+0.07z)| norm 0.3085 (+0.56z)| lr 5.81e-04 | 4146.58 ms | 32.6% bf16 MFU | 125864 tok/s step 2847/19560 | loss 3.643922 (-1.88z)| norm 0.3338 (+1.49z)| lr 5.81e-04 | 4147.94 ms | 32.6% bf16 MFU | 125890 tok/s step 2848/19560 | loss 3.643949 (-1.84z)| norm 0.2774 (-0.62z)| lr 5.81e-04 | 4147.81 ms | 32.6% bf16 MFU | 125916 tok/s step 2849/19560 | loss 3.703074 (-0.45z)| norm 0.2767 (-0.64z)| lr 5.81e-04 | 4154.24 ms | 32.5% bf16 MFU | 125930 tok/s step 2850/19560 | loss 3.750200 (+0.65z)| norm 0.2977 (+0.14z)| lr 5.81e-04 | 4151.05 ms | 32.5% bf16 MFU | 125949 tok/s step 2851/19560 | loss 3.721381 (-0.02z)| norm 0.2946 (+0.03z)| lr 5.81e-04 | 4146.74 ms | 32.6% bf16 MFU | 125973 tok/s step 2852/19560 | loss 3.676557 (-1.05z)| norm 0.2675 (-0.98z)| lr 5.81e-04 | 4152.28 ms | 32.5% bf16 MFU | 125988 tok/s step 2853/19560 | loss 3.669572 (-1.20z)| norm 0.2850 (-0.32z)| lr 5.81e-04 | 4161.60 ms | 32.4% bf16 MFU | 125987 tok/s step 2854/19560 | loss 3.787117 (+1.55z)| norm 0.3190 (+0.95z)| lr 5.81e-04 | 4167.73 ms | 32.4% bf16 MFU | 125978 tok/s step 2855/19560 | loss 3.741283 (+0.49z)| norm 0.3305 (+1.35z)| lr 5.81e-04 | 4155.32 ms | 32.5% bf16 MFU | 125988 tok/s step 2856/19560 | loss 3.711887 (-0.20z)| norm 0.2701 (-0.87z)| lr 5.81e-04 | 4142.89 ms | 32.6% bf16 MFU | 126016 tok/s step 2857/19560 | loss 3.686819 (-0.78z)| norm 0.2734 (-0.74z)| lr 5.81e-04 | 4149.91 ms | 32.5% bf16 MFU | 126032 tok/s step 2858/19560 | loss 3.744036 (+0.57z)| norm 0.2778 (-0.56z)| lr 5.81e-04 | 4151.87 ms | 32.5% bf16 MFU | 126044 tok/s step 2859/19560 | loss 3.728859 (+0.21z)| norm 0.2718 (-0.78z)| lr 5.81e-04 | 4163.14 ms | 32.4% bf16 MFU | 126039 tok/s step 2860/19560 | loss 3.707164 (-0.29z)| norm 0.2768 (-0.59z)| lr 5.81e-04 | 4145.02 ms | 32.6% bf16 MFU | 126061 tok/s step 2861/19560 | loss 3.664099 (-1.29z)| norm 0.2677 (-0.95z)| lr 5.81e-04 | 4153.01 ms | 32.5% bf16 MFU | 126070 tok/s step 2862/19560 | loss 3.681025 (-0.89z)| norm 0.2514 (-1.56z)| lr 5.81e-04 | 4151.40 ms | 32.5% bf16 MFU | 126081 tok/s step 2863/19560 | loss 3.743517 (+0.65z)| norm 0.2751 (-0.64z)| lr 5.81e-04 | 4143.17 ms | 32.6% bf16 MFU | 126104 tok/s step 2864/19560 | loss 3.746936 (+0.73z)| norm 0.3192 (+1.07z)| lr 5.81e-04 | 4153.01 ms | 32.5% bf16 MFU | 126111 tok/s step 2865/19560 | loss 3.728667 (+0.28z)| norm 0.3287 (+1.43z)| lr 5.81e-04 | 4170.74 ms | 32.4% bf16 MFU | 126091 tok/s step 2866/19560 | loss 3.643418 (-1.80z)| norm 0.2882 (-0.12z)| lr 5.81e-04 | 4146.71 ms | 32.6% bf16 MFU | 126108 tok/s step 2867/19560 | loss 3.735040 (+0.47z)| norm 0.2881 (-0.12z)| lr 5.81e-04 | 4155.87 ms | 32.5% bf16 MFU | 126111 tok/s step 2868/19560 | loss 3.717445 (+0.03z)| norm 0.2930 (+0.07z)| lr 5.81e-04 | 4170.96 ms | 32.4% bf16 MFU | 126090 tok/s step 2869/19560 | loss 3.712333 (-0.09z)| norm 0.3114 (+0.76z)| lr 5.81e-04 | 4154.57 ms | 32.5% bf16 MFU | 126095 tok/s step 2870/19560 | loss 3.710393 (-0.14z)| norm 0.2824 (-0.38z)| lr 5.81e-04 | 4151.65 ms | 32.5% bf16 MFU | 126105 tok/s step 2871/19560 | loss 3.698633 (-0.42z)| norm 0.2895 (-0.10z)| lr 5.81e-04 | 4150.37 ms | 32.5% bf16 MFU | 126116 tok/s step 2872/19560 | loss 3.705183 (-0.25z)| norm 0.3019 (+0.39z)| lr 5.81e-04 | 4147.28 ms | 32.6% bf16 MFU | 126131 tok/s step 2873/19560 | loss 3.673907 (-1.02z)| norm 0.2671 (-0.97z)| lr 5.81e-04 | 4146.94 ms | 32.6% bf16 MFU | 126146 tok/s step 2874/19560 | loss 3.684531 (-0.74z)| norm 0.2692 (-0.87z)| lr 5.81e-04 | 4160.93 ms | 32.4% bf16 MFU | 126138 tok/s step 2875/19560 | loss 3.669718 (-1.09z)| norm 0.2703 (-0.82z)| lr 5.81e-04 | 4273.75 ms | 31.6% bf16 MFU | 125965 tok/s step 2876/19560 | loss 3.652950 (-1.49z)| norm 0.2605 (-1.20z)| lr 5.81e-04 | 4155.70 ms | 32.5% bf16 MFU | 125975 tok/s step 2877/19560 | loss 3.669156 (-1.08z)| norm 0.2747 (-0.65z)| lr 5.81e-04 | 4166.56 ms | 32.4% bf16 MFU | 125968 tok/s step 2878/19560 | loss 3.742649 (+0.75z)| norm 0.2866 (-0.19z)| lr 5.80e-04 | 4153.84 ms | 32.5% bf16 MFU | 125981 tok/s step 2879/19560 | loss 3.626133 (-2.09z)| norm 0.2573 (-1.32z)| lr 5.80e-04 | 4161.10 ms | 32.4% bf16 MFU | 125981 tok/s step 2880/19560 | loss 3.680289 (-0.75z)| norm 0.3192 (+1.07z)| lr 5.80e-04 | 4154.74 ms | 32.5% bf16 MFU | 125992 tok/s step 2881/19560 | loss 3.683895 (-0.65z)| norm 0.3541 (+2.34z)| lr 5.80e-04 | 4179.76 ms | 32.3% bf16 MFU | 125964 tok/s step 2882/19560 | loss 3.632660 (-1.87z)| norm 0.3363 (+1.64z)| lr 5.80e-04 | 4152.69 ms | 32.5% bf16 MFU | 125978 tok/s step 2883/19560 | loss 3.766098 (+1.36z)| norm 0.2957 (+0.12z)| lr 5.80e-04 | 4157.15 ms | 32.5% bf16 MFU | 125985 tok/s step 2884/19560 | loss 3.674756 (-0.83z)| norm 0.2899 (-0.10z)| lr 5.80e-04 | 4169.05 ms | 32.4% bf16 MFU | 125974 tok/s step 2885/19560 | loss 3.713647 (+0.12z)| norm 0.2704 (-0.82z)| lr 5.80e-04 | 4161.88 ms | 32.4% bf16 MFU | 125974 tok/s step 2886/19560 | loss 3.730879 (+0.54z)| norm 0.2812 (-0.41z)| lr 5.80e-04 | 4150.67 ms | 32.5% bf16 MFU | 125991 tok/s step 2887/19560 | loss 3.716778 (+0.18z)| norm 0.2629 (-1.08z)| lr 5.80e-04 | 4146.96 ms | 32.6% bf16 MFU | 126013 tok/s step 2888/19560 | loss 3.647857 (-1.48z)| norm 0.2570 (-1.28z)| lr 5.80e-04 | 4153.31 ms | 32.5% bf16 MFU | 126024 tok/s step 2889/19560 | loss 3.699318 (-0.22z)| norm 0.2696 (-0.81z)| lr 5.80e-04 | 4154.29 ms | 32.5% bf16 MFU | 126033 tok/s step 2890/19560 | loss 3.700816 (-0.17z)| norm 0.2856 (-0.21z)| lr 5.80e-04 | 4142.41 ms | 32.6% bf16 MFU | 126059 tok/s step 2891/19560 | loss 3.663370 (-1.07z)| norm 0.3020 (+0.40z)| lr 5.80e-04 | 4153.49 ms | 32.5% bf16 MFU | 126068 tok/s step 2892/19560 | loss 3.676060 (-0.75z)| norm 0.2688 (-0.82z)| lr 5.80e-04 | 4156.47 ms | 32.5% bf16 MFU | 126071 tok/s step 2893/19560 | loss 3.645318 (-1.49z)| norm 0.2491 (-1.53z)| lr 5.80e-04 | 4146.15 ms | 32.6% bf16 MFU | 126090 tok/s step 2894/19560 | loss 3.643101 (-1.51z)| norm 0.2569 (-1.24z)| lr 5.80e-04 | 4154.08 ms | 32.5% bf16 MFU | 126096 tok/s step 2895/19560 | loss 3.670660 (-0.83z)| norm 0.2863 (-0.16z)| lr 5.80e-04 | 4146.13 ms | 32.6% bf16 MFU | 126114 tok/s step 2896/19560 | loss 3.753974 (+1.20z)| norm 0.2703 (-0.74z)| lr 5.80e-04 | 4144.73 ms | 32.6% bf16 MFU | 126133 tok/s step 2897/19560 | loss 3.696980 (-0.18z)| norm 0.2805 (-0.37z)| lr 5.80e-04 | 4163.99 ms | 32.4% bf16 MFU | 126122 tok/s step 2898/19560 | loss 3.781215 (+1.86z)| norm 0.2943 (+0.13z)| lr 5.80e-04 | 4161.62 ms | 32.4% bf16 MFU | 126115 tok/s step 2899/19560 | loss 3.736779 (+0.85z)| norm 0.2834 (-0.28z)| lr 5.80e-04 | 4142.28 ms | 32.6% bf16 MFU | 126138 tok/s step 2900/19560 | loss 3.687497 (-0.41z)| norm 0.3052 (+0.52z)| lr 5.80e-04 | 4148.60 ms | 32.5% bf16 MFU | 126150 tok/s step 2901/19560 | loss 3.728443 (+0.66z)| norm 0.3113 (+0.76z)| lr 5.80e-04 | 4159.30 ms | 32.5% bf16 MFU | 126145 tok/s step 2902/19560 | loss 3.707522 (+0.12z)| norm 0.3201 (+1.11z)| lr 5.80e-04 | 4154.99 ms | 32.5% bf16 MFU | 126147 tok/s step 2903/19560 | loss 3.659966 (-1.11z)| norm 0.3039 (+0.51z)| lr 5.80e-04 | 4160.16 ms | 32.5% bf16 MFU | 126141 tok/s step 2904/19560 | loss 3.657620 (-1.15z)| norm 0.2828 (-0.29z)| lr 5.80e-04 | 4146.88 ms | 32.6% bf16 MFU | 126155 tok/s step 2905/19560 | loss 3.696865 (-0.13z)| norm 0.2927 (+0.09z)| lr 5.80e-04 | 4157.14 ms | 32.5% bf16 MFU | 126153 tok/s step 2906/19560 | loss 3.713369 (+0.30z)| norm 0.2852 (-0.18z)| lr 5.80e-04 | 4155.47 ms | 32.5% bf16 MFU | 126154 tok/s step 2907/19560 | loss 3.640132 (-1.58z)| norm 0.2773 (-0.48z)| lr 5.80e-04 | 4146.03 ms | 32.6% bf16 MFU | 126169 tok/s step 2908/19560 | loss 3.672695 (-0.74z)| norm 0.2862 (-0.13z)| lr 5.80e-04 | 4147.72 ms | 32.6% bf16 MFU | 126181 tok/s step 2909/19560 | loss 3.724302 (+0.59z)| norm 0.2966 (+0.27z)| lr 5.80e-04 | 4152.61 ms | 32.5% bf16 MFU | 126185 tok/s step 2910/19560 | loss 3.745757 (+1.12z)| norm 0.3064 (+0.64z)| lr 5.80e-04 | 4168.36 ms | 32.4% bf16 MFU | 126164 tok/s step 2911/19560 | loss 3.688841 (-0.34z)| norm 0.2812 (-0.33z)| lr 5.80e-04 | 4146.19 ms | 32.6% bf16 MFU | 126179 tok/s step 2912/19560 | loss 3.752544 (+1.28z)| norm 0.2892 (-0.03z)| lr 5.80e-04 | 4158.17 ms | 32.5% bf16 MFU | 126174 tok/s step 2913/19560 | loss 3.712280 (+0.25z)| norm 0.2664 (-0.90z)| lr 5.80e-04 | 4152.40 ms | 32.5% bf16 MFU | 126178 tok/s step 2914/19560 | loss 3.700150 (-0.06z)| norm 0.3125 (+0.89z)| lr 5.80e-04 | 4157.78 ms | 32.5% bf16 MFU | 126174 tok/s step 2915/19560 | loss 3.693412 (-0.23z)| norm 0.2957 (+0.24z)| lr 5.80e-04 | 4157.43 ms | 32.5% bf16 MFU | 126171 tok/s step 2916/19560 | loss 3.649379 (-1.35z)| norm 0.3058 (+0.62z)| lr 5.80e-04 | 4154.73 ms | 32.5% bf16 MFU | 126172 tok/s step 2917/19560 | loss 3.644958 (-1.44z)| norm 0.2984 (+0.35z)| lr 5.80e-04 | 4150.15 ms | 32.5% bf16 MFU | 126180 tok/s step 2918/19560 | loss 3.650378 (-1.29z)| norm 0.2730 (-0.63z)| lr 5.80e-04 | 4178.55 ms | 32.3% bf16 MFU | 126144 tok/s step 2919/19560 | loss 3.700123 (-0.04z)| norm 0.2831 (-0.24z)| lr 5.80e-04 | 4156.42 ms | 32.5% bf16 MFU | 126144 tok/s step 2920/19560 | loss 3.724874 (+0.58z)| norm 0.2746 (-0.58z)| lr 5.80e-04 | 4140.59 ms | 32.6% bf16 MFU | 126168 tok/s step 2921/19560 | loss 3.712332 (+0.27z)| norm 0.2743 (-0.60z)| lr 5.80e-04 | 4156.60 ms | 32.5% bf16 MFU | 126166 tok/s step 2922/19560 | loss 3.723520 (+0.55z)| norm 0.2708 (-0.73z)| lr 5.80e-04 | 4144.75 ms | 32.6% bf16 MFU | 126183 tok/s step 2923/19560 | loss 3.678490 (-0.58z)| norm 0.2848 (-0.19z)| lr 5.80e-04 | 4387.75 ms | 30.8% bf16 MFU | 125848 tok/s step 2924/19560 | loss 3.702454 (+0.04z)| norm 0.2912 (+0.06z)| lr 5.80e-04 | 4151.76 ms | 32.5% bf16 MFU | 125870 tok/s step 2925/19560 | loss 3.694163 (-0.19z)| norm 0.2762 (-0.53z)| lr 5.80e-04 | 4183.52 ms | 32.3% bf16 MFU | 125842 tok/s step 2926/19560 | loss 3.659347 (-1.08z)| norm 0.2554 (-1.34z)| lr 5.80e-04 | 4149.32 ms | 32.5% bf16 MFU | 125868 tok/s step 2927/19560 | loss 3.746372 (+1.14z)| norm 0.2686 (-0.80z)| lr 5.80e-04 | 4149.97 ms | 32.5% bf16 MFU | 125891 tok/s step 2928/19560 | loss 3.710639 (+0.23z)| norm 0.2567 (-1.27z)| lr 5.80e-04 | 4152.51 ms | 32.5% bf16 MFU | 125910 tok/s step 2929/19560 | loss 3.707306 (+0.16z)| norm 0.2763 (-0.47z)| lr 5.80e-04 | 4263.84 ms | 31.7% bf16 MFU | 125762 tok/s step 2930/19560 | loss 3.642853 (-1.53z)| norm 0.2857 (-0.07z)| lr 5.80e-04 | 4162.82 ms | 32.4% bf16 MFU | 125771 tok/s step 2931/19560 | loss 3.662759 (-1.00z)| norm 0.2740 (-0.54z)| lr 5.80e-04 | 4146.81 ms | 32.6% bf16 MFU | 125804 tok/s step 2932/19560 | loss 3.651944 (-1.26z)| norm 0.2552 (-1.31z)| lr 5.80e-04 | 4141.43 ms | 32.6% bf16 MFU | 125844 tok/s step 2933/19560 | loss 3.654423 (-1.18z)| norm 0.2472 (-1.61z)| lr 5.80e-04 | 4149.77 ms | 32.5% bf16 MFU | 125869 tok/s step 2934/19560 | loss 3.630387 (-1.80z)| norm 0.3149 (+1.24z)| lr 5.79e-04 | 4161.69 ms | 32.4% bf16 MFU | 125874 tok/s step 2935/19560 | loss 3.639411 (-1.53z)| norm 0.3442 (+2.47z)| lr 5.79e-04 | 4149.39 ms | 32.5% bf16 MFU | 125898 tok/s step 2936/19560 | loss 3.708302 (+0.28z)| norm 0.3297 (+1.82z)| lr 5.79e-04 | 4150.83 ms | 32.5% bf16 MFU | 125919 tok/s step 2937/19560 | loss 3.670753 (-0.71z)| norm 0.3002 (+0.58z)| lr 5.79e-04 | 4147.32 ms | 32.6% bf16 MFU | 125944 tok/s step 2938/19560 | loss 3.739478 (+1.09z)| norm 0.3199 (+1.38z)| lr 5.79e-04 | 4154.90 ms | 32.5% bf16 MFU | 125956 tok/s step 2939/19560 | loss 3.715414 (+0.46z)| norm 0.3247 (+1.55z)| lr 5.79e-04 | 4138.66 ms | 32.6% bf16 MFU | 125992 tok/s step 2940/19560 | loss 3.715828 (+0.46z)| norm 0.3162 (+1.18z)| lr 5.79e-04 | 4144.75 ms | 32.6% bf16 MFU | 126017 tok/s step 2941/19560 | loss 3.668649 (-0.78z)| norm 0.3398 (+2.11z)| lr 5.79e-04 | 4157.38 ms | 32.5% bf16 MFU | 126022 tok/s step 2942/19560 | loss 3.813974 (+2.91z)| norm 0.3128 (+0.99z)| lr 5.79e-04 | 4266.78 ms | 31.6% bf16 MFU | 125865 tok/s step 2943/19560 | loss 3.697913 (-0.05z)| norm 0.2932 (+0.17z)| lr 5.79e-04 | 4152.64 ms | 32.5% bf16 MFU | 125884 tok/s step 2944/19560 | loss 3.678296 (-0.55z)| norm 0.2988 (+0.40z)| lr 5.79e-04 | 4174.83 ms | 32.3% bf16 MFU | 125869 tok/s step 2945/19560 | loss 3.670141 (-0.75z)| norm 0.3101 (+0.86z)| lr 5.79e-04 | 4143.65 ms | 32.6% bf16 MFU | 125902 tok/s step 2946/19560 | loss 3.662068 (-0.95z)| norm 0.2862 (-0.13z)| lr 5.79e-04 | 4140.87 ms | 32.6% bf16 MFU | 125937 tok/s step 2947/19560 | loss 3.717019 (+0.44z)| norm 0.2916 (+0.08z)| lr 5.79e-04 | 4153.00 ms | 32.5% bf16 MFU | 125953 tok/s step 2948/19560 | loss 3.692848 (-0.18z)| norm 0.2866 (-0.13z)| lr 5.79e-04 | 4191.11 ms | 32.2% bf16 MFU | 125910 tok/s step 2949/19560 | loss 3.705164 (+0.14z)| norm 0.3033 (+0.57z)| lr 5.79e-04 | 4150.84 ms | 32.5% bf16 MFU | 125930 tok/s step 2950/19560 | loss 3.699661 (-0.00z)| norm 0.3005 (+0.44z)| lr 5.79e-04 | 4140.87 ms | 32.6% bf16 MFU | 125964 tok/s step 2951/19560 | loss 3.742622 (+1.10z)| norm 0.2792 (-0.46z)| lr 5.79e-04 | 4150.17 ms | 32.5% bf16 MFU | 125982 tok/s step 2952/19560 | loss 3.718810 (+0.48z)| norm 0.2716 (-0.78z)| lr 5.79e-04 | 4176.96 ms | 32.3% bf16 MFU | 125959 tok/s step 2953/19560 | loss 3.643510 (-1.44z)| norm 0.3020 (+0.49z)| lr 5.79e-04 | 4212.97 ms | 32.0% bf16 MFU | 125883 tok/s step 2954/19560 | loss 3.703442 (+0.09z)| norm 0.2858 (-0.20z)| lr 5.79e-04 | 4159.37 ms | 32.5% bf16 MFU | 125892 tok/s step 2955/19560 | loss 3.761252 (+1.58z)| norm 0.2739 (-0.72z)| lr 5.79e-04 | 4263.09 ms | 31.7% bf16 MFU | 125746 tok/s step 2956/19560 | loss 3.643876 (-1.42z)| norm 0.2832 (-0.33z)| lr 5.79e-04 | 4631.42 ms | 29.2% bf16 MFU | 125119 tok/s step 2957/19560 | loss 3.636231 (-1.60z)| norm 0.2893 (-0.08z)| lr 5.79e-04 | 4150.29 ms | 32.5% bf16 MFU | 125179 tok/s step 2958/19560 | loss 3.691540 (-0.20z)| norm 0.2845 (-0.29z)| lr 5.79e-04 | 4145.73 ms | 32.6% bf16 MFU | 125244 tok/s step 2959/19560 | loss 3.647076 (-1.31z)| norm 0.2764 (-0.64z)| lr 5.79e-04 | 4145.61 ms | 32.6% bf16 MFU | 125305 tok/s step 2960/19560 | loss 3.697050 (-0.04z)| norm 0.2759 (-0.66z)| lr 5.79e-04 | 4241.05 ms | 31.8% bf16 MFU | 125221 tok/s step 2961/19560 | loss 3.777876 (+1.98z)| norm 0.3029 (+0.53z)| lr 5.79e-04 | 4143.35 ms | 32.6% bf16 MFU | 125287 tok/s step 2962/19560 | loss 3.646023 (-1.32z)| norm 0.3022 (+0.50z)| lr 5.79e-04 | 4140.09 ms | 32.6% bf16 MFU | 125354 tok/s step 2963/19560 | loss 3.705692 (+0.20z)| norm 0.2925 (+0.08z)| lr 5.79e-04 | 4143.12 ms | 32.6% bf16 MFU | 125414 tok/s step 2964/19560 | loss 3.687098 (-0.27z)| norm 0.3238 (+1.49z)| lr 5.79e-04 | 4158.04 ms | 32.5% bf16 MFU | 125447 tok/s step 2965/19560 | loss 3.669601 (-0.70z)| norm 0.3390 (+2.22z)| lr 5.79e-04 | 4201.57 ms | 32.1% bf16 MFU | 125414 tok/s step 2966/19560 | loss 3.682076 (-0.37z)| norm 0.2779 (-0.55z)| lr 5.79e-04 | 4405.15 ms | 30.6% bf16 MFU | 125094 tok/s step 2967/19560 | loss 3.679373 (-0.43z)| norm 0.3448 (+2.49z)| lr 5.79e-04 | 4197.95 ms | 32.2% bf16 MFU | 125084 tok/s step 2968/19560 | loss 3.739415 (+1.17z)| norm 0.3199 (+1.34z)| lr 5.79e-04 | 4202.76 ms | 32.1% bf16 MFU | 125067 tok/s step 2969/19560 | loss 3.770508 (+1.95z)| norm 0.2963 (+0.27z)| lr 5.79e-04 | 4183.65 ms | 32.3% bf16 MFU | 125080 tok/s step 2970/19560 | loss 3.755328 (+1.53z)| norm 0.2846 (-0.26z)| lr 5.79e-04 | 4161.29 ms | 32.4% bf16 MFU | 125126 tok/s step 2971/19560 | loss 3.729979 (+0.85z)| norm 0.2924 (+0.09z)| lr 5.79e-04 | 5231.44 ms | 25.8% bf16 MFU | 123880 tok/s step 2972/19560 | loss 3.809783 (+2.83z)| norm 0.2717 (-0.83z)| lr 5.79e-04 | 4147.71 ms | 32.6% bf16 MFU | 124006 tok/s step 2973/19560 | loss 3.689957 (-0.20z)| norm 0.2794 (-0.49z)| lr 5.79e-04 | 4159.54 ms | 32.5% bf16 MFU | 124108 tok/s step 2974/19560 | loss 3.692003 (-0.14z)| norm 0.2776 (-0.56z)| lr 5.79e-04 | 4170.92 ms | 32.4% bf16 MFU | 124188 tok/s step 2975/19560 | loss 3.740378 (+1.07z)| norm 0.2509 (-1.74z)| lr 5.79e-04 | 4148.82 ms | 32.5% bf16 MFU | 124297 tok/s step 2976/19560 | loss 3.638681 (-1.51z)| norm 0.2949 (+0.25z)| lr 5.79e-04 | 4140.33 ms | 32.6% bf16 MFU | 124414 tok/s step 2977/19560 | loss 3.680345 (-0.45z)| norm 0.3012 (+0.52z)| lr 5.79e-04 | 4172.05 ms | 32.4% bf16 MFU | 124476 tok/s step 2978/19560 | loss 3.801218 (+2.56z)| norm 0.2660 (-1.06z)| lr 5.79e-04 | 4160.83 ms | 32.4% bf16 MFU | 124553 tok/s step 2979/19560 | loss 3.679888 (-0.45z)| norm 0.2629 (-1.18z)| lr 5.79e-04 | 4189.47 ms | 32.2% bf16 MFU | 124582 tok/s step 2980/19560 | loss 3.699639 (+0.03z)| norm 0.2734 (-0.71z)| lr 5.79e-04 | 4162.36 ms | 32.4% bf16 MFU | 124651 tok/s step 2981/19560 | loss 3.680284 (-0.45z)| norm 0.2413 (-2.11z)| lr 5.79e-04 | 4153.18 ms | 32.5% bf16 MFU | 124731 tok/s step 2982/19560 | loss 3.729270 (+0.80z)| norm 0.2587 (-1.32z)| lr 5.79e-04 | 4193.61 ms | 32.2% bf16 MFU | 124745 tok/s step 2983/19560 | loss 3.774725 (+1.92z)| norm 0.2414 (-2.04z)| lr 5.79e-04 | 4179.95 ms | 32.3% bf16 MFU | 124779 tok/s step 2984/19560 | loss 3.716283 (+0.45z)| norm 0.2421 (-1.98z)| lr 5.79e-04 | 4162.52 ms | 32.4% bf16 MFU | 124838 tok/s step 2985/19560 | loss 3.678604 (-0.49z)| norm 0.2379 (-2.12z)| lr 5.79e-04 | 4166.77 ms | 32.4% bf16 MFU | 124887 tok/s step 2986/19560 | loss 3.750342 (+1.31z)| norm 0.2577 (-1.25z)| lr 5.79e-04 | 4169.15 ms | 32.4% bf16 MFU | 124931 tok/s step 2987/19560 | loss 3.729815 (+0.79z)| norm 0.2836 (-0.15z)| lr 5.79e-04 | 4150.84 ms | 32.5% bf16 MFU | 125000 tok/s step 2988/19560 | loss 3.705149 (+0.17z)| norm 0.3221 (+1.47z)| lr 5.78e-04 | 4157.56 ms | 32.5% bf16 MFU | 125055 tok/s step 2989/19560 | loss 3.722432 (+0.60z)| norm 0.3111 (+0.99z)| lr 5.78e-04 | 4162.70 ms | 32.4% bf16 MFU | 125100 tok/s step 2990/19560 | loss 3.776172 (+1.90z)| norm 0.2992 (+0.47z)| lr 5.78e-04 | 4175.15 ms | 32.3% bf16 MFU | 125123 tok/s step 2991/19560 | loss 3.713453 (+0.36z)| norm 0.3158 (+1.16z)| lr 5.78e-04 | 4157.63 ms | 32.5% bf16 MFU | 125172 tok/s step 2992/19560 | loss 3.736135 (+0.92z)| norm 0.3336 (+1.90z)| lr 5.78e-04 | 4178.12 ms | 32.3% bf16 MFU | 125188 tok/s step 2993/19560 | loss 3.684166 (-0.36z)| norm 0.3252 (+1.55z)| lr 5.78e-04 | 4161.37 ms | 32.4% bf16 MFU | 125228 tok/s step 2994/19560 | loss 3.738506 (+0.98z)| norm 0.2689 (-0.82z)| lr 5.78e-04 | 4156.22 ms | 32.5% bf16 MFU | 125274 tok/s step 2995/19560 | loss 3.737755 (+0.96z)| norm 0.2701 (-0.77z)| lr 5.78e-04 | 4156.37 ms | 32.5% bf16 MFU | 125317 tok/s step 2996/19560 | loss 3.710236 (+0.27z)| norm 0.2933 (+0.21z)| lr 5.78e-04 | 4184.67 ms | 32.3% bf16 MFU | 125316 tok/s step 2997/19560 | loss 3.734047 (+0.86z)| norm 0.2842 (-0.16z)| lr 5.78e-04 | 4172.44 ms | 32.4% bf16 MFU | 125333 tok/s step 2998/19560 | loss 3.689672 (-0.24z)| norm 0.2688 (-0.81z)| lr 5.78e-04 | 4179.34 ms | 32.3% bf16 MFU | 125338 tok/s step 2999/19560 | loss 3.694832 (-0.11z)| norm 0.2836 (-0.18z)| lr 5.78e-04 | 4165.74 ms | 32.4% bf16 MFU | 125364 tok/s step 3000/19560 | loss 3.787072 (+2.13z)| norm 0.2771 (-0.45z)| lr 5.78e-04 | 4162.41 ms | 32.4% bf16 MFU | 125394 tok/s val loss 3.697759 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2676/10042 = 0.266481 step 3001/19560 | loss 3.746683 (+1.13z)| norm 0.2920 (+0.17z)| lr 5.78e-04 | 4157.35 ms | 32.5% bf16 MFU | 125430 tok/s step 3002/19560 | loss 3.703994 (+0.08z)| norm 0.3105 (+0.94z)| lr 5.78e-04 | 4188.72 ms | 32.2% bf16 MFU | 125417 tok/s step 3003/19560 | loss 3.711279 (+0.25z)| norm 0.2955 (+0.30z)| lr 5.78e-04 | 4161.15 ms | 32.4% bf16 MFU | 125446 tok/s step 3004/19560 | loss 3.668239 (-0.81z)| norm 0.2974 (+0.37z)| lr 5.78e-04 | 4170.29 ms | 32.4% bf16 MFU | 125459 tok/s step 3005/19560 | loss 3.693234 (-0.20z)| norm 0.2755 (-0.56z)| lr 5.78e-04 | 4155.23 ms | 32.5% bf16 MFU | 125495 tok/s step 3006/19560 | loss 3.734728 (+0.82z)| norm 0.2998 (+0.47z)| lr 5.78e-04 | 4170.54 ms | 32.4% bf16 MFU | 125506 tok/s step 3007/19560 | loss 3.643012 (-1.45z)| norm 0.2942 (+0.22z)| lr 5.78e-04 | 4170.35 ms | 32.4% bf16 MFU | 125517 tok/s step 3008/19560 | loss 3.673544 (-0.69z)| norm 0.2665 (-0.96z)| lr 5.78e-04 | 4183.22 ms | 32.3% bf16 MFU | 125507 tok/s step 3009/19560 | loss 3.692655 (-0.22z)| norm 0.2513 (-1.61z)| lr 5.78e-04 | 4169.53 ms | 32.4% bf16 MFU | 125519 tok/s step 3010/19560 | loss 3.707956 (+0.15z)| norm 0.2584 (-1.28z)| lr 5.78e-04 | 4180.62 ms | 32.3% bf16 MFU | 125514 tok/s step 3011/19560 | loss 3.701615 (+0.00z)| norm 0.2825 (-0.21z)| lr 5.78e-04 | 4158.64 ms | 32.5% bf16 MFU | 125541 tok/s step 3012/19560 | loss 3.779802 (+1.94z)| norm 0.2878 (+0.03z)| lr 5.78e-04 | 4185.79 ms | 32.3% bf16 MFU | 125527 tok/s step 3013/19560 | loss 3.640783 (-1.51z)| norm 0.3270 (+1.73z)| lr 5.78e-04 | 4188.74 ms | 32.2% bf16 MFU | 125509 tok/s step 3014/19560 | loss 3.772334 (+1.72z)| norm 0.3712 (+3.47z)| lr 5.78e-04 | 4156.09 ms | 32.5% bf16 MFU | 125541 tok/s step 3015/19560 | loss 3.743761 (+1.01z)| norm 0.4035 (+4.41z)| lr 5.78e-04 | 4169.18 ms | 32.4% bf16 MFU | 125552 tok/s step 3016/19560 | loss 3.696070 (-0.16z)| norm 0.3445 (+2.08z)| lr 5.78e-04 | 4165.95 ms | 32.4% bf16 MFU | 125567 tok/s step 3017/19560 | loss 3.710372 (+0.19z)| norm 0.2827 (-0.29z)| lr 5.78e-04 | 4185.66 ms | 32.3% bf16 MFU | 125551 tok/s step 3018/19560 | loss 3.726554 (+0.58z)| norm 0.2933 (+0.12z)| lr 5.78e-04 | 4248.50 ms | 31.8% bf16 MFU | 125444 tok/s step 3019/19560 | loss 3.789738 (+2.08z)| norm 0.2701 (-0.76z)| lr 5.78e-04 | 4164.03 ms | 32.4% bf16 MFU | 125467 tok/s step 3020/19560 | loss 3.687488 (-0.40z)| norm 0.2699 (-0.77z)| lr 5.78e-04 | 4202.58 ms | 32.1% bf16 MFU | 125431 tok/s step 3021/19560 | loss 3.702846 (-0.04z)| norm 0.2663 (-0.92z)| lr 5.78e-04 | 4155.01 ms | 32.5% bf16 MFU | 125469 tok/s step 3022/19560 | loss 3.685870 (-0.47z)| norm 0.2639 (-1.02z)| lr 5.78e-04 | 4168.84 ms | 32.4% bf16 MFU | 125484 tok/s step 3023/19560 | loss 3.735857 (+0.75z)| norm 0.2766 (-0.52z)| lr 5.78e-04 | 4158.98 ms | 32.5% bf16 MFU | 125513 tok/s step 3024/19560 | loss 3.742589 (+0.92z)| norm 0.2698 (-0.78z)| lr 5.78e-04 | 4175.16 ms | 32.3% bf16 MFU | 125516 tok/s step 3025/19560 | loss 3.702855 (-0.06z)| norm 0.2798 (-0.40z)| lr 5.78e-04 | 4166.36 ms | 32.4% bf16 MFU | 125532 tok/s step 3026/19560 | loss 3.713154 (+0.21z)| norm 0.2616 (-1.09z)| lr 5.78e-04 | 4156.02 ms | 32.5% bf16 MFU | 125563 tok/s step 3027/19560 | loss 3.680406 (-0.60z)| norm 0.2490 (-1.55z)| lr 5.78e-04 | 4162.59 ms | 32.4% bf16 MFU | 125582 tok/s step 3028/19560 | loss 3.745030 (+1.01z)| norm 0.2490 (-1.52z)| lr 5.78e-04 | 4164.41 ms | 32.4% bf16 MFU | 125598 tok/s step 3029/19560 | loss 3.844062 (+3.32z)| norm 0.2488 (-1.50z)| lr 5.78e-04 | 4164.23 ms | 32.4% bf16 MFU | 125613 tok/s step 3030/19560 | loss 3.751371 (+1.08z)| norm 0.2508 (-1.40z)| lr 5.78e-04 | 4163.80 ms | 32.4% bf16 MFU | 125628 tok/s step 3031/19560 | loss 3.674841 (-0.75z)| norm 0.2536 (-1.27z)| lr 5.78e-04 | 4171.24 ms | 32.4% bf16 MFU | 125632 tok/s step 3032/19560 | loss 3.741548 (+0.83z)| norm 0.2597 (-1.03z)| lr 5.78e-04 | 4165.16 ms | 32.4% bf16 MFU | 125644 tok/s step 3033/19560 | loss 3.709106 (+0.05z)| norm 0.2517 (-1.31z)| lr 5.78e-04 | 4165.01 ms | 32.4% bf16 MFU | 125655 tok/s step 3034/19560 | loss 3.676675 (-0.72z)| norm 0.2634 (-0.87z)| lr 5.78e-04 | 4155.37 ms | 32.5% bf16 MFU | 125681 tok/s step 3035/19560 | loss 3.765329 (+1.39z)| norm 0.3452 (+2.08z)| lr 5.78e-04 | 4158.85 ms | 32.5% bf16 MFU | 125700 tok/s step 3036/19560 | loss 3.682812 (-0.60z)| norm 0.3551 (+2.37z)| lr 5.78e-04 | 4179.88 ms | 32.3% bf16 MFU | 125687 tok/s step 3037/19560 | loss 3.732759 (+0.60z)| norm 0.3276 (+1.38z)| lr 5.78e-04 | 4158.38 ms | 32.5% bf16 MFU | 125707 tok/s step 3038/19560 | loss 3.717760 (+0.25z)| norm 0.3145 (+0.91z)| lr 5.78e-04 | 4152.05 ms | 32.5% bf16 MFU | 125735 tok/s step 3039/19560 | loss 3.727076 (+0.46z)| norm 0.3315 (+1.48z)| lr 5.78e-04 | 4165.03 ms | 32.4% bf16 MFU | 125742 tok/s step 3040/19560 | loss 3.724602 (+0.41z)| norm 0.3302 (+1.41z)| lr 5.78e-04 | 4161.39 ms | 32.4% bf16 MFU | 125754 tok/s step 3041/19560 | loss 3.728942 (+0.51z)| norm 0.3067 (+0.59z)| lr 5.77e-04 | 4165.65 ms | 32.4% bf16 MFU | 125760 tok/s step 3042/19560 | loss 3.701617 (-0.15z)| norm 0.3459 (+1.91z)| lr 5.77e-04 | 4167.87 ms | 32.4% bf16 MFU | 125761 tok/s step 3043/19560 | loss 3.700990 (-0.17z)| norm 0.3081 (+0.62z)| lr 5.77e-04 | 4174.79 ms | 32.3% bf16 MFU | 125752 tok/s step 3044/19560 | loss 3.700108 (-0.20z)| norm 0.3239 (+1.15z)| lr 5.77e-04 | 4155.46 ms | 32.5% bf16 MFU | 125773 tok/s step 3045/19560 | loss 3.691226 (-0.43z)| norm 0.3092 (+0.65z)| lr 5.77e-04 | 4179.09 ms | 32.3% bf16 MFU | 125757 tok/s step 3046/19560 | loss 3.703900 (-0.13z)| norm 0.2882 (-0.07z)| lr 5.77e-04 | 4151.73 ms | 32.5% bf16 MFU | 125784 tok/s step 3047/19560 | loss 3.732520 (+0.58z)| norm 0.2951 (+0.16z)| lr 5.77e-04 | 4200.85 ms | 32.1% bf16 MFU | 125735 tok/s step 3048/19560 | loss 3.767731 (+1.43z)| norm 0.3313 (+1.37z)| lr 5.77e-04 | 4173.84 ms | 32.3% bf16 MFU | 125729 tok/s step 3049/19560 | loss 3.699188 (-0.25z)| norm 0.3128 (+0.73z)| lr 5.77e-04 | 4165.50 ms | 32.4% bf16 MFU | 125735 tok/s step 3050/19560 | loss 3.678258 (-0.76z)| norm 0.3219 (+1.02z)| lr 5.77e-04 | 4165.76 ms | 32.4% bf16 MFU | 125741 tok/s step 3051/19560 | loss 3.654629 (-1.33z)| norm 0.2829 (-0.29z)| lr 5.77e-04 | 4151.57 ms | 32.5% bf16 MFU | 125769 tok/s step 3052/19560 | loss 3.756562 (+1.15z)| norm 0.2777 (-0.46z)| lr 5.77e-04 | 4159.03 ms | 32.5% bf16 MFU | 125783 tok/s step 3053/19560 | loss 3.673193 (-0.87z)| norm 0.2978 (+0.21z)| lr 5.77e-04 | 4166.13 ms | 32.4% bf16 MFU | 125786 tok/s step 3054/19560 | loss 3.723318 (+0.33z)| norm 0.2553 (-1.21z)| lr 5.77e-04 | 4173.77 ms | 32.3% bf16 MFU | 125778 tok/s step 3055/19560 | loss 3.769218 (+1.44z)| norm 0.2634 (-0.94z)| lr 5.77e-04 | 4165.61 ms | 32.4% bf16 MFU | 125782 tok/s step 3056/19560 | loss 3.696182 (-0.33z)| norm 0.2693 (-0.75z)| lr 5.77e-04 | 4170.06 ms | 32.4% bf16 MFU | 125779 tok/s step 3057/19560 | loss 3.772535 (+1.49z)| norm 0.2758 (-0.53z)| lr 5.77e-04 | 4159.48 ms | 32.5% bf16 MFU | 125793 tok/s step 3058/19560 | loss 3.737481 (+0.64z)| norm 0.2941 (+0.08z)| lr 5.77e-04 | 4170.50 ms | 32.4% bf16 MFU | 125789 tok/s step 3059/19560 | loss 3.725264 (+0.33z)| norm 0.3008 (+0.30z)| lr 5.77e-04 | 4159.95 ms | 32.5% bf16 MFU | 125801 tok/s step 3060/19560 | loss 3.708945 (-0.07z)| norm 0.2822 (-0.33z)| lr 5.77e-04 | 4169.84 ms | 32.4% bf16 MFU | 125797 tok/s step 3061/19560 | loss 3.706614 (-0.14z)| norm 0.2893 (-0.10z)| lr 5.77e-04 | 4229.31 ms | 31.9% bf16 MFU | 125706 tok/s step 3062/19560 | loss 3.691498 (-0.54z)| norm 0.3092 (+0.58z)| lr 5.77e-04 | 4164.72 ms | 32.4% bf16 MFU | 125715 tok/s step 3063/19560 | loss 3.738605 (+0.63z)| norm 0.2827 (-0.32z)| lr 5.77e-04 | 4166.89 ms | 32.4% bf16 MFU | 125720 tok/s step 3064/19560 | loss 3.684328 (-0.74z)| norm 0.3214 (+1.03z)| lr 5.77e-04 | 4151.91 ms | 32.5% bf16 MFU | 125748 tok/s step 3065/19560 | loss 3.684927 (-0.73z)| norm 0.2838 (-0.27z)| lr 5.77e-04 | 4156.13 ms | 32.5% bf16 MFU | 125768 tok/s step 3066/19560 | loss 3.779815 (+1.66z)| norm 0.2994 (+0.28z)| lr 5.77e-04 | 4167.42 ms | 32.4% bf16 MFU | 125770 tok/s step 3067/19560 | loss 3.677106 (-0.92z)| norm 0.2999 (+0.30z)| lr 5.77e-04 | 4163.18 ms | 32.4% bf16 MFU | 125778 tok/s step 3068/19560 | loss 3.751170 (+0.93z)| norm 0.2843 (-0.23z)| lr 5.77e-04 | 4189.58 ms | 32.2% bf16 MFU | 125746 tok/s step 3069/19560 | loss 3.750244 (+0.90z)| norm 0.2544 (-1.27z)| lr 5.77e-04 | 4161.93 ms | 32.4% bf16 MFU | 125758 tok/s step 3070/19560 | loss 3.730003 (+0.42z)| norm 0.2757 (-0.51z)| lr 5.77e-04 | 4163.52 ms | 32.4% bf16 MFU | 125766 tok/s step 3071/19560 | loss 3.687871 (-0.66z)| norm 0.2949 (+0.17z)| lr 5.77e-04 | 4167.73 ms | 32.4% bf16 MFU | 125767 tok/s step 3072/19560 | loss 3.799169 (+2.14z)| norm 0.2745 (-0.54z)| lr 5.77e-04 | 4164.22 ms | 32.4% bf16 MFU | 125774 tok/s step 3073/19560 | loss 3.690397 (-0.62z)| norm 0.3098 (+0.70z)| lr 5.77e-04 | 4155.00 ms | 32.5% bf16 MFU | 125795 tok/s step 3074/19560 | loss 3.740276 (+0.63z)| norm 0.2829 (-0.25z)| lr 5.77e-04 | 4166.96 ms | 32.4% bf16 MFU | 125796 tok/s step 3075/19560 | loss 3.739488 (+0.61z)| norm 0.2893 (-0.02z)| lr 5.77e-04 | 4146.02 ms | 32.6% bf16 MFU | 125829 tok/s step 3076/19560 | loss 3.751273 (+0.89z)| norm 0.3155 (+0.89z)| lr 5.77e-04 | 4160.17 ms | 32.5% bf16 MFU | 125839 tok/s step 3077/19560 | loss 3.683509 (-0.82z)| norm 0.3155 (+0.89z)| lr 5.77e-04 | 4159.27 ms | 32.5% bf16 MFU | 125849 tok/s step 3078/19560 | loss 3.685209 (-0.77z)| norm 0.2851 (-0.17z)| lr 5.77e-04 | 4171.18 ms | 32.4% bf16 MFU | 125842 tok/s step 3079/19560 | loss 3.712527 (-0.08z)| norm 0.2740 (-0.56z)| lr 5.77e-04 | 4171.55 ms | 32.4% bf16 MFU | 125834 tok/s step 3080/19560 | loss 3.714642 (-0.02z)| norm 0.3014 (+0.39z)| lr 5.77e-04 | 4154.05 ms | 32.5% bf16 MFU | 125853 tok/s step 3081/19560 | loss 3.733251 (+0.44z)| norm 0.2867 (-0.12z)| lr 5.77e-04 | 4171.98 ms | 32.4% bf16 MFU | 125843 tok/s step 3082/19560 | loss 3.797978 (+2.05z)| norm 0.2860 (-0.15z)| lr 5.77e-04 | 4162.42 ms | 32.4% bf16 MFU | 125849 tok/s step 3083/19560 | loss 3.741791 (+0.63z)| norm 0.2915 (+0.05z)| lr 5.77e-04 | 4160.96 ms | 32.4% bf16 MFU | 125857 tok/s step 3084/19560 | loss 3.715695 (-0.04z)| norm 0.3197 (+1.03z)| lr 5.77e-04 | 4168.00 ms | 32.4% bf16 MFU | 125853 tok/s step 3085/19560 | loss 3.715968 (-0.05z)| norm 0.3138 (+0.81z)| lr 5.77e-04 | 4161.61 ms | 32.4% bf16 MFU | 125860 tok/s step 3086/19560 | loss 3.680279 (-0.98z)| norm 0.3033 (+0.43z)| lr 5.77e-04 | 4151.36 ms | 32.5% bf16 MFU | 125881 tok/s step 3087/19560 | loss 3.745232 (+0.70z)| norm 0.3103 (+0.67z)| lr 5.77e-04 | 4159.52 ms | 32.5% bf16 MFU | 125890 tok/s step 3088/19560 | loss 3.695193 (-0.62z)| norm 0.3008 (+0.33z)| lr 5.77e-04 | 4161.39 ms | 32.4% bf16 MFU | 125895 tok/s step 3089/19560 | loss 3.732340 (+0.37z)| norm 0.2686 (-0.78z)| lr 5.77e-04 | 4162.99 ms | 32.4% bf16 MFU | 125897 tok/s step 3090/19560 | loss 3.732571 (+0.37z)| norm 0.2813 (-0.33z)| lr 5.77e-04 | 4178.58 ms | 32.3% bf16 MFU | 125876 tok/s step 3091/19560 | loss 3.646847 (-1.91z)| norm 0.2588 (-1.11z)| lr 5.77e-04 | 4157.36 ms | 32.5% bf16 MFU | 125887 tok/s step 3092/19560 | loss 3.716993 (-0.05z)| norm 0.2638 (-0.92z)| lr 5.77e-04 | 4155.43 ms | 32.5% bf16 MFU | 125901 tok/s step 3093/19560 | loss 3.689160 (-0.80z)| norm 0.2721 (-0.62z)| lr 5.76e-04 | 4164.16 ms | 32.4% bf16 MFU | 125902 tok/s step 3094/19560 | loss 3.699687 (-0.52z)| norm 0.2677 (-0.77z)| lr 5.76e-04 | 7246.34 ms | 18.6% bf16 MFU | 123224 tok/s step 3095/19560 | loss 3.669752 (-1.32z)| norm 0.2515 (-1.32z)| lr 5.76e-04 | 4154.56 ms | 32.5% bf16 MFU | 123373 tok/s step 3096/19560 | loss 3.705956 (-0.34z)| norm 0.2709 (-0.62z)| lr 5.76e-04 | 4168.05 ms | 32.4% bf16 MFU | 123493 tok/s step 3097/19560 | loss 3.753075 (+0.93z)| norm 0.2690 (-0.68z)| lr 5.76e-04 | 4158.28 ms | 32.5% bf16 MFU | 123623 tok/s step 3098/19560 | loss 3.641671 (-2.03z)| norm 0.2451 (-1.51z)| lr 5.76e-04 | 4157.55 ms | 32.5% bf16 MFU | 123747 tok/s step 3099/19560 | loss 3.732541 (+0.40z)| norm 0.2783 (-0.33z)| lr 5.76e-04 | 4160.65 ms | 32.5% bf16 MFU | 123860 tok/s step 3100/19560 | loss 3.744522 (+0.74z)| norm 0.2699 (-0.63z)| lr 5.76e-04 | 4150.45 ms | 32.5% bf16 MFU | 123983 tok/s step 3101/19560 | loss 3.709254 (-0.22z)| norm 0.2788 (-0.31z)| lr 5.76e-04 | 4164.07 ms | 32.4% bf16 MFU | 124079 tok/s step 3102/19560 | loss 3.698635 (-0.51z)| norm 0.2664 (-0.75z)| lr 5.76e-04 | 4154.38 ms | 32.5% bf16 MFU | 124186 tok/s step 3103/19560 | loss 3.689614 (-0.75z)| norm 0.2609 (-0.95z)| lr 5.76e-04 | 4173.76 ms | 32.3% bf16 MFU | 124257 tok/s step 3104/19560 | loss 3.692585 (-0.69z)| norm 0.2767 (-0.38z)| lr 5.76e-04 | 4151.86 ms | 32.5% bf16 MFU | 124358 tok/s step 3105/19560 | loss 3.718050 (+0.01z)| norm 0.2975 (+0.35z)| lr 5.76e-04 | 4152.13 ms | 32.5% bf16 MFU | 124454 tok/s step 3106/19560 | loss 3.714039 (-0.09z)| norm 0.2875 (-0.01z)| lr 5.76e-04 | 4171.99 ms | 32.4% bf16 MFU | 124514 tok/s step 3107/19560 | loss 3.702587 (-0.42z)| norm 0.2937 (+0.20z)| lr 5.76e-04 | 4171.25 ms | 32.4% bf16 MFU | 124573 tok/s step 3108/19560 | loss 3.702946 (-0.41z)| norm 0.2591 (-1.01z)| lr 5.76e-04 | 4188.69 ms | 32.2% bf16 MFU | 124603 tok/s step 3109/19560 | loss 3.687993 (-0.84z)| norm 0.2855 (-0.10z)| lr 5.76e-04 | 4214.41 ms | 32.0% bf16 MFU | 124593 tok/s step 3110/19560 | loss 3.737068 (+0.56z)| norm 0.3105 (+0.78z)| lr 5.76e-04 | 4191.85 ms | 32.2% bf16 MFU | 124617 tok/s step 3111/19560 | loss 3.709283 (-0.22z)| norm 0.3163 (+0.97z)| lr 5.76e-04 | 4167.11 ms | 32.4% bf16 MFU | 124677 tok/s step 3112/19560 | loss 3.687437 (-0.84z)| norm 0.3013 (+0.42z)| lr 5.76e-04 | 4168.77 ms | 32.4% bf16 MFU | 124731 tok/s step 3113/19560 | loss 3.692221 (-0.71z)| norm 0.2886 (-0.06z)| lr 5.76e-04 | 4204.25 ms | 32.1% bf16 MFU | 124730 tok/s step 3114/19560 | loss 3.682552 (-0.97z)| norm 0.3004 (+0.37z)| lr 5.76e-04 | 4167.93 ms | 32.4% bf16 MFU | 124783 tok/s step 3115/19560 | loss 3.767326 (+1.46z)| norm 0.2737 (-0.62z)| lr 5.76e-04 | 4164.02 ms | 32.4% bf16 MFU | 124839 tok/s step 3116/19560 | loss 3.729176 (+0.36z)| norm 0.2630 (-1.00z)| lr 5.76e-04 | 4160.52 ms | 32.5% bf16 MFU | 124898 tok/s step 3117/19560 | loss 3.734562 (+0.51z)| norm 0.2390 (-1.85z)| lr 5.76e-04 | 4161.95 ms | 32.4% bf16 MFU | 124952 tok/s step 3118/19560 | loss 3.691062 (-0.73z)| norm 0.2376 (-1.85z)| lr 5.76e-04 | 4156.86 ms | 32.5% bf16 MFU | 125011 tok/s step 3119/19560 | loss 3.694281 (-0.63z)| norm 0.2449 (-1.56z)| lr 5.76e-04 | 4170.23 ms | 32.4% bf16 MFU | 125046 tok/s step 3120/19560 | loss 3.679435 (-1.04z)| norm 0.2673 (-0.74z)| lr 5.76e-04 | 4163.64 ms | 32.4% bf16 MFU | 125090 tok/s step 3121/19560 | loss 3.684895 (-0.88z)| norm 0.2730 (-0.52z)| lr 5.76e-04 | 4161.60 ms | 32.4% bf16 MFU | 125134 tok/s step 3122/19560 | loss 3.732733 (+0.50z)| norm 0.2769 (-0.39z)| lr 5.76e-04 | 4164.40 ms | 32.4% bf16 MFU | 125173 tok/s step 3123/19560 | loss 3.777109 (+1.75z)| norm 0.2890 (+0.05z)| lr 5.76e-04 | 4171.75 ms | 32.4% bf16 MFU | 125198 tok/s step 3124/19560 | loss 3.766656 (+1.43z)| norm 0.2908 (+0.12z)| lr 5.76e-04 | 4179.47 ms | 32.3% bf16 MFU | 125210 tok/s step 3125/19560 | loss 3.692655 (-0.66z)| norm 0.2833 (-0.16z)| lr 5.76e-04 | 4165.59 ms | 32.4% bf16 MFU | 125243 tok/s step 3126/19560 | loss 3.704526 (-0.33z)| norm 0.3109 (+0.84z)| lr 5.76e-04 | 4212.87 ms | 32.0% bf16 MFU | 125203 tok/s step 3127/19560 | loss 3.696662 (-0.55z)| norm 0.3085 (+0.75z)| lr 5.76e-04 | 4200.11 ms | 32.1% bf16 MFU | 125184 tok/s step 3128/19560 | loss 3.709254 (-0.18z)| norm 0.3277 (+1.42z)| lr 5.76e-04 | 4161.52 ms | 32.4% bf16 MFU | 125224 tok/s step 3129/19560 | loss 3.715449 (+0.01z)| norm 0.4126 (+4.15z)| lr 5.76e-04 | 4384.16 ms | 30.8% bf16 MFU | 124942 tok/s step 3130/19560 | loss 3.693181 (-0.63z)| norm 0.3568 (+2.22z)| lr 5.76e-04 | 4175.03 ms | 32.3% bf16 MFU | 124974 tok/s step 3131/19560 | loss 3.695754 (-0.55z)| norm 0.3138 (+0.79z)| lr 5.76e-04 | 4163.87 ms | 32.4% bf16 MFU | 125021 tok/s step 3132/19560 | loss 3.683495 (-0.92z)| norm 0.3210 (+1.02z)| lr 5.76e-04 | 4316.04 ms | 31.3% bf16 MFU | 124844 tok/s step 3133/19560 | loss 3.701531 (-0.40z)| norm 0.3040 (+0.45z)| lr 5.76e-04 | 4316.09 ms | 31.3% bf16 MFU | 124675 tok/s step 3134/19560 | loss 3.641174 (-2.09z)| norm 0.2788 (-0.37z)| lr 5.76e-04 | 4169.71 ms | 32.4% bf16 MFU | 124728 tok/s step 3135/19560 | loss 3.683226 (-0.91z)| norm 0.2895 (-0.02z)| lr 5.76e-04 | 4158.98 ms | 32.5% bf16 MFU | 124795 tok/s step 3136/19560 | loss 3.700603 (-0.42z)| norm 0.2754 (-0.49z)| lr 5.76e-04 | 4166.45 ms | 32.4% bf16 MFU | 124847 tok/s step 3137/19560 | loss 3.661654 (-1.53z)| norm 0.2718 (-0.61z)| lr 5.76e-04 | 4295.59 ms | 31.4% bf16 MFU | 124707 tok/s step 3138/19560 | loss 3.696863 (-0.51z)| norm 0.2856 (-0.16z)| lr 5.76e-04 | 4198.80 ms | 32.2% bf16 MFU | 124715 tok/s step 3139/19560 | loss 3.698290 (-0.47z)| norm 0.3205 (+0.98z)| lr 5.76e-04 | 4180.31 ms | 32.3% bf16 MFU | 124750 tok/s step 3140/19560 | loss 3.703465 (-0.31z)| norm 0.2934 (+0.08z)| lr 5.76e-04 | 4173.38 ms | 32.4% bf16 MFU | 124794 tok/s step 3141/19560 | loss 3.683635 (-0.91z)| norm 0.2834 (-0.24z)| lr 5.76e-04 | 4179.06 ms | 32.3% bf16 MFU | 124827 tok/s step 3142/19560 | loss 3.675297 (-1.14z)| norm 0.3170 (+0.92z)| lr 5.76e-04 | 4172.74 ms | 32.4% bf16 MFU | 124868 tok/s step 3143/19560 | loss 3.690238 (-0.68z)| norm 0.3096 (+0.73z)| lr 5.76e-04 | 4415.86 ms | 30.6% bf16 MFU | 124561 tok/s step 3144/19560 | loss 3.734024 (+0.61z)| norm 0.2839 (-0.19z)| lr 5.76e-04 | 4210.65 ms | 32.1% bf16 MFU | 124559 tok/s step 3145/19560 | loss 3.715917 (+0.07z)| norm 0.2729 (-0.59z)| lr 5.75e-04 | 4162.66 ms | 32.4% bf16 MFU | 124628 tok/s step 3146/19560 | loss 3.679960 (-0.99z)| norm 0.2895 (+0.02z)| lr 5.75e-04 | 4259.54 ms | 31.7% bf16 MFU | 124551 tok/s step 3147/19560 | loss 3.665587 (-1.40z)| norm 0.2699 (-0.70z)| lr 5.75e-04 | 8734.54 ms | 15.5% bf16 MFU | 121325 tok/s step 3148/19560 | loss 3.727273 (+0.45z)| norm 0.2899 (+0.03z)| lr 5.75e-04 | 4132.49 ms | 32.7% bf16 MFU | 121602 tok/s step 3149/19560 | loss 3.665473 (-1.40z)| norm 0.2630 (-0.96z)| lr 5.75e-04 | 4174.32 ms | 32.3% bf16 MFU | 121802 tok/s step 3150/19560 | loss 3.727701 (+0.46z)| norm 0.2566 (-1.19z)| lr 5.75e-04 | 4204.91 ms | 32.1% bf16 MFU | 121946 tok/s step 3151/19560 | loss 3.670253 (-1.25z)| norm 0.2624 (-0.97z)| lr 5.75e-04 | 4166.42 ms | 32.4% bf16 MFU | 122141 tok/s step 3152/19560 | loss 3.624789 (-2.52z)| norm 0.2857 (-0.12z)| lr 5.75e-04 | 4154.16 ms | 32.5% bf16 MFU | 122344 tok/s step 3153/19560 | loss 3.609762 (-2.85z)| norm 0.3215 (+1.18z)| lr 5.75e-04 | 4169.25 ms | 32.4% bf16 MFU | 122514 tok/s step 3154/19560 | loss 3.692978 (-0.49z)| norm 0.3178 (+1.03z)| lr 5.75e-04 | 4175.11 ms | 32.3% bf16 MFU | 122667 tok/s step 3155/19560 | loss 3.694478 (-0.45z)| norm 0.3282 (+1.39z)| lr 5.75e-04 | 4147.51 ms | 32.6% bf16 MFU | 122855 tok/s step 3156/19560 | loss 3.701077 (-0.25z)| norm 0.3298 (+1.42z)| lr 5.75e-04 | 4177.90 ms | 32.3% bf16 MFU | 122986 tok/s step 3157/19560 | loss 3.745464 (+1.09z)| norm 0.2846 (-0.24z)| lr 5.75e-04 | 4196.86 ms | 32.2% bf16 MFU | 123083 tok/s step 3158/19560 | loss 3.705308 (-0.11z)| norm 0.2924 (+0.03z)| lr 5.75e-04 | 4199.28 ms | 32.2% bf16 MFU | 123172 tok/s step 3159/19560 | loss 3.751035 (+1.26z)| norm 0.3075 (+0.58z)| lr 5.75e-04 | 4159.58 ms | 32.5% bf16 MFU | 123315 tok/s step 3160/19560 | loss 3.697226 (-0.36z)| norm 0.2870 (-0.20z)| lr 5.75e-04 | 4176.08 ms | 32.3% bf16 MFU | 123427 tok/s step 3161/19560 | loss 3.654745 (-1.62z)| norm 0.2650 (-1.04z)| lr 5.75e-04 | 4166.07 ms | 32.4% bf16 MFU | 123548 tok/s step 3162/19560 | loss 3.687810 (-0.63z)| norm 0.2866 (-0.23z)| lr 5.75e-04 | 4231.09 ms | 31.9% bf16 MFU | 123566 tok/s step 3163/19560 | loss 3.738661 (+0.91z)| norm 0.2934 (+0.05z)| lr 5.75e-04 | 4190.69 ms | 32.2% bf16 MFU | 123643 tok/s step 3164/19560 | loss 3.684510 (-0.73z)| norm 0.3524 (+2.35z)| lr 5.75e-04 | 4173.05 ms | 32.4% bf16 MFU | 123743 tok/s step 3165/19560 | loss 3.704219 (-0.12z)| norm 0.3318 (+1.54z)| lr 5.75e-04 | 4163.03 ms | 32.4% bf16 MFU | 123853 tok/s step 3166/19560 | loss 3.740334 (+0.96z)| norm 0.2734 (-0.71z)| lr 5.75e-04 | 4165.53 ms | 32.4% bf16 MFU | 123953 tok/s step 3167/19560 | loss 3.691115 (-0.52z)| norm 0.2712 (-0.78z)| lr 5.75e-04 | 4302.90 ms | 31.4% bf16 MFU | 123848 tok/s step 3168/19560 | loss 3.734648 (+0.80z)| norm 0.2968 (+0.23z)| lr 5.75e-04 | 4189.95 ms | 32.2% bf16 MFU | 123912 tok/s step 3169/19560 | loss 3.649078 (-1.75z)| norm 0.2595 (-1.22z)| lr 5.75e-04 | 4159.16 ms | 32.5% bf16 MFU | 124019 tok/s step 3170/19560 | loss 3.638196 (-2.03z)| norm 0.2718 (-0.73z)| lr 5.75e-04 | 4170.60 ms | 32.4% bf16 MFU | 124104 tok/s step 3171/19560 | loss 3.624889 (-2.35z)| norm 0.2414 (-1.90z)| lr 5.75e-04 | 4245.11 ms | 31.8% bf16 MFU | 124074 tok/s step 3172/19560 | loss 3.650874 (-1.58z)| norm 0.2647 (-0.97z)| lr 5.75e-04 | 4172.40 ms | 32.4% bf16 MFU | 124153 tok/s step 3173/19560 | loss 3.680146 (-0.74z)| norm 0.2854 (-0.14z)| lr 5.75e-04 | 4182.52 ms | 32.3% bf16 MFU | 124213 tok/s step 3174/19560 | loss 3.733323 (+0.77z)| norm 0.2794 (-0.37z)| lr 5.75e-04 | 4173.04 ms | 32.4% bf16 MFU | 124284 tok/s step 3175/19560 | loss 3.716550 (+0.29z)| norm 0.2898 (+0.04z)| lr 5.75e-04 | 4166.82 ms | 32.4% bf16 MFU | 124361 tok/s step 3176/19560 | loss 3.646619 (-1.67z)| norm 0.2537 (-1.37z)| lr 5.75e-04 | 4339.35 ms | 31.1% bf16 MFU | 124184 tok/s step 3177/19560 | loss 3.657785 (-1.33z)| norm 0.2532 (-1.37z)| lr 5.75e-04 | 4162.96 ms | 32.4% bf16 MFU | 124272 tok/s step 3178/19560 | loss 3.725214 (+0.56z)| norm 0.2577 (-1.17z)| lr 5.75e-04 | 4161.60 ms | 32.4% bf16 MFU | 124357 tok/s step 3179/19560 | loss 3.651136 (-1.53z)| norm 0.2539 (-1.31z)| lr 5.75e-04 | 4153.34 ms | 32.5% bf16 MFU | 124451 tok/s step 3180/19560 | loss 3.645177 (-1.67z)| norm 0.2595 (-1.08z)| lr 5.75e-04 | 4163.21 ms | 32.4% bf16 MFU | 124525 tok/s step 3181/19560 | loss 3.647091 (-1.60z)| norm 0.2743 (-0.49z)| lr 5.75e-04 | 4159.88 ms | 32.5% bf16 MFU | 124601 tok/s step 3182/19560 | loss 3.772449 (+1.87z)| norm 0.2802 (-0.26z)| lr 5.75e-04 | 4175.56 ms | 32.3% bf16 MFU | 124649 tok/s step 3183/19560 | loss 3.729850 (+0.71z)| norm 0.3115 (+0.96z)| lr 5.75e-04 | 4153.06 ms | 32.5% bf16 MFU | 124728 tok/s step 3184/19560 | loss 3.676089 (-0.78z)| norm 0.3292 (+1.63z)| lr 5.75e-04 | 4148.63 ms | 32.5% bf16 MFU | 124811 tok/s step 3185/19560 | loss 3.752483 (+1.36z)| norm 0.3612 (+2.77z)| lr 5.75e-04 | 4166.49 ms | 32.4% bf16 MFU | 124862 tok/s step 3186/19560 | loss 3.704153 (+0.01z)| norm 0.3411 (+1.96z)| lr 5.75e-04 | 4389.38 ms | 30.8% bf16 MFU | 124591 tok/s step 3187/19560 | loss 3.661648 (-1.17z)| norm 0.2807 (-0.30z)| lr 5.75e-04 | 4168.37 ms | 32.4% bf16 MFU | 124650 tok/s step 3188/19560 | loss 3.792588 (+2.43z)| norm 0.3207 (+1.19z)| lr 5.75e-04 | 4176.78 ms | 32.3% bf16 MFU | 124694 tok/s step 3189/19560 | loss 3.725285 (+0.58z)| norm 0.3046 (+0.58z)| lr 5.75e-04 | 4155.28 ms | 32.5% bf16 MFU | 124768 tok/s step 3190/19560 | loss 3.681808 (-0.61z)| norm 0.2720 (-0.62z)| lr 5.75e-04 | 4161.00 ms | 32.4% bf16 MFU | 124830 tok/s step 3191/19560 | loss 3.641127 (-1.68z)| norm 0.2710 (-0.65z)| lr 5.75e-04 | 4473.98 ms | 30.2% bf16 MFU | 124448 tok/s step 3192/19560 | loss 3.625844 (-2.05z)| norm 0.2813 (-0.26z)| lr 5.75e-04 | 4158.57 ms | 32.5% bf16 MFU | 124529 tok/s step 3193/19560 | loss 3.660984 (-1.10z)| norm 0.2843 (-0.15z)| lr 5.75e-04 | 4176.46 ms | 32.3% bf16 MFU | 124579 tok/s step 3194/19560 | loss 3.662531 (-1.05z)| norm 0.2987 (+0.39z)| lr 5.75e-04 | 4167.74 ms | 32.4% bf16 MFU | 124640 tok/s step 3195/19560 | loss 3.645807 (-1.49z)| norm 0.2841 (-0.15z)| lr 5.74e-04 | 4163.57 ms | 32.4% bf16 MFU | 124704 tok/s step 3196/19560 | loss 3.701216 (+0.01z)| norm 0.2828 (-0.20z)| lr 5.74e-04 | 4144.07 ms | 32.6% bf16 MFU | 124795 tok/s step 3197/19560 | loss 3.706316 (+0.15z)| norm 0.3214 (+1.23z)| lr 5.74e-04 | 4156.81 ms | 32.5% bf16 MFU | 124861 tok/s step 3198/19560 | loss 3.695915 (-0.12z)| norm 0.3495 (+2.22z)| lr 5.74e-04 | 4145.45 ms | 32.6% bf16 MFU | 124942 tok/s step 3199/19560 | loss 3.677554 (-0.62z)| norm 0.2898 (+0.02z)| lr 5.74e-04 | 4153.33 ms | 32.5% bf16 MFU | 125006 tok/s step 3200/19560 | loss 3.760641 (+1.68z)| norm 0.2676 (-0.79z)| lr 5.74e-04 | 4179.21 ms | 32.3% bf16 MFU | 125029 tok/s step 3201/19560 | loss 3.648368 (-1.41z)| norm 0.3052 (+0.59z)| lr 5.74e-04 | 4207.36 ms | 32.1% bf16 MFU | 125008 tok/s step 3202/19560 | loss 3.729063 (+0.81z)| norm 0.2947 (+0.20z)| lr 5.74e-04 | 4164.64 ms | 32.4% bf16 MFU | 125052 tok/s step 3203/19560 | loss 3.678915 (-0.56z)| norm 0.3019 (+0.46z)| lr 5.74e-04 | 4151.41 ms | 32.5% bf16 MFU | 125114 tok/s step 3204/19560 | loss 3.634965 (-1.74z)| norm 0.3181 (+1.05z)| lr 5.74e-04 | 4169.18 ms | 32.4% bf16 MFU | 125146 tok/s step 3205/19560 | loss 3.621120 (-2.07z)| norm 0.2858 (-0.12z)| lr 5.74e-04 | 4144.64 ms | 32.6% bf16 MFU | 125214 tok/s step 3206/19560 | loss 3.636576 (-1.63z)| norm 0.2989 (+0.35z)| lr 5.74e-04 | 4168.17 ms | 32.4% bf16 MFU | 125242 tok/s step 3207/19560 | loss 3.637957 (-1.56z)| norm 0.2711 (-0.66z)| lr 5.74e-04 | 4160.85 ms | 32.4% bf16 MFU | 125280 tok/s step 3208/19560 | loss 3.692665 (-0.10z)| norm 0.2563 (-1.19z)| lr 5.74e-04 | 4146.77 ms | 32.6% bf16 MFU | 125338 tok/s step 3209/19560 | loss 3.641727 (-1.43z)| norm 0.2613 (-1.00z)| lr 5.74e-04 | 4223.37 ms | 32.0% bf16 MFU | 125278 tok/s step 3210/19560 | loss 3.647534 (-1.28z)| norm 0.2511 (-1.35z)| lr 5.74e-04 | 4193.22 ms | 32.2% bf16 MFU | 125266 tok/s step 3211/19560 | loss 3.690423 (-0.11z)| norm 0.2739 (-0.52z)| lr 5.74e-04 | 4177.62 ms | 32.3% bf16 MFU | 125277 tok/s step 3212/19560 | loss 3.720715 (+0.72z)| norm 0.2794 (-0.31z)| lr 5.74e-04 | 4170.19 ms | 32.4% bf16 MFU | 125300 tok/s step 3213/19560 | loss 3.626612 (-1.80z)| norm 0.2608 (-0.97z)| lr 5.74e-04 | 4163.72 ms | 32.4% bf16 MFU | 125331 tok/s step 3214/19560 | loss 3.683114 (-0.28z)| norm 0.3117 (+0.88z)| lr 5.74e-04 | 4175.65 ms | 32.3% bf16 MFU | 125342 tok/s step 3215/19560 | loss 3.605425 (-2.31z)| norm 0.3229 (+1.27z)| lr 5.74e-04 | 4146.88 ms | 32.6% bf16 MFU | 125396 tok/s step 3216/19560 | loss 3.673503 (-0.50z)| norm 0.3029 (+0.55z)| lr 5.74e-04 | 4164.68 ms | 32.4% bf16 MFU | 125421 tok/s step 3217/19560 | loss 3.673546 (-0.49z)| norm 0.2922 (+0.16z)| lr 5.74e-04 | 4182.44 ms | 32.3% bf16 MFU | 125418 tok/s step 3218/19560 | loss 3.669558 (-0.58z)| norm 0.2800 (-0.28z)| lr 5.74e-04 | 4200.34 ms | 32.1% bf16 MFU | 125388 tok/s step 3219/19560 | loss 3.626305 (-1.72z)| norm 0.2795 (-0.31z)| lr 5.74e-04 | 4158.60 ms | 32.5% bf16 MFU | 125422 tok/s step 3220/19560 | loss 3.674491 (-0.44z)| norm 0.2549 (-1.20z)| lr 5.74e-04 | 4173.40 ms | 32.4% bf16 MFU | 125432 tok/s step 3221/19560 | loss 3.684062 (-0.18z)| norm 0.2794 (-0.31z)| lr 5.74e-04 | 4379.55 ms | 30.8% bf16 MFU | 125146 tok/s step 3222/19560 | loss 3.689009 (-0.05z)| norm 0.2777 (-0.38z)| lr 5.74e-04 | 4161.75 ms | 32.4% bf16 MFU | 125188 tok/s step 3223/19560 | loss 3.689774 (-0.03z)| norm 0.2836 (-0.18z)| lr 5.74e-04 | 4256.47 ms | 31.7% bf16 MFU | 125087 tok/s step 3224/19560 | loss 3.698631 (+0.20z)| norm 0.2690 (-0.71z)| lr 5.74e-04 | 4150.18 ms | 32.5% bf16 MFU | 125149 tok/s step 3225/19560 | loss 3.695108 (+0.12z)| norm 0.2743 (-0.52z)| lr 5.74e-04 | 4187.38 ms | 32.2% bf16 MFU | 125152 tok/s step 3226/19560 | loss 3.726251 (+0.95z)| norm 0.3114 (+0.83z)| lr 5.74e-04 | 4223.27 ms | 32.0% bf16 MFU | 125102 tok/s step 3227/19560 | loss 3.696258 (+0.15z)| norm 0.3455 (+2.03z)| lr 5.74e-04 | 4261.75 ms | 31.7% bf16 MFU | 124998 tok/s step 3228/19560 | loss 3.687286 (-0.09z)| norm 0.3324 (+1.53z)| lr 5.74e-04 | 4186.61 ms | 32.2% bf16 MFU | 125009 tok/s step 3229/19560 | loss 3.668797 (-0.58z)| norm 0.2784 (-0.41z)| lr 5.74e-04 | 4162.18 ms | 32.4% bf16 MFU | 125057 tok/s step 3230/19560 | loss 3.736063 (+1.24z)| norm 0.2583 (-1.13z)| lr 5.74e-04 | 4189.00 ms | 32.2% bf16 MFU | 125062 tok/s step 3231/19560 | loss 3.663979 (-0.71z)| norm 0.2798 (-0.37z)| lr 5.74e-04 | 4155.87 ms | 32.5% bf16 MFU | 125117 tok/s step 3232/19560 | loss 3.789757 (+2.60z)| norm 0.3052 (+0.54z)| lr 5.74e-04 | 4308.74 ms | 31.3% bf16 MFU | 124945 tok/s step 3233/19560 | loss 3.689303 (-0.04z)| norm 0.3203 (+1.07z)| lr 5.74e-04 | 4377.68 ms | 30.8% bf16 MFU | 124686 tok/s step 3234/19560 | loss 3.697158 (+0.17z)| norm 0.3182 (+0.99z)| lr 5.74e-04 | 4247.36 ms | 31.8% bf16 MFU | 124624 tok/s step 3235/19560 | loss 3.655939 (-0.90z)| norm 0.2823 (-0.30z)| lr 5.74e-04 | 4189.02 ms | 32.2% bf16 MFU | 124650 tok/s step 3236/19560 | loss 3.692737 (+0.07z)| norm 0.2638 (-0.96z)| lr 5.74e-04 | 4174.43 ms | 32.3% bf16 MFU | 124697 tok/s step 3237/19560 | loss 3.653755 (-0.95z)| norm 0.2508 (-1.40z)| lr 5.74e-04 | 4188.31 ms | 32.2% bf16 MFU | 124722 tok/s step 3238/19560 | loss 3.641401 (-1.25z)| norm 0.2589 (-1.10z)| lr 5.74e-04 | 4305.96 ms | 31.4% bf16 MFU | 124573 tok/s step 3239/19560 | loss 3.688343 (-0.02z)| norm 0.2589 (-1.08z)| lr 5.74e-04 | 4271.37 ms | 31.6% bf16 MFU | 124482 tok/s step 3240/19560 | loss 3.681983 (-0.18z)| norm 0.2840 (-0.19z)| lr 5.74e-04 | 4152.40 ms | 32.5% bf16 MFU | 124571 tok/s step 3241/19560 | loss 3.673090 (-0.41z)| norm 0.2661 (-0.81z)| lr 5.74e-04 | 4159.43 ms | 32.5% bf16 MFU | 124645 tok/s step 3242/19560 | loss 3.699919 (+0.29z)| norm 0.2658 (-0.81z)| lr 5.74e-04 | 4171.45 ms | 32.4% bf16 MFU | 124697 tok/s step 3243/19560 | loss 3.635669 (-1.39z)| norm 0.2768 (-0.43z)| lr 5.74e-04 | 4207.52 ms | 32.1% bf16 MFU | 124692 tok/s step 3244/19560 | loss 3.664725 (-0.60z)| norm 0.2583 (-1.08z)| lr 5.73e-04 | 4162.53 ms | 32.4% bf16 MFU | 124755 tok/s step 3245/19560 | loss 3.709312 (+0.59z)| norm 0.2803 (-0.32z)| lr 5.73e-04 | 4158.74 ms | 32.5% bf16 MFU | 124821 tok/s step 3246/19560 | loss 3.711497 (+0.65z)| norm 0.2889 (-0.02z)| lr 5.73e-04 | 4177.02 ms | 32.3% bf16 MFU | 124856 tok/s step 3247/19560 | loss 3.695371 (+0.21z)| norm 0.2994 (+0.34z)| lr 5.73e-04 | 4183.75 ms | 32.3% bf16 MFU | 124879 tok/s step 3248/19560 | loss 3.682116 (-0.14z)| norm 0.2736 (-0.60z)| lr 5.73e-04 | 4216.34 ms | 32.0% bf16 MFU | 124852 tok/s step 3249/19560 | loss 3.636348 (-1.34z)| norm 0.2934 (+0.12z)| lr 5.73e-04 | 4167.45 ms | 32.4% bf16 MFU | 124900 tok/s step 3250/19560 | loss 3.704329 (+0.47z)| norm 0.2787 (-0.42z)| lr 5.73e-04 | 4240.45 ms | 31.8% bf16 MFU | 124837 tok/s val loss 3.668579 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2692/10042 = 0.268074 step 3251/19560 | loss 3.712543 (+0.72z)| norm 0.2490 (-1.49z)| lr 5.73e-04 | 4317.94 ms | 31.3% bf16 MFU | 124666 tok/s step 3252/19560 | loss 3.718064 (+0.89z)| norm 0.2718 (-0.65z)| lr 5.73e-04 | 4159.29 ms | 32.5% bf16 MFU | 124735 tok/s step 3253/19560 | loss 3.696992 (+0.31z)| norm 0.2626 (-0.98z)| lr 5.73e-04 | 4166.62 ms | 32.4% bf16 MFU | 124790 tok/s step 3254/19560 | loss 3.652910 (-0.90z)| norm 0.2848 (-0.17z)| lr 5.73e-04 | 4172.13 ms | 32.4% bf16 MFU | 124834 tok/s step 3255/19560 | loss 3.645715 (-1.08z)| norm 0.2931 (+0.14z)| lr 5.73e-04 | 4161.12 ms | 32.4% bf16 MFU | 124892 tok/s step 3256/19560 | loss 3.651761 (-0.90z)| norm 0.2596 (-1.06z)| lr 5.73e-04 | 4155.58 ms | 32.5% bf16 MFU | 124956 tok/s step 3257/19560 | loss 3.709968 (+0.70z)| norm 0.2823 (-0.22z)| lr 5.73e-04 | 4164.68 ms | 32.4% bf16 MFU | 125002 tok/s step 3258/19560 | loss 3.749101 (+1.74z)| norm 0.3157 (+1.15z)| lr 5.73e-04 | 4224.16 ms | 32.0% bf16 MFU | 124958 tok/s step 3259/19560 | loss 3.632013 (-1.42z)| norm 0.3422 (+2.19z)| lr 5.73e-04 | 4252.73 ms | 31.7% bf16 MFU | 124874 tok/s step 3260/19560 | loss 3.632186 (-1.39z)| norm 0.3070 (+0.79z)| lr 5.73e-04 | 4178.51 ms | 32.3% bf16 MFU | 124904 tok/s step 3261/19560 | loss 3.721510 (+0.99z)| norm 0.2908 (+0.13z)| lr 5.73e-04 | 4510.84 ms | 29.9% bf16 MFU | 124470 tok/s step 3262/19560 | loss 3.654681 (-0.80z)| norm 0.2790 (-0.34z)| lr 5.73e-04 | 4178.92 ms | 32.3% bf16 MFU | 124520 tok/s step 3263/19560 | loss 3.663338 (-0.56z)| norm 0.3127 (+1.01z)| lr 5.73e-04 | 4160.47 ms | 32.5% bf16 MFU | 124595 tok/s step 3264/19560 | loss 3.697395 (+0.35z)| norm 0.2935 (+0.23z)| lr 5.73e-04 | 4597.52 ms | 29.4% bf16 MFU | 124067 tok/s step 3265/19560 | loss 3.617382 (-1.76z)| norm 0.2586 (-1.16z)| lr 5.73e-04 | 4177.27 ms | 32.3% bf16 MFU | 124139 tok/s step 3266/19560 | loss 3.723330 (+1.03z)| norm 0.2597 (-1.10z)| lr 5.73e-04 | 4170.06 ms | 32.4% bf16 MFU | 124218 tok/s step 3267/19560 | loss 3.678121 (-0.15z)| norm 0.2976 (+0.42z)| lr 5.73e-04 | 4571.57 ms | 29.5% bf16 MFU | 123742 tok/s step 3268/19560 | loss 3.682435 (-0.04z)| norm 0.3154 (+1.12z)| lr 5.73e-04 | 4371.40 ms | 30.9% bf16 MFU | 123551 tok/s step 3269/19560 | loss 3.670553 (-0.35z)| norm 0.2628 (-0.97z)| lr 5.73e-04 | 4358.67 ms | 31.0% bf16 MFU | 123388 tok/s step 3270/19560 | loss 3.679868 (-0.10z)| norm 0.2804 (-0.26z)| lr 5.73e-04 | 4159.53 ms | 32.5% bf16 MFU | 123521 tok/s step 3271/19560 | loss 3.713947 (+0.79z)| norm 0.2884 (+0.06z)| lr 5.73e-04 | 4163.87 ms | 32.4% bf16 MFU | 123641 tok/s step 3272/19560 | loss 3.699586 (+0.42z)| norm 0.2900 (+0.13z)| lr 5.73e-04 | 4158.26 ms | 32.5% bf16 MFU | 123763 tok/s step 3273/19560 | loss 3.669796 (-0.36z)| norm 0.3100 (+0.91z)| lr 5.73e-04 | 4181.83 ms | 32.3% bf16 MFU | 123843 tok/s step 3274/19560 | loss 3.711289 (+0.73z)| norm 0.2914 (+0.17z)| lr 5.73e-04 | 4174.57 ms | 32.3% bf16 MFU | 123931 tok/s step 3275/19560 | loss 3.791029 (+2.74z)| norm 0.2876 (+0.01z)| lr 5.73e-04 | 4162.46 ms | 32.4% bf16 MFU | 124032 tok/s step 3276/19560 | loss 3.677270 (-0.18z)| norm 0.2909 (+0.15z)| lr 5.73e-04 | 4166.21 ms | 32.4% bf16 MFU | 124122 tok/s step 3277/19560 | loss 3.776041 (+2.30z)| norm 0.3247 (+1.47z)| lr 5.73e-04 | 4157.90 ms | 32.5% bf16 MFU | 124221 tok/s step 3278/19560 | loss 3.751086 (+1.66z)| norm 0.3426 (+2.13z)| lr 5.73e-04 | 4177.74 ms | 32.3% bf16 MFU | 124285 tok/s step 3279/19560 | loss 3.704165 (+0.47z)| norm 0.3265 (+1.48z)| lr 5.73e-04 | 4181.55 ms | 32.3% bf16 MFU | 124340 tok/s step 3280/19560 | loss 3.713179 (+0.69z)| norm 0.3078 (+0.73z)| lr 5.73e-04 | 4158.32 ms | 32.5% bf16 MFU | 124427 tok/s step 3281/19560 | loss 3.667668 (-0.48z)| norm 0.2849 (-0.15z)| lr 5.73e-04 | 4259.61 ms | 31.7% bf16 MFU | 124360 tok/s step 3282/19560 | loss 3.695237 (+0.22z)| norm 0.2893 (+0.03z)| lr 5.73e-04 | 4173.97 ms | 32.3% bf16 MFU | 124422 tok/s step 3283/19560 | loss 3.650254 (-0.92z)| norm 0.2620 (-1.04z)| lr 5.73e-04 | 4167.87 ms | 32.4% bf16 MFU | 124491 tok/s step 3284/19560 | loss 3.665403 (-0.53z)| norm 0.2765 (-0.45z)| lr 5.73e-04 | 4163.56 ms | 32.4% bf16 MFU | 124562 tok/s step 3285/19560 | loss 3.700986 (+0.40z)| norm 0.2606 (-1.08z)| lr 5.73e-04 | 4153.39 ms | 32.5% bf16 MFU | 124646 tok/s step 3286/19560 | loss 3.766886 (+2.05z)| norm 0.2819 (-0.22z)| lr 5.73e-04 | 4212.23 ms | 32.1% bf16 MFU | 124637 tok/s step 3287/19560 | loss 3.698854 (+0.34z)| norm 0.2756 (-0.46z)| lr 5.73e-04 | 4202.32 ms | 32.1% bf16 MFU | 124643 tok/s step 3288/19560 | loss 3.663600 (-0.56z)| norm 0.2597 (-1.09z)| lr 5.73e-04 | 4177.31 ms | 32.3% bf16 MFU | 124686 tok/s step 3289/19560 | loss 3.762782 (+1.94z)| norm 0.2962 (+0.36z)| lr 5.73e-04 | 4208.14 ms | 32.1% bf16 MFU | 124681 tok/s step 3290/19560 | loss 3.804600 (+2.87z)| norm 0.3269 (+1.56z)| lr 5.73e-04 | 4167.83 ms | 32.4% bf16 MFU | 124737 tok/s step 3291/19560 | loss 3.646049 (-0.99z)| norm 0.3494 (+2.38z)| lr 5.73e-04 | 4250.21 ms | 31.8% bf16 MFU | 124668 tok/s step 3292/19560 | loss 3.712406 (+0.63z)| norm 0.3166 (+1.14z)| lr 5.72e-04 | 4167.47 ms | 32.4% bf16 MFU | 124725 tok/s step 3293/19560 | loss 3.679428 (-0.17z)| norm 0.2913 (+0.16z)| lr 5.72e-04 | 4171.05 ms | 32.4% bf16 MFU | 124773 tok/s step 3294/19560 | loss 3.622357 (-1.54z)| norm 0.2782 (-0.37z)| lr 5.72e-04 | 4165.38 ms | 32.4% bf16 MFU | 124828 tok/s step 3295/19560 | loss 3.668434 (-0.41z)| norm 0.3298 (+1.66z)| lr 5.72e-04 | 4188.28 ms | 32.2% bf16 MFU | 124846 tok/s step 3296/19560 | loss 3.656194 (-0.70z)| norm 0.2920 (+0.17z)| lr 5.72e-04 | 4182.04 ms | 32.3% bf16 MFU | 124872 tok/s step 3297/19560 | loss 3.646860 (-0.93z)| norm 0.2639 (-0.95z)| lr 5.72e-04 | 4203.88 ms | 32.1% bf16 MFU | 124864 tok/s step 3298/19560 | loss 3.665845 (-0.47z)| norm 0.3096 (+0.85z)| lr 5.72e-04 | 4174.83 ms | 32.3% bf16 MFU | 124900 tok/s step 3299/19560 | loss 3.671191 (-0.35z)| norm 0.2977 (+0.37z)| lr 5.72e-04 | 4164.19 ms | 32.4% bf16 MFU | 124950 tok/s step 3300/19560 | loss 3.684864 (-0.02z)| norm 0.2556 (-1.32z)| lr 5.72e-04 | 4168.38 ms | 32.4% bf16 MFU | 124991 tok/s step 3301/19560 | loss 3.671639 (-0.35z)| norm 0.2626 (-1.03z)| lr 5.72e-04 | 4169.62 ms | 32.4% bf16 MFU | 125029 tok/s step 3302/19560 | loss 3.697562 (+0.31z)| norm 0.2784 (-0.40z)| lr 5.72e-04 | 4240.70 ms | 31.8% bf16 MFU | 124959 tok/s step 3303/19560 | loss 3.701842 (+0.42z)| norm 0.2704 (-0.71z)| lr 5.72e-04 | 4190.91 ms | 32.2% bf16 MFU | 124966 tok/s step 3304/19560 | loss 3.667289 (-0.45z)| norm 0.2861 (-0.09z)| lr 5.72e-04 | 4167.15 ms | 32.4% bf16 MFU | 125009 tok/s step 3305/19560 | loss 3.674665 (-0.27z)| norm 0.2888 (+0.00z)| lr 5.72e-04 | 4168.08 ms | 32.4% bf16 MFU | 125047 tok/s step 3306/19560 | loss 3.632776 (-1.30z)| norm 0.3043 (+0.62z)| lr 5.72e-04 | 4315.40 ms | 31.3% bf16 MFU | 124870 tok/s step 3307/19560 | loss 3.685263 (+0.01z)| norm 0.3102 (+0.85z)| lr 5.72e-04 | 4180.93 ms | 32.3% bf16 MFU | 124896 tok/s step 3308/19560 | loss 3.646877 (-0.96z)| norm 0.2908 (+0.04z)| lr 5.72e-04 | 4424.27 ms | 30.5% bf16 MFU | 124577 tok/s step 3309/19560 | loss 3.643656 (-1.04z)| norm 0.2728 (-0.70z)| lr 5.72e-04 | 4206.77 ms | 32.1% bf16 MFU | 124579 tok/s step 3310/19560 | loss 3.701694 (+0.44z)| norm 0.2807 (-0.37z)| lr 5.72e-04 | 4169.62 ms | 32.4% bf16 MFU | 124637 tok/s step 3311/19560 | loss 3.747365 (+1.60z)| norm 0.2637 (-1.06z)| lr 5.72e-04 | 4164.58 ms | 32.4% bf16 MFU | 124700 tok/s step 3312/19560 | loss 3.647145 (-0.94z)| norm 0.2581 (-1.27z)| lr 5.72e-04 | 4268.88 ms | 31.6% bf16 MFU | 124606 tok/s step 3313/19560 | loss 3.679567 (-0.11z)| norm 0.2845 (-0.16z)| lr 5.72e-04 | 4167.70 ms | 32.4% bf16 MFU | 124665 tok/s step 3314/19560 | loss 3.710552 (+0.69z)| norm 0.2898 (+0.09z)| lr 5.72e-04 | 4160.27 ms | 32.5% bf16 MFU | 124733 tok/s step 3315/19560 | loss 3.690027 (+0.15z)| norm 0.2878 (-0.00z)| lr 5.72e-04 | 4198.64 ms | 32.2% bf16 MFU | 124740 tok/s step 3316/19560 | loss 3.690481 (+0.19z)| norm 0.2623 (-1.10z)| lr 5.72e-04 | 4197.23 ms | 32.2% bf16 MFU | 124749 tok/s step 3317/19560 | loss 3.656685 (-0.69z)| norm 0.2488 (-1.66z)| lr 5.72e-04 | 4179.05 ms | 32.3% bf16 MFU | 124784 tok/s step 3318/19560 | loss 3.772030 (+2.31z)| norm 0.2614 (-1.11z)| lr 5.72e-04 | 9087.64 ms | 14.9% bf16 MFU | 121430 tok/s step 3319/19560 | loss 3.614630 (-1.77z)| norm 0.2678 (-0.83z)| lr 5.72e-04 | 4159.36 ms | 32.5% bf16 MFU | 121661 tok/s step 3320/19560 | loss 3.614135 (-1.78z)| norm 0.2663 (-0.88z)| lr 5.72e-04 | 4318.23 ms | 31.3% bf16 MFU | 121648 tok/s step 3321/19560 | loss 3.608582 (-1.89z)| norm 0.2607 (-1.11z)| lr 5.72e-04 | 4296.02 ms | 31.4% bf16 MFU | 121668 tok/s step 3322/19560 | loss 3.631127 (-1.30z)| norm 0.2738 (-0.54z)| lr 5.72e-04 | 4155.43 ms | 32.5% bf16 MFU | 121893 tok/s step 3323/19560 | loss 3.672761 (-0.25z)| norm 0.3240 (+1.59z)| lr 5.72e-04 | 4273.94 ms | 31.6% bf16 MFU | 121932 tok/s step 3324/19560 | loss 3.673917 (-0.22z)| norm 0.3054 (+0.79z)| lr 5.72e-04 | 4185.13 ms | 32.3% bf16 MFU | 122099 tok/s step 3325/19560 | loss 3.688529 (+0.16z)| norm 0.2651 (-0.91z)| lr 5.72e-04 | 4217.67 ms | 32.0% bf16 MFU | 122209 tok/s step 3326/19560 | loss 3.651600 (-0.77z)| norm 0.2886 (+0.12z)| lr 5.72e-04 | 4158.14 ms | 32.5% bf16 MFU | 122403 tok/s step 3327/19560 | loss 3.719631 (+0.95z)| norm 0.2833 (-0.11z)| lr 5.72e-04 | 4155.21 ms | 32.5% bf16 MFU | 122592 tok/s step 3328/19560 | loss 3.657709 (-0.61z)| norm 0.2930 (+0.30z)| lr 5.72e-04 | 4182.77 ms | 32.3% bf16 MFU | 122730 tok/s step 3329/19560 | loss 3.690872 (+0.23z)| norm 0.3124 (+1.15z)| lr 5.72e-04 | 4150.06 ms | 32.5% bf16 MFU | 122910 tok/s step 3330/19560 | loss 3.683364 (+0.05z)| norm 0.3012 (+0.66z)| lr 5.72e-04 | 4160.42 ms | 32.5% bf16 MFU | 123065 tok/s step 3331/19560 | loss 3.704472 (+0.59z)| norm 0.2494 (-1.58z)| lr 5.72e-04 | 4227.74 ms | 31.9% bf16 MFU | 123112 tok/s step 3332/19560 | loss 3.698206 (+0.42z)| norm 0.2745 (-0.48z)| lr 5.72e-04 | 4197.55 ms | 32.2% bf16 MFU | 123202 tok/s step 3333/19560 | loss 3.678323 (-0.11z)| norm 0.2719 (-0.59z)| lr 5.72e-04 | 4361.98 ms | 31.0% bf16 MFU | 123052 tok/s step 3334/19560 | loss 3.620693 (-1.61z)| norm 0.3094 (+1.04z)| lr 5.72e-04 | 4199.73 ms | 32.1% bf16 MFU | 123141 tok/s step 3335/19560 | loss 3.689854 (+0.18z)| norm 0.3524 (+2.81z)| lr 5.72e-04 | 4161.54 ms | 32.4% bf16 MFU | 123283 tok/s step 3336/19560 | loss 3.688743 (+0.16z)| norm 0.3081 (+0.92z)| lr 5.72e-04 | 4409.94 ms | 30.6% bf16 MFU | 123063 tok/s step 3337/19560 | loss 3.717735 (+0.90z)| norm 0.2786 (-0.34z)| lr 5.72e-04 | 4155.64 ms | 32.5% bf16 MFU | 123218 tok/s step 3338/19560 | loss 3.726598 (+1.12z)| norm 0.2830 (-0.17z)| lr 5.72e-04 | 4304.68 ms | 31.4% bf16 MFU | 123147 tok/s step 3339/19560 | loss 3.655692 (-0.74z)| norm 0.2884 (+0.06z)| lr 5.71e-04 | 4158.73 ms | 32.5% bf16 MFU | 123293 tok/s step 3340/19560 | loss 3.706432 (+0.60z)| norm 0.3061 (+0.81z)| lr 5.71e-04 | 4175.10 ms | 32.3% bf16 MFU | 123407 tok/s step 3341/19560 | loss 3.700516 (+0.43z)| norm 0.3082 (+0.89z)| lr 5.71e-04 | 4252.72 ms | 31.7% bf16 MFU | 123401 tok/s step 3342/19560 | loss 3.643816 (-1.06z)| norm 0.3122 (+1.06z)| lr 5.71e-04 | 4154.85 ms | 32.5% bf16 MFU | 123540 tok/s step 3343/19560 | loss 3.675431 (-0.24z)| norm 0.2799 (-0.32z)| lr 5.71e-04 | 4170.11 ms | 32.4% bf16 MFU | 123650 tok/s step 3344/19560 | loss 3.669698 (-0.40z)| norm 0.2618 (-1.08z)| lr 5.71e-04 | 4151.14 ms | 32.5% bf16 MFU | 123782 tok/s step 3345/19560 | loss 3.673758 (-0.29z)| norm 0.2791 (-0.33z)| lr 5.71e-04 | 4196.49 ms | 32.2% bf16 MFU | 123840 tok/s step 3346/19560 | loss 3.705762 (+0.56z)| norm 0.2843 (-0.11z)| lr 5.71e-04 | 4308.38 ms | 31.3% bf16 MFU | 123732 tok/s step 3347/19560 | loss 3.621149 (-1.70z)| norm 0.2481 (-1.65z)| lr 5.71e-04 | 4157.45 ms | 32.5% bf16 MFU | 123851 tok/s step 3348/19560 | loss 3.775791 (+2.36z)| norm 0.3027 (+0.68z)| lr 5.71e-04 | 4281.42 ms | 31.5% bf16 MFU | 123781 tok/s step 3349/19560 | loss 3.664976 (-0.53z)| norm 0.3345 (+2.00z)| lr 5.71e-04 | 4188.17 ms | 32.2% bf16 MFU | 123851 tok/s step 3350/19560 | loss 3.682186 (-0.08z)| norm 0.3030 (+0.65z)| lr 5.71e-04 | 4158.30 ms | 32.5% bf16 MFU | 123963 tok/s step 3351/19560 | loss 3.733406 (+1.24z)| norm 0.3018 (+0.59z)| lr 5.71e-04 | 4160.98 ms | 32.4% bf16 MFU | 124065 tok/s step 3352/19560 | loss 3.667538 (-0.46z)| norm 0.2803 (-0.32z)| lr 5.71e-04 | 4166.52 ms | 32.4% bf16 MFU | 124153 tok/s step 3353/19560 | loss 3.654823 (-0.78z)| norm 0.2760 (-0.50z)| lr 5.71e-04 | 4164.84 ms | 32.4% bf16 MFU | 124240 tok/s step 3354/19560 | loss 3.664921 (-0.51z)| norm 0.2940 (+0.27z)| lr 5.71e-04 | 4168.85 ms | 32.4% bf16 MFU | 124316 tok/s step 3355/19560 | loss 3.671149 (-0.34z)| norm 0.3058 (+0.80z)| lr 5.71e-04 | 4268.01 ms | 31.6% bf16 MFU | 124242 tok/s step 3356/19560 | loss 3.739722 (+1.42z)| norm 0.3242 (+1.61z)| lr 5.71e-04 | 4173.11 ms | 32.4% bf16 MFU | 124312 tok/s step 3357/19560 | loss 3.655109 (-0.76z)| norm 0.3188 (+1.35z)| lr 5.71e-04 | 4164.30 ms | 32.4% bf16 MFU | 124391 tok/s step 3358/19560 | loss 3.691750 (+0.19z)| norm 0.3248 (+1.59z)| lr 5.71e-04 | 4169.83 ms | 32.4% bf16 MFU | 124458 tok/s step 3359/19560 | loss 3.662252 (-0.57z)| norm 0.3014 (+0.56z)| lr 5.71e-04 | 4165.24 ms | 32.4% bf16 MFU | 124529 tok/s step 3360/19560 | loss 3.781208 (+2.52z)| norm 0.3274 (+1.67z)| lr 5.71e-04 | 4237.27 ms | 31.9% bf16 MFU | 124489 tok/s step 3361/19560 | loss 3.711912 (+0.71z)| norm 0.3191 (+1.31z)| lr 5.71e-04 | 4151.85 ms | 32.5% bf16 MFU | 124579 tok/s step 3362/19560 | loss 3.708655 (+0.63z)| norm 0.2717 (-0.71z)| lr 5.71e-04 | 4148.72 ms | 32.5% bf16 MFU | 124668 tok/s step 3363/19560 | loss 3.696630 (+0.31z)| norm 0.2765 (-0.50z)| lr 5.71e-04 | 4162.30 ms | 32.4% bf16 MFU | 124733 tok/s step 3364/19560 | loss 3.656838 (-0.72z)| norm 0.2939 (+0.24z)| lr 5.71e-04 | 4150.95 ms | 32.5% bf16 MFU | 124812 tok/s step 3365/19560 | loss 3.702602 (+0.46z)| norm 0.2659 (-0.98z)| lr 5.71e-04 | 4150.64 ms | 32.5% bf16 MFU | 124887 tok/s step 3366/19560 | loss 3.681183 (-0.11z)| norm 0.2879 (-0.03z)| lr 5.71e-04 | 4148.07 ms | 32.5% bf16 MFU | 124962 tok/s step 3367/19560 | loss 3.672590 (-0.33z)| norm 0.2846 (-0.19z)| lr 5.71e-04 | 4163.76 ms | 32.4% bf16 MFU | 125010 tok/s step 3368/19560 | loss 3.686600 (+0.04z)| norm 0.2827 (-0.27z)| lr 5.71e-04 | 4164.61 ms | 32.4% bf16 MFU | 125054 tok/s step 3369/19560 | loss 3.711401 (+0.68z)| norm 0.2686 (-0.90z)| lr 5.71e-04 | 4325.94 ms | 31.2% bf16 MFU | 124861 tok/s step 3370/19560 | loss 3.771755 (+2.19z)| norm 0.2764 (-0.56z)| lr 5.71e-04 | 4160.47 ms | 32.5% bf16 MFU | 124919 tok/s step 3371/19560 | loss 3.633748 (-1.34z)| norm 0.2888 (-0.01z)| lr 5.71e-04 | 4153.49 ms | 32.5% bf16 MFU | 124984 tok/s step 3372/19560 | loss 3.704478 (+0.46z)| norm 0.2748 (-0.64z)| lr 5.71e-04 | 4155.79 ms | 32.5% bf16 MFU | 125043 tok/s step 3373/19560 | loss 3.675426 (-0.27z)| norm 0.2682 (-0.93z)| lr 5.71e-04 | 4191.55 ms | 32.2% bf16 MFU | 125045 tok/s step 3374/19560 | loss 3.773794 (+2.20z)| norm 0.3129 (+1.05z)| lr 5.71e-04 | 4182.05 ms | 32.3% bf16 MFU | 125061 tok/s step 3375/19560 | loss 3.698805 (+0.31z)| norm 0.3276 (+1.67z)| lr 5.71e-04 | 4155.51 ms | 32.5% bf16 MFU | 125116 tok/s step 3376/19560 | loss 3.671669 (-0.37z)| norm 0.2847 (-0.22z)| lr 5.71e-04 | 4172.32 ms | 32.4% bf16 MFU | 125144 tok/s step 3377/19560 | loss 3.689778 (+0.07z)| norm 0.3064 (+0.73z)| lr 5.71e-04 | 4152.09 ms | 32.5% bf16 MFU | 125200 tok/s step 3378/19560 | loss 3.644316 (-1.06z)| norm 0.2647 (-1.08z)| lr 5.71e-04 | 4207.06 ms | 32.1% bf16 MFU | 125171 tok/s step 3379/19560 | loss 3.692875 (+0.17z)| norm 0.2631 (-1.17z)| lr 5.71e-04 | 4164.19 ms | 32.4% bf16 MFU | 125208 tok/s step 3380/19560 | loss 3.728083 (+1.05z)| norm 0.2537 (-1.56z)| lr 5.71e-04 | 4164.16 ms | 32.4% bf16 MFU | 125242 tok/s step 3381/19560 | loss 3.665158 (-0.53z)| norm 0.2749 (-0.65z)| lr 5.71e-04 | 4165.78 ms | 32.4% bf16 MFU | 125273 tok/s step 3382/19560 | loss 3.706645 (+0.51z)| norm 0.2993 (+0.42z)| lr 5.71e-04 | 4148.57 ms | 32.5% bf16 MFU | 125328 tok/s step 3383/19560 | loss 3.668625 (-0.46z)| norm 0.2972 (+0.32z)| lr 5.71e-04 | 4264.96 ms | 31.7% bf16 MFU | 125208 tok/s step 3384/19560 | loss 3.665008 (-0.55z)| norm 0.3073 (+0.76z)| lr 5.71e-04 | 4153.83 ms | 32.5% bf16 MFU | 125259 tok/s step 3385/19560 | loss 3.674596 (-0.30z)| norm 0.3439 (+2.29z)| lr 5.71e-04 | 4149.48 ms | 32.5% bf16 MFU | 125313 tok/s step 3386/19560 | loss 3.614260 (-1.80z)| norm 0.3372 (+1.97z)| lr 5.70e-04 | 4366.55 ms | 30.9% bf16 MFU | 125051 tok/s step 3387/19560 | loss 3.731953 (+1.16z)| norm 0.2926 (+0.09z)| lr 5.70e-04 | 4160.07 ms | 32.5% bf16 MFU | 125100 tok/s step 3388/19560 | loss 3.751470 (+1.63z)| norm 0.2953 (+0.22z)| lr 5.70e-04 | 4165.52 ms | 32.4% bf16 MFU | 125138 tok/s step 3389/19560 | loss 3.665642 (-0.54z)| norm 0.2714 (-0.81z)| lr 5.70e-04 | 4149.61 ms | 32.5% bf16 MFU | 125199 tok/s step 3390/19560 | loss 3.658293 (-0.72z)| norm 0.2748 (-0.66z)| lr 5.70e-04 | 4161.26 ms | 32.4% bf16 MFU | 125238 tok/s step 3391/19560 | loss 3.646601 (-1.01z)| norm 0.2615 (-1.22z)| lr 5.70e-04 | 4159.96 ms | 32.5% bf16 MFU | 125278 tok/s step 3392/19560 | loss 3.675876 (-0.27z)| norm 0.2697 (-0.86z)| lr 5.70e-04 | 4159.41 ms | 32.5% bf16 MFU | 125317 tok/s step 3393/19560 | loss 3.647923 (-0.99z)| norm 0.2828 (-0.30z)| lr 5.70e-04 | 4161.19 ms | 32.4% bf16 MFU | 125350 tok/s step 3394/19560 | loss 3.688296 (+0.05z)| norm 0.2842 (-0.25z)| lr 5.70e-04 | 4155.75 ms | 32.5% bf16 MFU | 125391 tok/s step 3395/19560 | loss 3.711718 (+0.64z)| norm 0.2811 (-0.38z)| lr 5.70e-04 | 4163.81 ms | 32.4% bf16 MFU | 125417 tok/s step 3396/19560 | loss 3.662956 (-0.60z)| norm 0.2689 (-0.90z)| lr 5.70e-04 | 4172.35 ms | 32.4% bf16 MFU | 125429 tok/s step 3397/19560 | loss 3.731880 (+1.14z)| norm 0.3176 (+1.21z)| lr 5.70e-04 | 4168.59 ms | 32.4% bf16 MFU | 125446 tok/s step 3398/19560 | loss 3.661270 (-0.65z)| norm 0.3083 (+0.79z)| lr 5.70e-04 | 4160.76 ms | 32.5% bf16 MFU | 125474 tok/s step 3399/19560 | loss 3.629270 (-1.44z)| norm 0.3264 (+1.56z)| lr 5.70e-04 | 4148.09 ms | 32.5% bf16 MFU | 125520 tok/s step 3400/19560 | loss 3.703225 (+0.43z)| norm 0.3037 (+0.57z)| lr 5.70e-04 | 4148.25 ms | 32.5% bf16 MFU | 125564 tok/s step 3401/19560 | loss 3.644407 (-1.05z)| norm 0.2861 (-0.18z)| lr 5.70e-04 | 4155.25 ms | 32.5% bf16 MFU | 125594 tok/s step 3402/19560 | loss 3.640862 (-1.12z)| norm 0.2784 (-0.51z)| lr 5.70e-04 | 4167.25 ms | 32.4% bf16 MFU | 125605 tok/s step 3403/19560 | loss 3.665980 (-0.48z)| norm 0.2653 (-1.06z)| lr 5.70e-04 | 4165.16 ms | 32.4% bf16 MFU | 125619 tok/s step 3404/19560 | loss 3.664660 (-0.51z)| norm 0.2727 (-0.74z)| lr 5.70e-04 | 4157.24 ms | 32.5% bf16 MFU | 125643 tok/s step 3405/19560 | loss 3.638653 (-1.17z)| norm 0.2743 (-0.66z)| lr 5.70e-04 | 4170.76 ms | 32.4% bf16 MFU | 125646 tok/s step 3406/19560 | loss 3.701105 (+0.48z)| norm 0.2872 (-0.08z)| lr 5.70e-04 | 4171.23 ms | 32.4% bf16 MFU | 125649 tok/s step 3407/19560 | loss 3.719724 (+0.97z)| norm 0.2538 (-1.54z)| lr 5.70e-04 | 4148.70 ms | 32.5% bf16 MFU | 125685 tok/s step 3408/19560 | loss 3.629916 (-1.38z)| norm 0.2746 (-0.61z)| lr 5.70e-04 | 4158.18 ms | 32.5% bf16 MFU | 125705 tok/s step 3409/19560 | loss 3.684319 (+0.05z)| norm 0.2529 (-1.54z)| lr 5.70e-04 | 4152.86 ms | 32.5% bf16 MFU | 125732 tok/s step 3410/19560 | loss 3.665807 (-0.43z)| norm 0.2482 (-1.71z)| lr 5.70e-04 | 4159.86 ms | 32.5% bf16 MFU | 125747 tok/s step 3411/19560 | loss 3.784873 (+2.60z)| norm 0.2558 (-1.38z)| lr 5.70e-04 | 4303.71 ms | 31.4% bf16 MFU | 125551 tok/s step 3412/19560 | loss 3.639986 (-1.10z)| norm 0.2689 (-0.81z)| lr 5.70e-04 | 4155.16 ms | 32.5% bf16 MFU | 125582 tok/s step 3413/19560 | loss 3.670945 (-0.31z)| norm 0.2796 (-0.35z)| lr 5.70e-04 | 4151.39 ms | 32.5% bf16 MFU | 125618 tok/s step 3414/19560 | loss 3.671733 (-0.27z)| norm 0.3119 (+1.03z)| lr 5.70e-04 | 4171.88 ms | 32.4% bf16 MFU | 125621 tok/s step 3415/19560 | loss 3.725255 (+1.11z)| norm 0.3004 (+0.53z)| lr 5.70e-04 | 4179.24 ms | 32.3% bf16 MFU | 125612 tok/s step 3416/19560 | loss 3.666239 (-0.42z)| norm 0.2986 (+0.44z)| lr 5.70e-04 | 4161.01 ms | 32.4% bf16 MFU | 125631 tok/s step 3417/19560 | loss 3.713109 (+0.81z)| norm 0.2717 (-0.72z)| lr 5.70e-04 | 4160.91 ms | 32.4% bf16 MFU | 125650 tok/s step 3418/19560 | loss 3.704654 (+0.64z)| norm 0.3078 (+0.86z)| lr 5.70e-04 | 4155.41 ms | 32.5% bf16 MFU | 125676 tok/s step 3419/19560 | loss 3.605109 (-2.05z)| norm 0.3014 (+0.62z)| lr 5.70e-04 | 4151.52 ms | 32.5% bf16 MFU | 125707 tok/s step 3420/19560 | loss 3.710766 (+0.80z)| norm 0.2641 (-1.05z)| lr 5.70e-04 | 4159.49 ms | 32.5% bf16 MFU | 125724 tok/s step 3421/19560 | loss 3.707739 (+0.72z)| norm 0.2620 (-1.13z)| lr 5.70e-04 | 4172.99 ms | 32.4% bf16 MFU | 125719 tok/s step 3422/19560 | loss 3.640303 (-1.11z)| norm 0.2792 (-0.36z)| lr 5.70e-04 | 4574.48 ms | 29.5% bf16 MFU | 125164 tok/s step 3423/19560 | loss 3.628395 (-1.41z)| norm 0.2636 (-1.04z)| lr 5.70e-04 | 4164.01 ms | 32.4% bf16 MFU | 125201 tok/s step 3424/19560 | loss 3.665140 (-0.43z)| norm 0.2685 (-0.81z)| lr 5.70e-04 | 4175.16 ms | 32.3% bf16 MFU | 125220 tok/s step 3425/19560 | loss 3.706326 (+0.67z)| norm 0.2619 (-1.11z)| lr 5.70e-04 | 4168.90 ms | 32.4% bf16 MFU | 125247 tok/s step 3426/19560 | loss 3.702610 (+0.56z)| norm 0.2805 (-0.25z)| lr 5.70e-04 | 4165.36 ms | 32.4% bf16 MFU | 125278 tok/s step 3427/19560 | loss 3.603123 (-2.07z)| norm 0.2674 (-0.84z)| lr 5.70e-04 | 4157.03 ms | 32.5% bf16 MFU | 125320 tok/s step 3428/19560 | loss 3.741729 (+1.57z)| norm 0.2667 (-0.88z)| lr 5.70e-04 | 4349.60 ms | 31.0% bf16 MFU | 125081 tok/s step 3429/19560 | loss 3.649374 (-0.84z)| norm 0.2805 (-0.26z)| lr 5.70e-04 | 4170.71 ms | 32.4% bf16 MFU | 125112 tok/s step 3430/19560 | loss 3.662124 (-0.50z)| norm 0.2736 (-0.57z)| lr 5.70e-04 | 4180.75 ms | 32.3% bf16 MFU | 125127 tok/s step 3431/19560 | loss 3.688439 (+0.19z)| norm 0.2992 (+0.59z)| lr 5.70e-04 | 4155.98 ms | 32.5% bf16 MFU | 125178 tok/s step 3432/19560 | loss 3.642716 (-1.00z)| norm 0.2821 (-0.19z)| lr 5.69e-04 | 4198.35 ms | 32.2% bf16 MFU | 125163 tok/s step 3433/19560 | loss 3.659593 (-0.55z)| norm 0.2859 (-0.02z)| lr 5.69e-04 | 4164.38 ms | 32.4% bf16 MFU | 125200 tok/s step 3434/19560 | loss 3.637637 (-1.13z)| norm 0.3106 (+1.11z)| lr 5.69e-04 | 4155.46 ms | 32.5% bf16 MFU | 125248 tok/s step 3435/19560 | loss 3.699319 (+0.48z)| norm 0.3268 (+1.83z)| lr 5.69e-04 | 4165.59 ms | 32.4% bf16 MFU | 125279 tok/s step 3436/19560 | loss 3.664321 (-0.44z)| norm 0.2548 (-1.41z)| lr 5.69e-04 | 4155.58 ms | 32.5% bf16 MFU | 125323 tok/s step 3437/19560 | loss 3.703808 (+0.58z)| norm 0.2713 (-0.67z)| lr 5.69e-04 | 4179.37 ms | 32.3% bf16 MFU | 125330 tok/s step 3438/19560 | loss 3.663414 (-0.47z)| norm 0.2769 (-0.41z)| lr 5.69e-04 | 4164.10 ms | 32.4% bf16 MFU | 125358 tok/s step 3439/19560 | loss 3.664258 (-0.44z)| norm 0.3148 (+1.27z)| lr 5.69e-04 | 4149.37 ms | 32.5% bf16 MFU | 125408 tok/s step 3440/19560 | loss 3.646428 (-0.91z)| norm 0.2958 (+0.40z)| lr 5.69e-04 | 4159.24 ms | 32.5% bf16 MFU | 125440 tok/s step 3441/19560 | loss 3.637790 (-1.12z)| norm 0.2712 (-0.70z)| lr 5.69e-04 | 4154.94 ms | 32.5% bf16 MFU | 125478 tok/s step 3442/19560 | loss 3.772304 (+2.36z)| norm 0.3096 (+1.02z)| lr 5.69e-04 | 4151.99 ms | 32.5% bf16 MFU | 125517 tok/s step 3443/19560 | loss 3.616051 (-1.64z)| norm 0.2967 (+0.44z)| lr 5.69e-04 | 4180.87 ms | 32.3% bf16 MFU | 125512 tok/s step 3444/19560 | loss 3.739656 (+1.50z)| norm 0.2875 (+0.01z)| lr 5.69e-04 | 4168.52 ms | 32.4% bf16 MFU | 125525 tok/s step 3445/19560 | loss 3.646826 (-0.85z)| norm 0.2768 (-0.48z)| lr 5.69e-04 | 4155.31 ms | 32.5% bf16 MFU | 125557 tok/s step 3446/19560 | loss 3.751730 (+1.82z)| norm 0.2684 (-0.86z)| lr 5.69e-04 | 4155.41 ms | 32.5% bf16 MFU | 125588 tok/s step 3447/19560 | loss 3.676338 (-0.12z)| norm 0.3481 (+2.66z)| lr 5.69e-04 | 4291.89 ms | 31.5% bf16 MFU | 125416 tok/s step 3448/19560 | loss 3.684416 (+0.08z)| norm 0.2926 (+0.20z)| lr 5.69e-04 | 4181.56 ms | 32.3% bf16 MFU | 125415 tok/s step 3449/19560 | loss 3.723340 (+1.08z)| norm 0.3133 (+1.10z)| lr 5.69e-04 | 4162.55 ms | 32.4% bf16 MFU | 125441 tok/s step 3450/19560 | loss 3.675274 (-0.20z)| norm 0.3224 (+1.48z)| lr 5.69e-04 | 4163.34 ms | 32.4% bf16 MFU | 125466 tok/s step 3451/19560 | loss 3.684096 (+0.04z)| norm 0.3407 (+2.26z)| lr 5.69e-04 | 4164.71 ms | 32.4% bf16 MFU | 125487 tok/s step 3452/19560 | loss 3.676480 (-0.17z)| norm 0.3982 (+4.38z)| lr 5.69e-04 | 4154.01 ms | 32.5% bf16 MFU | 125523 tok/s step 3453/19560 | loss 3.708295 (+0.67z)| norm 0.3916 (+3.84z)| lr 5.69e-04 | 4163.96 ms | 32.4% bf16 MFU | 125543 tok/s step 3454/19560 | loss 3.639575 (-1.15z)| norm 0.3205 (+1.12z)| lr 5.69e-04 | 4151.31 ms | 32.5% bf16 MFU | 125580 tok/s step 3455/19560 | loss 3.683450 (+0.02z)| norm 0.3213 (+1.13z)| lr 5.69e-04 | 4153.07 ms | 32.5% bf16 MFU | 125613 tok/s step 3456/19560 | loss 3.679411 (-0.09z)| norm 0.3077 (+0.61z)| lr 5.69e-04 | 4152.51 ms | 32.5% bf16 MFU | 125646 tok/s step 3457/19560 | loss 3.673316 (-0.25z)| norm 0.2792 (-0.46z)| lr 5.69e-04 | 4153.79 ms | 32.5% bf16 MFU | 125674 tok/s step 3458/19560 | loss 3.677404 (-0.14z)| norm 0.2914 (+0.01z)| lr 5.69e-04 | 4171.24 ms | 32.4% bf16 MFU | 125675 tok/s step 3459/19560 | loss 3.661893 (-0.54z)| norm 0.2734 (-0.69z)| lr 5.69e-04 | 4162.00 ms | 32.4% bf16 MFU | 125690 tok/s step 3460/19560 | loss 3.757466 (+1.96z)| norm 0.2582 (-1.25z)| lr 5.69e-04 | 4152.71 ms | 32.5% bf16 MFU | 125718 tok/s step 3461/19560 | loss 3.695626 (+0.34z)| norm 0.2781 (-0.50z)| lr 5.69e-04 | 4156.88 ms | 32.5% bf16 MFU | 125738 tok/s step 3462/19560 | loss 3.691946 (+0.23z)| norm 0.2888 (-0.09z)| lr 5.69e-04 | 4205.22 ms | 32.1% bf16 MFU | 125685 tok/s step 3463/19560 | loss 3.700962 (+0.46z)| norm 0.2702 (-0.79z)| lr 5.69e-04 | 4200.19 ms | 32.1% bf16 MFU | 125642 tok/s step 3464/19560 | loss 3.627460 (-1.46z)| norm 0.2427 (-1.82z)| lr 5.69e-04 | 4157.94 ms | 32.5% bf16 MFU | 125665 tok/s step 3465/19560 | loss 3.650651 (-0.84z)| norm 0.3213 (+1.18z)| lr 5.69e-04 | 4151.93 ms | 32.5% bf16 MFU | 125695 tok/s step 3466/19560 | loss 3.656234 (-0.68z)| norm 0.2776 (-0.48z)| lr 5.69e-04 | 4173.93 ms | 32.3% bf16 MFU | 125691 tok/s step 3467/19560 | loss 3.664885 (-0.45z)| norm 0.2737 (-0.63z)| lr 5.69e-04 | 4279.41 ms | 31.6% bf16 MFU | 125532 tok/s step 3468/19560 | loss 3.648989 (-0.86z)| norm 0.2843 (-0.22z)| lr 5.69e-04 | 4239.71 ms | 31.8% bf16 MFU | 125439 tok/s step 3469/19560 | loss 3.684690 (+0.09z)| norm 0.2818 (-0.30z)| lr 5.69e-04 | 4172.30 ms | 32.4% bf16 MFU | 125450 tok/s step 3470/19560 | loss 3.656326 (-0.67z)| norm 0.2590 (-1.16z)| lr 5.69e-04 | 4243.88 ms | 31.8% bf16 MFU | 125354 tok/s step 3471/19560 | loss 3.666071 (-0.41z)| norm 0.2763 (-0.50z)| lr 5.69e-04 | 4163.48 ms | 32.4% bf16 MFU | 125383 tok/s step 3472/19560 | loss 3.672675 (-0.23z)| norm 0.2918 (+0.09z)| lr 5.69e-04 | 4272.52 ms | 31.6% bf16 MFU | 125249 tok/s step 3473/19560 | loss 3.646777 (-0.91z)| norm 0.2928 (+0.12z)| lr 5.69e-04 | 4159.99 ms | 32.5% bf16 MFU | 125288 tok/s step 3474/19560 | loss 3.653377 (-0.72z)| norm 0.2956 (+0.22z)| lr 5.69e-04 | 4250.65 ms | 31.8% bf16 MFU | 125191 tok/s step 3475/19560 | loss 3.664855 (-0.43z)| norm 0.2629 (-1.04z)| lr 5.69e-04 | 4158.02 ms | 32.5% bf16 MFU | 125236 tok/s step 3476/19560 | loss 3.787298 (+2.80z)| norm 0.2882 (-0.06z)| lr 5.69e-04 | 4150.66 ms | 32.5% bf16 MFU | 125290 tok/s step 3477/19560 | loss 3.677261 (-0.11z)| norm 0.2680 (-0.83z)| lr 5.68e-04 | 4154.90 ms | 32.5% bf16 MFU | 125335 tok/s step 3478/19560 | loss 3.681232 (-0.00z)| norm 0.2727 (-0.63z)| lr 5.68e-04 | 4155.58 ms | 32.5% bf16 MFU | 125376 tok/s step 3479/19560 | loss 3.678582 (-0.06z)| norm 0.2808 (-0.31z)| lr 5.68e-04 | 4250.24 ms | 31.8% bf16 MFU | 125275 tok/s step 3480/19560 | loss 3.699310 (+0.48z)| norm 0.2663 (-0.87z)| lr 5.68e-04 | 4300.40 ms | 31.4% bf16 MFU | 125107 tok/s step 3481/19560 | loss 3.781119 (+2.57z)| norm 0.3076 (+0.72z)| lr 5.68e-04 | 4160.51 ms | 32.5% bf16 MFU | 125153 tok/s step 3482/19560 | loss 3.670031 (-0.32z)| norm 0.3144 (+0.98z)| lr 5.68e-04 | 4149.00 ms | 32.5% bf16 MFU | 125213 tok/s step 3483/19560 | loss 3.734646 (+1.34z)| norm 0.2813 (-0.30z)| lr 5.68e-04 | 4212.30 ms | 32.1% bf16 MFU | 125176 tok/s step 3484/19560 | loss 3.655849 (-0.68z)| norm 0.2968 (+0.31z)| lr 5.68e-04 | 4159.26 ms | 32.5% bf16 MFU | 125220 tok/s step 3485/19560 | loss 3.696352 (+0.36z)| norm 0.2771 (-0.44z)| lr 5.68e-04 | 4151.95 ms | 32.5% bf16 MFU | 125272 tok/s step 3486/19560 | loss 3.637177 (-1.16z)| norm 0.2904 (+0.09z)| lr 5.68e-04 | 4151.63 ms | 32.5% bf16 MFU | 125323 tok/s step 3487/19560 | loss 3.665716 (-0.42z)| norm 0.2969 (+0.35z)| lr 5.68e-04 | 4151.49 ms | 32.5% bf16 MFU | 125371 tok/s step 3488/19560 | loss 3.660674 (-0.54z)| norm 0.2673 (-0.81z)| lr 5.68e-04 | 4158.13 ms | 32.5% bf16 MFU | 125407 tok/s step 3489/19560 | loss 3.687524 (+0.18z)| norm 0.2857 (-0.07z)| lr 5.68e-04 | 4170.46 ms | 32.4% bf16 MFU | 125423 tok/s step 3490/19560 | loss 3.726777 (+1.21z)| norm 0.2672 (-0.81z)| lr 5.68e-04 | 4164.56 ms | 32.4% bf16 MFU | 125446 tok/s step 3491/19560 | loss 3.695964 (+0.40z)| norm 0.2900 (+0.10z)| lr 5.68e-04 | 4159.35 ms | 32.5% bf16 MFU | 125476 tok/s step 3492/19560 | loss 3.716143 (+0.92z)| norm 0.3206 (+1.31z)| lr 5.68e-04 | 4152.04 ms | 32.5% bf16 MFU | 125516 tok/s step 3493/19560 | loss 3.675007 (-0.16z)| norm 0.3286 (+1.59z)| lr 5.68e-04 | 4167.15 ms | 32.4% bf16 MFU | 125531 tok/s step 3494/19560 | loss 3.672456 (-0.23z)| norm 0.2662 (-0.86z)| lr 5.68e-04 | 4167.47 ms | 32.4% bf16 MFU | 125545 tok/s step 3495/19560 | loss 3.762566 (+2.10z)| norm 0.2844 (-0.14z)| lr 5.68e-04 | 4156.20 ms | 32.5% bf16 MFU | 125575 tok/s step 3496/19560 | loss 3.705271 (+0.61z)| norm 0.2502 (-1.47z)| lr 5.68e-04 | 4165.21 ms | 32.4% bf16 MFU | 125590 tok/s step 3497/19560 | loss 3.661955 (-0.51z)| norm 0.3046 (+0.65z)| lr 5.68e-04 | 4172.47 ms | 32.4% bf16 MFU | 125593 tok/s step 3498/19560 | loss 3.667230 (-0.36z)| norm 0.2959 (+0.30z)| lr 5.68e-04 | 4155.63 ms | 32.5% bf16 MFU | 125621 tok/s step 3499/19560 | loss 3.684845 (+0.10z)| norm 0.2864 (-0.07z)| lr 5.68e-04 | 4153.49 ms | 32.5% bf16 MFU | 125652 tok/s step 3500/19560 | loss 3.591981 (-2.32z)| norm 0.3057 (+0.68z)| lr 5.68e-04 | 4165.24 ms | 32.4% bf16 MFU | 125663 tok/s val loss 3.643960 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2661/10042 = 0.264987 step 3501/19560 | loss 3.614242 (-1.70z)| norm 0.2725 (-0.62z)| lr 5.68e-04 | 4271.13 ms | 31.6% bf16 MFU | 125517 tok/s step 3502/19560 | loss 3.712701 (+0.88z)| norm 0.2541 (-1.32z)| lr 5.68e-04 | 4291.86 ms | 31.5% bf16 MFU | 125349 tok/s step 3503/19560 | loss 3.673051 (-0.16z)| norm 0.2452 (-1.64z)| lr 5.68e-04 | 4236.22 ms | 31.9% bf16 MFU | 125270 tok/s step 3504/19560 | loss 3.702739 (+0.62z)| norm 0.2767 (-0.41z)| lr 5.68e-04 | 4153.77 ms | 32.5% bf16 MFU | 125317 tok/s step 3505/19560 | loss 3.689509 (+0.27z)| norm 0.2852 (-0.07z)| lr 5.68e-04 | 4150.08 ms | 32.5% bf16 MFU | 125368 tok/s step 3506/19560 | loss 3.665866 (-0.36z)| norm 0.2816 (-0.22z)| lr 5.68e-04 | 4147.06 ms | 32.6% bf16 MFU | 125421 tok/s step 3507/19560 | loss 3.652162 (-0.72z)| norm 0.2728 (-0.57z)| lr 5.68e-04 | 4153.43 ms | 32.5% bf16 MFU | 125461 tok/s step 3508/19560 | loss 3.622967 (-1.47z)| norm 0.2505 (-1.44z)| lr 5.68e-04 | 4147.21 ms | 32.6% bf16 MFU | 125509 tok/s step 3509/19560 | loss 3.670949 (-0.20z)| norm 0.3167 (+1.13z)| lr 5.68e-04 | 4235.48 ms | 31.9% bf16 MFU | 125423 tok/s step 3510/19560 | loss 3.691700 (+0.35z)| norm 0.3631 (+2.83z)| lr 5.68e-04 | 4160.78 ms | 32.5% bf16 MFU | 125452 tok/s step 3511/19560 | loss 3.665908 (-0.33z)| norm 0.3132 (+0.94z)| lr 5.68e-04 | 4151.57 ms | 32.5% bf16 MFU | 125494 tok/s step 3512/19560 | loss 3.656214 (-0.58z)| norm 0.3268 (+1.44z)| lr 5.68e-04 | 4294.00 ms | 31.4% bf16 MFU | 125324 tok/s step 3513/19560 | loss 3.660603 (-0.46z)| norm 0.2732 (-0.56z)| lr 5.68e-04 | 4299.52 ms | 31.4% bf16 MFU | 125155 tok/s step 3514/19560 | loss 3.688705 (+0.27z)| norm 0.2821 (-0.20z)| lr 5.68e-04 | 4153.32 ms | 32.5% bf16 MFU | 125209 tok/s step 3515/19560 | loss 3.647304 (-0.83z)| norm 0.2732 (-0.54z)| lr 5.68e-04 | 4151.41 ms | 32.5% bf16 MFU | 125263 tok/s step 3516/19560 | loss 3.631997 (-1.23z)| norm 0.2740 (-0.50z)| lr 5.68e-04 | 4270.28 ms | 31.6% bf16 MFU | 125139 tok/s step 3517/19560 | loss 3.734420 (+1.53z)| norm 0.2556 (-1.20z)| lr 5.68e-04 | 4505.06 ms | 30.0% bf16 MFU | 124701 tok/s step 3518/19560 | loss 3.697570 (+0.53z)| norm 0.2708 (-0.62z)| lr 5.68e-04 | 4266.20 ms | 31.6% bf16 MFU | 124610 tok/s step 3519/19560 | loss 3.675539 (-0.07z)| norm 0.2724 (-0.56z)| lr 5.68e-04 | 4502.08 ms | 30.0% bf16 MFU | 124203 tok/s step 3520/19560 | loss 3.718822 (+1.08z)| norm 0.2665 (-0.79z)| lr 5.68e-04 | 4151.04 ms | 32.5% bf16 MFU | 124308 tok/s step 3521/19560 | loss 3.683999 (+0.14z)| norm 0.2531 (-1.28z)| lr 5.68e-04 | 4368.75 ms | 30.9% bf16 MFU | 124093 tok/s step 3522/19560 | loss 3.679313 (+0.01z)| norm 0.2987 (+0.45z)| lr 5.67e-04 | 4241.85 ms | 31.8% bf16 MFU | 124068 tok/s step 3523/19560 | loss 3.701943 (+0.63z)| norm 0.3326 (+1.70z)| lr 5.67e-04 | 4197.31 ms | 32.2% bf16 MFU | 124110 tok/s step 3524/19560 | loss 3.699648 (+0.56z)| norm 0.3229 (+1.32z)| lr 5.67e-04 | 4153.63 ms | 32.5% bf16 MFU | 124216 tok/s step 3525/19560 | loss 3.692801 (+0.38z)| norm 0.3111 (+0.88z)| lr 5.67e-04 | 4206.14 ms | 32.1% bf16 MFU | 124237 tok/s step 3526/19560 | loss 3.759548 (+2.14z)| norm 0.2999 (+0.46z)| lr 5.67e-04 | 4148.67 ms | 32.5% bf16 MFU | 124344 tok/s step 3527/19560 | loss 3.787608 (+2.79z)| norm 0.3158 (+1.07z)| lr 5.67e-04 | 4195.11 ms | 32.2% bf16 MFU | 124376 tok/s step 3528/19560 | loss 3.720397 (+1.03z)| norm 0.3462 (+2.16z)| lr 5.67e-04 | 4152.40 ms | 32.5% bf16 MFU | 124470 tok/s step 3529/19560 | loss 3.735892 (+1.41z)| norm 0.3477 (+2.16z)| lr 5.67e-04 | 4217.49 ms | 32.0% bf16 MFU | 124462 tok/s step 3530/19560 | loss 3.641690 (-1.03z)| norm 0.3322 (+1.57z)| lr 5.67e-04 | 4262.32 ms | 31.7% bf16 MFU | 124389 tok/s step 3531/19560 | loss 3.687124 (+0.14z)| norm 0.3759 (+3.01z)| lr 5.67e-04 | 4159.00 ms | 32.5% bf16 MFU | 124473 tok/s step 3532/19560 | loss 3.684629 (+0.07z)| norm 0.3155 (+0.89z)| lr 5.67e-04 | 4226.27 ms | 31.9% bf16 MFU | 124452 tok/s step 3533/19560 | loss 3.741494 (+1.52z)| norm 0.2954 (+0.18z)| lr 5.67e-04 | 4147.58 ms | 32.6% bf16 MFU | 124550 tok/s step 3534/19560 | loss 3.646700 (-0.91z)| norm 0.3049 (+0.51z)| lr 5.67e-04 | 4179.46 ms | 32.3% bf16 MFU | 124595 tok/s step 3535/19560 | loss 3.646788 (-0.90z)| norm 0.2702 (-0.70z)| lr 5.67e-04 | 4150.76 ms | 32.5% bf16 MFU | 124680 tok/s step 3536/19560 | loss 3.726444 (+1.14z)| norm 0.2763 (-0.49z)| lr 5.67e-04 | 4216.33 ms | 32.0% bf16 MFU | 124664 tok/s step 3537/19560 | loss 3.623422 (-1.50z)| norm 0.3038 (+0.46z)| lr 5.67e-04 | 4154.97 ms | 32.5% bf16 MFU | 124740 tok/s step 3538/19560 | loss 3.625587 (-1.42z)| norm 0.2903 (-0.03z)| lr 5.67e-04 | 4230.64 ms | 31.9% bf16 MFU | 124699 tok/s step 3539/19560 | loss 3.639460 (-1.07z)| norm 0.2986 (+0.26z)| lr 5.67e-04 | 4199.43 ms | 32.2% bf16 MFU | 124706 tok/s step 3540/19560 | loss 3.696707 (+0.41z)| norm 0.2741 (-0.62z)| lr 5.67e-04 | 4181.35 ms | 32.3% bf16 MFU | 124740 tok/s step 3541/19560 | loss 3.746343 (+1.67z)| norm 0.2791 (-0.44z)| lr 5.67e-04 | 4164.65 ms | 32.4% bf16 MFU | 124798 tok/s step 3542/19560 | loss 3.685484 (+0.10z)| norm 0.3029 (+0.41z)| lr 5.67e-04 | 4177.79 ms | 32.3% bf16 MFU | 124833 tok/s step 3543/19560 | loss 3.662248 (-0.49z)| norm 0.2631 (-1.00z)| lr 5.67e-04 | 4174.00 ms | 32.3% bf16 MFU | 124872 tok/s step 3544/19560 | loss 3.602957 (-1.98z)| norm 0.2557 (-1.24z)| lr 5.67e-04 | 4170.48 ms | 32.4% bf16 MFU | 124914 tok/s step 3545/19560 | loss 3.656046 (-0.62z)| norm 0.2610 (-1.05z)| lr 5.67e-04 | 4164.07 ms | 32.4% bf16 MFU | 124963 tok/s step 3546/19560 | loss 3.667313 (-0.32z)| norm 0.2710 (-0.68z)| lr 5.67e-04 | 4164.58 ms | 32.4% bf16 MFU | 125010 tok/s step 3547/19560 | loss 3.683541 (+0.08z)| norm 0.2806 (-0.34z)| lr 5.67e-04 | 4181.38 ms | 32.3% bf16 MFU | 125029 tok/s step 3548/19560 | loss 3.602207 (-1.98z)| norm 0.2663 (-0.85z)| lr 5.67e-04 | 4428.91 ms | 30.5% bf16 MFU | 124696 tok/s step 3549/19560 | loss 3.675129 (-0.11z)| norm 0.2698 (-0.73z)| lr 5.67e-04 | 4169.81 ms | 32.4% bf16 MFU | 124748 tok/s step 3550/19560 | loss 3.675292 (-0.11z)| norm 0.2592 (-1.09z)| lr 5.67e-04 | 4166.05 ms | 32.4% bf16 MFU | 124803 tok/s step 3551/19560 | loss 3.648233 (-0.82z)| norm 0.2556 (-1.21z)| lr 5.67e-04 | 4160.15 ms | 32.5% bf16 MFU | 124864 tok/s step 3552/19560 | loss 3.653451 (-0.68z)| norm 0.2506 (-1.38z)| lr 5.67e-04 | 4156.24 ms | 32.5% bf16 MFU | 124928 tok/s step 3553/19560 | loss 3.747766 (+1.73z)| norm 0.2787 (-0.40z)| lr 5.67e-04 | 4166.33 ms | 32.4% bf16 MFU | 124974 tok/s step 3554/19560 | loss 3.661490 (-0.47z)| norm 0.2804 (-0.34z)| lr 5.67e-04 | 4163.60 ms | 32.4% bf16 MFU | 125021 tok/s step 3555/19560 | loss 3.636872 (-1.12z)| norm 0.2879 (-0.08z)| lr 5.67e-04 | 4185.67 ms | 32.3% bf16 MFU | 125033 tok/s step 3556/19560 | loss 3.627672 (-1.34z)| norm 0.2800 (-0.37z)| lr 5.67e-04 | 4154.63 ms | 32.5% bf16 MFU | 125091 tok/s step 3557/19560 | loss 3.713203 (+0.87z)| norm 0.2741 (-0.57z)| lr 5.67e-04 | 4153.48 ms | 32.5% bf16 MFU | 125148 tok/s step 3558/19560 | loss 3.673372 (-0.16z)| norm 0.2669 (-0.82z)| lr 5.67e-04 | 4167.38 ms | 32.4% bf16 MFU | 125181 tok/s step 3559/19560 | loss 3.675496 (-0.11z)| norm 0.2843 (-0.20z)| lr 5.67e-04 | 4162.52 ms | 32.4% bf16 MFU | 125220 tok/s step 3560/19560 | loss 3.642374 (-0.97z)| norm 0.2908 (+0.02z)| lr 5.67e-04 | 4180.06 ms | 32.3% bf16 MFU | 125230 tok/s step 3561/19560 | loss 3.668643 (-0.29z)| norm 0.2949 (+0.16z)| lr 5.67e-04 | 4166.00 ms | 32.4% bf16 MFU | 125261 tok/s step 3562/19560 | loss 3.647332 (-0.84z)| norm 0.3461 (+1.93z)| lr 5.67e-04 | 4169.01 ms | 32.4% bf16 MFU | 125286 tok/s step 3563/19560 | loss 3.658127 (-0.55z)| norm 0.3255 (+1.21z)| lr 5.67e-04 | 4169.03 ms | 32.4% bf16 MFU | 125309 tok/s step 3564/19560 | loss 3.661000 (-0.48z)| norm 0.2760 (-0.51z)| lr 5.67e-04 | 4165.85 ms | 32.4% bf16 MFU | 125337 tok/s step 3565/19560 | loss 3.651753 (-0.71z)| norm 0.2763 (-0.50z)| lr 5.67e-04 | 4179.57 ms | 32.3% bf16 MFU | 125342 tok/s step 3566/19560 | loss 3.695210 (+0.42z)| norm 0.2981 (+0.25z)| lr 5.66e-04 | 4177.52 ms | 32.3% bf16 MFU | 125350 tok/s step 3567/19560 | loss 3.724969 (+1.17z)| norm 0.3122 (+0.74z)| lr 5.66e-04 | 4166.00 ms | 32.4% bf16 MFU | 125375 tok/s step 3568/19560 | loss 3.615192 (-1.65z)| norm 0.2949 (+0.14z)| lr 5.66e-04 | 4173.52 ms | 32.4% bf16 MFU | 125387 tok/s step 3569/19560 | loss 3.633281 (-1.18z)| norm 0.2748 (-0.56z)| lr 5.66e-04 | 4161.34 ms | 32.4% bf16 MFU | 125417 tok/s step 3570/19560 | loss 3.723825 (+1.17z)| norm 0.2945 (+0.13z)| lr 5.66e-04 | 4155.67 ms | 32.5% bf16 MFU | 125455 tok/s step 3571/19560 | loss 3.609051 (-1.82z)| norm 0.2876 (-0.11z)| lr 5.66e-04 | 4172.93 ms | 32.4% bf16 MFU | 125464 tok/s step 3572/19560 | loss 3.618841 (-1.54z)| norm 0.2600 (-1.06z)| lr 5.66e-04 | 4167.11 ms | 32.4% bf16 MFU | 125481 tok/s step 3573/19560 | loss 3.721758 (+1.12z)| norm 0.2681 (-0.78z)| lr 5.66e-04 | 4154.08 ms | 32.5% bf16 MFU | 125518 tok/s step 3574/19560 | loss 3.560991 (-2.96z)| norm 0.2715 (-0.66z)| lr 5.66e-04 | 4162.31 ms | 32.4% bf16 MFU | 125540 tok/s step 3575/19560 | loss 3.639556 (-0.95z)| norm 0.2697 (-0.71z)| lr 5.66e-04 | 4166.79 ms | 32.4% bf16 MFU | 125554 tok/s step 3576/19560 | loss 3.639255 (-0.94z)| norm 0.2644 (-0.89z)| lr 5.66e-04 | 4154.27 ms | 32.5% bf16 MFU | 125587 tok/s step 3577/19560 | loss 3.648085 (-0.71z)| norm 0.2572 (-1.13z)| lr 5.66e-04 | 4165.34 ms | 32.4% bf16 MFU | 125601 tok/s step 3578/19560 | loss 3.640901 (-0.88z)| norm 0.2575 (-1.10z)| lr 5.66e-04 | 4169.69 ms | 32.4% bf16 MFU | 125608 tok/s step 3579/19560 | loss 3.649760 (-0.65z)| norm 0.2631 (-0.89z)| lr 5.66e-04 | 4166.21 ms | 32.4% bf16 MFU | 125620 tok/s step 3580/19560 | loss 3.615805 (-1.48z)| norm 0.2596 (-1.04z)| lr 5.66e-04 | 4159.30 ms | 32.5% bf16 MFU | 125641 tok/s step 3581/19560 | loss 3.686447 (+0.29z)| norm 0.2856 (-0.02z)| lr 5.66e-04 | 4166.92 ms | 32.4% bf16 MFU | 125650 tok/s step 3582/19560 | loss 3.690346 (+0.38z)| norm 0.3150 (+1.17z)| lr 5.66e-04 | 4183.15 ms | 32.3% bf16 MFU | 125634 tok/s step 3583/19560 | loss 3.695786 (+0.52z)| norm 0.3262 (+1.62z)| lr 5.66e-04 | 4182.88 ms | 32.3% bf16 MFU | 125620 tok/s step 3584/19560 | loss 3.654851 (-0.51z)| norm 0.2891 (+0.13z)| lr 5.66e-04 | 4155.87 ms | 32.5% bf16 MFU | 125646 tok/s step 3585/19560 | loss 3.665931 (-0.23z)| norm 0.2921 (+0.24z)| lr 5.66e-04 | 4152.78 ms | 32.5% bf16 MFU | 125677 tok/s step 3586/19560 | loss 3.640862 (-0.85z)| norm 0.3178 (+1.27z)| lr 5.66e-04 | 4167.05 ms | 32.4% bf16 MFU | 125684 tok/s step 3587/19560 | loss 3.714412 (+0.98z)| norm 0.2952 (+0.35z)| lr 5.66e-04 | 4154.81 ms | 32.5% bf16 MFU | 125709 tok/s step 3588/19560 | loss 3.581181 (-2.30z)| norm 0.3081 (+0.86z)| lr 5.66e-04 | 4170.23 ms | 32.4% bf16 MFU | 125710 tok/s step 3589/19560 | loss 3.679147 (+0.14z)| norm 0.2857 (-0.05z)| lr 5.66e-04 | 4155.55 ms | 32.5% bf16 MFU | 125732 tok/s step 3590/19560 | loss 3.678010 (+0.11z)| norm 0.2782 (-0.35z)| lr 5.66e-04 | 4152.51 ms | 32.5% bf16 MFU | 125759 tok/s step 3591/19560 | loss 3.629687 (-1.07z)| norm 0.3073 (+0.81z)| lr 5.66e-04 | 4224.10 ms | 32.0% bf16 MFU | 125677 tok/s step 3592/19560 | loss 3.680842 (+0.19z)| norm 0.2928 (+0.22z)| lr 5.66e-04 | 4153.61 ms | 32.5% bf16 MFU | 125704 tok/s step 3593/19560 | loss 3.680869 (+0.18z)| norm 0.2926 (+0.22z)| lr 5.66e-04 | 4152.71 ms | 32.5% bf16 MFU | 125731 tok/s step 3594/19560 | loss 3.655917 (-0.44z)| norm 0.2930 (+0.23z)| lr 5.66e-04 | 4164.20 ms | 32.4% bf16 MFU | 125740 tok/s step 3595/19560 | loss 3.687454 (+0.34z)| norm 0.2646 (-0.93z)| lr 5.66e-04 | 4165.68 ms | 32.4% bf16 MFU | 125746 tok/s step 3596/19560 | loss 3.688255 (+0.35z)| norm 0.2742 (-0.54z)| lr 5.66e-04 | 4161.81 ms | 32.4% bf16 MFU | 125757 tok/s step 3597/19560 | loss 3.636657 (-0.92z)| norm 0.2821 (-0.21z)| lr 5.66e-04 | 4154.32 ms | 32.5% bf16 MFU | 125780 tok/s step 3598/19560 | loss 3.636196 (-0.93z)| norm 0.2797 (-0.32z)| lr 5.66e-04 | 4158.70 ms | 32.5% bf16 MFU | 125794 tok/s step 3599/19560 | loss 3.680644 (+0.17z)| norm 0.2745 (-0.53z)| lr 5.66e-04 | 4183.84 ms | 32.3% bf16 MFU | 125770 tok/s step 3600/19560 | loss 3.569508 (-2.50z)| norm 0.6945 (+9.33z)| lr 5.66e-04 | 4167.96 ms | 32.4% bf16 MFU | 125771 tok/s step 3601/19560 | loss 3.686405 (+0.32z)| norm 0.9052 (+8.80z)| lr 5.66e-04 | 4151.18 ms | 32.5% bf16 MFU | 125798 tok/s step 3602/19560 | loss 3.747859 (+1.77z)| norm 0.5834 (+3.87z)| lr 5.66e-04 | 4171.66 ms | 32.4% bf16 MFU | 125792 tok/s step 3603/19560 | loss 3.636764 (-0.88z)| norm 0.4200 (+1.63z)| lr 5.66e-04 | 4167.77 ms | 32.4% bf16 MFU | 125792 tok/s step 3604/19560 | loss 3.694046 (+0.52z)| norm 0.3758 (+1.02z)| lr 5.66e-04 | 4160.82 ms | 32.4% bf16 MFU | 125803 tok/s step 3605/19560 | loss 3.665569 (-0.18z)| norm 0.3776 (+1.03z)| lr 5.66e-04 | 4164.59 ms | 32.4% bf16 MFU | 125807 tok/s step 3606/19560 | loss 3.609634 (-1.53z)| norm 0.3310 (+0.40z)| lr 5.66e-04 | 4172.40 ms | 32.4% bf16 MFU | 125799 tok/s step 3607/19560 | loss 3.675914 (+0.09z)| norm 0.3239 (+0.30z)| lr 5.66e-04 | 4160.74 ms | 32.5% bf16 MFU | 125810 tok/s step 3608/19560 | loss 3.596987 (-1.80z)| norm 0.2945 (-0.09z)| lr 5.66e-04 | 4151.27 ms | 32.5% bf16 MFU | 125834 tok/s step 3609/19560 | loss 3.673320 (+0.07z)| norm 0.2966 (-0.06z)| lr 5.65e-04 | 4152.23 ms | 32.5% bf16 MFU | 125856 tok/s step 3610/19560 | loss 3.635059 (-0.87z)| norm 0.3097 (+0.11z)| lr 5.65e-04 | 4214.10 ms | 32.0% bf16 MFU | 125784 tok/s step 3611/19560 | loss 3.733574 (+1.56z)| norm 0.2690 (-0.43z)| lr 5.65e-04 | 4160.80 ms | 32.4% bf16 MFU | 125795 tok/s step 3612/19560 | loss 3.627820 (-1.04z)| norm 0.2808 (-0.27z)| lr 5.65e-04 | 4149.36 ms | 32.5% bf16 MFU | 125823 tok/s step 3613/19560 | loss 3.669365 (-0.01z)| norm 0.2776 (-0.31z)| lr 5.65e-04 | 4168.65 ms | 32.4% bf16 MFU | 125820 tok/s step 3614/19560 | loss 3.657822 (-0.30z)| norm 0.2784 (-0.30z)| lr 5.65e-04 | 4149.97 ms | 32.5% bf16 MFU | 125846 tok/s step 3615/19560 | loss 3.616197 (-1.31z)| norm 0.2868 (-0.19z)| lr 5.65e-04 | 4196.83 ms | 32.2% bf16 MFU | 125800 tok/s step 3616/19560 | loss 3.662980 (-0.17z)| norm 0.2863 (-0.20z)| lr 5.65e-04 | 4170.36 ms | 32.4% bf16 MFU | 125796 tok/s step 3617/19560 | loss 3.646212 (-0.57z)| norm 0.2713 (-0.39z)| lr 5.65e-04 | 4168.65 ms | 32.4% bf16 MFU | 125794 tok/s step 3618/19560 | loss 3.624834 (-1.08z)| norm 0.2744 (-0.35z)| lr 5.65e-04 | 4172.43 ms | 32.4% bf16 MFU | 125787 tok/s step 3619/19560 | loss 3.630117 (-0.93z)| norm 0.2413 (-0.79z)| lr 5.65e-04 | 4185.50 ms | 32.3% bf16 MFU | 125761 tok/s step 3620/19560 | loss 3.678819 (+0.27z)| norm 0.2687 (-0.42z)| lr 5.65e-04 | 4169.26 ms | 32.4% bf16 MFU | 125761 tok/s step 3621/19560 | loss 3.677227 (+0.23z)| norm 0.2761 (-0.31z)| lr 5.65e-04 | 4174.63 ms | 32.3% bf16 MFU | 125752 tok/s step 3622/19560 | loss 3.608755 (-1.44z)| norm 0.2578 (-0.56z)| lr 5.65e-04 | 4153.01 ms | 32.5% bf16 MFU | 125777 tok/s step 3623/19560 | loss 3.688412 (+0.54z)| norm 0.2616 (-0.50z)| lr 5.65e-04 | 4155.92 ms | 32.5% bf16 MFU | 125796 tok/s step 3624/19560 | loss 3.668784 (+0.06z)| norm 0.2763 (-0.31z)| lr 5.65e-04 | 4150.81 ms | 32.5% bf16 MFU | 125821 tok/s step 3625/19560 | loss 3.680706 (+0.35z)| norm 0.2963 (-0.04z)| lr 5.65e-04 | 4153.43 ms | 32.5% bf16 MFU | 125842 tok/s step 3626/19560 | loss 3.664568 (-0.05z)| norm 0.2833 (-0.22z)| lr 5.65e-04 | 4159.54 ms | 32.5% bf16 MFU | 125852 tok/s step 3627/19560 | loss 3.620831 (-1.13z)| norm 0.2595 (-0.53z)| lr 5.65e-04 | 4159.26 ms | 32.5% bf16 MFU | 125862 tok/s step 3628/19560 | loss 3.666267 (-0.01z)| norm 0.2589 (-0.53z)| lr 5.65e-04 | 4185.77 ms | 32.3% bf16 MFU | 125832 tok/s step 3629/19560 | loss 3.657036 (-0.25z)| norm 0.2496 (-0.65z)| lr 5.65e-04 | 4154.67 ms | 32.5% bf16 MFU | 125850 tok/s step 3630/19560 | loss 3.593952 (-1.83z)| norm 0.2664 (-0.43z)| lr 5.65e-04 | 4149.30 ms | 32.5% bf16 MFU | 125875 tok/s step 3631/19560 | loss 3.615939 (-1.25z)| norm 0.2945 (-0.06z)| lr 5.65e-04 | 4224.30 ms | 32.0% bf16 MFU | 125787 tok/s step 3632/19560 | loss 3.651798 (-0.34z)| norm 0.2587 (-0.54z)| lr 5.65e-04 | 4155.23 ms | 32.5% bf16 MFU | 125806 tok/s step 3633/19560 | loss 3.678012 (+0.33z)| norm 0.2545 (-0.59z)| lr 5.65e-04 | 4155.00 ms | 32.5% bf16 MFU | 125825 tok/s step 3634/19560 | loss 3.680985 (+0.40z)| norm 0.2718 (-0.36z)| lr 5.65e-04 | 4161.32 ms | 32.4% bf16 MFU | 125833 tok/s step 3635/19560 | loss 3.608691 (-1.41z)| norm 0.2748 (-0.32z)| lr 5.65e-04 | 4162.02 ms | 32.4% bf16 MFU | 125840 tok/s step 3636/19560 | loss 3.642644 (-0.56z)| norm 0.2472 (-0.68z)| lr 5.65e-04 | 4150.48 ms | 32.5% bf16 MFU | 125864 tok/s step 3637/19560 | loss 3.690101 (+0.63z)| norm 0.2420 (-0.74z)| lr 5.65e-04 | 4155.65 ms | 32.5% bf16 MFU | 125879 tok/s step 3638/19560 | loss 3.645047 (-0.50z)| norm 0.2567 (-0.54z)| lr 5.65e-04 | 4153.08 ms | 32.5% bf16 MFU | 125897 tok/s step 3639/19560 | loss 3.621113 (-1.09z)| norm 0.2775 (-0.26z)| lr 5.65e-04 | 4149.65 ms | 32.5% bf16 MFU | 125920 tok/s step 3640/19560 | loss 3.657720 (-0.17z)| norm 0.2734 (-0.31z)| lr 5.65e-04 | 4164.01 ms | 32.4% bf16 MFU | 125919 tok/s step 3641/19560 | loss 3.650788 (-0.34z)| norm 0.2915 (-0.07z)| lr 5.65e-04 | 4158.07 ms | 32.5% bf16 MFU | 125928 tok/s step 3642/19560 | loss 3.631297 (-0.82z)| norm 0.3152 (+0.24z)| lr 5.65e-04 | 4157.44 ms | 32.5% bf16 MFU | 125937 tok/s step 3643/19560 | loss 3.654508 (-0.24z)| norm 0.2961 (-0.02z)| lr 5.65e-04 | 4177.52 ms | 32.3% bf16 MFU | 125915 tok/s step 3644/19560 | loss 3.593392 (-1.74z)| norm 0.2829 (-0.19z)| lr 5.65e-04 | 4154.10 ms | 32.5% bf16 MFU | 125930 tok/s step 3645/19560 | loss 3.655304 (-0.20z)| norm 0.2730 (-0.33z)| lr 5.65e-04 | 4157.47 ms | 32.5% bf16 MFU | 125939 tok/s step 3646/19560 | loss 3.666835 (+0.10z)| norm 0.2685 (-0.38z)| lr 5.65e-04 | 4161.33 ms | 32.4% bf16 MFU | 125941 tok/s step 3647/19560 | loss 3.561502 (-2.46z)| norm 0.2654 (-0.43z)| lr 5.65e-04 | 4148.90 ms | 32.5% bf16 MFU | 125962 tok/s step 3648/19560 | loss 3.675754 (+0.35z)| norm 0.2934 (-0.06z)| lr 5.65e-04 | 4163.51 ms | 32.4% bf16 MFU | 125961 tok/s step 3649/19560 | loss 3.643640 (-0.44z)| norm 0.3542 (+0.74z)| lr 5.65e-04 | 4151.35 ms | 32.5% bf16 MFU | 125977 tok/s step 3650/19560 | loss 3.618609 (-1.04z)| norm 0.3116 (+0.17z)| lr 5.65e-04 | 4176.82 ms | 32.3% bf16 MFU | 125954 tok/s step 3651/19560 | loss 3.681347 (+0.51z)| norm 0.2768 (-0.28z)| lr 5.65e-04 | 4150.52 ms | 32.5% bf16 MFU | 125973 tok/s step 3652/19560 | loss 3.664224 (+0.09z)| norm 0.2908 (-0.09z)| lr 5.64e-04 | 4189.22 ms | 32.2% bf16 MFU | 125932 tok/s step 3653/19560 | loss 3.627269 (-0.81z)| norm 0.3082 (+0.14z)| lr 5.64e-04 | 4147.53 ms | 32.6% bf16 MFU | 125956 tok/s step 3654/19560 | loss 3.620440 (-0.97z)| norm 0.2895 (-0.11z)| lr 5.64e-04 | 4155.55 ms | 32.5% bf16 MFU | 125966 tok/s step 3655/19560 | loss 3.661350 (+0.09z)| norm 0.3129 (+0.20z)| lr 5.64e-04 | 4169.04 ms | 32.4% bf16 MFU | 125956 tok/s step 3656/19560 | loss 3.644447 (-0.34z)| norm 0.2891 (-0.11z)| lr 5.64e-04 | 4160.34 ms | 32.5% bf16 MFU | 125959 tok/s step 3657/19560 | loss 3.630130 (-0.71z)| norm 0.2605 (-0.48z)| lr 5.64e-04 | 4147.04 ms | 32.6% bf16 MFU | 125982 tok/s step 3658/19560 | loss 3.626673 (-0.80z)| norm 0.2760 (-0.27z)| lr 5.64e-04 | 4222.95 ms | 32.0% bf16 MFU | 125891 tok/s step 3659/19560 | loss 3.555065 (-2.65z)| norm 0.2815 (-0.19z)| lr 5.64e-04 | 4149.24 ms | 32.5% bf16 MFU | 125914 tok/s step 3660/19560 | loss 3.658552 (+0.09z)| norm 0.2720 (-0.31z)| lr 5.64e-04 | 4156.92 ms | 32.5% bf16 MFU | 125924 tok/s step 3661/19560 | loss 3.634888 (-0.52z)| norm 0.2886 (-0.09z)| lr 5.64e-04 | 4145.71 ms | 32.6% bf16 MFU | 125952 tok/s step 3662/19560 | loss 3.688027 (+0.90z)| norm 0.2803 (-0.19z)| lr 5.64e-04 | 4162.44 ms | 32.4% bf16 MFU | 125952 tok/s step 3663/19560 | loss 3.694676 (+1.07z)| norm 0.2602 (-0.46z)| lr 5.64e-04 | 4148.04 ms | 32.5% bf16 MFU | 125974 tok/s step 3664/19560 | loss 3.635677 (-0.50z)| norm 0.3101 (+0.20z)| lr 5.64e-04 | 4346.60 ms | 31.1% bf16 MFU | 125706 tok/s step 3665/19560 | loss 3.682172 (+0.75z)| norm 0.3360 (+0.54z)| lr 5.64e-04 | 4158.16 ms | 32.5% bf16 MFU | 125725 tok/s step 3666/19560 | loss 3.736233 (+2.16z)| norm 0.3431 (+0.63z)| lr 5.64e-04 | 4207.44 ms | 32.1% bf16 MFU | 125669 tok/s step 3667/19560 | loss 3.573087 (-2.15z)| norm 0.3015 (+0.08z)| lr 5.64e-04 | 4168.37 ms | 32.4% bf16 MFU | 125675 tok/s step 3668/19560 | loss 3.701865 (+1.23z)| norm 0.3491 (+0.70z)| lr 5.64e-04 | 4153.17 ms | 32.5% bf16 MFU | 125703 tok/s step 3669/19560 | loss 3.596570 (-1.53z)| norm 0.3122 (+0.21z)| lr 5.64e-04 | 4165.28 ms | 32.4% bf16 MFU | 125711 tok/s step 3670/19560 | loss 3.623831 (-0.79z)| norm 0.3085 (+0.16z)| lr 5.64e-04 | 4152.32 ms | 32.5% bf16 MFU | 125739 tok/s step 3671/19560 | loss 3.631744 (-0.57z)| norm 0.2711 (-0.34z)| lr 5.64e-04 | 4161.80 ms | 32.4% bf16 MFU | 125751 tok/s step 3672/19560 | loss 3.588496 (-1.71z)| norm 0.2722 (-0.33z)| lr 5.64e-04 | 4147.68 ms | 32.6% bf16 MFU | 125784 tok/s step 3673/19560 | loss 3.636077 (-0.45z)| norm 0.2857 (-0.15z)| lr 5.64e-04 | 4158.79 ms | 32.5% bf16 MFU | 125798 tok/s step 3674/19560 | loss 3.704822 (+1.36z)| norm 0.2817 (-0.21z)| lr 5.64e-04 | 4164.25 ms | 32.4% bf16 MFU | 125803 tok/s step 3675/19560 | loss 3.651106 (-0.05z)| norm 0.2745 (-0.30z)| lr 5.64e-04 | 4164.93 ms | 32.4% bf16 MFU | 125807 tok/s step 3676/19560 | loss 3.655315 (+0.05z)| norm 0.2893 (-0.10z)| lr 5.64e-04 | 4213.67 ms | 32.0% bf16 MFU | 125738 tok/s step 3677/19560 | loss 3.701106 (+1.25z)| norm 0.2549 (-0.56z)| lr 5.64e-04 | 4152.09 ms | 32.5% bf16 MFU | 125765 tok/s step 3678/19560 | loss 3.641883 (-0.30z)| norm 0.2680 (-0.39z)| lr 5.64e-04 | 4223.21 ms | 32.0% bf16 MFU | 125684 tok/s step 3679/19560 | loss 3.628818 (-0.64z)| norm 0.2700 (-0.36z)| lr 5.64e-04 | 4150.42 ms | 32.5% bf16 MFU | 125715 tok/s step 3680/19560 | loss 3.662991 (+0.26z)| norm 0.2665 (-0.41z)| lr 5.64e-04 | 4151.85 ms | 32.5% bf16 MFU | 125744 tok/s step 3681/19560 | loss 3.717568 (+1.73z)| norm 0.2713 (-0.35z)| lr 5.64e-04 | 4151.26 ms | 32.5% bf16 MFU | 125771 tok/s step 3682/19560 | loss 3.628501 (-0.65z)| norm 0.2577 (-0.53z)| lr 5.64e-04 | 4284.38 ms | 31.5% bf16 MFU | 125601 tok/s step 3683/19560 | loss 3.665184 (+0.33z)| norm 0.2681 (-0.38z)| lr 5.64e-04 | 4149.50 ms | 32.5% bf16 MFU | 125639 tok/s step 3684/19560 | loss 3.682916 (+0.79z)| norm 0.2494 (-0.63z)| lr 5.64e-04 | 4159.58 ms | 32.5% bf16 MFU | 125659 tok/s step 3685/19560 | loss 3.553463 (-2.59z)| norm 0.2662 (-0.41z)| lr 5.64e-04 | 4401.50 ms | 30.7% bf16 MFU | 125332 tok/s step 3686/19560 | loss 3.689324 (+0.97z)| norm 0.2815 (-0.20z)| lr 5.64e-04 | 4154.15 ms | 32.5% bf16 MFU | 125376 tok/s step 3687/19560 | loss 3.637774 (-0.37z)| norm 0.2448 (-0.69z)| lr 5.64e-04 | 4207.63 ms | 32.1% bf16 MFU | 125337 tok/s step 3688/19560 | loss 3.658060 (+0.16z)| norm 0.2528 (-0.58z)| lr 5.64e-04 | 4193.05 ms | 32.2% bf16 MFU | 125322 tok/s step 3689/19560 | loss 3.645370 (-0.17z)| norm 0.2763 (-0.26z)| lr 5.64e-04 | 4431.38 ms | 30.5% bf16 MFU | 124972 tok/s step 3690/19560 | loss 3.619820 (-0.83z)| norm 0.2604 (-0.46z)| lr 5.64e-04 | 4332.63 ms | 31.2% bf16 MFU | 124773 tok/s step 3691/19560 | loss 3.651819 (+0.00z)| norm 0.2599 (-0.47z)| lr 5.64e-04 | 4152.78 ms | 32.5% bf16 MFU | 124847 tok/s step 3692/19560 | loss 3.626621 (-0.65z)| norm 0.2523 (-0.56z)| lr 5.64e-04 | 4335.14 ms | 31.1% bf16 MFU | 124652 tok/s step 3693/19560 | loss 3.604079 (-1.22z)| norm 0.2588 (-0.47z)| lr 5.64e-04 | 4276.88 ms | 31.6% bf16 MFU | 124549 tok/s step 3694/19560 | loss 3.713629 (+1.61z)| norm 0.2890 (-0.07z)| lr 5.63e-04 | 4145.95 ms | 32.6% bf16 MFU | 124644 tok/s step 3695/19560 | loss 3.647433 (-0.08z)| norm 0.3075 (+0.17z)| lr 5.63e-04 | 4228.74 ms | 31.9% bf16 MFU | 124611 tok/s step 3696/19560 | loss 3.648959 (-0.05z)| norm 0.3296 (+0.46z)| lr 5.63e-04 | 4276.81 ms | 31.6% bf16 MFU | 124510 tok/s step 3697/19560 | loss 3.656266 (+0.14z)| norm 0.3508 (+0.73z)| lr 5.63e-04 | 4157.67 ms | 32.5% bf16 MFU | 124589 tok/s step 3698/19560 | loss 3.733719 (+2.17z)| norm 0.3526 (+0.75z)| lr 5.63e-04 | 4150.19 ms | 32.5% bf16 MFU | 124676 tok/s step 3699/19560 | loss 3.627714 (-0.62z)| norm 0.2935 (-0.03z)| lr 5.63e-04 | 4250.02 ms | 31.8% bf16 MFU | 124611 tok/s step 3700/19560 | loss 3.633873 (-0.46z)| norm 0.3053 (+0.12z)| lr 5.63e-04 | 4157.37 ms | 32.5% bf16 MFU | 124686 tok/s step 3701/19560 | loss 3.662184 (+0.30z)| norm 0.2721 (-0.32z)| lr 5.63e-04 | 4148.07 ms | 32.5% bf16 MFU | 124771 tok/s step 3702/19560 | loss 3.648491 (-0.08z)| norm 0.2867 (-0.13z)| lr 5.63e-04 | 4154.13 ms | 32.5% bf16 MFU | 124843 tok/s step 3703/19560 | loss 3.704895 (+1.43z)| norm 0.2657 (-0.40z)| lr 5.63e-04 | 4148.29 ms | 32.5% bf16 MFU | 124920 tok/s step 3704/19560 | loss 3.628005 (-0.65z)| norm 0.2631 (-0.44z)| lr 5.63e-04 | 4218.22 ms | 32.0% bf16 MFU | 124889 tok/s step 3705/19560 | loss 3.625025 (-0.72z)| norm 0.2826 (-0.18z)| lr 5.63e-04 | 4148.66 ms | 32.5% bf16 MFU | 124963 tok/s step 3706/19560 | loss 3.682397 (+0.82z)| norm 0.2868 (-0.13z)| lr 5.63e-04 | 4139.62 ms | 32.6% bf16 MFU | 125047 tok/s step 3707/19560 | loss 3.658774 (+0.18z)| norm 0.2823 (-0.19z)| lr 5.63e-04 | 4237.26 ms | 31.9% bf16 MFU | 124982 tok/s step 3708/19560 | loss 3.665688 (+0.35z)| norm 0.2639 (-0.44z)| lr 5.63e-04 | 4209.55 ms | 32.1% bf16 MFU | 124960 tok/s step 3709/19560 | loss 3.658238 (+0.16z)| norm 0.2967 (-0.00z)| lr 5.63e-04 | 4146.69 ms | 32.6% bf16 MFU | 125034 tok/s step 3710/19560 | loss 3.640110 (-0.32z)| norm 0.2853 (-0.15z)| lr 5.63e-04 | 4151.87 ms | 32.5% bf16 MFU | 125096 tok/s step 3711/19560 | loss 3.619324 (-0.87z)| norm 0.2841 (-0.16z)| lr 5.63e-04 | 4152.84 ms | 32.5% bf16 MFU | 125153 tok/s step 3712/19560 | loss 3.647215 (-0.11z)| norm 0.2676 (-0.38z)| lr 5.63e-04 | 4150.33 ms | 32.5% bf16 MFU | 125212 tok/s step 3713/19560 | loss 3.662261 (+0.30z)| norm 0.2856 (-0.14z)| lr 5.63e-04 | 5049.42 ms | 26.7% bf16 MFU | 124143 tok/s step 3714/19560 | loss 3.628347 (-0.62z)| norm 0.2655 (-0.40z)| lr 5.63e-04 | 4144.48 ms | 32.6% bf16 MFU | 124261 tok/s step 3715/19560 | loss 3.665825 (+0.41z)| norm 0.2939 (-0.02z)| lr 5.63e-04 | 4345.46 ms | 31.1% bf16 MFU | 124080 tok/s step 3716/19560 | loss 3.620835 (-0.84z)| norm 0.2706 (-0.33z)| lr 5.63e-04 | 4254.85 ms | 31.7% bf16 MFU | 124038 tok/s step 3717/19560 | loss 3.612048 (-1.07z)| norm 0.3001 (+0.06z)| lr 5.63e-04 | 4147.57 ms | 32.6% bf16 MFU | 124156 tok/s step 3718/19560 | loss 3.727779 (+2.11z)| norm 0.3119 (+0.21z)| lr 5.63e-04 | 4270.22 ms | 31.6% bf16 MFU | 124087 tok/s step 3719/19560 | loss 3.665341 (+0.39z)| norm 0.2945 (-0.02z)| lr 5.63e-04 | 4154.94 ms | 32.5% bf16 MFU | 124192 tok/s step 3720/19560 | loss 3.725162 (+1.99z)| norm 0.2901 (-0.07z)| lr 5.63e-04 | 4145.58 ms | 32.6% bf16 MFU | 124306 tok/s step 3721/19560 | loss 3.611637 (-1.07z)| norm 0.2809 (-0.20z)| lr 5.63e-04 | 4148.25 ms | 32.5% bf16 MFU | 124410 tok/s step 3722/19560 | loss 3.597189 (-1.43z)| norm 0.2682 (-0.36z)| lr 5.63e-04 | 4285.05 ms | 31.5% bf16 MFU | 124307 tok/s step 3723/19560 | loss 3.544522 (-2.74z)| norm 0.3197 (+0.32z)| lr 5.63e-04 | 4150.91 ms | 32.5% bf16 MFU | 124407 tok/s step 3724/19560 | loss 3.555839 (-2.37z)| norm 0.2839 (-0.16z)| lr 5.63e-04 | 4148.71 ms | 32.5% bf16 MFU | 124505 tok/s step 3725/19560 | loss 3.623404 (-0.64z)| norm 0.3054 (+0.12z)| lr 5.63e-04 | 4150.49 ms | 32.5% bf16 MFU | 124596 tok/s step 3726/19560 | loss 3.654519 (+0.15z)| norm 0.2997 (+0.04z)| lr 5.63e-04 | 4263.32 ms | 31.7% bf16 MFU | 124515 tok/s step 3727/19560 | loss 3.670225 (+0.56z)| norm 0.2723 (-0.32z)| lr 5.63e-04 | 4164.72 ms | 32.4% bf16 MFU | 124584 tok/s step 3728/19560 | loss 3.681916 (+0.85z)| norm 0.2640 (-0.43z)| lr 5.63e-04 | 4161.64 ms | 32.4% bf16 MFU | 124654 tok/s step 3729/19560 | loss 3.640543 (-0.22z)| norm 0.2664 (-0.55z)| lr 5.63e-04 | 4166.71 ms | 32.4% bf16 MFU | 124712 tok/s step 3730/19560 | loss 3.634426 (-0.36z)| norm 0.2823 (-0.11z)| lr 5.63e-04 | 4208.94 ms | 32.1% bf16 MFU | 124705 tok/s step 3731/19560 | loss 3.578778 (-1.82z)| norm 0.3013 (+0.64z)| lr 5.63e-04 | 4180.89 ms | 32.3% bf16 MFU | 124740 tok/s step 3732/19560 | loss 3.603267 (-1.15z)| norm 0.2726 (-0.45z)| lr 5.63e-04 | 4181.64 ms | 32.3% bf16 MFU | 124772 tok/s step 3733/19560 | loss 3.679990 (+0.87z)| norm 0.2607 (-0.95z)| lr 5.63e-04 | 4170.47 ms | 32.4% bf16 MFU | 124819 tok/s step 3734/19560 | loss 3.699253 (+1.36z)| norm 0.2739 (-0.37z)| lr 5.63e-04 | 4186.89 ms | 32.2% bf16 MFU | 124839 tok/s step 3735/19560 | loss 3.697071 (+1.29z)| norm 0.2888 (+0.29z)| lr 5.62e-04 | 4201.04 ms | 32.1% bf16 MFU | 124837 tok/s step 3736/19560 | loss 3.638428 (-0.26z)| norm 0.2712 (-0.47z)| lr 5.62e-04 | 4169.23 ms | 32.4% bf16 MFU | 124883 tok/s step 3737/19560 | loss 3.672690 (+0.65z)| norm 0.2692 (-0.55z)| lr 5.62e-04 | 4168.49 ms | 32.4% bf16 MFU | 124927 tok/s step 3738/19560 | loss 3.655532 (+0.19z)| norm 0.3014 (+0.86z)| lr 5.62e-04 | 4208.34 ms | 32.1% bf16 MFU | 124910 tok/s step 3739/19560 | loss 3.633957 (-0.37z)| norm 0.3093 (+1.19z)| lr 5.62e-04 | 4163.43 ms | 32.4% bf16 MFU | 124961 tok/s step 3740/19560 | loss 3.712277 (+1.71z)| norm 0.3099 (+1.20z)| lr 5.62e-04 | 4168.24 ms | 32.4% bf16 MFU | 125002 tok/s step 3741/19560 | loss 3.644978 (-0.08z)| norm 0.2967 (+0.62z)| lr 5.62e-04 | 4176.25 ms | 32.3% bf16 MFU | 125029 tok/s step 3742/19560 | loss 3.603498 (-1.17z)| norm 0.2844 (+0.08z)| lr 5.62e-04 | 4165.04 ms | 32.4% bf16 MFU | 125071 tok/s step 3743/19560 | loss 3.606663 (-1.08z)| norm 0.2674 (-0.65z)| lr 5.62e-04 | 4168.94 ms | 32.4% bf16 MFU | 125106 tok/s step 3744/19560 | loss 3.652865 (+0.14z)| norm 0.2472 (-1.50z)| lr 5.62e-04 | 4163.45 ms | 32.4% bf16 MFU | 125147 tok/s step 3745/19560 | loss 3.644138 (-0.09z)| norm 0.2740 (-0.34z)| lr 5.62e-04 | 4151.10 ms | 32.5% bf16 MFU | 125205 tok/s step 3746/19560 | loss 3.602282 (-1.19z)| norm 0.4189 (+5.19z)| lr 5.62e-04 | 4156.90 ms | 32.5% bf16 MFU | 125251 tok/s step 3747/19560 | loss 3.610018 (-0.98z)| norm 0.3136 (+1.15z)| lr 5.62e-04 | 4163.85 ms | 32.4% bf16 MFU | 125284 tok/s step 3748/19560 | loss 3.697199 (+1.31z)| norm 0.3100 (+0.99z)| lr 5.62e-04 | 4171.10 ms | 32.4% bf16 MFU | 125304 tok/s step 3749/19560 | loss 3.669862 (+0.59z)| norm 0.3076 (+0.89z)| lr 5.62e-04 | 4174.76 ms | 32.3% bf16 MFU | 125318 tok/s step 3750/19560 | loss 3.657852 (+0.27z)| norm 0.3135 (+1.10z)| lr 5.62e-04 | 4168.80 ms | 32.4% bf16 MFU | 125341 tok/s val loss 3.629004 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2656/10042 = 0.264489 step 3751/19560 | loss 3.652467 (+0.14z)| norm 0.2905 (+0.21z)| lr 5.62e-04 | 4161.49 ms | 32.4% bf16 MFU | 125373 tok/s step 3752/19560 | loss 3.673008 (+0.68z)| norm 0.3032 (+0.69z)| lr 5.62e-04 | 4179.74 ms | 32.3% bf16 MFU | 125376 tok/s step 3753/19560 | loss 3.686542 (+1.03z)| norm 0.3002 (+0.57z)| lr 5.62e-04 | 4163.05 ms | 32.4% bf16 MFU | 125404 tok/s step 3754/19560 | loss 3.596047 (-1.33z)| norm 0.4098 (+4.35z)| lr 5.62e-04 | 4160.68 ms | 32.5% bf16 MFU | 125434 tok/s step 3755/19560 | loss 3.648535 (+0.04z)| norm 0.4070 (+3.96z)| lr 5.62e-04 | 4169.74 ms | 32.4% bf16 MFU | 125450 tok/s step 3756/19560 | loss 3.681361 (+0.90z)| norm 0.3847 (+3.08z)| lr 5.62e-04 | 4165.83 ms | 32.4% bf16 MFU | 125470 tok/s step 3757/19560 | loss 3.624730 (-0.58z)| norm 0.3449 (+1.77z)| lr 5.62e-04 | 4159.89 ms | 32.5% bf16 MFU | 125498 tok/s step 3758/19560 | loss 3.690421 (+1.12z)| norm 0.2937 (+0.14z)| lr 5.62e-04 | 4165.38 ms | 32.4% bf16 MFU | 125516 tok/s step 3759/19560 | loss 3.681149 (+0.87z)| norm 0.2998 (+0.33z)| lr 5.62e-04 | 4164.85 ms | 32.4% bf16 MFU | 125535 tok/s step 3760/19560 | loss 3.670807 (+0.59z)| norm 0.2730 (-0.52z)| lr 5.62e-04 | 4431.61 ms | 30.5% bf16 MFU | 125173 tok/s step 3761/19560 | loss 3.658988 (+0.28z)| norm 0.2816 (-0.26z)| lr 5.62e-04 | 4158.19 ms | 32.5% bf16 MFU | 125219 tok/s step 3762/19560 | loss 3.700083 (+1.35z)| norm 0.2804 (-0.30z)| lr 5.62e-04 | 4169.94 ms | 32.4% bf16 MFU | 125245 tok/s step 3763/19560 | loss 3.660473 (+0.31z)| norm 0.3005 (+0.34z)| lr 5.62e-04 | 4167.39 ms | 32.4% bf16 MFU | 125273 tok/s step 3764/19560 | loss 3.641951 (-0.18z)| norm 0.3044 (+0.45z)| lr 5.62e-04 | 4164.97 ms | 32.4% bf16 MFU | 125303 tok/s step 3765/19560 | loss 3.752270 (+2.64z)| norm 1.0746 (+10.30z)| lr 5.62e-04 | 4153.79 ms | 32.5% bf16 MFU | 125349 tok/s step 3766/19560 | loss 3.619694 (-0.75z)| norm 0.4852 (+2.42z)| lr 5.62e-04 | 4173.70 ms | 32.3% bf16 MFU | 125362 tok/s step 3767/19560 | loss 3.850058 (+4.65z)| norm 0.4511 (+1.93z)| lr 5.62e-04 | 4162.56 ms | 32.4% bf16 MFU | 125392 tok/s step 3768/19560 | loss 3.710801 (+1.38z)| norm 0.7619 (+5.19z)| lr 5.62e-04 | 4158.56 ms | 32.5% bf16 MFU | 125426 tok/s step 3769/19560 | loss 3.727401 (+1.73z)| norm 0.4604 (+1.74z)| lr 5.62e-04 | 4173.04 ms | 32.4% bf16 MFU | 125437 tok/s step 3770/19560 | loss 3.642168 (-0.22z)| norm 0.4215 (+1.29z)| lr 5.62e-04 | 4162.73 ms | 32.4% bf16 MFU | 125462 tok/s step 3771/19560 | loss 3.651630 (-0.01z)| norm 0.3217 (+0.17z)| lr 5.62e-04 | 4160.00 ms | 32.5% bf16 MFU | 125491 tok/s step 3772/19560 | loss 3.694953 (+0.97z)| norm 0.3383 (+0.35z)| lr 5.62e-04 | 4164.30 ms | 32.4% bf16 MFU | 125511 tok/s step 3773/19560 | loss 3.594604 (-1.32z)| norm 0.3263 (+0.21z)| lr 5.62e-04 | 4164.85 ms | 32.4% bf16 MFU | 125530 tok/s step 3774/19560 | loss 3.630676 (-0.48z)| norm 0.3215 (+0.16z)| lr 5.62e-04 | 4159.56 ms | 32.5% bf16 MFU | 125555 tok/s step 3775/19560 | loss 3.646120 (-0.15z)| norm 0.2917 (-0.18z)| lr 5.62e-04 | 4162.78 ms | 32.4% bf16 MFU | 125575 tok/s step 3776/19560 | loss 3.684310 (+0.73z)| norm 0.2916 (-0.18z)| lr 5.61e-04 | 4158.25 ms | 32.5% bf16 MFU | 125600 tok/s step 3777/19560 | loss 3.680273 (+0.63z)| norm 0.2852 (-0.24z)| lr 5.61e-04 | 4165.23 ms | 32.4% bf16 MFU | 125614 tok/s step 3778/19560 | loss 3.669071 (+0.37z)| norm 0.2624 (-0.49z)| lr 5.61e-04 | 4167.45 ms | 32.4% bf16 MFU | 125624 tok/s step 3779/19560 | loss 3.751446 (+2.22z)| norm 0.2831 (-0.26z)| lr 5.61e-04 | 4181.21 ms | 32.3% bf16 MFU | 125612 tok/s step 3780/19560 | loss 3.699507 (+1.03z)| norm 0.3091 (+0.02z)| lr 5.61e-04 | 4166.68 ms | 32.4% bf16 MFU | 125623 tok/s step 3781/19560 | loss 3.604179 (-1.13z)| norm 0.2767 (-0.33z)| lr 5.61e-04 | 4163.44 ms | 32.4% bf16 MFU | 125638 tok/s step 3782/19560 | loss 3.718727 (+1.44z)| norm 0.2957 (-0.12z)| lr 5.61e-04 | 4161.96 ms | 32.4% bf16 MFU | 125655 tok/s step 3783/19560 | loss 3.642099 (-0.28z)| norm 0.3004 (-0.07z)| lr 5.61e-04 | 4162.33 ms | 32.4% bf16 MFU | 125670 tok/s step 3784/19560 | loss 3.650600 (-0.09z)| norm 0.2798 (-0.30z)| lr 5.61e-04 | 4159.66 ms | 32.5% bf16 MFU | 125689 tok/s step 3785/19560 | loss 3.614822 (-0.89z)| norm 0.2562 (-0.56z)| lr 5.61e-04 | 4181.63 ms | 32.3% bf16 MFU | 125673 tok/s step 3786/19560 | loss 3.622238 (-0.72z)| norm 0.2440 (-0.69z)| lr 5.61e-04 | 4157.76 ms | 32.5% bf16 MFU | 125694 tok/s step 3787/19560 | loss 3.653991 (-0.03z)| norm 0.2408 (-0.72z)| lr 5.61e-04 | 4162.29 ms | 32.4% bf16 MFU | 125708 tok/s step 3788/19560 | loss 3.653241 (-0.04z)| norm 0.2366 (-0.77z)| lr 5.61e-04 | 4168.53 ms | 32.4% bf16 MFU | 125711 tok/s step 3789/19560 | loss 3.601710 (-1.21z)| norm 0.2398 (-0.72z)| lr 5.61e-04 | 4172.94 ms | 32.4% bf16 MFU | 125707 tok/s step 3790/19560 | loss 3.611248 (-0.98z)| norm 0.2465 (-0.65z)| lr 5.61e-04 | 4155.20 ms | 32.5% bf16 MFU | 125731 tok/s step 3791/19560 | loss 3.581980 (-1.61z)| norm 0.2633 (-0.46z)| lr 5.61e-04 | 4160.33 ms | 32.5% bf16 MFU | 125745 tok/s step 3792/19560 | loss 3.635914 (-0.40z)| norm 0.2559 (-0.54z)| lr 5.61e-04 | 4162.34 ms | 32.4% bf16 MFU | 125756 tok/s step 3793/19560 | loss 3.695830 (+0.95z)| norm 0.2388 (-0.72z)| lr 5.61e-04 | 4168.54 ms | 32.4% bf16 MFU | 125757 tok/s step 3794/19560 | loss 3.667672 (+0.33z)| norm 0.2661 (-0.41z)| lr 5.61e-04 | 4210.36 ms | 32.1% bf16 MFU | 125695 tok/s step 3795/19560 | loss 3.627503 (-0.60z)| norm 0.2589 (-0.48z)| lr 5.61e-04 | 4154.98 ms | 32.5% bf16 MFU | 125720 tok/s step 3796/19560 | loss 3.587259 (-1.50z)| norm 0.2767 (-0.28z)| lr 5.61e-04 | 4165.69 ms | 32.4% bf16 MFU | 125727 tok/s step 3797/19560 | loss 3.665879 (+0.30z)| norm 0.3014 (-0.01z)| lr 5.61e-04 | 4166.31 ms | 32.4% bf16 MFU | 125732 tok/s step 3798/19560 | loss 3.620480 (-0.75z)| norm 0.3006 (-0.02z)| lr 5.61e-04 | 4164.44 ms | 32.4% bf16 MFU | 125740 tok/s step 3799/19560 | loss 3.684693 (+0.72z)| norm 0.3368 (+0.37z)| lr 5.61e-04 | 4162.58 ms | 32.4% bf16 MFU | 125751 tok/s step 3800/19560 | loss 3.607325 (-1.08z)| norm 0.3417 (+0.42z)| lr 5.61e-04 | 4166.63 ms | 32.4% bf16 MFU | 125755 tok/s step 3801/19560 | loss 3.653876 (+0.00z)| norm 0.2629 (-0.44z)| lr 5.61e-04 | 4163.38 ms | 32.4% bf16 MFU | 125764 tok/s step 3802/19560 | loss 3.739514 (+1.97z)| norm 0.3023 (-0.01z)| lr 5.61e-04 | 4156.22 ms | 32.5% bf16 MFU | 125783 tok/s step 3803/19560 | loss 3.658232 (+0.09z)| norm 0.2810 (-0.24z)| lr 5.61e-04 | 4176.54 ms | 32.3% bf16 MFU | 125770 tok/s step 3804/19560 | loss 3.586429 (-1.53z)| norm 0.2577 (-0.50z)| lr 5.61e-04 | 4161.31 ms | 32.4% bf16 MFU | 125781 tok/s step 3805/19560 | loss 3.696627 (+0.98z)| norm 0.2726 (-0.34z)| lr 5.61e-04 | 4169.45 ms | 32.4% bf16 MFU | 125779 tok/s step 3806/19560 | loss 3.681413 (+0.63z)| norm 0.2452 (-0.64z)| lr 5.61e-04 | 4162.33 ms | 32.4% bf16 MFU | 125788 tok/s step 3807/19560 | loss 3.653652 (-0.01z)| norm 0.2654 (-0.41z)| lr 5.61e-04 | 4156.74 ms | 32.5% bf16 MFU | 125806 tok/s step 3808/19560 | loss 3.652117 (-0.04z)| norm 0.2941 (-0.10z)| lr 5.61e-04 | 4158.75 ms | 32.5% bf16 MFU | 125819 tok/s step 3809/19560 | loss 3.699674 (+1.05z)| norm 0.2777 (-0.28z)| lr 5.61e-04 | 4166.68 ms | 32.4% bf16 MFU | 125819 tok/s step 3810/19560 | loss 3.651260 (-0.06z)| norm 0.2768 (-0.29z)| lr 5.61e-04 | 4157.38 ms | 32.5% bf16 MFU | 125834 tok/s step 3811/19560 | loss 3.625861 (-0.64z)| norm 0.2794 (-0.27z)| lr 5.61e-04 | 4161.75 ms | 32.4% bf16 MFU | 125841 tok/s step 3812/19560 | loss 3.616254 (-0.84z)| norm 0.2758 (-0.31z)| lr 5.61e-04 | 4162.02 ms | 32.4% bf16 MFU | 125847 tok/s step 3813/19560 | loss 3.613020 (-0.95z)| norm 0.3082 (+0.05z)| lr 5.61e-04 | 4159.04 ms | 32.5% bf16 MFU | 125858 tok/s step 3814/19560 | loss 3.632605 (-0.48z)| norm 0.3182 (+0.15z)| lr 5.61e-04 | 4158.67 ms | 32.5% bf16 MFU | 125869 tok/s step 3815/19560 | loss 3.655247 (+0.05z)| norm 0.2503 (-0.60z)| lr 5.61e-04 | 4166.48 ms | 32.4% bf16 MFU | 125867 tok/s step 3816/19560 | loss 3.637281 (-0.37z)| norm 0.2752 (-0.32z)| lr 5.61e-04 | 4159.34 ms | 32.5% bf16 MFU | 125876 tok/s step 3817/19560 | loss 3.635699 (-0.41z)| norm 0.2504 (-0.60z)| lr 5.60e-04 | 4162.07 ms | 32.4% bf16 MFU | 125881 tok/s step 3818/19560 | loss 3.698139 (+1.03z)| norm 0.2674 (-0.41z)| lr 5.60e-04 | 4155.26 ms | 32.5% bf16 MFU | 125895 tok/s step 3819/19560 | loss 3.660761 (+0.16z)| norm 0.3039 (-0.01z)| lr 5.60e-04 | 4153.79 ms | 32.5% bf16 MFU | 125912 tok/s step 3820/19560 | loss 3.647776 (-0.14z)| norm 0.3040 (-0.01z)| lr 5.60e-04 | 4159.51 ms | 32.5% bf16 MFU | 125918 tok/s step 3821/19560 | loss 3.623221 (-0.72z)| norm 0.2765 (-0.32z)| lr 5.60e-04 | 4170.65 ms | 32.4% bf16 MFU | 125908 tok/s step 3822/19560 | loss 3.623399 (-0.71z)| norm 0.2721 (-0.37z)| lr 5.60e-04 | 4167.86 ms | 32.4% bf16 MFU | 125902 tok/s step 3823/19560 | loss 3.673226 (+0.46z)| norm 0.3240 (+0.21z)| lr 5.60e-04 | 4161.95 ms | 32.4% bf16 MFU | 125906 tok/s step 3824/19560 | loss 3.643117 (-0.24z)| norm 0.2780 (-0.30z)| lr 5.60e-04 | 4157.93 ms | 32.5% bf16 MFU | 125915 tok/s step 3825/19560 | loss 3.678485 (+0.58z)| norm 0.2922 (-0.14z)| lr 5.60e-04 | 4158.69 ms | 32.5% bf16 MFU | 125923 tok/s step 3826/19560 | loss 3.624089 (-0.68z)| norm 0.2867 (-0.19z)| lr 5.60e-04 | 4172.62 ms | 32.4% bf16 MFU | 125909 tok/s step 3827/19560 | loss 3.683560 (+0.72z)| norm 0.3153 (+0.12z)| lr 5.60e-04 | 4173.23 ms | 32.4% bf16 MFU | 125895 tok/s step 3828/19560 | loss 3.623162 (-0.71z)| norm 0.2809 (-0.25z)| lr 5.60e-04 | 4158.06 ms | 32.5% bf16 MFU | 125905 tok/s step 3829/19560 | loss 3.702384 (+1.15z)| norm 0.2686 (-0.39z)| lr 5.60e-04 | 4146.27 ms | 32.6% bf16 MFU | 125932 tok/s step 3830/19560 | loss 3.598881 (-1.27z)| norm 0.2951 (-0.10z)| lr 5.60e-04 | 4165.85 ms | 32.4% bf16 MFU | 125928 tok/s step 3831/19560 | loss 3.625899 (-0.63z)| norm 0.2874 (-0.18z)| lr 5.60e-04 | 4173.27 ms | 32.4% bf16 MFU | 125913 tok/s step 3832/19560 | loss 3.601689 (-1.19z)| norm 0.2850 (-0.21z)| lr 5.60e-04 | 4162.71 ms | 32.4% bf16 MFU | 125915 tok/s step 3833/19560 | loss 3.610312 (-0.98z)| norm 0.2930 (-0.13z)| lr 5.60e-04 | 4156.21 ms | 32.5% bf16 MFU | 125927 tok/s step 3834/19560 | loss 3.660861 (+0.21z)| norm 0.2744 (-0.33z)| lr 5.60e-04 | 4156.23 ms | 32.5% bf16 MFU | 125937 tok/s step 3835/19560 | loss 3.641356 (-0.25z)| norm 0.3105 (+0.07z)| lr 5.60e-04 | 4156.68 ms | 32.5% bf16 MFU | 125947 tok/s step 3836/19560 | loss 3.589236 (-1.44z)| norm 0.3224 (+0.19z)| lr 5.60e-04 | 4160.88 ms | 32.4% bf16 MFU | 125950 tok/s step 3837/19560 | loss 3.641814 (-0.22z)| norm 0.2500 (-0.60z)| lr 5.60e-04 | 4159.23 ms | 32.5% bf16 MFU | 125955 tok/s step 3838/19560 | loss 3.666921 (+0.36z)| norm 0.2844 (-0.22z)| lr 5.60e-04 | 4161.32 ms | 32.4% bf16 MFU | 125957 tok/s step 3839/19560 | loss 3.628672 (-0.53z)| norm 0.2747 (-0.33z)| lr 5.60e-04 | 4162.35 ms | 32.4% bf16 MFU | 125957 tok/s step 3840/19560 | loss 3.620535 (-0.71z)| norm 0.2800 (-0.27z)| lr 5.60e-04 | 4178.25 ms | 32.3% bf16 MFU | 125933 tok/s step 3841/19560 | loss 3.675683 (+0.56z)| norm 0.2979 (-0.07z)| lr 5.60e-04 | 4163.14 ms | 32.4% bf16 MFU | 125933 tok/s step 3842/19560 | loss 3.654261 (+0.06z)| norm 0.2867 (-0.20z)| lr 5.60e-04 | 4167.72 ms | 32.4% bf16 MFU | 125927 tok/s step 3843/19560 | loss 3.611356 (-0.92z)| norm 0.2822 (-0.25z)| lr 5.60e-04 | 4159.99 ms | 32.5% bf16 MFU | 125932 tok/s step 3844/19560 | loss 3.653708 (+0.05z)| norm 0.3248 (+0.22z)| lr 5.60e-04 | 4158.21 ms | 32.5% bf16 MFU | 125939 tok/s step 3845/19560 | loss 3.596482 (-1.27z)| norm 0.3244 (+0.21z)| lr 5.60e-04 | 4165.58 ms | 32.4% bf16 MFU | 125936 tok/s step 3846/19560 | loss 3.619242 (-0.73z)| norm 0.2733 (-0.35z)| lr 5.60e-04 | 4151.97 ms | 32.5% bf16 MFU | 125953 tok/s step 3847/19560 | loss 3.620768 (-0.68z)| norm 0.3051 (-0.00z)| lr 5.60e-04 | 4147.78 ms | 32.6% bf16 MFU | 125975 tok/s step 3848/19560 | loss 3.689825 (+0.94z)| norm 0.2688 (-0.40z)| lr 5.60e-04 | 4163.46 ms | 32.4% bf16 MFU | 125973 tok/s step 3849/19560 | loss 3.636153 (-0.33z)| norm 0.4830 (+1.93z)| lr 5.60e-04 | 4160.46 ms | 32.5% bf16 MFU | 125975 tok/s step 3850/19560 | loss 3.636437 (-0.33z)| norm 0.2800 (-0.29z)| lr 5.60e-04 | 4150.55 ms | 32.5% bf16 MFU | 125992 tok/s step 3851/19560 | loss 3.631257 (-0.48z)| norm 0.3082 (+0.02z)| lr 5.60e-04 | 4168.39 ms | 32.4% bf16 MFU | 125981 tok/s step 3852/19560 | loss 3.641878 (-0.24z)| norm 0.2763 (-0.33z)| lr 5.60e-04 | 4162.08 ms | 32.4% bf16 MFU | 125981 tok/s step 3853/19560 | loss 3.610307 (-1.02z)| norm 0.2859 (-0.22z)| lr 5.60e-04 | 4167.52 ms | 32.4% bf16 MFU | 125972 tok/s step 3854/19560 | loss 3.679724 (+0.69z)| norm 0.2976 (-0.09z)| lr 5.60e-04 | 4162.70 ms | 32.4% bf16 MFU | 125971 tok/s step 3855/19560 | loss 3.585815 (-1.60z)| norm 0.2682 (-0.42z)| lr 5.60e-04 | 4159.37 ms | 32.5% bf16 MFU | 125974 tok/s step 3856/19560 | loss 3.621410 (-0.72z)| norm 0.2961 (-0.11z)| lr 5.60e-04 | 4153.77 ms | 32.5% bf16 MFU | 125987 tok/s step 3857/19560 | loss 3.622428 (-0.69z)| norm 0.2955 (-0.12z)| lr 5.59e-04 | 4164.10 ms | 32.4% bf16 MFU | 125983 tok/s step 3858/19560 | loss 3.623370 (-0.66z)| norm 0.2938 (-0.14z)| lr 5.59e-04 | 4170.01 ms | 32.4% bf16 MFU | 125970 tok/s step 3859/19560 | loss 3.638560 (-0.30z)| norm 0.2695 (-0.41z)| lr 5.59e-04 | 4165.26 ms | 32.4% bf16 MFU | 125965 tok/s step 3860/19560 | loss 3.667062 (+0.39z)| norm 0.2736 (-0.36z)| lr 5.59e-04 | 4150.63 ms | 32.5% bf16 MFU | 125983 tok/s step 3861/19560 | loss 3.628906 (-0.55z)| norm 0.2748 (-0.35z)| lr 5.59e-04 | 4157.84 ms | 32.5% bf16 MFU | 125988 tok/s step 3862/19560 | loss 3.666329 (+0.39z)| norm 0.2915 (-0.17z)| lr 5.59e-04 | 4167.57 ms | 32.4% bf16 MFU | 125979 tok/s step 3863/19560 | loss 3.685496 (+0.88z)| norm 0.3241 (+0.19z)| lr 5.59e-04 | 4148.94 ms | 32.5% bf16 MFU | 125998 tok/s step 3864/19560 | loss 3.649243 (-0.04z)| norm 0.3192 (+0.13z)| lr 5.59e-04 | 4164.78 ms | 32.4% bf16 MFU | 125993 tok/s step 3865/19560 | loss 3.681569 (+0.77z)| norm 0.2914 (-0.18z)| lr 5.59e-04 | 4155.33 ms | 32.5% bf16 MFU | 126002 tok/s step 3866/19560 | loss 3.621304 (-0.73z)| norm 0.2997 (-0.09z)| lr 5.59e-04 | 4166.34 ms | 32.4% bf16 MFU | 125994 tok/s step 3867/19560 | loss 3.724771 (+1.82z)| norm 0.3133 (+0.06z)| lr 5.59e-04 | 4163.45 ms | 32.4% bf16 MFU | 125990 tok/s step 3868/19560 | loss 3.657843 (+0.18z)| norm 0.2949 (-0.14z)| lr 5.59e-04 | 4157.10 ms | 32.5% bf16 MFU | 125997 tok/s step 3869/19560 | loss 3.637239 (-0.34z)| norm 0.2824 (-0.27z)| lr 5.59e-04 | 4160.61 ms | 32.5% bf16 MFU | 125997 tok/s step 3870/19560 | loss 3.639197 (-0.29z)| norm 0.2893 (-0.20z)| lr 5.59e-04 | 4148.91 ms | 32.5% bf16 MFU | 126016 tok/s step 3871/19560 | loss 3.700142 (+1.21z)| norm 0.2808 (-0.29z)| lr 5.59e-04 | 4158.84 ms | 32.5% bf16 MFU | 126018 tok/s step 3872/19560 | loss 3.664805 (+0.33z)| norm 0.2786 (-0.32z)| lr 5.59e-04 | 4150.80 ms | 32.5% bf16 MFU | 126033 tok/s step 3873/19560 | loss 3.530842 (-2.90z)| norm 0.2795 (-0.31z)| lr 5.59e-04 | 4159.02 ms | 32.5% bf16 MFU | 126034 tok/s step 3874/19560 | loss 3.599857 (-1.23z)| norm 0.2699 (-0.41z)| lr 5.59e-04 | 4160.79 ms | 32.4% bf16 MFU | 126033 tok/s step 3875/19560 | loss 3.664211 (+0.31z)| norm 0.2959 (-0.12z)| lr 5.59e-04 | 4161.48 ms | 32.4% bf16 MFU | 126031 tok/s step 3876/19560 | loss 3.685675 (+0.84z)| norm 0.2703 (-0.40z)| lr 5.59e-04 | 4156.04 ms | 32.5% bf16 MFU | 126037 tok/s step 3877/19560 | loss 3.644849 (-0.15z)| norm 0.2669 (-0.43z)| lr 5.59e-04 | 4159.33 ms | 32.5% bf16 MFU | 126037 tok/s step 3878/19560 | loss 3.611702 (-0.94z)| norm 0.2650 (-0.45z)| lr 5.59e-04 | 4151.73 ms | 32.5% bf16 MFU | 126050 tok/s step 3879/19560 | loss 3.586942 (-1.52z)| norm 0.2598 (-0.50z)| lr 5.59e-04 | 4158.21 ms | 32.5% bf16 MFU | 126051 tok/s step 3880/19560 | loss 3.636520 (-0.32z)| norm 0.2735 (-0.35z)| lr 5.59e-04 | 4160.36 ms | 32.5% bf16 MFU | 126050 tok/s step 3881/19560 | loss 3.631284 (-0.44z)| norm 0.2982 (-0.08z)| lr 5.59e-04 | 4162.73 ms | 32.4% bf16 MFU | 126045 tok/s step 3882/19560 | loss 3.620067 (-0.72z)| norm 0.2815 (-0.25z)| lr 5.59e-04 | 4159.82 ms | 32.5% bf16 MFU | 126044 tok/s step 3883/19560 | loss 3.596210 (-1.27z)| norm 0.2549 (-0.53z)| lr 5.59e-04 | 4263.37 ms | 31.7% bf16 MFU | 125891 tok/s step 3884/19560 | loss 3.608085 (-0.97z)| norm 0.2736 (-0.32z)| lr 5.59e-04 | 4166.77 ms | 32.4% bf16 MFU | 125888 tok/s step 3885/19560 | loss 3.695527 (+1.11z)| norm 0.2857 (-0.18z)| lr 5.59e-04 | 4155.86 ms | 32.5% bf16 MFU | 125901 tok/s step 3886/19560 | loss 3.627645 (-0.51z)| norm 0.3611 (+0.66z)| lr 5.59e-04 | 4166.62 ms | 32.4% bf16 MFU | 125897 tok/s step 3887/19560 | loss 3.602869 (-1.08z)| norm 0.3238 (+0.24z)| lr 5.59e-04 | 4169.33 ms | 32.4% bf16 MFU | 125890 tok/s step 3888/19560 | loss 3.654591 (+0.16z)| norm 0.2879 (-0.16z)| lr 5.59e-04 | 4163.54 ms | 32.4% bf16 MFU | 125892 tok/s step 3889/19560 | loss 3.668707 (+0.49z)| norm 0.2884 (-0.16z)| lr 5.59e-04 | 4195.35 ms | 32.2% bf16 MFU | 125846 tok/s step 3890/19560 | loss 3.611456 (-0.86z)| norm 0.2696 (-0.36z)| lr 5.59e-04 | 4157.64 ms | 32.5% bf16 MFU | 125858 tok/s step 3891/19560 | loss 3.599193 (-1.14z)| norm 0.2769 (-0.28z)| lr 5.59e-04 | 4168.44 ms | 32.4% bf16 MFU | 125854 tok/s step 3892/19560 | loss 3.700343 (+1.26z)| norm 0.2817 (-0.23z)| lr 5.59e-04 | 4158.28 ms | 32.5% bf16 MFU | 125866 tok/s step 3893/19560 | loss 3.645562 (-0.02z)| norm 0.2482 (-0.81z)| lr 5.59e-04 | 4153.18 ms | 32.5% bf16 MFU | 125884 tok/s step 3894/19560 | loss 3.636694 (-0.24z)| norm 0.3184 (+0.43z)| lr 5.59e-04 | 4159.44 ms | 32.5% bf16 MFU | 125892 tok/s step 3895/19560 | loss 3.643648 (-0.04z)| norm 0.3574 (+1.17z)| lr 5.59e-04 | 4215.61 ms | 32.0% bf16 MFU | 125816 tok/s step 3896/19560 | loss 3.659886 (+0.42z)| norm 0.2579 (-0.89z)| lr 5.59e-04 | 4197.29 ms | 32.2% bf16 MFU | 125771 tok/s step 3897/19560 | loss 3.656502 (+0.35z)| norm 0.2727 (-0.48z)| lr 5.58e-04 | 4163.48 ms | 32.4% bf16 MFU | 125779 tok/s step 3898/19560 | loss 3.653316 (+0.26z)| norm 0.2839 (-0.10z)| lr 5.58e-04 | 4222.80 ms | 32.0% bf16 MFU | 125698 tok/s step 3899/19560 | loss 3.629787 (-0.40z)| norm 0.3083 (+0.72z)| lr 5.58e-04 | 4153.34 ms | 32.5% bf16 MFU | 125724 tok/s step 3900/19560 | loss 3.637106 (-0.18z)| norm 0.3205 (+1.14z)| lr 5.58e-04 | 4165.00 ms | 32.4% bf16 MFU | 125732 tok/s step 3901/19560 | loss 3.648297 (+0.12z)| norm 0.2902 (+0.13z)| lr 5.58e-04 | 4160.53 ms | 32.5% bf16 MFU | 125746 tok/s step 3902/19560 | loss 3.638758 (-0.15z)| norm 0.2848 (-0.05z)| lr 5.58e-04 | 4150.87 ms | 32.5% bf16 MFU | 125774 tok/s step 3903/19560 | loss 3.661908 (+0.51z)| norm 0.2770 (-0.31z)| lr 5.58e-04 | 4204.52 ms | 32.1% bf16 MFU | 125720 tok/s step 3904/19560 | loss 3.645422 (+0.05z)| norm 0.3112 (+0.85z)| lr 5.58e-04 | 4405.79 ms | 30.6% bf16 MFU | 125384 tok/s step 3905/19560 | loss 3.687640 (+1.25z)| norm 0.3170 (+1.03z)| lr 5.58e-04 | 4156.46 ms | 32.5% bf16 MFU | 125422 tok/s step 3906/19560 | loss 3.605378 (-1.08z)| norm 0.3408 (+1.81z)| lr 5.58e-04 | 4161.29 ms | 32.4% bf16 MFU | 125451 tok/s step 3907/19560 | loss 3.588084 (-1.58z)| norm 0.3324 (+1.50z)| lr 5.58e-04 | 4164.55 ms | 32.4% bf16 MFU | 125473 tok/s step 3908/19560 | loss 3.670088 (+0.83z)| norm 0.2708 (-0.55z)| lr 5.58e-04 | 4170.00 ms | 32.4% bf16 MFU | 125485 tok/s step 3909/19560 | loss 3.660597 (+0.54z)| norm 0.2834 (-0.13z)| lr 5.58e-04 | 4383.70 ms | 30.8% bf16 MFU | 125191 tok/s step 3910/19560 | loss 3.651178 (+0.28z)| norm 0.2843 (-0.10z)| lr 5.58e-04 | 4155.10 ms | 32.5% bf16 MFU | 125241 tok/s step 3911/19560 | loss 3.592566 (-1.47z)| norm 0.3271 (+1.32z)| lr 5.58e-04 | 4202.92 ms | 32.1% bf16 MFU | 125216 tok/s step 3912/19560 | loss 3.644824 (+0.10z)| norm 0.2944 (+0.23z)| lr 5.58e-04 | 4437.48 ms | 30.4% bf16 MFU | 124862 tok/s step 3913/19560 | loss 3.617246 (-0.73z)| norm 0.3531 (+2.12z)| lr 5.58e-04 | 4207.21 ms | 32.1% bf16 MFU | 124850 tok/s step 3914/19560 | loss 3.640413 (-0.03z)| norm 0.3067 (+0.59z)| lr 5.58e-04 | 4165.94 ms | 32.4% bf16 MFU | 124900 tok/s step 3915/19560 | loss 3.592419 (-1.45z)| norm 0.3058 (+0.55z)| lr 5.58e-04 | 4152.89 ms | 32.5% bf16 MFU | 124968 tok/s step 3916/19560 | loss 3.759876 (+3.37z)| norm 0.2774 (-0.41z)| lr 5.58e-04 | 4209.97 ms | 32.1% bf16 MFU | 124946 tok/s step 3917/19560 | loss 3.653223 (+0.31z)| norm 0.2900 (+0.00z)| lr 5.58e-04 | 4169.45 ms | 32.4% bf16 MFU | 124986 tok/s step 3918/19560 | loss 3.616405 (-0.74z)| norm 0.2836 (-0.23z)| lr 5.58e-04 | 4180.19 ms | 32.3% bf16 MFU | 125008 tok/s step 3919/19560 | loss 3.584221 (-1.67z)| norm 0.2636 (-0.92z)| lr 5.58e-04 | 4174.40 ms | 32.3% bf16 MFU | 125037 tok/s step 3920/19560 | loss 3.615825 (-0.76z)| norm 0.2684 (-0.76z)| lr 5.58e-04 | 4218.44 ms | 32.0% bf16 MFU | 124999 tok/s step 3921/19560 | loss 3.609882 (-0.91z)| norm 0.3075 (+0.58z)| lr 5.58e-04 | 4183.34 ms | 32.3% bf16 MFU | 125016 tok/s step 3922/19560 | loss 3.645816 (+0.13z)| norm 0.3249 (+1.17z)| lr 5.58e-04 | 4223.60 ms | 32.0% bf16 MFU | 124972 tok/s step 3923/19560 | loss 3.656072 (+0.42z)| norm 0.2809 (-0.37z)| lr 5.58e-04 | 4167.90 ms | 32.4% bf16 MFU | 125013 tok/s step 3924/19560 | loss 3.588419 (-1.54z)| norm 0.2650 (-0.92z)| lr 5.58e-04 | 4165.52 ms | 32.4% bf16 MFU | 125055 tok/s step 3925/19560 | loss 3.632607 (-0.25z)| norm 0.2835 (-0.27z)| lr 5.58e-04 | 4174.54 ms | 32.3% bf16 MFU | 125082 tok/s step 3926/19560 | loss 3.829544 (+4.88z)| norm 0.3288 (+1.29z)| lr 5.58e-04 | 4176.45 ms | 32.3% bf16 MFU | 125105 tok/s step 3927/19560 | loss 3.681225 (+1.00z)| norm 0.2863 (-0.17z)| lr 5.58e-04 | 4181.35 ms | 32.3% bf16 MFU | 125119 tok/s step 3928/19560 | loss 3.672413 (+0.76z)| norm 0.2926 (+0.07z)| lr 5.58e-04 | 4157.09 ms | 32.5% bf16 MFU | 125169 tok/s step 3929/19560 | loss 3.642712 (-0.02z)| norm 0.2758 (-0.53z)| lr 5.58e-04 | 4157.19 ms | 32.5% bf16 MFU | 125216 tok/s step 3930/19560 | loss 3.636908 (-0.15z)| norm 0.2836 (-0.25z)| lr 5.58e-04 | 4159.58 ms | 32.5% bf16 MFU | 125258 tok/s step 3931/19560 | loss 3.598944 (-1.15z)| norm 0.2719 (-0.66z)| lr 5.58e-04 | 4181.22 ms | 32.3% bf16 MFU | 125264 tok/s step 3932/19560 | loss 3.624109 (-0.49z)| norm 0.2821 (-0.31z)| lr 5.58e-04 | 4153.82 ms | 32.5% bf16 MFU | 125312 tok/s step 3933/19560 | loss 3.672281 (+0.82z)| norm 0.2591 (-1.12z)| lr 5.58e-04 | 4153.81 ms | 32.5% bf16 MFU | 125357 tok/s step 3934/19560 | loss 3.664391 (+0.61z)| norm 0.2614 (-1.05z)| lr 5.58e-04 | 4153.98 ms | 32.5% bf16 MFU | 125400 tok/s step 3935/19560 | loss 3.644138 (+0.06z)| norm 0.2702 (-0.74z)| lr 5.58e-04 | 4170.53 ms | 32.4% bf16 MFU | 125416 tok/s step 3936/19560 | loss 3.603796 (-1.02z)| norm 0.2569 (-1.20z)| lr 5.57e-04 | 4158.46 ms | 32.5% bf16 MFU | 125449 tok/s step 3937/19560 | loss 3.621673 (-0.53z)| norm 0.2853 (-0.19z)| lr 5.57e-04 | 4160.05 ms | 32.5% bf16 MFU | 125478 tok/s step 3938/19560 | loss 3.647491 (+0.18z)| norm 0.3084 (+0.62z)| lr 5.57e-04 | 4165.59 ms | 32.4% bf16 MFU | 125497 tok/s step 3939/19560 | loss 3.580420 (-1.63z)| norm 0.3022 (+0.40z)| lr 5.57e-04 | 4148.60 ms | 32.5% bf16 MFU | 125541 tok/s step 3940/19560 | loss 3.662631 (+0.59z)| norm 0.3174 (+0.92z)| lr 5.57e-04 | 4164.52 ms | 32.4% bf16 MFU | 125559 tok/s step 3941/19560 | loss 3.624373 (-0.45z)| norm 0.2889 (-0.08z)| lr 5.57e-04 | 4159.52 ms | 32.5% bf16 MFU | 125583 tok/s step 3942/19560 | loss 3.588167 (-1.41z)| norm 0.2857 (-0.19z)| lr 5.57e-04 | 4158.37 ms | 32.5% bf16 MFU | 125608 tok/s step 3943/19560 | loss 3.588673 (-1.38z)| norm 0.2968 (+0.19z)| lr 5.57e-04 | 4166.54 ms | 32.4% bf16 MFU | 125619 tok/s step 3944/19560 | loss 3.625115 (-0.40z)| norm 0.2784 (-0.46z)| lr 5.57e-04 | 4151.13 ms | 32.5% bf16 MFU | 125653 tok/s step 3945/19560 | loss 3.565058 (-1.96z)| norm 0.2983 (+0.24z)| lr 5.57e-04 | 4156.82 ms | 32.5% bf16 MFU | 125677 tok/s step 3946/19560 | loss 3.567756 (-1.85z)| norm 0.2952 (+0.12z)| lr 5.57e-04 | 4152.84 ms | 32.5% bf16 MFU | 125705 tok/s step 3947/19560 | loss 3.602193 (-0.94z)| norm 0.2729 (-0.68z)| lr 5.57e-04 | 4152.56 ms | 32.5% bf16 MFU | 125733 tok/s step 3948/19560 | loss 3.626531 (-0.30z)| norm 0.2699 (-0.78z)| lr 5.57e-04 | 4252.49 ms | 31.8% bf16 MFU | 125611 tok/s step 3949/19560 | loss 3.617303 (-0.54z)| norm 0.2878 (-0.14z)| lr 5.57e-04 | 4166.45 ms | 32.4% bf16 MFU | 125622 tok/s step 3950/19560 | loss 3.642643 (+0.12z)| norm 0.2497 (-1.50z)| lr 5.57e-04 | 4171.23 ms | 32.4% bf16 MFU | 125626 tok/s step 3951/19560 | loss 3.610346 (-0.71z)| norm 0.2790 (-0.43z)| lr 5.57e-04 | 4155.61 ms | 32.5% bf16 MFU | 125652 tok/s step 3952/19560 | loss 3.556301 (-2.07z)| norm 0.2819 (-0.33z)| lr 5.57e-04 | 4374.86 ms | 30.9% bf16 MFU | 125362 tok/s step 3953/19560 | loss 3.621316 (-0.39z)| norm 0.2608 (-1.08z)| lr 5.57e-04 | 4149.62 ms | 32.5% bf16 MFU | 125411 tok/s step 3954/19560 | loss 3.636430 (-0.00z)| norm 0.2536 (-1.32z)| lr 5.57e-04 | 4154.60 ms | 32.5% bf16 MFU | 125450 tok/s step 3955/19560 | loss 3.637852 (+0.05z)| norm 0.2914 (+0.04z)| lr 5.57e-04 | 4163.13 ms | 32.4% bf16 MFU | 125475 tok/s step 3956/19560 | loss 3.632607 (-0.09z)| norm 0.2471 (-1.53z)| lr 5.57e-04 | 4152.10 ms | 32.5% bf16 MFU | 125514 tok/s step 3957/19560 | loss 3.606233 (-0.77z)| norm 0.2602 (-1.06z)| lr 5.57e-04 | 4227.60 ms | 31.9% bf16 MFU | 125439 tok/s step 3958/19560 | loss 3.602084 (-0.88z)| norm 0.2743 (-0.55z)| lr 5.57e-04 | 4152.15 ms | 32.5% bf16 MFU | 125481 tok/s step 3959/19560 | loss 3.601832 (-0.88z)| norm 0.2594 (-1.07z)| lr 5.57e-04 | 4170.15 ms | 32.4% bf16 MFU | 125493 tok/s step 3960/19560 | loss 3.630934 (-0.12z)| norm 0.2635 (-0.92z)| lr 5.57e-04 | 4165.60 ms | 32.4% bf16 MFU | 125511 tok/s step 3961/19560 | loss 3.624086 (-0.30z)| norm 0.2561 (-1.16z)| lr 5.57e-04 | 4162.52 ms | 32.4% bf16 MFU | 125534 tok/s step 3962/19560 | loss 3.608221 (-0.71z)| norm 0.2597 (-1.03z)| lr 5.57e-04 | 4164.84 ms | 32.4% bf16 MFU | 125551 tok/s step 3963/19560 | loss 3.615732 (-0.51z)| norm 0.2934 (+0.16z)| lr 5.57e-04 | 4151.18 ms | 32.5% bf16 MFU | 125588 tok/s step 3964/19560 | loss 3.579040 (-1.46z)| norm 0.3018 (+0.46z)| lr 5.57e-04 | 4199.10 ms | 32.2% bf16 MFU | 125552 tok/s step 3965/19560 | loss 3.644728 (+0.26z)| norm 0.3109 (+0.77z)| lr 5.57e-04 | 4149.30 ms | 32.5% bf16 MFU | 125592 tok/s step 3966/19560 | loss 3.598341 (-0.94z)| norm 0.3161 (+0.94z)| lr 5.57e-04 | 4579.23 ms | 29.5% bf16 MFU | 125037 tok/s step 3967/19560 | loss 3.650167 (+0.41z)| norm 0.2887 (-0.03z)| lr 5.57e-04 | 4162.69 ms | 32.4% bf16 MFU | 125083 tok/s step 3968/19560 | loss 3.608438 (-0.68z)| norm 0.2706 (-0.67z)| lr 5.57e-04 | 4168.10 ms | 32.4% bf16 MFU | 125118 tok/s step 3969/19560 | loss 3.613948 (-0.53z)| norm 0.2856 (-0.14z)| lr 5.57e-04 | 4195.13 ms | 32.2% bf16 MFU | 125111 tok/s step 3970/19560 | loss 3.625513 (-0.22z)| norm 0.2934 (+0.14z)| lr 5.57e-04 | 4147.67 ms | 32.6% bf16 MFU | 125176 tok/s step 3971/19560 | loss 3.650907 (+0.44z)| norm 0.2818 (-0.27z)| lr 5.57e-04 | 4176.51 ms | 32.3% bf16 MFU | 125193 tok/s step 3972/19560 | loss 3.643016 (+0.24z)| norm 0.3292 (+1.40z)| lr 5.57e-04 | 4157.76 ms | 32.5% bf16 MFU | 125239 tok/s step 3973/19560 | loss 3.624455 (-0.26z)| norm 0.3009 (+0.41z)| lr 5.57e-04 | 4155.01 ms | 32.5% bf16 MFU | 125286 tok/s step 3974/19560 | loss 3.598479 (-0.94z)| norm 0.2950 (+0.19z)| lr 5.57e-04 | 4150.40 ms | 32.5% bf16 MFU | 125338 tok/s step 3975/19560 | loss 3.663908 (+0.77z)| norm 0.2907 (+0.05z)| lr 5.56e-04 | 4155.94 ms | 32.5% bf16 MFU | 125378 tok/s step 3976/19560 | loss 3.584387 (-1.29z)| norm 0.3058 (+0.57z)| lr 5.56e-04 | 4154.66 ms | 32.5% bf16 MFU | 125419 tok/s step 3977/19560 | loss 3.636948 (+0.09z)| norm 0.3020 (+0.61z)| lr 5.56e-04 | 4165.50 ms | 32.4% bf16 MFU | 125441 tok/s step 3978/19560 | loss 3.605526 (-0.73z)| norm 0.2711 (-0.76z)| lr 5.56e-04 | 4154.44 ms | 32.5% bf16 MFU | 125479 tok/s step 3979/19560 | loss 3.583504 (-1.29z)| norm 0.2775 (-0.47z)| lr 5.56e-04 | 4147.86 ms | 32.6% bf16 MFU | 125525 tok/s step 3980/19560 | loss 3.687639 (+1.41z)| norm 0.2817 (-0.28z)| lr 5.56e-04 | 4148.76 ms | 32.5% bf16 MFU | 125568 tok/s step 3981/19560 | loss 3.540792 (-2.33z)| norm 0.3036 (+0.69z)| lr 5.56e-04 | 4194.68 ms | 32.2% bf16 MFU | 125539 tok/s step 3982/19560 | loss 3.608356 (-0.61z)| norm 0.2949 (+0.30z)| lr 5.56e-04 | 4156.42 ms | 32.5% bf16 MFU | 125569 tok/s step 3983/19560 | loss 3.644490 (+0.30z)| norm 0.2694 (-0.84z)| lr 5.56e-04 | 4149.34 ms | 32.5% bf16 MFU | 125608 tok/s step 3984/19560 | loss 3.587560 (-1.14z)| norm 0.2974 (+0.41z)| lr 5.56e-04 | 4160.07 ms | 32.5% bf16 MFU | 125629 tok/s step 3985/19560 | loss 3.621830 (-0.27z)| norm 0.3293 (+1.80z)| lr 5.56e-04 | 4161.79 ms | 32.4% bf16 MFU | 125646 tok/s step 3986/19560 | loss 3.618934 (-0.34z)| norm 0.2716 (-0.73z)| lr 5.56e-04 | 4170.18 ms | 32.4% bf16 MFU | 125650 tok/s step 3987/19560 | loss 3.630616 (-0.04z)| norm 0.3105 (+0.97z)| lr 5.56e-04 | 4156.56 ms | 32.5% bf16 MFU | 125675 tok/s step 3988/19560 | loss 3.605807 (-0.66z)| norm 0.3071 (+0.80z)| lr 5.56e-04 | 4169.76 ms | 32.4% bf16 MFU | 125678 tok/s step 3989/19560 | loss 3.594501 (-0.94z)| norm 0.2784 (-0.46z)| lr 5.56e-04 | 4167.51 ms | 32.4% bf16 MFU | 125684 tok/s step 3990/19560 | loss 3.641588 (+0.26z)| norm 0.2665 (-0.97z)| lr 5.56e-04 | 4162.20 ms | 32.4% bf16 MFU | 125698 tok/s step 3991/19560 | loss 3.537656 (-2.33z)| norm 0.3282 (+1.73z)| lr 5.56e-04 | 4156.00 ms | 32.5% bf16 MFU | 125721 tok/s step 3992/19560 | loss 3.595294 (-0.86z)| norm 0.2877 (-0.03z)| lr 5.56e-04 | 4156.92 ms | 32.5% bf16 MFU | 125741 tok/s step 3993/19560 | loss 3.595678 (-0.84z)| norm 0.2905 (+0.09z)| lr 5.56e-04 | 4187.35 ms | 32.2% bf16 MFU | 125714 tok/s step 3994/19560 | loss 3.620713 (-0.21z)| norm 0.3201 (+1.38z)| lr 5.56e-04 | 4160.06 ms | 32.5% bf16 MFU | 125730 tok/s step 3995/19560 | loss 3.614732 (-0.35z)| norm 0.2876 (-0.03z)| lr 5.56e-04 | 4164.19 ms | 32.4% bf16 MFU | 125739 tok/s step 3996/19560 | loss 3.640670 (+0.32z)| norm 0.2799 (-0.37z)| lr 5.56e-04 | 4167.47 ms | 32.4% bf16 MFU | 125742 tok/s step 3997/19560 | loss 3.626340 (-0.04z)| norm 0.2659 (-0.97z)| lr 5.56e-04 | 4152.10 ms | 32.5% bf16 MFU | 125768 tok/s step 3998/19560 | loss 3.600110 (-0.71z)| norm 0.2775 (-0.46z)| lr 5.56e-04 | 4146.94 ms | 32.6% bf16 MFU | 125801 tok/s step 3999/19560 | loss 3.603870 (-0.60z)| norm 0.2625 (-1.10z)| lr 5.56e-04 | 4149.96 ms | 32.5% bf16 MFU | 125828 tok/s step 4000/19560 | loss 3.631638 (+0.13z)| norm 0.2838 (-0.18z)| lr 5.56e-04 | 4178.76 ms | 32.3% bf16 MFU | 125810 tok/s val loss 3.607753 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2712/10042 = 0.270066 step 4001/19560 | loss 3.602548 (-0.66z)| norm 0.2756 (-0.54z)| lr 5.56e-04 | 4320.82 ms | 31.2% bf16 MFU | 125586 tok/s step 4002/19560 | loss 3.583167 (-1.18z)| norm 0.2561 (-1.37z)| lr 5.56e-04 | 4159.38 ms | 32.5% bf16 MFU | 125610 tok/s step 4003/19560 | loss 3.656504 (+0.79z)| norm 0.2817 (-0.26z)| lr 5.56e-04 | 4173.81 ms | 32.3% bf16 MFU | 125610 tok/s step 4004/19560 | loss 3.572963 (-1.43z)| norm 0.2684 (-0.83z)| lr 5.56e-04 | 4171.28 ms | 32.4% bf16 MFU | 125614 tok/s step 4005/19560 | loss 3.649651 (+0.63z)| norm 0.2627 (-1.08z)| lr 5.56e-04 | 4155.97 ms | 32.5% bf16 MFU | 125641 tok/s step 4006/19560 | loss 3.647173 (+0.55z)| norm 0.2815 (-0.27z)| lr 5.56e-04 | 4149.69 ms | 32.5% bf16 MFU | 125676 tok/s step 4007/19560 | loss 3.554086 (-1.91z)| norm 0.2536 (-1.48z)| lr 5.56e-04 | 4156.16 ms | 32.5% bf16 MFU | 125699 tok/s step 4008/19560 | loss 3.594409 (-0.83z)| norm 0.2673 (-0.88z)| lr 5.56e-04 | 4148.20 ms | 32.5% bf16 MFU | 125734 tok/s step 4009/19560 | loss 3.643695 (+0.47z)| norm 0.2907 (+0.13z)| lr 5.56e-04 | 4148.33 ms | 32.5% bf16 MFU | 125766 tok/s step 4010/19560 | loss 3.672389 (+1.21z)| norm 0.2981 (+0.45z)| lr 5.56e-04 | 4190.48 ms | 32.2% bf16 MFU | 125734 tok/s step 4011/19560 | loss 3.650308 (+0.62z)| norm 0.2840 (-0.17z)| lr 5.56e-04 | 4163.21 ms | 32.4% bf16 MFU | 125744 tok/s step 4012/19560 | loss 3.610869 (-0.42z)| norm 0.3095 (+0.92z)| lr 5.56e-04 | 4169.95 ms | 32.4% bf16 MFU | 125743 tok/s step 4013/19560 | loss 3.648336 (+0.58z)| norm 0.3205 (+1.38z)| lr 5.55e-04 | 4148.40 ms | 32.5% bf16 MFU | 125775 tok/s step 4014/19560 | loss 3.639968 (+0.35z)| norm 0.2976 (+0.43z)| lr 5.55e-04 | 4156.32 ms | 32.5% bf16 MFU | 125794 tok/s step 4015/19560 | loss 3.602160 (-0.65z)| norm 0.3121 (+1.09z)| lr 5.55e-04 | 4163.31 ms | 32.4% bf16 MFU | 125800 tok/s step 4016/19560 | loss 3.635408 (+0.24z)| norm 0.3042 (+0.72z)| lr 5.55e-04 | 4153.66 ms | 32.5% bf16 MFU | 125821 tok/s step 4017/19560 | loss 3.558210 (-1.78z)| norm 0.3109 (+1.02z)| lr 5.55e-04 | 4164.88 ms | 32.4% bf16 MFU | 125825 tok/s step 4018/19560 | loss 3.580715 (-1.17z)| norm 0.2973 (+0.40z)| lr 5.55e-04 | 4173.97 ms | 32.3% bf16 MFU | 125814 tok/s step 4019/19560 | loss 3.611191 (-0.38z)| norm 0.2924 (+0.17z)| lr 5.55e-04 | 4153.92 ms | 32.5% bf16 MFU | 125834 tok/s step 4020/19560 | loss 3.626316 (+0.04z)| norm 0.2760 (-0.56z)| lr 5.55e-04 | 4159.21 ms | 32.5% bf16 MFU | 125845 tok/s step 4021/19560 | loss 3.582881 (-1.11z)| norm 0.3085 (+0.89z)| lr 5.55e-04 | 4159.45 ms | 32.5% bf16 MFU | 125855 tok/s step 4022/19560 | loss 3.677803 (+1.41z)| norm 0.2927 (+0.18z)| lr 5.55e-04 | 4224.07 ms | 32.0% bf16 MFU | 125768 tok/s step 4023/19560 | loss 3.599401 (-0.66z)| norm 0.3072 (+0.89z)| lr 5.55e-04 | 4154.45 ms | 32.5% bf16 MFU | 125790 tok/s step 4024/19560 | loss 3.602192 (-0.58z)| norm 0.2997 (+0.53z)| lr 5.55e-04 | 4163.12 ms | 32.4% bf16 MFU | 125797 tok/s step 4025/19560 | loss 3.688130 (+1.68z)| norm 0.2877 (-0.06z)| lr 5.55e-04 | 4147.27 ms | 32.6% bf16 MFU | 125828 tok/s step 4026/19560 | loss 3.613675 (-0.27z)| norm 0.2686 (-0.96z)| lr 5.55e-04 | 4161.57 ms | 32.4% bf16 MFU | 125836 tok/s step 4027/19560 | loss 3.524693 (-2.52z)| norm 0.2608 (-1.31z)| lr 5.55e-04 | 4155.78 ms | 32.5% bf16 MFU | 125852 tok/s step 4028/19560 | loss 3.580067 (-1.09z)| norm 0.2880 (-0.01z)| lr 5.55e-04 | 4162.99 ms | 32.4% bf16 MFU | 125856 tok/s step 4029/19560 | loss 3.549143 (-1.84z)| norm 0.3222 (+1.61z)| lr 5.55e-04 | 4383.87 ms | 30.8% bf16 MFU | 125543 tok/s step 4030/19560 | loss 3.605522 (-0.40z)| norm 0.3393 (+2.35z)| lr 5.55e-04 | 4161.38 ms | 32.4% bf16 MFU | 125566 tok/s step 4031/19560 | loss 3.618027 (-0.08z)| norm 0.3008 (+0.55z)| lr 5.55e-04 | 4161.34 ms | 32.4% bf16 MFU | 125587 tok/s step 4032/19560 | loss 3.617634 (-0.09z)| norm 0.2549 (-1.55z)| lr 5.55e-04 | 4172.26 ms | 32.4% bf16 MFU | 125591 tok/s step 4033/19560 | loss 3.579949 (-1.03z)| norm 0.2779 (-0.48z)| lr 5.55e-04 | 4168.92 ms | 32.4% bf16 MFU | 125599 tok/s step 4034/19560 | loss 3.663247 (+1.09z)| norm 0.2801 (-0.36z)| lr 5.55e-04 | 4159.73 ms | 32.5% bf16 MFU | 125621 tok/s step 4035/19560 | loss 3.630751 (+0.25z)| norm 0.2803 (-0.34z)| lr 5.55e-04 | 4177.08 ms | 32.3% bf16 MFU | 125616 tok/s step 4036/19560 | loss 3.578112 (-1.08z)| norm 0.2625 (-1.20z)| lr 5.55e-04 | 4146.53 ms | 32.6% bf16 MFU | 125657 tok/s step 4037/19560 | loss 3.653177 (+0.85z)| norm 0.2576 (-1.41z)| lr 5.55e-04 | 4182.72 ms | 32.3% bf16 MFU | 125641 tok/s step 4038/19560 | loss 3.545660 (-1.87z)| norm 0.2855 (-0.08z)| lr 5.55e-04 | 4164.30 ms | 32.4% bf16 MFU | 125654 tok/s step 4039/19560 | loss 3.625620 (+0.15z)| norm 0.2463 (-1.92z)| lr 5.55e-04 | 4149.96 ms | 32.5% bf16 MFU | 125688 tok/s step 4040/19560 | loss 3.615494 (-0.10z)| norm 0.2355 (-2.37z)| lr 5.55e-04 | 4165.97 ms | 32.4% bf16 MFU | 125697 tok/s step 4041/19560 | loss 3.682046 (+1.57z)| norm 0.2797 (-0.28z)| lr 5.55e-04 | 4148.48 ms | 32.5% bf16 MFU | 125731 tok/s step 4042/19560 | loss 3.551094 (-1.70z)| norm 0.2620 (-1.12z)| lr 5.55e-04 | 4162.38 ms | 32.4% bf16 MFU | 125742 tok/s step 4043/19560 | loss 3.636157 (+0.42z)| norm 0.2749 (-0.48z)| lr 5.55e-04 | 4210.80 ms | 32.1% bf16 MFU | 125681 tok/s step 4044/19560 | loss 3.606822 (-0.30z)| norm 0.2934 (+0.41z)| lr 5.55e-04 | 4145.28 ms | 32.6% bf16 MFU | 125720 tok/s step 4045/19560 | loss 3.677210 (+1.53z)| norm 0.2843 (-0.03z)| lr 5.55e-04 | 4166.25 ms | 32.4% bf16 MFU | 125727 tok/s step 4046/19560 | loss 3.615973 (-0.07z)| norm 0.2979 (+0.63z)| lr 5.55e-04 | 4149.44 ms | 32.5% bf16 MFU | 125758 tok/s step 4047/19560 | loss 3.672245 (+1.38z)| norm 0.3058 (+1.00z)| lr 5.55e-04 | 4301.15 ms | 31.4% bf16 MFU | 125565 tok/s step 4048/19560 | loss 3.634206 (+0.39z)| norm 0.2905 (+0.24z)| lr 5.55e-04 | 4161.23 ms | 32.4% bf16 MFU | 125586 tok/s step 4049/19560 | loss 3.583624 (-0.92z)| norm 0.2832 (-0.11z)| lr 5.55e-04 | 4165.78 ms | 32.4% bf16 MFU | 125600 tok/s step 4050/19560 | loss 3.538254 (-2.04z)| norm 0.2864 (+0.07z)| lr 5.55e-04 | 4146.94 ms | 32.6% bf16 MFU | 125641 tok/s step 4051/19560 | loss 3.597961 (-0.51z)| norm 0.2815 (-0.18z)| lr 5.54e-04 | 4163.95 ms | 32.4% bf16 MFU | 125654 tok/s step 4052/19560 | loss 3.577856 (-1.02z)| norm 0.2835 (-0.08z)| lr 5.54e-04 | 4147.03 ms | 32.6% bf16 MFU | 125693 tok/s step 4053/19560 | loss 3.620841 (+0.08z)| norm 0.2640 (-1.05z)| lr 5.54e-04 | 4145.80 ms | 32.6% bf16 MFU | 125731 tok/s step 4054/19560 | loss 3.638382 (+0.65z)| norm 0.3191 (+1.72z)| lr 5.54e-04 | 4148.18 ms | 32.5% bf16 MFU | 125764 tok/s step 4055/19560 | loss 3.616591 (+0.03z)| norm 0.3519 (+3.21z)| lr 5.54e-04 | 4164.53 ms | 32.4% bf16 MFU | 125771 tok/s step 4056/19560 | loss 3.603073 (-0.36z)| norm 0.3029 (+0.84z)| lr 5.54e-04 | 4161.50 ms | 32.4% bf16 MFU | 125782 tok/s step 4057/19560 | loss 3.605487 (-0.28z)| norm 0.2648 (-0.99z)| lr 5.54e-04 | 4150.07 ms | 32.5% bf16 MFU | 125809 tok/s step 4058/19560 | loss 3.679475 (+1.90z)| norm 0.3022 (+0.79z)| lr 5.54e-04 | 4161.16 ms | 32.4% bf16 MFU | 125818 tok/s step 4059/19560 | loss 3.575611 (-1.15z)| norm 0.2904 (+0.22z)| lr 5.54e-04 | 4169.61 ms | 32.4% bf16 MFU | 125815 tok/s step 4060/19560 | loss 3.654020 (+1.13z)| norm 0.2967 (+0.52z)| lr 5.54e-04 | 4159.80 ms | 32.5% bf16 MFU | 125826 tok/s step 4061/19560 | loss 3.637544 (+0.67z)| norm 0.2848 (-0.06z)| lr 5.54e-04 | 4169.60 ms | 32.4% bf16 MFU | 125821 tok/s step 4062/19560 | loss 3.722015 (+3.05z)| norm 0.2813 (-0.24z)| lr 5.54e-04 | 4149.15 ms | 32.5% bf16 MFU | 125848 tok/s step 4063/19560 | loss 3.586694 (-0.81z)| norm 0.2910 (+0.23z)| lr 5.54e-04 | 4146.76 ms | 32.6% bf16 MFU | 125878 tok/s step 4064/19560 | loss 3.600237 (-0.42z)| norm 0.2898 (+0.16z)| lr 5.54e-04 | 4185.04 ms | 32.3% bf16 MFU | 125848 tok/s step 4065/19560 | loss 3.646438 (+0.89z)| norm 0.2596 (-1.31z)| lr 5.54e-04 | 4173.00 ms | 32.4% bf16 MFU | 125837 tok/s step 4066/19560 | loss 3.639096 (+0.69z)| norm 0.2757 (-0.51z)| lr 5.54e-04 | 4147.66 ms | 32.6% bf16 MFU | 125865 tok/s step 4067/19560 | loss 3.621559 (+0.18z)| norm 0.2620 (-1.16z)| lr 5.54e-04 | 4166.63 ms | 32.4% bf16 MFU | 125864 tok/s step 4068/19560 | loss 3.565713 (-1.40z)| norm 0.2822 (-0.17z)| lr 5.54e-04 | 4177.44 ms | 32.3% bf16 MFU | 125846 tok/s step 4069/19560 | loss 3.599541 (-0.43z)| norm 0.3093 (+1.15z)| lr 5.54e-04 | 4154.17 ms | 32.5% bf16 MFU | 125864 tok/s step 4070/19560 | loss 3.609527 (-0.14z)| norm 0.2709 (-0.72z)| lr 5.54e-04 | 4146.37 ms | 32.6% bf16 MFU | 125893 tok/s step 4071/19560 | loss 3.580620 (-0.97z)| norm 0.2840 (-0.07z)| lr 5.54e-04 | 4157.23 ms | 32.5% bf16 MFU | 125904 tok/s step 4072/19560 | loss 3.672733 (+1.64z)| norm 0.2608 (-1.20z)| lr 5.54e-04 | 4165.85 ms | 32.4% bf16 MFU | 125902 tok/s step 4073/19560 | loss 3.652654 (+1.06z)| norm 0.2598 (-1.23z)| lr 5.54e-04 | 4153.33 ms | 32.5% bf16 MFU | 125918 tok/s step 4074/19560 | loss 3.605432 (-0.30z)| norm 0.2613 (-1.13z)| lr 5.54e-04 | 4153.26 ms | 32.5% bf16 MFU | 125934 tok/s step 4075/19560 | loss 3.580612 (-1.00z)| norm 0.2879 (+0.14z)| lr 5.54e-04 | 4168.71 ms | 32.4% bf16 MFU | 125926 tok/s step 4076/19560 | loss 3.617901 (+0.06z)| norm 0.3058 (+0.99z)| lr 5.54e-04 | 4187.81 ms | 32.2% bf16 MFU | 125889 tok/s step 4077/19560 | loss 3.638122 (+0.64z)| norm 0.3082 (+1.09z)| lr 5.54e-04 | 4167.20 ms | 32.4% bf16 MFU | 125885 tok/s step 4078/19560 | loss 3.626635 (+0.31z)| norm 0.2823 (-0.16z)| lr 5.54e-04 | 4168.05 ms | 32.4% bf16 MFU | 125880 tok/s step 4079/19560 | loss 3.669075 (+1.50z)| norm 0.3813 (+4.26z)| lr 5.54e-04 | 4148.08 ms | 32.5% bf16 MFU | 125906 tok/s step 4080/19560 | loss 3.597869 (-0.53z)| norm 0.3023 (+0.70z)| lr 5.54e-04 | 4160.94 ms | 32.4% bf16 MFU | 125911 tok/s step 4081/19560 | loss 3.597684 (-0.53z)| norm 0.3110 (+1.07z)| lr 5.54e-04 | 4152.66 ms | 32.5% bf16 MFU | 125928 tok/s step 4082/19560 | loss 3.586309 (-0.84z)| norm 0.3013 (+0.63z)| lr 5.54e-04 | 4161.27 ms | 32.4% bf16 MFU | 125931 tok/s step 4083/19560 | loss 3.630629 (+0.42z)| norm 0.2817 (-0.25z)| lr 5.54e-04 | 4158.39 ms | 32.5% bf16 MFU | 125939 tok/s step 4084/19560 | loss 3.646657 (+0.88z)| norm 0.3035 (+0.72z)| lr 5.54e-04 | 4147.83 ms | 32.6% bf16 MFU | 125962 tok/s step 4085/19560 | loss 3.593007 (-0.65z)| norm 0.2687 (-0.87z)| lr 5.54e-04 | 4154.73 ms | 32.5% bf16 MFU | 125973 tok/s step 4086/19560 | loss 3.601521 (-0.41z)| norm 0.2948 (+0.31z)| lr 5.54e-04 | 4145.30 ms | 32.6% bf16 MFU | 125998 tok/s step 4087/19560 | loss 3.699435 (+2.31z)| norm 0.3007 (+0.57z)| lr 5.54e-04 | 4148.21 ms | 32.5% bf16 MFU | 126018 tok/s step 4088/19560 | loss 3.568817 (-1.31z)| norm 0.3006 (+0.56z)| lr 5.54e-04 | 4154.47 ms | 32.5% bf16 MFU | 126027 tok/s step 4089/19560 | loss 3.596846 (-0.53z)| norm 0.2815 (-0.34z)| lr 5.53e-04 | 4150.26 ms | 32.5% bf16 MFU | 126042 tok/s step 4090/19560 | loss 3.622022 (+0.17z)| norm 0.3095 (+0.95z)| lr 5.53e-04 | 4156.86 ms | 32.5% bf16 MFU | 126046 tok/s step 4091/19560 | loss 3.597480 (-0.51z)| norm 0.2984 (+0.43z)| lr 5.53e-04 | 4262.36 ms | 31.7% bf16 MFU | 125894 tok/s step 4092/19560 | loss 3.620333 (+0.12z)| norm 0.2867 (-0.11z)| lr 5.53e-04 | 4194.49 ms | 32.2% bf16 MFU | 125849 tok/s step 4093/19560 | loss 3.634161 (+0.50z)| norm 0.3073 (+0.86z)| lr 5.53e-04 | 4159.25 ms | 32.5% bf16 MFU | 125859 tok/s step 4094/19560 | loss 3.579073 (-1.02z)| norm 0.2724 (-0.77z)| lr 5.53e-04 | 4147.29 ms | 32.6% bf16 MFU | 125887 tok/s step 4095/19560 | loss 3.650409 (+0.96z)| norm 0.2664 (-1.04z)| lr 5.53e-04 | 4158.56 ms | 32.5% bf16 MFU | 125896 tok/s step 4096/19560 | loss 3.645225 (+0.80z)| norm 0.3003 (+0.54z)| lr 5.53e-04 | 4252.52 ms | 31.7% bf16 MFU | 125766 tok/s step 4097/19560 | loss 3.606243 (-0.27z)| norm 0.2837 (-0.24z)| lr 5.53e-04 | 4529.86 ms | 29.8% bf16 MFU | 125265 tok/s step 4098/19560 | loss 3.599351 (-0.46z)| norm 0.2462 (-1.96z)| lr 5.53e-04 | 4283.38 ms | 31.5% bf16 MFU | 125122 tok/s step 4099/19560 | loss 3.607270 (-0.23z)| norm 0.2698 (-0.86z)| lr 5.53e-04 | 4833.14 ms | 27.9% bf16 MFU | 124289 tok/s step 4100/19560 | loss 3.550564 (-1.77z)| norm 0.2586 (-1.36z)| lr 5.53e-04 | 4144.68 ms | 32.6% bf16 MFU | 124400 tok/s step 4101/19560 | loss 3.573841 (-1.11z)| norm 0.2943 (+0.31z)| lr 5.53e-04 | 4370.91 ms | 30.9% bf16 MFU | 124177 tok/s step 4102/19560 | loss 3.584946 (-0.80z)| norm 0.3096 (+1.01z)| lr 5.53e-04 | 4141.19 ms | 32.6% bf16 MFU | 124299 tok/s step 4103/19560 | loss 3.600117 (-0.38z)| norm 0.3168 (+1.33z)| lr 5.53e-04 | 4161.91 ms | 32.4% bf16 MFU | 124382 tok/s step 4104/19560 | loss 3.627107 (+0.35z)| norm 0.3056 (+0.81z)| lr 5.53e-04 | 4141.52 ms | 32.6% bf16 MFU | 124493 tok/s step 4105/19560 | loss 3.668778 (+1.48z)| norm 0.2936 (+0.26z)| lr 5.53e-04 | 4147.18 ms | 32.6% bf16 MFU | 124589 tok/s step 4106/19560 | loss 3.631176 (+0.45z)| norm 0.2625 (-1.17z)| lr 5.53e-04 | 4149.06 ms | 32.5% bf16 MFU | 124678 tok/s step 4107/19560 | loss 3.582831 (-0.87z)| norm 0.2897 (+0.08z)| lr 5.53e-04 | 4149.98 ms | 32.5% bf16 MFU | 124761 tok/s step 4108/19560 | loss 3.606914 (-0.20z)| norm 0.2661 (-1.00z)| lr 5.53e-04 | 4181.76 ms | 32.3% bf16 MFU | 124791 tok/s step 4109/19560 | loss 3.614783 (+0.00z)| norm 0.2940 (+0.29z)| lr 5.53e-04 | 4171.90 ms | 32.4% bf16 MFU | 124835 tok/s step 4110/19560 | loss 3.561266 (-1.48z)| norm 0.2957 (+0.36z)| lr 5.53e-04 | 4164.29 ms | 32.4% bf16 MFU | 124889 tok/s step 4111/19560 | loss 3.606875 (-0.20z)| norm 0.2794 (-0.39z)| lr 5.53e-04 | 4190.10 ms | 32.2% bf16 MFU | 124900 tok/s step 4112/19560 | loss 3.661314 (+1.31z)| norm 0.3233 (+1.61z)| lr 5.53e-04 | 4163.74 ms | 32.4% bf16 MFU | 124951 tok/s step 4113/19560 | loss 3.584593 (-0.83z)| norm 0.3424 (+2.45z)| lr 5.53e-04 | 4165.86 ms | 32.4% bf16 MFU | 124996 tok/s step 4114/19560 | loss 3.549347 (-1.77z)| norm 0.3711 (+3.54z)| lr 5.53e-04 | 4174.62 ms | 32.3% bf16 MFU | 125026 tok/s step 4115/19560 | loss 3.713014 (+2.64z)| norm 0.3503 (+2.57z)| lr 5.53e-04 | 4313.44 ms | 31.3% bf16 MFU | 124852 tok/s step 4116/19560 | loss 3.578462 (-0.95z)| norm 0.2895 (+0.02z)| lr 5.53e-04 | 4167.63 ms | 32.4% bf16 MFU | 124900 tok/s step 4117/19560 | loss 3.605714 (-0.23z)| norm 0.2862 (-0.12z)| lr 5.53e-04 | 4208.54 ms | 32.1% bf16 MFU | 124883 tok/s step 4118/19560 | loss 3.704044 (+2.34z)| norm 0.3637 (+3.01z)| lr 5.53e-04 | 4176.28 ms | 32.3% bf16 MFU | 124916 tok/s step 4119/19560 | loss 3.561454 (-1.41z)| norm 0.3337 (+1.78z)| lr 5.53e-04 | 4160.95 ms | 32.4% bf16 MFU | 124971 tok/s step 4120/19560 | loss 3.588955 (-0.68z)| norm 0.2900 (+0.00z)| lr 5.53e-04 | 4170.81 ms | 32.4% bf16 MFU | 125007 tok/s step 4121/19560 | loss 3.586550 (-0.74z)| norm 0.2957 (+0.23z)| lr 5.53e-04 | 4162.09 ms | 32.4% bf16 MFU | 125055 tok/s step 4122/19560 | loss 3.566974 (-1.24z)| norm 0.3107 (+0.84z)| lr 5.53e-04 | 4163.72 ms | 32.4% bf16 MFU | 125098 tok/s step 4123/19560 | loss 3.563658 (-1.31z)| norm 0.2734 (-0.67z)| lr 5.53e-04 | 4264.04 ms | 31.7% bf16 MFU | 124991 tok/s step 4124/19560 | loss 3.606984 (-0.17z)| norm 0.2954 (+0.22z)| lr 5.53e-04 | 4185.26 ms | 32.3% bf16 MFU | 125005 tok/s step 4125/19560 | loss 3.557914 (-1.43z)| norm 0.2823 (-0.32z)| lr 5.53e-04 | 4176.90 ms | 32.3% bf16 MFU | 125031 tok/s step 4126/19560 | loss 3.613929 (+0.02z)| norm 0.3068 (+0.67z)| lr 5.52e-04 | 4168.84 ms | 32.4% bf16 MFU | 125068 tok/s step 4127/19560 | loss 3.528528 (-2.14z)| norm 0.3493 (+2.34z)| lr 5.52e-04 | 4160.19 ms | 32.5% bf16 MFU | 125115 tok/s step 4128/19560 | loss 3.582078 (-0.77z)| norm 0.2745 (-0.66z)| lr 5.52e-04 | 4164.23 ms | 32.4% bf16 MFU | 125155 tok/s step 4129/19560 | loss 3.576412 (-0.90z)| norm 0.2796 (-0.46z)| lr 5.52e-04 | 4167.97 ms | 32.4% bf16 MFU | 125187 tok/s step 4130/19560 | loss 3.589285 (-0.58z)| norm 0.2744 (-0.67z)| lr 5.52e-04 | 4164.02 ms | 32.4% bf16 MFU | 125223 tok/s step 4131/19560 | loss 3.650862 (+0.98z)| norm 0.2581 (-1.32z)| lr 5.52e-04 | 4166.33 ms | 32.4% bf16 MFU | 125254 tok/s step 4132/19560 | loss 3.626269 (+0.35z)| norm 0.3056 (+0.58z)| lr 5.52e-04 | 4159.90 ms | 32.5% bf16 MFU | 125293 tok/s step 4133/19560 | loss 3.635574 (+0.59z)| norm 0.2709 (-0.82z)| lr 5.52e-04 | 4159.07 ms | 32.5% bf16 MFU | 125331 tok/s step 4134/19560 | loss 3.592249 (-0.50z)| norm 0.2635 (-1.11z)| lr 5.52e-04 | 4160.48 ms | 32.5% bf16 MFU | 125365 tok/s step 4135/19560 | loss 3.538042 (-1.87z)| norm 0.2986 (+0.29z)| lr 5.52e-04 | 4164.07 ms | 32.4% bf16 MFU | 125392 tok/s step 4136/19560 | loss 3.638401 (+0.66z)| norm 0.2651 (-1.07z)| lr 5.52e-04 | 4158.72 ms | 32.5% bf16 MFU | 125426 tok/s step 4137/19560 | loss 3.624181 (+0.31z)| norm 0.2793 (-0.49z)| lr 5.52e-04 | 4155.72 ms | 32.5% bf16 MFU | 125463 tok/s step 4138/19560 | loss 3.642193 (+0.78z)| norm 0.2713 (-0.80z)| lr 5.52e-04 | 4177.97 ms | 32.3% bf16 MFU | 125464 tok/s step 4139/19560 | loss 3.533754 (-1.95z)| norm 0.2671 (-0.96z)| lr 5.52e-04 | 4169.99 ms | 32.4% bf16 MFU | 125477 tok/s step 4140/19560 | loss 3.582631 (-0.71z)| norm 0.2806 (-0.41z)| lr 5.52e-04 | 4169.25 ms | 32.4% bf16 MFU | 125491 tok/s step 4141/19560 | loss 3.585301 (-0.63z)| norm 0.2668 (-0.95z)| lr 5.52e-04 | 4172.58 ms | 32.4% bf16 MFU | 125499 tok/s step 4142/19560 | loss 3.612612 (+0.07z)| norm 0.2732 (-0.69z)| lr 5.52e-04 | 4166.88 ms | 32.4% bf16 MFU | 125515 tok/s step 4143/19560 | loss 3.594706 (-0.39z)| norm 0.2676 (-0.90z)| lr 5.52e-04 | 4165.65 ms | 32.4% bf16 MFU | 125532 tok/s step 4144/19560 | loss 3.588114 (-0.54z)| norm 0.3059 (+0.64z)| lr 5.52e-04 | 4173.12 ms | 32.4% bf16 MFU | 125538 tok/s step 4145/19560 | loss 3.600909 (-0.23z)| norm 0.3421 (+2.06z)| lr 5.52e-04 | 4165.89 ms | 32.4% bf16 MFU | 125553 tok/s step 4146/19560 | loss 3.605813 (-0.11z)| norm 0.3252 (+1.37z)| lr 5.52e-04 | 4157.08 ms | 32.5% bf16 MFU | 125582 tok/s step 4147/19560 | loss 3.602154 (-0.20z)| norm 0.2694 (-0.82z)| lr 5.52e-04 | 4163.75 ms | 32.4% bf16 MFU | 125598 tok/s step 4148/19560 | loss 3.572453 (-0.95z)| norm 0.2508 (-1.53z)| lr 5.52e-04 | 4161.72 ms | 32.4% bf16 MFU | 125617 tok/s step 4149/19560 | loss 3.641044 (+0.79z)| norm 0.2789 (-0.42z)| lr 5.52e-04 | 4163.32 ms | 32.4% bf16 MFU | 125633 tok/s step 4150/19560 | loss 3.572106 (-0.95z)| norm 0.2812 (-0.33z)| lr 5.52e-04 | 4164.62 ms | 32.4% bf16 MFU | 125646 tok/s step 4151/19560 | loss 3.646291 (+0.94z)| norm 0.3019 (+0.48z)| lr 5.52e-04 | 4153.15 ms | 32.5% bf16 MFU | 125676 tok/s step 4152/19560 | loss 3.590990 (-0.47z)| norm 0.2708 (-0.72z)| lr 5.52e-04 | 4173.46 ms | 32.4% bf16 MFU | 125673 tok/s step 4153/19560 | loss 3.585911 (-0.59z)| norm 0.2752 (-0.55z)| lr 5.52e-04 | 4166.28 ms | 32.4% bf16 MFU | 125681 tok/s step 4154/19560 | loss 3.575083 (-0.86z)| norm 0.2665 (-0.88z)| lr 5.52e-04 | 4153.50 ms | 32.5% bf16 MFU | 125709 tok/s step 4155/19560 | loss 3.602389 (-0.17z)| norm 0.3294 (+1.54z)| lr 5.52e-04 | 4172.24 ms | 32.4% bf16 MFU | 125706 tok/s step 4156/19560 | loss 3.648551 (+1.03z)| norm 0.3171 (+1.05z)| lr 5.52e-04 | 4160.93 ms | 32.4% bf16 MFU | 125721 tok/s step 4157/19560 | loss 3.612847 (+0.08z)| norm 0.3062 (+0.63z)| lr 5.52e-04 | 4162.40 ms | 32.4% bf16 MFU | 125733 tok/s step 4158/19560 | loss 3.642181 (+0.85z)| norm 0.4168 (+4.55z)| lr 5.52e-04 | 4161.90 ms | 32.4% bf16 MFU | 125745 tok/s step 4159/19560 | loss 3.586840 (-0.62z)| norm 0.3325 (+1.49z)| lr 5.52e-04 | 4167.42 ms | 32.4% bf16 MFU | 125748 tok/s step 4160/19560 | loss 3.644305 (+0.90z)| norm 0.3329 (+1.48z)| lr 5.52e-04 | 4166.31 ms | 32.4% bf16 MFU | 125753 tok/s step 4161/19560 | loss 3.574674 (-0.94z)| norm 0.3320 (+1.42z)| lr 5.52e-04 | 4174.53 ms | 32.3% bf16 MFU | 125745 tok/s step 4162/19560 | loss 3.606876 (-0.08z)| norm 0.3253 (+1.17z)| lr 5.52e-04 | 4172.68 ms | 32.4% bf16 MFU | 125740 tok/s step 4163/19560 | loss 3.563572 (-1.21z)| norm 0.3086 (+0.57z)| lr 5.51e-04 | 4163.14 ms | 32.4% bf16 MFU | 125750 tok/s step 4164/19560 | loss 3.575479 (-0.89z)| norm 0.3249 (+1.13z)| lr 5.51e-04 | 4156.12 ms | 32.5% bf16 MFU | 125770 tok/s step 4165/19560 | loss 3.596524 (-0.33z)| norm 0.3739 (+2.75z)| lr 5.51e-04 | 4165.21 ms | 32.4% bf16 MFU | 125775 tok/s step 4166/19560 | loss 3.593954 (-0.41z)| norm 0.3561 (+2.08z)| lr 5.51e-04 | 4160.45 ms | 32.5% bf16 MFU | 125787 tok/s step 4167/19560 | loss 3.660746 (+1.37z)| norm 0.3184 (+0.80z)| lr 5.51e-04 | 4170.34 ms | 32.4% bf16 MFU | 125783 tok/s step 4168/19560 | loss 3.564665 (-1.18z)| norm 0.3164 (+0.72z)| lr 5.51e-04 | 4180.75 ms | 32.3% bf16 MFU | 125764 tok/s step 4169/19560 | loss 3.581621 (-0.72z)| norm 0.2957 (+0.00z)| lr 5.51e-04 | 4170.28 ms | 32.4% bf16 MFU | 125762 tok/s step 4170/19560 | loss 3.547165 (-1.64z)| norm 0.3170 (+0.72z)| lr 5.51e-04 | 4155.51 ms | 32.5% bf16 MFU | 125782 tok/s step 4171/19560 | loss 3.604218 (-0.10z)| norm 0.3029 (+0.23z)| lr 5.51e-04 | 4167.13 ms | 32.4% bf16 MFU | 125784 tok/s step 4172/19560 | loss 3.576267 (-0.85z)| norm 0.3046 (+0.29z)| lr 5.51e-04 | 4169.14 ms | 32.4% bf16 MFU | 125783 tok/s step 4173/19560 | loss 3.578728 (-0.77z)| norm 0.2989 (+0.08z)| lr 5.51e-04 | 4269.07 ms | 31.6% bf16 MFU | 125634 tok/s step 4174/19560 | loss 3.541747 (-1.74z)| norm 0.2681 (-0.97z)| lr 5.51e-04 | 4172.73 ms | 32.4% bf16 MFU | 125635 tok/s step 4175/19560 | loss 3.633646 (+0.75z)| norm 0.3289 (+1.12z)| lr 5.51e-04 | 4156.50 ms | 32.5% bf16 MFU | 125660 tok/s step 4176/19560 | loss 3.572172 (-0.91z)| norm 0.2818 (-0.50z)| lr 5.51e-04 | 4166.35 ms | 32.4% bf16 MFU | 125669 tok/s step 4177/19560 | loss 3.684014 (+2.07z)| norm 0.2625 (-1.15z)| lr 5.51e-04 | 4167.08 ms | 32.4% bf16 MFU | 125676 tok/s step 4178/19560 | loss 3.657515 (+1.35z)| norm 0.2984 (+0.07z)| lr 5.51e-04 | 4171.19 ms | 32.4% bf16 MFU | 125677 tok/s step 4179/19560 | loss 3.635133 (+0.74z)| norm 0.3033 (+0.24z)| lr 5.51e-04 | 4161.33 ms | 32.4% bf16 MFU | 125693 tok/s step 4180/19560 | loss 3.569362 (-1.02z)| norm 0.3135 (+0.58z)| lr 5.51e-04 | 4160.66 ms | 32.5% bf16 MFU | 125708 tok/s step 4181/19560 | loss 3.594708 (-0.34z)| norm 0.2678 (-0.99z)| lr 5.51e-04 | 4160.64 ms | 32.5% bf16 MFU | 125724 tok/s step 4182/19560 | loss 3.566090 (-1.09z)| norm 0.2637 (-1.12z)| lr 5.51e-04 | 4161.03 ms | 32.4% bf16 MFU | 125737 tok/s step 4183/19560 | loss 3.590423 (-0.43z)| norm 0.2551 (-1.39z)| lr 5.51e-04 | 4172.72 ms | 32.4% bf16 MFU | 125733 tok/s step 4184/19560 | loss 3.621180 (+0.39z)| norm 0.2507 (-1.52z)| lr 5.51e-04 | 4161.46 ms | 32.4% bf16 MFU | 125746 tok/s step 4185/19560 | loss 3.625236 (+0.49z)| norm 0.2616 (-1.14z)| lr 5.51e-04 | 4162.92 ms | 32.4% bf16 MFU | 125755 tok/s step 4186/19560 | loss 3.590516 (-0.42z)| norm 0.2632 (-1.07z)| lr 5.51e-04 | 4162.35 ms | 32.4% bf16 MFU | 125766 tok/s step 4187/19560 | loss 3.605846 (-0.02z)| norm 0.2487 (-1.54z)| lr 5.51e-04 | 4165.16 ms | 32.4% bf16 MFU | 125771 tok/s step 4188/19560 | loss 3.514488 (-2.42z)| norm 0.2664 (-0.93z)| lr 5.51e-04 | 4160.52 ms | 32.5% bf16 MFU | 125783 tok/s step 4189/19560 | loss 3.585585 (-0.52z)| norm 0.2481 (-1.52z)| lr 5.51e-04 | 4163.01 ms | 32.4% bf16 MFU | 125791 tok/s step 4190/19560 | loss 3.586709 (-0.48z)| norm 0.2445 (-1.62z)| lr 5.51e-04 | 4151.11 ms | 32.5% bf16 MFU | 125817 tok/s step 4191/19560 | loss 3.623953 (+0.55z)| norm 0.2565 (-1.21z)| lr 5.51e-04 | 4488.20 ms | 30.1% bf16 MFU | 125366 tok/s step 4192/19560 | loss 3.599723 (-0.12z)| norm 0.2540 (-1.27z)| lr 5.51e-04 | 4165.60 ms | 32.4% bf16 MFU | 125391 tok/s step 4193/19560 | loss 3.542262 (-1.69z)| norm 0.2591 (-1.11z)| lr 5.51e-04 | 4172.30 ms | 32.4% bf16 MFU | 125405 tok/s step 4194/19560 | loss 3.582739 (-0.56z)| norm 0.2528 (-1.30z)| lr 5.51e-04 | 4156.40 ms | 32.5% bf16 MFU | 125441 tok/s step 4195/19560 | loss 3.577147 (-0.70z)| norm 0.2663 (-0.86z)| lr 5.51e-04 | 4187.82 ms | 32.2% bf16 MFU | 125429 tok/s step 4196/19560 | loss 3.586846 (-0.44z)| norm 0.2732 (-0.64z)| lr 5.51e-04 | 4158.18 ms | 32.5% bf16 MFU | 125462 tok/s step 4197/19560 | loss 3.645907 (+1.18z)| norm 0.2708 (-0.70z)| lr 5.51e-04 | 4166.37 ms | 32.4% bf16 MFU | 125481 tok/s step 4198/19560 | loss 3.573992 (-0.79z)| norm 0.3216 (+0.92z)| lr 5.51e-04 | 4163.78 ms | 32.4% bf16 MFU | 125502 tok/s step 4199/19560 | loss 3.522173 (-2.17z)| norm 0.2767 (-0.52z)| lr 5.50e-04 | 4162.95 ms | 32.4% bf16 MFU | 125524 tok/s step 4200/19560 | loss 3.587587 (-0.39z)| norm 0.2994 (+0.20z)| lr 5.50e-04 | 4172.85 ms | 32.4% bf16 MFU | 125530 tok/s step 4201/19560 | loss 3.580372 (-0.57z)| norm 0.3187 (+0.81z)| lr 5.50e-04 | 4161.39 ms | 32.4% bf16 MFU | 125553 tok/s step 4202/19560 | loss 3.629359 (+0.77z)| norm 0.2842 (-0.32z)| lr 5.50e-04 | 4157.12 ms | 32.5% bf16 MFU | 125581 tok/s step 4203/19560 | loss 3.611988 (+0.29z)| norm 0.2867 (-0.23z)| lr 5.50e-04 | 4168.22 ms | 32.4% bf16 MFU | 125592 tok/s step 4204/19560 | loss 3.586115 (-0.42z)| norm 0.2685 (-0.82z)| lr 5.50e-04 | 4170.05 ms | 32.4% bf16 MFU | 125598 tok/s step 4205/19560 | loss 3.533989 (-1.82z)| norm 0.2650 (-0.92z)| lr 5.50e-04 | 4153.66 ms | 32.5% bf16 MFU | 125630 tok/s step 4206/19560 | loss 3.546675 (-1.44z)| norm 0.2896 (-0.12z)| lr 5.50e-04 | 4164.08 ms | 32.4% bf16 MFU | 125643 tok/s step 4207/19560 | loss 3.572750 (-0.72z)| norm 0.2809 (-0.39z)| lr 5.50e-04 | 4158.70 ms | 32.5% bf16 MFU | 125665 tok/s step 4208/19560 | loss 3.572364 (-0.73z)| norm 0.2996 (+0.24z)| lr 5.50e-04 | 4158.57 ms | 32.5% bf16 MFU | 125685 tok/s step 4209/19560 | loss 3.566193 (-0.89z)| norm 0.2992 (+0.23z)| lr 5.50e-04 | 4168.49 ms | 32.4% bf16 MFU | 125690 tok/s step 4210/19560 | loss 3.633722 (+0.94z)| norm 0.2756 (-0.56z)| lr 5.50e-04 | 4154.56 ms | 32.5% bf16 MFU | 125715 tok/s step 4211/19560 | loss 3.587048 (-0.32z)| norm 0.2987 (+0.21z)| lr 5.50e-04 | 4173.32 ms | 32.4% bf16 MFU | 125711 tok/s step 4212/19560 | loss 3.601198 (+0.08z)| norm 0.2818 (-0.35z)| lr 5.50e-04 | 4154.19 ms | 32.5% bf16 MFU | 125735 tok/s step 4213/19560 | loss 3.569438 (-0.79z)| norm 0.2647 (-0.92z)| lr 5.50e-04 | 4151.08 ms | 32.5% bf16 MFU | 125764 tok/s step 4214/19560 | loss 3.592124 (-0.16z)| norm 0.2610 (-1.03z)| lr 5.50e-04 | 4167.70 ms | 32.4% bf16 MFU | 125765 tok/s step 4215/19560 | loss 3.697948 (+2.74z)| norm 0.2808 (-0.36z)| lr 5.50e-04 | 4154.46 ms | 32.5% bf16 MFU | 125787 tok/s step 4216/19560 | loss 3.630790 (+0.88z)| norm 0.2891 (-0.09z)| lr 5.50e-04 | 4170.21 ms | 32.4% bf16 MFU | 125784 tok/s step 4217/19560 | loss 3.550755 (-1.29z)| norm 0.3058 (+0.47z)| lr 5.50e-04 | 4158.96 ms | 32.5% bf16 MFU | 125798 tok/s step 4218/19560 | loss 3.587286 (-0.29z)| norm 0.2846 (-0.23z)| lr 5.50e-04 | 4166.87 ms | 32.4% bf16 MFU | 125799 tok/s step 4219/19560 | loss 3.614146 (+0.44z)| norm 0.2739 (-0.58z)| lr 5.50e-04 | 4161.42 ms | 32.4% bf16 MFU | 125808 tok/s step 4220/19560 | loss 3.578827 (-0.52z)| norm 0.2548 (-1.21z)| lr 5.50e-04 | 4157.24 ms | 32.5% bf16 MFU | 125824 tok/s step 4221/19560 | loss 3.650411 (+1.43z)| norm 0.4718 (+5.27z)| lr 5.50e-04 | 4161.85 ms | 32.4% bf16 MFU | 125831 tok/s step 4222/19560 | loss 3.624454 (+0.71z)| norm 0.2851 (-0.22z)| lr 5.50e-04 | 4161.16 ms | 32.4% bf16 MFU | 125840 tok/s step 4223/19560 | loss 3.622176 (+0.66z)| norm 0.3007 (+0.23z)| lr 5.50e-04 | 4163.70 ms | 32.4% bf16 MFU | 125844 tok/s step 4224/19560 | loss 3.570334 (-0.74z)| norm 0.2677 (-0.73z)| lr 5.50e-04 | 4159.37 ms | 32.5% bf16 MFU | 125854 tok/s step 4225/19560 | loss 3.600728 (+0.09z)| norm 0.2972 (+0.13z)| lr 5.50e-04 | 4164.88 ms | 32.4% bf16 MFU | 125855 tok/s step 4226/19560 | loss 3.579705 (-0.48z)| norm 0.2857 (-0.22z)| lr 5.50e-04 | 4169.62 ms | 32.4% bf16 MFU | 125850 tok/s step 4227/19560 | loss 3.619417 (+0.61z)| norm 0.2837 (-0.28z)| lr 5.50e-04 | 4177.73 ms | 32.3% bf16 MFU | 125832 tok/s step 4228/19560 | loss 3.525644 (-1.94z)| norm 0.2676 (-0.76z)| lr 5.50e-04 | 4162.99 ms | 32.4% bf16 MFU | 125837 tok/s step 4229/19560 | loss 3.588917 (-0.23z)| norm 0.3119 (+0.55z)| lr 5.50e-04 | 4180.20 ms | 32.3% bf16 MFU | 125816 tok/s step 4230/19560 | loss 3.631370 (+0.92z)| norm 0.3367 (+1.28z)| lr 5.50e-04 | 4166.03 ms | 32.4% bf16 MFU | 125818 tok/s step 4231/19560 | loss 3.737834 (+3.58z)| norm 0.3458 (+1.53z)| lr 5.50e-04 | 4156.51 ms | 32.5% bf16 MFU | 125834 tok/s step 4232/19560 | loss 3.543909 (-1.38z)| norm 0.3616 (+1.95z)| lr 5.50e-04 | 4166.20 ms | 32.4% bf16 MFU | 125834 tok/s step 4233/19560 | loss 3.604041 (+0.17z)| norm 0.2881 (-0.17z)| lr 5.50e-04 | 4198.84 ms | 32.2% bf16 MFU | 125786 tok/s step 4234/19560 | loss 3.580865 (-0.42z)| norm 0.3020 (+0.22z)| lr 5.50e-04 | 4175.59 ms | 32.3% bf16 MFU | 125775 tok/s step 4235/19560 | loss 3.578247 (-0.49z)| norm 0.2985 (+0.12z)| lr 5.50e-04 | 4165.55 ms | 32.4% bf16 MFU | 125779 tok/s step 4236/19560 | loss 3.584472 (-0.32z)| norm 0.3000 (+0.15z)| lr 5.49e-04 | 4161.32 ms | 32.4% bf16 MFU | 125790 tok/s step 4237/19560 | loss 3.649473 (+1.35z)| norm 0.2900 (-0.14z)| lr 5.49e-04 | 4159.90 ms | 32.5% bf16 MFU | 125802 tok/s step 4238/19560 | loss 3.705115 (+2.68z)| norm 0.3080 (+0.38z)| lr 5.49e-04 | 4167.99 ms | 32.4% bf16 MFU | 125801 tok/s step 4239/19560 | loss 3.604882 (+0.17z)| norm 0.2774 (-0.50z)| lr 5.49e-04 | 4156.22 ms | 32.5% bf16 MFU | 125818 tok/s step 4240/19560 | loss 3.553380 (-1.11z)| norm 0.2731 (-0.62z)| lr 5.49e-04 | 4167.54 ms | 32.4% bf16 MFU | 125818 tok/s step 4241/19560 | loss 3.581954 (-0.39z)| norm 0.2811 (-0.37z)| lr 5.49e-04 | 4161.28 ms | 32.4% bf16 MFU | 125826 tok/s step 4242/19560 | loss 3.569291 (-0.72z)| norm 0.2930 (-0.01z)| lr 5.49e-04 | 4161.56 ms | 32.4% bf16 MFU | 125834 tok/s step 4243/19560 | loss 3.606115 (+0.25z)| norm 0.2777 (-0.46z)| lr 5.49e-04 | 4174.98 ms | 32.3% bf16 MFU | 125821 tok/s step 4244/19560 | loss 3.689085 (+2.35z)| norm 0.2692 (-0.70z)| lr 5.49e-04 | 4189.04 ms | 32.2% bf16 MFU | 125788 tok/s step 4245/19560 | loss 3.633601 (+0.92z)| norm 0.2573 (-1.05z)| lr 5.49e-04 | 4173.86 ms | 32.3% bf16 MFU | 125779 tok/s step 4246/19560 | loss 3.607995 (+0.29z)| norm 0.2976 (+0.17z)| lr 5.49e-04 | 4157.62 ms | 32.5% bf16 MFU | 125796 tok/s step 4247/19560 | loss 3.586653 (-0.28z)| norm 0.2904 (-0.04z)| lr 5.49e-04 | 4169.95 ms | 32.4% bf16 MFU | 125792 tok/s step 4248/19560 | loss 3.593254 (-0.11z)| norm 0.2855 (-0.18z)| lr 5.49e-04 | 4158.14 ms | 32.5% bf16 MFU | 125807 tok/s step 4249/19560 | loss 3.622241 (+0.65z)| norm 0.2828 (-0.26z)| lr 5.49e-04 | 4161.78 ms | 32.4% bf16 MFU | 125816 tok/s step 4250/19560 | loss 3.532945 (-1.69z)| norm 0.2889 (-0.07z)| lr 5.49e-04 | 4171.86 ms | 32.4% bf16 MFU | 125808 tok/s val loss 3.595261 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2712/10042 = 0.270066 step 4251/19560 | loss 3.489716 (-2.73z)| norm 0.2665 (-0.76z)| lr 5.49e-04 | 4440.14 ms | 30.4% bf16 MFU | 125422 tok/s step 4252/19560 | loss 3.617530 (+0.53z)| norm 0.3038 (+0.39z)| lr 5.49e-04 | 4187.77 ms | 32.2% bf16 MFU | 125411 tok/s step 4253/19560 | loss 3.558114 (-0.99z)| norm 0.2919 (+0.02z)| lr 5.49e-04 | 4174.34 ms | 32.3% bf16 MFU | 125420 tok/s step 4254/19560 | loss 3.565321 (-0.79z)| norm 0.3512 (+1.81z)| lr 5.49e-04 | 4255.51 ms | 31.7% bf16 MFU | 125309 tok/s step 4255/19560 | loss 3.618626 (+0.55z)| norm 0.3380 (+1.42z)| lr 5.49e-04 | 4194.91 ms | 32.2% bf16 MFU | 125293 tok/s step 4256/19560 | loss 3.565310 (-0.81z)| norm 0.2913 (-0.01z)| lr 5.49e-04 | 4160.39 ms | 32.5% bf16 MFU | 125329 tok/s step 4257/19560 | loss 3.619004 (+0.56z)| norm 0.2721 (-0.60z)| lr 5.49e-04 | 4205.75 ms | 32.1% bf16 MFU | 125296 tok/s step 4258/19560 | loss 3.607943 (+0.27z)| norm 0.2655 (-0.79z)| lr 5.49e-04 | 4156.69 ms | 32.5% bf16 MFU | 125337 tok/s step 4259/19560 | loss 3.678546 (+2.05z)| norm 0.2702 (-0.66z)| lr 5.49e-04 | 4166.43 ms | 32.4% bf16 MFU | 125362 tok/s step 4260/19560 | loss 3.618572 (+0.53z)| norm 0.2673 (-0.73z)| lr 5.49e-04 | 4159.45 ms | 32.5% bf16 MFU | 125397 tok/s step 4261/19560 | loss 3.659590 (+1.56z)| norm 0.2948 (+0.10z)| lr 5.49e-04 | 4156.90 ms | 32.5% bf16 MFU | 125433 tok/s step 4262/19560 | loss 3.581580 (-0.41z)| norm 0.2764 (-0.47z)| lr 5.49e-04 | 7736.97 ms | 17.5% bf16 MFU | 122550 tok/s step 4263/19560 | loss 3.625786 (+0.70z)| norm 0.2874 (-0.13z)| lr 5.49e-04 | 4159.16 ms | 32.5% bf16 MFU | 122725 tok/s step 4264/19560 | loss 3.594522 (-0.09z)| norm 0.2817 (-0.31z)| lr 5.49e-04 | 4146.61 ms | 32.6% bf16 MFU | 122910 tok/s step 4265/19560 | loss 3.595572 (-0.06z)| norm 0.2810 (-0.33z)| lr 5.49e-04 | 4169.54 ms | 32.4% bf16 MFU | 123052 tok/s step 4266/19560 | loss 3.629375 (+0.81z)| norm 0.3158 (+0.73z)| lr 5.49e-04 | 4156.90 ms | 32.5% bf16 MFU | 123206 tok/s step 4267/19560 | loss 3.604235 (+0.15z)| norm 0.3057 (+0.41z)| lr 5.49e-04 | 4275.99 ms | 31.6% bf16 MFU | 123176 tok/s step 4268/19560 | loss 3.582823 (-0.40z)| norm 0.2599 (-0.99z)| lr 5.49e-04 | 4706.96 ms | 28.7% bf16 MFU | 122587 tok/s step 4269/19560 | loss 3.611034 (+0.32z)| norm 0.2771 (-0.47z)| lr 5.49e-04 | 4146.83 ms | 32.6% bf16 MFU | 122779 tok/s step 4270/19560 | loss 3.581696 (-0.43z)| norm 0.2681 (-0.74z)| lr 5.49e-04 | 4148.56 ms | 32.5% bf16 MFU | 122959 tok/s step 4271/19560 | loss 3.658467 (+1.53z)| norm 0.2744 (-0.55z)| lr 5.48e-04 | 4148.82 ms | 32.5% bf16 MFU | 123129 tok/s step 4272/19560 | loss 3.563557 (-0.90z)| norm 0.2476 (-1.35z)| lr 5.48e-04 | 4151.16 ms | 32.5% bf16 MFU | 123288 tok/s step 4273/19560 | loss 3.592405 (-0.16z)| norm 0.2627 (-0.88z)| lr 5.48e-04 | 19112.24 ms | 7.1% bf16 MFU | 118495 tok/s step 4274/19560 | loss 3.613148 (+0.37z)| norm 0.2478 (-1.31z)| lr 5.48e-04 | 4114.43 ms | 32.8% bf16 MFU | 118942 tok/s step 4275/19560 | loss 3.595845 (-0.07z)| norm 0.2630 (-0.84z)| lr 5.48e-04 | 4123.56 ms | 32.7% bf16 MFU | 119352 tok/s step 4276/19560 | loss 3.607176 (+0.21z)| norm 0.2796 (-0.34z)| lr 5.48e-04 | 4119.34 ms | 32.8% bf16 MFU | 119748 tok/s step 4277/19560 | loss 3.619677 (+0.54z)| norm 0.3594 (+2.05z)| lr 5.48e-04 | 4121.09 ms | 32.8% bf16 MFU | 120122 tok/s step 4278/19560 | loss 3.597505 (-0.03z)| norm 0.2816 (-0.30z)| lr 5.48e-04 | 4116.64 ms | 32.8% bf16 MFU | 120483 tok/s step 4279/19560 | loss 3.630121 (+0.81z)| norm 0.2780 (-0.40z)| lr 5.48e-04 | 4142.55 ms | 32.6% bf16 MFU | 120787 tok/s step 4280/19560 | loss 3.594413 (-0.11z)| norm 0.2952 (+0.11z)| lr 5.48e-04 | 4129.32 ms | 32.7% bf16 MFU | 121096 tok/s step 4281/19560 | loss 3.637430 (+0.99z)| norm 0.2772 (-0.43z)| lr 5.48e-04 | 4134.16 ms | 32.7% bf16 MFU | 121382 tok/s step 4282/19560 | loss 3.564270 (-0.90z)| norm 0.2790 (-0.38z)| lr 5.48e-04 | 4152.89 ms | 32.5% bf16 MFU | 121626 tok/s step 4283/19560 | loss 3.605907 (+0.18z)| norm 0.2972 (+0.18z)| lr 5.48e-04 | 4138.00 ms | 32.6% bf16 MFU | 121879 tok/s step 4284/19560 | loss 3.649239 (+1.29z)| norm 0.3298 (+1.17z)| lr 5.48e-04 | 4148.83 ms | 32.5% bf16 MFU | 122104 tok/s step 4285/19560 | loss 3.570749 (-0.72z)| norm 0.3094 (+0.55z)| lr 5.48e-04 | 4145.98 ms | 32.6% bf16 MFU | 122322 tok/s step 4286/19560 | loss 3.588822 (-0.25z)| norm 0.2908 (+0.01z)| lr 5.48e-04 | 4137.23 ms | 32.6% bf16 MFU | 122542 tok/s step 4287/19560 | loss 3.466591 (-3.24z)| norm 0.3148 (+0.79z)| lr 5.48e-04 | 4141.11 ms | 32.6% bf16 MFU | 122745 tok/s step 4288/19560 | loss 3.613203 (+0.40z)| norm 0.2914 (+0.05z)| lr 5.48e-04 | 4144.60 ms | 32.6% bf16 MFU | 122933 tok/s step 4289/19560 | loss 3.608935 (+0.29z)| norm 0.2863 (-0.11z)| lr 5.48e-04 | 4138.10 ms | 32.6% bf16 MFU | 123121 tok/s step 4290/19560 | loss 3.622987 (+0.63z)| norm 0.2954 (+0.20z)| lr 5.48e-04 | 4152.85 ms | 32.5% bf16 MFU | 123277 tok/s step 4291/19560 | loss 3.588498 (-0.23z)| norm 0.2694 (-0.65z)| lr 5.48e-04 | 4153.02 ms | 32.5% bf16 MFU | 123425 tok/s step 4292/19560 | loss 3.628613 (+0.76z)| norm 0.2981 (+0.31z)| lr 5.48e-04 | 4136.97 ms | 32.6% bf16 MFU | 123591 tok/s step 4293/19560 | loss 3.613686 (+0.38z)| norm 0.2662 (-0.75z)| lr 5.48e-04 | 4148.79 ms | 32.5% bf16 MFU | 123730 tok/s step 4294/19560 | loss 3.561999 (-0.89z)| norm 0.2587 (-1.00z)| lr 5.48e-04 | 4154.99 ms | 32.5% bf16 MFU | 123852 tok/s step 4295/19560 | loss 3.570439 (-0.67z)| norm 0.2562 (-1.07z)| lr 5.48e-04 | 4154.93 ms | 32.5% bf16 MFU | 123969 tok/s step 4296/19560 | loss 3.601885 (+0.11z)| norm 0.2848 (-0.06z)| lr 5.48e-04 | 4141.90 ms | 32.6% bf16 MFU | 124100 tok/s step 4297/19560 | loss 3.582929 (-0.37z)| norm 0.2904 (+0.14z)| lr 5.48e-04 | 4150.80 ms | 32.5% bf16 MFU | 124210 tok/s step 4298/19560 | loss 3.635350 (+0.93z)| norm 0.3280 (+1.45z)| lr 5.48e-04 | 4160.07 ms | 32.5% bf16 MFU | 124301 tok/s step 4299/19560 | loss 3.599507 (+0.03z)| norm 0.3056 (+0.66z)| lr 5.48e-04 | 4161.37 ms | 32.4% bf16 MFU | 124386 tok/s step 4300/19560 | loss 3.582383 (-0.40z)| norm 0.2757 (-0.37z)| lr 5.48e-04 | 4147.36 ms | 32.6% bf16 MFU | 124487 tok/s step 4301/19560 | loss 3.614394 (+0.40z)| norm 0.2848 (-0.05z)| lr 5.48e-04 | 4179.73 ms | 32.3% bf16 MFU | 124534 tok/s step 4302/19560 | loss 3.644701 (+1.14z)| norm 0.2544 (-1.11z)| lr 5.48e-04 | 4177.81 ms | 32.3% bf16 MFU | 124582 tok/s step 4303/19560 | loss 3.653563 (+1.36z)| norm 0.2667 (-0.67z)| lr 5.48e-04 | 4183.25 ms | 32.3% bf16 MFU | 124620 tok/s step 4304/19560 | loss 3.543618 (-1.39z)| norm 0.2760 (-0.34z)| lr 5.48e-04 | 4159.14 ms | 32.5% bf16 MFU | 124692 tok/s step 4305/19560 | loss 3.653692 (+1.38z)| norm 0.2616 (-0.84z)| lr 5.48e-04 | 4174.32 ms | 32.3% bf16 MFU | 124737 tok/s step 4306/19560 | loss 3.579798 (-0.48z)| norm 0.2600 (-0.89z)| lr 5.48e-04 | 4162.37 ms | 32.4% bf16 MFU | 124798 tok/s step 4307/19560 | loss 3.593655 (-0.12z)| norm 0.2686 (-0.58z)| lr 5.47e-04 | 4155.61 ms | 32.5% bf16 MFU | 124866 tok/s step 4308/19560 | loss 3.539531 (-1.48z)| norm 0.2784 (-0.22z)| lr 5.47e-04 | 4148.13 ms | 32.5% bf16 MFU | 124943 tok/s step 4309/19560 | loss 3.627812 (+0.75z)| norm 0.2830 (-0.07z)| lr 5.47e-04 | 4151.36 ms | 32.5% bf16 MFU | 125010 tok/s step 4310/19560 | loss 3.556995 (-1.04z)| norm 0.2595 (-0.89z)| lr 5.47e-04 | 4145.41 ms | 32.6% bf16 MFU | 125083 tok/s step 4311/19560 | loss 3.553082 (-1.12z)| norm 0.2592 (-0.90z)| lr 5.47e-04 | 4184.90 ms | 32.3% bf16 MFU | 125093 tok/s step 4312/19560 | loss 3.593269 (-0.11z)| norm 0.2425 (-1.48z)| lr 5.47e-04 | 4140.34 ms | 32.6% bf16 MFU | 125170 tok/s step 4313/19560 | loss 3.535137 (-1.54z)| norm 0.2584 (-0.93z)| lr 5.47e-04 | 4164.69 ms | 32.4% bf16 MFU | 125206 tok/s step 4314/19560 | loss 3.578685 (-0.45z)| norm 0.2479 (-1.29z)| lr 5.47e-04 | 4168.94 ms | 32.4% bf16 MFU | 125234 tok/s step 4315/19560 | loss 3.586483 (-0.25z)| norm 0.2887 (+0.13z)| lr 5.47e-04 | 4144.24 ms | 32.6% bf16 MFU | 125297 tok/s step 4316/19560 | loss 3.609339 (+0.30z)| norm 0.2794 (-0.20z)| lr 5.47e-04 | 4147.98 ms | 32.6% bf16 MFU | 125352 tok/s step 4317/19560 | loss 3.598563 (+0.03z)| norm 0.2602 (-0.88z)| lr 5.47e-04 | 4159.35 ms | 32.5% bf16 MFU | 125387 tok/s step 4318/19560 | loss 3.640385 (+1.07z)| norm 0.2738 (-0.41z)| lr 5.47e-04 | 4151.43 ms | 32.5% bf16 MFU | 125432 tok/s step 4319/19560 | loss 3.656756 (+1.47z)| norm 0.2905 (+0.17z)| lr 5.47e-04 | 4142.99 ms | 32.6% bf16 MFU | 125488 tok/s step 4320/19560 | loss 3.629889 (+0.79z)| norm 0.3031 (+0.61z)| lr 5.47e-04 | 4152.07 ms | 32.5% bf16 MFU | 125527 tok/s step 4321/19560 | loss 3.582579 (-0.41z)| norm 0.2975 (+0.40z)| lr 5.47e-04 | 4155.78 ms | 32.5% bf16 MFU | 125559 tok/s step 4322/19560 | loss 3.625775 (+0.67z)| norm 0.2545 (-1.15z)| lr 5.47e-04 | 4144.08 ms | 32.6% bf16 MFU | 125607 tok/s step 4323/19560 | loss 3.599803 (+0.01z)| norm 0.2639 (-0.81z)| lr 5.47e-04 | 4159.39 ms | 32.5% bf16 MFU | 125629 tok/s step 4324/19560 | loss 3.611990 (+0.32z)| norm 0.2725 (-0.50z)| lr 5.47e-04 | 4149.44 ms | 32.5% bf16 MFU | 125665 tok/s step 4325/19560 | loss 3.580474 (-0.47z)| norm 0.3027 (+0.58z)| lr 5.47e-04 | 4174.37 ms | 32.3% bf16 MFU | 125662 tok/s step 4326/19560 | loss 3.587517 (-0.29z)| norm 0.3116 (+0.90z)| lr 5.47e-04 | 4146.26 ms | 32.6% bf16 MFU | 125701 tok/s step 4327/19560 | loss 3.614301 (+0.37z)| norm 0.3474 (+2.14z)| lr 5.47e-04 | 4155.41 ms | 32.5% bf16 MFU | 125724 tok/s step 4328/19560 | loss 3.631559 (+0.81z)| norm 0.3133 (+0.92z)| lr 5.47e-04 | 4148.63 ms | 32.5% bf16 MFU | 125757 tok/s step 4329/19560 | loss 3.566502 (-0.86z)| norm 0.2388 (-1.68z)| lr 5.47e-04 | 4167.48 ms | 32.4% bf16 MFU | 125759 tok/s step 4330/19560 | loss 3.620630 (+0.53z)| norm 0.2622 (-0.85z)| lr 5.47e-04 | 4166.55 ms | 32.4% bf16 MFU | 125763 tok/s step 4331/19560 | loss 3.568540 (-0.80z)| norm 0.2811 (-0.18z)| lr 5.47e-04 | 4141.14 ms | 32.6% bf16 MFU | 125805 tok/s step 4332/19560 | loss 3.584123 (-0.40z)| norm 0.2705 (-0.56z)| lr 5.47e-04 | 4139.95 ms | 32.6% bf16 MFU | 125847 tok/s step 4333/19560 | loss 3.683686 (+2.11z)| norm 0.2830 (-0.12z)| lr 5.47e-04 | 4155.65 ms | 32.5% bf16 MFU | 125863 tok/s step 4334/19560 | loss 3.659895 (+1.48z)| norm 0.2954 (+0.31z)| lr 5.47e-04 | 4146.96 ms | 32.6% bf16 MFU | 125891 tok/s step 4335/19560 | loss 3.608024 (+0.16z)| norm 0.2991 (+0.44z)| lr 5.47e-04 | 4213.52 ms | 32.0% bf16 MFU | 125818 tok/s step 4336/19560 | loss 3.576467 (-0.65z)| norm 0.2977 (+0.39z)| lr 5.47e-04 | 4214.41 ms | 32.0% bf16 MFU | 125747 tok/s step 4337/19560 | loss 3.583027 (-0.49z)| norm 0.3251 (+1.33z)| lr 5.47e-04 | 4150.85 ms | 32.5% bf16 MFU | 125775 tok/s step 4338/19560 | loss 3.594916 (-0.18z)| norm 0.3236 (+1.26z)| lr 5.47e-04 | 4321.90 ms | 31.2% bf16 MFU | 125552 tok/s step 4339/19560 | loss 3.568993 (-0.83z)| norm 0.2841 (-0.11z)| lr 5.47e-04 | 4155.48 ms | 32.5% bf16 MFU | 125583 tok/s step 4340/19560 | loss 3.615986 (+0.36z)| norm 0.2698 (-0.60z)| lr 5.47e-04 | 4151.40 ms | 32.5% bf16 MFU | 125618 tok/s step 4341/19560 | loss 3.683646 (+2.04z)| norm 0.2816 (-0.19z)| lr 5.47e-04 | 4144.81 ms | 32.6% bf16 MFU | 125662 tok/s step 4342/19560 | loss 3.586080 (-0.42z)| norm 0.2710 (-0.57z)| lr 5.46e-04 | 4149.38 ms | 32.5% bf16 MFU | 125696 tok/s step 4343/19560 | loss 3.639910 (+0.97z)| norm 0.3113 (+0.82z)| lr 5.46e-04 | 4204.51 ms | 32.1% bf16 MFU | 125646 tok/s step 4344/19560 | loss 3.629665 (+0.71z)| norm 0.3231 (+1.22z)| lr 5.46e-04 | 4254.33 ms | 31.7% bf16 MFU | 125526 tok/s step 4345/19560 | loss 3.614845 (+0.32z)| norm 0.2765 (-0.38z)| lr 5.46e-04 | 4164.39 ms | 32.4% bf16 MFU | 125545 tok/s step 4346/19560 | loss 3.546647 (-1.43z)| norm 0.3023 (+0.51z)| lr 5.46e-04 | 4186.68 ms | 32.2% bf16 MFU | 125529 tok/s step 4347/19560 | loss 3.591524 (-0.27z)| norm 0.2929 (+0.17z)| lr 5.46e-04 | 4143.60 ms | 32.6% bf16 MFU | 125579 tok/s step 4348/19560 | loss 3.600404 (-0.05z)| norm 0.2694 (-0.64z)| lr 5.46e-04 | 4333.95 ms | 31.2% bf16 MFU | 125348 tok/s step 4349/19560 | loss 3.567498 (-0.88z)| norm 0.3040 (+0.73z)| lr 5.46e-04 | 4155.83 ms | 32.5% bf16 MFU | 125389 tok/s step 4350/19560 | loss 3.692415 (+2.28z)| norm 0.3770 (+3.56z)| lr 5.46e-04 | 4310.98 ms | 31.3% bf16 MFU | 125200 tok/s step 4351/19560 | loss 3.717163 (+2.80z)| norm 0.3312 (+1.72z)| lr 5.46e-04 | 4159.31 ms | 32.5% bf16 MFU | 125243 tok/s step 4352/19560 | loss 3.567113 (-0.88z)| norm 0.2881 (+0.01z)| lr 5.46e-04 | 4145.72 ms | 32.6% bf16 MFU | 125304 tok/s step 4353/19560 | loss 3.502212 (-2.39z)| norm 0.3341 (+1.79z)| lr 5.46e-04 | 4162.78 ms | 32.4% bf16 MFU | 125336 tok/s step 4354/19560 | loss 3.674981 (+1.71z)| norm 0.3674 (+2.96z)| lr 5.46e-04 | 4144.53 ms | 32.6% bf16 MFU | 125394 tok/s step 4355/19560 | loss 3.615554 (+0.30z)| norm 0.3398 (+1.88z)| lr 5.46e-04 | 4146.60 ms | 32.6% bf16 MFU | 125447 tok/s step 4356/19560 | loss 3.539547 (-1.51z)| norm 0.2750 (-0.52z)| lr 5.46e-04 | 4148.68 ms | 32.5% bf16 MFU | 125493 tok/s step 4357/19560 | loss 3.705973 (+2.37z)| norm 0.2826 (-0.23z)| lr 5.46e-04 | 4166.30 ms | 32.4% bf16 MFU | 125510 tok/s step 4358/19560 | loss 3.574616 (-0.67z)| norm 0.3128 (+0.91z)| lr 5.46e-04 | 4148.29 ms | 32.5% bf16 MFU | 125554 tok/s step 4359/19560 | loss 3.633194 (+0.74z)| norm 0.3334 (+1.69z)| lr 5.46e-04 | 4163.84 ms | 32.4% bf16 MFU | 125572 tok/s step 4360/19560 | loss 3.632697 (+0.71z)| norm 0.3114 (+0.90z)| lr 5.46e-04 | 4156.54 ms | 32.5% bf16 MFU | 125600 tok/s step 4361/19560 | loss 3.655173 (+1.24z)| norm 0.2929 (+0.18z)| lr 5.46e-04 | 4178.17 ms | 32.3% bf16 MFU | 125594 tok/s step 4362/19560 | loss 3.603796 (-0.00z)| norm 0.3184 (+1.16z)| lr 5.46e-04 | 4169.76 ms | 32.4% bf16 MFU | 125601 tok/s step 4363/19560 | loss 3.582121 (-0.53z)| norm 0.3055 (+0.66z)| lr 5.46e-04 | 4168.43 ms | 32.4% bf16 MFU | 125610 tok/s step 4364/19560 | loss 3.582247 (-0.52z)| norm 0.2891 (+0.03z)| lr 5.46e-04 | 4157.46 ms | 32.5% bf16 MFU | 125635 tok/s step 4365/19560 | loss 3.600684 (-0.07z)| norm 0.3095 (+0.81z)| lr 5.46e-04 | 4144.52 ms | 32.6% bf16 MFU | 125678 tok/s step 4366/19560 | loss 3.564793 (-0.93z)| norm 0.3048 (+0.63z)| lr 5.46e-04 | 4159.68 ms | 32.5% bf16 MFU | 125696 tok/s step 4367/19560 | loss 3.623854 (+0.53z)| norm 0.3158 (+1.04z)| lr 5.46e-04 | 4158.18 ms | 32.5% bf16 MFU | 125716 tok/s step 4368/19560 | loss 3.582558 (-0.50z)| norm 0.2659 (-0.88z)| lr 5.46e-04 | 4167.25 ms | 32.4% bf16 MFU | 125721 tok/s step 4369/19560 | loss 3.567249 (-0.88z)| norm 0.2567 (-1.22z)| lr 5.46e-04 | 4158.36 ms | 32.5% bf16 MFU | 125739 tok/s step 4370/19560 | loss 3.634970 (+0.79z)| norm 0.2858 (-0.10z)| lr 5.46e-04 | 4146.80 ms | 32.6% bf16 MFU | 125773 tok/s step 4371/19560 | loss 3.553457 (-1.22z)| norm 0.2976 (+0.34z)| lr 5.46e-04 | 4146.63 ms | 32.6% bf16 MFU | 125807 tok/s step 4372/19560 | loss 3.621539 (+0.48z)| norm 0.2687 (-0.76z)| lr 5.46e-04 | 4157.31 ms | 32.5% bf16 MFU | 125822 tok/s step 4373/19560 | loss 3.663671 (+1.53z)| norm 0.3034 (+0.55z)| lr 5.46e-04 | 4164.28 ms | 32.4% bf16 MFU | 125826 tok/s step 4374/19560 | loss 3.578470 (-0.59z)| norm 0.2936 (+0.18z)| lr 5.46e-04 | 4174.51 ms | 32.3% bf16 MFU | 125814 tok/s step 4375/19560 | loss 3.570889 (-0.78z)| norm 0.2906 (+0.06z)| lr 5.46e-04 | 4142.89 ms | 32.6% bf16 MFU | 125851 tok/s step 4376/19560 | loss 3.595181 (-0.17z)| norm 0.2817 (-0.28z)| lr 5.46e-04 | 4157.88 ms | 32.5% bf16 MFU | 125863 tok/s step 4377/19560 | loss 3.498957 (-2.48z)| norm 0.2661 (-0.87z)| lr 5.45e-04 | 4162.88 ms | 32.4% bf16 MFU | 125867 tok/s step 4378/19560 | loss 3.646445 (+1.09z)| norm 0.3006 (+0.45z)| lr 5.45e-04 | 4164.91 ms | 32.4% bf16 MFU | 125868 tok/s step 4379/19560 | loss 3.578968 (-0.60z)| norm 0.2750 (-0.53z)| lr 5.45e-04 | 4177.38 ms | 32.3% bf16 MFU | 125850 tok/s step 4380/19560 | loss 3.612010 (+0.23z)| norm 0.2715 (-0.66z)| lr 5.45e-04 | 4164.48 ms | 32.4% bf16 MFU | 125852 tok/s step 4381/19560 | loss 3.691463 (+2.18z)| norm 0.2559 (-1.24z)| lr 5.45e-04 | 4164.34 ms | 32.4% bf16 MFU | 125855 tok/s step 4382/19560 | loss 3.557149 (-1.15z)| norm 0.2571 (-1.19z)| lr 5.45e-04 | 4167.83 ms | 32.4% bf16 MFU | 125851 tok/s step 4383/19560 | loss 3.629652 (+0.64z)| norm 0.2697 (-0.69z)| lr 5.45e-04 | 4142.30 ms | 32.6% bf16 MFU | 125887 tok/s step 4384/19560 | loss 3.539685 (-1.57z)| norm 0.2574 (-1.15z)| lr 5.45e-04 | 4252.58 ms | 31.7% bf16 MFU | 125757 tok/s step 4385/19560 | loss 3.602288 (-0.03z)| norm 0.2839 (-0.12z)| lr 5.45e-04 | 4198.74 ms | 32.2% bf16 MFU | 125713 tok/s step 4386/19560 | loss 3.633358 (+0.73z)| norm 0.2762 (-0.43z)| lr 5.45e-04 | 4144.09 ms | 32.6% bf16 MFU | 125753 tok/s step 4387/19560 | loss 3.592997 (-0.25z)| norm 0.2926 (+0.21z)| lr 5.45e-04 | 4146.84 ms | 32.6% bf16 MFU | 125787 tok/s step 4388/19560 | loss 3.607720 (+0.12z)| norm 0.2913 (+0.15z)| lr 5.45e-04 | 4247.17 ms | 31.8% bf16 MFU | 125670 tok/s step 4389/19560 | loss 3.606328 (+0.10z)| norm 0.3073 (+0.77z)| lr 5.45e-04 | 4146.89 ms | 32.6% bf16 MFU | 125708 tok/s step 4390/19560 | loss 3.713855 (+2.69z)| norm 0.3068 (+0.75z)| lr 5.45e-04 | 4143.94 ms | 32.6% bf16 MFU | 125748 tok/s step 4391/19560 | loss 3.611217 (+0.19z)| norm 0.2559 (-1.23z)| lr 5.45e-04 | 4145.93 ms | 32.6% bf16 MFU | 125784 tok/s step 4392/19560 | loss 3.667891 (+1.55z)| norm 0.2694 (-0.70z)| lr 5.45e-04 | 4147.99 ms | 32.6% bf16 MFU | 125814 tok/s step 4393/19560 | loss 3.529657 (-1.77z)| norm 0.2574 (-1.16z)| lr 5.45e-04 | 4146.85 ms | 32.6% bf16 MFU | 125845 tok/s step 4394/19560 | loss 3.719774 (+2.69z)| norm 0.2809 (-0.24z)| lr 5.45e-04 | 4172.17 ms | 32.4% bf16 MFU | 125836 tok/s step 4395/19560 | loss 3.580057 (-0.56z)| norm 0.2562 (-1.18z)| lr 5.45e-04 | 4156.87 ms | 32.5% bf16 MFU | 125851 tok/s step 4396/19560 | loss 3.594062 (-0.23z)| norm 0.2475 (-1.50z)| lr 5.45e-04 | 4161.67 ms | 32.4% bf16 MFU | 125857 tok/s step 4397/19560 | loss 3.591788 (-0.28z)| norm 0.2439 (-1.62z)| lr 5.45e-04 | 4159.30 ms | 32.5% bf16 MFU | 125867 tok/s step 4398/19560 | loss 3.622024 (+0.41z)| norm 0.2594 (-1.02z)| lr 5.45e-04 | 4148.17 ms | 32.5% bf16 MFU | 125893 tok/s step 4399/19560 | loss 3.544433 (-1.37z)| norm 0.2713 (-0.57z)| lr 5.45e-04 | 4153.94 ms | 32.5% bf16 MFU | 125909 tok/s step 4400/19560 | loss 3.567458 (-0.84z)| norm 0.2791 (-0.28z)| lr 5.45e-04 | 4147.72 ms | 32.6% bf16 MFU | 125934 tok/s step 4401/19560 | loss 3.551802 (-1.19z)| norm 0.2583 (-1.08z)| lr 5.45e-04 | 4163.62 ms | 32.4% bf16 MFU | 125933 tok/s step 4402/19560 | loss 3.583497 (-0.45z)| norm 0.2428 (-1.67z)| lr 5.45e-04 | 4152.08 ms | 32.5% bf16 MFU | 125950 tok/s step 4403/19560 | loss 3.589882 (-0.30z)| norm 0.2601 (-1.00z)| lr 5.45e-04 | 4162.72 ms | 32.4% bf16 MFU | 125950 tok/s step 4404/19560 | loss 3.577022 (-0.59z)| norm 0.2967 (+0.39z)| lr 5.45e-04 | 4145.78 ms | 32.6% bf16 MFU | 125976 tok/s step 4405/19560 | loss 3.573022 (-0.67z)| norm 0.2694 (-0.64z)| lr 5.45e-04 | 4144.48 ms | 32.6% bf16 MFU | 126002 tok/s step 4406/19560 | loss 3.573786 (-0.65z)| norm 0.2416 (-1.71z)| lr 5.45e-04 | 4167.99 ms | 32.4% bf16 MFU | 125991 tok/s step 4407/19560 | loss 3.581806 (-0.46z)| norm 0.2555 (-1.15z)| lr 5.45e-04 | 4164.07 ms | 32.4% bf16 MFU | 125987 tok/s step 4408/19560 | loss 3.628968 (+0.62z)| norm 0.2681 (-0.65z)| lr 5.45e-04 | 4164.40 ms | 32.4% bf16 MFU | 125983 tok/s step 4409/19560 | loss 3.538236 (-1.44z)| norm 0.3136 (+1.09z)| lr 5.45e-04 | 4164.93 ms | 32.4% bf16 MFU | 125978 tok/s step 4410/19560 | loss 3.629579 (+0.64z)| norm 0.2732 (-0.47z)| lr 5.45e-04 | 4176.18 ms | 32.3% bf16 MFU | 125956 tok/s step 4411/19560 | loss 3.567415 (-0.77z)| norm 0.2598 (-0.97z)| lr 5.45e-04 | 4158.41 ms | 32.5% bf16 MFU | 125962 tok/s step 4412/19560 | loss 3.597275 (-0.08z)| norm 0.2765 (-0.31z)| lr 5.44e-04 | 4179.68 ms | 32.3% bf16 MFU | 125936 tok/s step 4413/19560 | loss 3.681800 (+1.82z)| norm 0.2684 (-0.62z)| lr 5.44e-04 | 4145.48 ms | 32.6% bf16 MFU | 125963 tok/s step 4414/19560 | loss 3.605953 (+0.09z)| norm 0.3013 (+0.66z)| lr 5.44e-04 | 4145.34 ms | 32.6% bf16 MFU | 125988 tok/s step 4415/19560 | loss 3.585588 (-0.41z)| norm 0.2983 (+0.55z)| lr 5.44e-04 | 4149.77 ms | 32.5% bf16 MFU | 126006 tok/s step 4416/19560 | loss 3.547744 (-1.28z)| norm 0.3152 (+1.20z)| lr 5.44e-04 | 4159.20 ms | 32.5% bf16 MFU | 126008 tok/s step 4417/19560 | loss 3.606196 (+0.09z)| norm 0.2919 (+0.29z)| lr 5.44e-04 | 4156.98 ms | 32.5% bf16 MFU | 126014 tok/s step 4418/19560 | loss 3.579226 (-0.53z)| norm 0.2925 (+0.31z)| lr 5.44e-04 | 4185.10 ms | 32.3% bf16 MFU | 125977 tok/s step 4419/19560 | loss 3.588367 (-0.32z)| norm 0.2846 (+0.00z)| lr 5.44e-04 | 4156.78 ms | 32.5% bf16 MFU | 125985 tok/s step 4420/19560 | loss 3.599780 (-0.05z)| norm 0.2811 (-0.13z)| lr 5.44e-04 | 4295.46 ms | 31.4% bf16 MFU | 125788 tok/s step 4421/19560 | loss 3.577557 (-0.56z)| norm 0.2576 (-1.04z)| lr 5.44e-04 | 4145.85 ms | 32.6% bf16 MFU | 125822 tok/s step 4422/19560 | loss 3.596585 (-0.12z)| norm 0.2797 (-0.19z)| lr 5.44e-04 | 4144.65 ms | 32.6% bf16 MFU | 125856 tok/s step 4423/19560 | loss 3.620050 (+0.42z)| norm 0.2510 (-1.30z)| lr 5.44e-04 | 4162.08 ms | 32.4% bf16 MFU | 125861 tok/s step 4424/19560 | loss 3.613477 (+0.27z)| norm 0.3155 (+1.19z)| lr 5.44e-04 | 4154.21 ms | 32.5% bf16 MFU | 125879 tok/s step 4425/19560 | loss 3.616579 (+0.33z)| norm 0.3170 (+1.24z)| lr 5.44e-04 | 4143.87 ms | 32.6% bf16 MFU | 125911 tok/s step 4426/19560 | loss 3.596149 (-0.14z)| norm 0.3024 (+0.69z)| lr 5.44e-04 | 4151.81 ms | 32.5% bf16 MFU | 125929 tok/s step 4427/19560 | loss 3.593596 (-0.20z)| norm 0.2935 (+0.35z)| lr 5.44e-04 | 4157.53 ms | 32.5% bf16 MFU | 125938 tok/s step 4428/19560 | loss 3.528395 (-1.71z)| norm 0.2856 (+0.03z)| lr 5.44e-04 | 4143.98 ms | 32.6% bf16 MFU | 125967 tok/s step 4429/19560 | loss 3.581263 (-0.47z)| norm 0.2570 (-1.06z)| lr 5.44e-04 | 4141.58 ms | 32.6% bf16 MFU | 125998 tok/s step 4430/19560 | loss 3.636372 (+0.82z)| norm 0.2654 (-0.74z)| lr 5.44e-04 | 4155.96 ms | 32.5% bf16 MFU | 126006 tok/s step 4431/19560 | loss 3.557661 (-1.01z)| norm 0.2720 (-0.49z)| lr 5.44e-04 | 4171.90 ms | 32.4% bf16 MFU | 125989 tok/s step 4432/19560 | loss 3.650449 (+1.15z)| norm 0.2846 (-0.00z)| lr 5.44e-04 | 4154.24 ms | 32.5% bf16 MFU | 126000 tok/s step 4433/19560 | loss 3.667064 (+1.53z)| norm 0.3067 (+0.84z)| lr 5.44e-04 | 4143.23 ms | 32.6% bf16 MFU | 126027 tok/s step 4434/19560 | loss 3.589484 (-0.28z)| norm 0.3304 (+1.73z)| lr 5.44e-04 | 4159.33 ms | 32.5% bf16 MFU | 126028 tok/s step 4435/19560 | loss 3.603473 (+0.04z)| norm 0.3014 (+0.60z)| lr 5.44e-04 | 4165.40 ms | 32.4% bf16 MFU | 126020 tok/s step 4436/19560 | loss 3.551252 (-1.19z)| norm 0.2740 (-0.45z)| lr 5.44e-04 | 4156.04 ms | 32.5% bf16 MFU | 126027 tok/s step 4437/19560 | loss 3.550357 (-1.19z)| norm 0.2960 (+0.39z)| lr 5.44e-04 | 4144.00 ms | 32.6% bf16 MFU | 126051 tok/s step 4438/19560 | loss 3.571376 (-0.70z)| norm 0.3514 (+2.45z)| lr 5.44e-04 | 4145.32 ms | 32.6% bf16 MFU | 126073 tok/s step 4439/19560 | loss 3.621943 (+0.47z)| norm 0.3020 (+0.57z)| lr 5.44e-04 | 4144.15 ms | 32.6% bf16 MFU | 126095 tok/s step 4440/19560 | loss 3.657669 (+1.29z)| norm 0.2864 (-0.03z)| lr 5.44e-04 | 4390.41 ms | 30.8% bf16 MFU | 125761 tok/s step 4441/19560 | loss 3.589322 (-0.32z)| norm 0.2854 (-0.08z)| lr 5.44e-04 | 4159.37 ms | 32.5% bf16 MFU | 125775 tok/s step 4442/19560 | loss 3.647338 (+1.04z)| norm 0.2958 (+0.31z)| lr 5.44e-04 | 4152.95 ms | 32.5% bf16 MFU | 125799 tok/s step 4443/19560 | loss 3.622738 (+0.45z)| norm 0.2898 (+0.07z)| lr 5.44e-04 | 4157.74 ms | 32.5% bf16 MFU | 125814 tok/s step 4444/19560 | loss 3.567864 (-0.83z)| norm 0.2861 (-0.07z)| lr 5.44e-04 | 4228.69 ms | 31.9% bf16 MFU | 125722 tok/s step 4445/19560 | loss 3.682933 (+1.83z)| norm 0.2809 (-0.28z)| lr 5.44e-04 | 4148.55 ms | 32.5% bf16 MFU | 125755 tok/s step 4446/19560 | loss 3.574311 (-0.67z)| norm 0.2816 (-0.26z)| lr 5.43e-04 | 4157.16 ms | 32.5% bf16 MFU | 125773 tok/s step 4447/19560 | loss 3.566184 (-0.85z)| norm 0.2576 (-1.18z)| lr 5.43e-04 | 4149.08 ms | 32.5% bf16 MFU | 125803 tok/s step 4448/19560 | loss 3.591785 (-0.25z)| norm 0.2653 (-0.86z)| lr 5.43e-04 | 4155.09 ms | 32.5% bf16 MFU | 125821 tok/s step 4449/19560 | loss 3.579467 (-0.53z)| norm 0.2886 (+0.04z)| lr 5.43e-04 | 4270.51 ms | 31.6% bf16 MFU | 125669 tok/s step 4450/19560 | loss 3.535378 (-1.53z)| norm 0.2870 (-0.03z)| lr 5.43e-04 | 4144.18 ms | 32.6% bf16 MFU | 125711 tok/s step 4451/19560 | loss 3.627877 (+0.60z)| norm 0.3267 (+1.49z)| lr 5.43e-04 | 4155.16 ms | 32.5% bf16 MFU | 125734 tok/s step 4452/19560 | loss 3.602049 (+0.00z)| norm 0.2859 (-0.09z)| lr 5.43e-04 | 4160.31 ms | 32.5% bf16 MFU | 125749 tok/s step 4453/19560 | loss 3.547876 (-1.23z)| norm 0.2873 (-0.04z)| lr 5.43e-04 | 4146.18 ms | 32.6% bf16 MFU | 125784 tok/s step 4454/19560 | loss 3.547561 (-1.23z)| norm 0.2966 (+0.33z)| lr 5.43e-04 | 4151.18 ms | 32.5% bf16 MFU | 125809 tok/s step 4455/19560 | loss 3.630123 (+0.65z)| norm 0.3116 (+0.94z)| lr 5.43e-04 | 4152.49 ms | 32.5% bf16 MFU | 125832 tok/s step 4456/19560 | loss 3.598747 (-0.05z)| norm 0.2800 (-0.30z)| lr 5.43e-04 | 4161.05 ms | 32.4% bf16 MFU | 125840 tok/s step 4457/19560 | loss 3.638772 (+0.85z)| norm 0.2709 (-0.69z)| lr 5.43e-04 | 4144.03 ms | 32.6% bf16 MFU | 125874 tok/s step 4458/19560 | loss 3.607513 (+0.13z)| norm 0.2835 (-0.19z)| lr 5.43e-04 | 4146.21 ms | 32.6% bf16 MFU | 125903 tok/s step 4459/19560 | loss 3.613954 (+0.27z)| norm 0.2733 (-0.59z)| lr 5.43e-04 | 4146.38 ms | 32.6% bf16 MFU | 125930 tok/s step 4460/19560 | loss 3.605526 (+0.08z)| norm 0.4062 (+4.39z)| lr 5.43e-04 | 4143.84 ms | 32.6% bf16 MFU | 125960 tok/s step 4461/19560 | loss 3.614457 (+0.30z)| norm 0.3083 (+0.72z)| lr 5.43e-04 | 4154.08 ms | 32.5% bf16 MFU | 125972 tok/s step 4462/19560 | loss 3.610352 (+0.21z)| norm 0.3451 (+2.04z)| lr 5.43e-04 | 4151.70 ms | 32.5% bf16 MFU | 125988 tok/s step 4463/19560 | loss 3.591286 (-0.23z)| norm 0.2891 (-0.01z)| lr 5.43e-04 | 4157.98 ms | 32.5% bf16 MFU | 125993 tok/s step 4464/19560 | loss 3.608589 (+0.17z)| norm 0.2970 (+0.28z)| lr 5.43e-04 | 4144.82 ms | 32.6% bf16 MFU | 126018 tok/s step 4465/19560 | loss 3.642235 (+0.94z)| norm 0.2799 (-0.34z)| lr 5.43e-04 | 4169.06 ms | 32.4% bf16 MFU | 126005 tok/s step 4466/19560 | loss 3.560759 (-0.95z)| norm 0.2780 (-0.40z)| lr 5.43e-04 | 4204.85 ms | 32.1% bf16 MFU | 125939 tok/s step 4467/19560 | loss 3.635244 (+0.77z)| norm 0.3131 (+0.90z)| lr 5.43e-04 | 4155.62 ms | 32.5% bf16 MFU | 125950 tok/s step 4468/19560 | loss 3.615078 (+0.30z)| norm 0.2649 (-0.89z)| lr 5.43e-04 | 4155.21 ms | 32.5% bf16 MFU | 125961 tok/s step 4469/19560 | loss 3.643258 (+0.98z)| norm 0.3515 (+2.25z)| lr 5.43e-04 | 4146.62 ms | 32.6% bf16 MFU | 125985 tok/s step 4470/19560 | loss 3.574981 (-0.63z)| norm 0.3472 (+2.05z)| lr 5.43e-04 | 4158.93 ms | 32.5% bf16 MFU | 125989 tok/s step 4471/19560 | loss 3.612653 (+0.26z)| norm 0.2883 (-0.06z)| lr 5.43e-04 | 4157.09 ms | 32.5% bf16 MFU | 125996 tok/s step 4472/19560 | loss 3.610614 (+0.22z)| norm 0.2867 (-0.11z)| lr 5.43e-04 | 4150.59 ms | 32.5% bf16 MFU | 126012 tok/s step 4473/19560 | loss 3.649615 (+1.13z)| norm 0.2958 (+0.22z)| lr 5.43e-04 | 4162.15 ms | 32.4% bf16 MFU | 126009 tok/s step 4474/19560 | loss 3.649103 (+1.10z)| norm 0.4150 (+4.18z)| lr 5.43e-04 | 4166.49 ms | 32.4% bf16 MFU | 126001 tok/s step 4475/19560 | loss 3.627838 (+0.59z)| norm 0.3194 (+0.96z)| lr 5.43e-04 | 4144.38 ms | 32.6% bf16 MFU | 126026 tok/s step 4476/19560 | loss 3.510255 (-2.11z)| norm 0.2993 (+0.27z)| lr 5.43e-04 | 4142.80 ms | 32.6% bf16 MFU | 126052 tok/s step 4477/19560 | loss 3.546923 (-1.26z)| norm 0.3020 (+0.37z)| lr 5.43e-04 | 4186.43 ms | 32.3% bf16 MFU | 126011 tok/s step 4478/19560 | loss 3.627517 (+0.61z)| norm 0.3048 (+0.49z)| lr 5.43e-04 | 4156.27 ms | 32.5% bf16 MFU | 126018 tok/s step 4479/19560 | loss 3.594629 (-0.14z)| norm 0.3221 (+1.10z)| lr 5.43e-04 | 4159.59 ms | 32.5% bf16 MFU | 126019 tok/s step 4480/19560 | loss 3.600719 (+0.00z)| norm 0.2695 (-0.72z)| lr 5.42e-04 | 4143.82 ms | 32.6% bf16 MFU | 126044 tok/s step 4481/19560 | loss 3.535125 (-1.60z)| norm 0.3041 (+0.49z)| lr 5.42e-04 | 4147.69 ms | 32.6% bf16 MFU | 126062 tok/s step 4482/19560 | loss 3.579560 (-0.51z)| norm 0.2898 (+0.01z)| lr 5.42e-04 | 4332.56 ms | 31.2% bf16 MFU | 125810 tok/s step 4483/19560 | loss 3.518115 (-1.98z)| norm 0.2830 (-0.22z)| lr 5.42e-04 | 4210.32 ms | 32.1% bf16 MFU | 125746 tok/s step 4484/19560 | loss 3.571782 (-0.68z)| norm 0.2797 (-0.34z)| lr 5.42e-04 | 4149.43 ms | 32.5% bf16 MFU | 125776 tok/s step 4485/19560 | loss 3.578251 (-0.51z)| norm 0.2419 (-1.69z)| lr 5.42e-04 | 4151.84 ms | 32.5% bf16 MFU | 125801 tok/s step 4486/19560 | loss 3.654937 (+1.39z)| norm 0.2628 (-0.92z)| lr 5.42e-04 | 4142.94 ms | 32.6% bf16 MFU | 125838 tok/s step 4487/19560 | loss 3.564320 (-0.86z)| norm 0.2360 (-1.86z)| lr 5.42e-04 | 4149.95 ms | 32.5% bf16 MFU | 125863 tok/s step 4488/19560 | loss 3.644589 (+1.14z)| norm 0.2638 (-0.84z)| lr 5.42e-04 | 4207.17 ms | 32.1% bf16 MFU | 125801 tok/s step 4489/19560 | loss 3.578804 (-0.49z)| norm 0.2825 (-0.17z)| lr 5.42e-04 | 4145.12 ms | 32.6% bf16 MFU | 125835 tok/s step 4490/19560 | loss 3.631610 (+0.83z)| norm 0.2682 (-0.67z)| lr 5.42e-04 | 4144.34 ms | 32.6% bf16 MFU | 125869 tok/s step 4491/19560 | loss 3.537742 (-1.50z)| norm 0.2516 (-1.25z)| lr 5.42e-04 | 4152.47 ms | 32.5% bf16 MFU | 125888 tok/s step 4492/19560 | loss 3.572939 (-0.62z)| norm 0.2523 (-1.21z)| lr 5.42e-04 | 4182.07 ms | 32.3% bf16 MFU | 125862 tok/s step 4493/19560 | loss 3.627264 (+0.72z)| norm 0.2454 (-1.43z)| lr 5.42e-04 | 4167.58 ms | 32.4% bf16 MFU | 125859 tok/s step 4494/19560 | loss 3.552867 (-1.12z)| norm 0.2572 (-0.99z)| lr 5.42e-04 | 4198.26 ms | 32.2% bf16 MFU | 125810 tok/s step 4495/19560 | loss 3.596880 (-0.02z)| norm 0.2708 (-0.50z)| lr 5.42e-04 | 4240.82 ms | 31.8% bf16 MFU | 125701 tok/s step 4496/19560 | loss 3.566734 (-0.77z)| norm 0.2634 (-0.76z)| lr 5.42e-04 | 4163.30 ms | 32.4% bf16 MFU | 125713 tok/s step 4497/19560 | loss 3.552083 (-1.12z)| norm 0.2585 (-0.94z)| lr 5.42e-04 | 4219.35 ms | 32.0% bf16 MFU | 125640 tok/s step 4498/19560 | loss 3.623929 (+0.65z)| norm 0.2753 (-0.34z)| lr 5.42e-04 | 4169.57 ms | 32.4% bf16 MFU | 125645 tok/s step 4499/19560 | loss 3.611892 (+0.35z)| norm 0.2861 (+0.05z)| lr 5.42e-04 | 4156.37 ms | 32.5% bf16 MFU | 125670 tok/s step 4500/19560 | loss 3.637724 (+0.98z)| norm 0.3068 (+0.78z)| lr 5.42e-04 | 4158.50 ms | 32.5% bf16 MFU | 125690 tok/s val loss 3.581325 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2737/10042 = 0.272555 step 4501/19560 | loss 3.594858 (-0.07z)| norm 0.2917 (+0.24z)| lr 5.42e-04 | 4226.50 ms | 31.9% bf16 MFU | 125608 tok/s step 4502/19560 | loss 3.673793 (+1.86z)| norm 0.3020 (+0.61z)| lr 5.42e-04 | 4159.82 ms | 32.5% bf16 MFU | 125629 tok/s step 4503/19560 | loss 3.579127 (-0.47z)| norm 0.2585 (-0.93z)| lr 5.42e-04 | 4187.07 ms | 32.2% bf16 MFU | 125609 tok/s step 4504/19560 | loss 3.520666 (-1.88z)| norm 0.2693 (-0.54z)| lr 5.42e-04 | 4152.21 ms | 32.5% bf16 MFU | 125642 tok/s step 4505/19560 | loss 3.555472 (-1.06z)| norm 0.2581 (-0.93z)| lr 5.42e-04 | 4152.10 ms | 32.5% bf16 MFU | 125673 tok/s step 4506/19560 | loss 3.728439 (+3.11z)| norm 0.2800 (-0.15z)| lr 5.42e-04 | 4159.72 ms | 32.5% bf16 MFU | 125691 tok/s step 4507/19560 | loss 3.622926 (+0.57z)| norm 0.2848 (+0.01z)| lr 5.42e-04 | 4177.78 ms | 32.3% bf16 MFU | 125682 tok/s step 4508/19560 | loss 3.591024 (-0.19z)| norm 0.2902 (+0.20z)| lr 5.42e-04 | 4164.74 ms | 32.4% bf16 MFU | 125692 tok/s step 4509/19560 | loss 3.614703 (+0.40z)| norm 0.3100 (+0.89z)| lr 5.42e-04 | 4177.42 ms | 32.3% bf16 MFU | 125683 tok/s step 4510/19560 | loss 3.560580 (-0.93z)| norm 0.3312 (+1.61z)| lr 5.42e-04 | 4160.72 ms | 32.5% bf16 MFU | 125699 tok/s step 4511/19560 | loss 3.541726 (-1.36z)| norm 0.2925 (+0.24z)| lr 5.42e-04 | 4168.27 ms | 32.4% bf16 MFU | 125703 tok/s step 4512/19560 | loss 3.584822 (-0.33z)| norm 0.2709 (-0.53z)| lr 5.42e-04 | 4153.70 ms | 32.5% bf16 MFU | 125729 tok/s step 4513/19560 | loss 3.565998 (-0.78z)| norm 0.2906 (+0.16z)| lr 5.42e-04 | 4168.22 ms | 32.4% bf16 MFU | 125732 tok/s step 4514/19560 | loss 3.578491 (-0.46z)| norm 0.2734 (-0.44z)| lr 5.41e-04 | 4158.69 ms | 32.5% bf16 MFU | 125748 tok/s step 4515/19560 | loss 3.569094 (-0.69z)| norm 0.2790 (-0.24z)| lr 5.41e-04 | 4167.25 ms | 32.4% bf16 MFU | 125752 tok/s step 4516/19560 | loss 3.628739 (+0.76z)| norm 0.2856 (-0.01z)| lr 5.41e-04 | 4179.22 ms | 32.3% bf16 MFU | 125737 tok/s step 4517/19560 | loss 3.589668 (-0.19z)| norm 0.2953 (+0.34z)| lr 5.41e-04 | 4167.57 ms | 32.4% bf16 MFU | 125740 tok/s step 4518/19560 | loss 3.586380 (-0.25z)| norm 0.3033 (+0.63z)| lr 5.41e-04 | 4174.70 ms | 32.3% bf16 MFU | 125732 tok/s step 4519/19560 | loss 3.574802 (-0.53z)| norm 0.2971 (+0.40z)| lr 5.41e-04 | 4159.02 ms | 32.5% bf16 MFU | 125749 tok/s step 4520/19560 | loss 3.549787 (-1.15z)| norm 0.2922 (+0.22z)| lr 5.41e-04 | 4162.71 ms | 32.4% bf16 MFU | 125759 tok/s step 4521/19560 | loss 3.531360 (-1.62z)| norm 0.2558 (-1.08z)| lr 5.41e-04 | 4173.35 ms | 32.4% bf16 MFU | 125752 tok/s step 4522/19560 | loss 3.544719 (-1.29z)| norm 0.2725 (-0.48z)| lr 5.41e-04 | 4169.69 ms | 32.4% bf16 MFU | 125751 tok/s step 4523/19560 | loss 3.567130 (-0.70z)| norm 0.2842 (-0.08z)| lr 5.41e-04 | 4165.44 ms | 32.4% bf16 MFU | 125757 tok/s step 4524/19560 | loss 3.592451 (-0.03z)| norm 0.2667 (-0.71z)| lr 5.41e-04 | 4169.58 ms | 32.4% bf16 MFU | 125756 tok/s step 4525/19560 | loss 3.575456 (-0.47z)| norm 0.2690 (-0.64z)| lr 5.41e-04 | 4152.79 ms | 32.5% bf16 MFU | 125781 tok/s step 4526/19560 | loss 3.640346 (+1.22z)| norm 0.2832 (-0.13z)| lr 5.41e-04 | 4175.88 ms | 32.3% bf16 MFU | 125769 tok/s step 4527/19560 | loss 3.521960 (-1.86z)| norm 0.2705 (-0.59z)| lr 5.41e-04 | 4155.38 ms | 32.5% bf16 MFU | 125790 tok/s step 4528/19560 | loss 3.564650 (-0.75z)| norm 0.2705 (-0.59z)| lr 5.41e-04 | 4161.46 ms | 32.4% bf16 MFU | 125799 tok/s step 4529/19560 | loss 3.572297 (-0.55z)| norm 0.2532 (-1.22z)| lr 5.41e-04 | 4175.63 ms | 32.3% bf16 MFU | 125787 tok/s step 4530/19560 | loss 3.709730 (+2.90z)| norm 0.2668 (-0.74z)| lr 5.41e-04 | 4163.51 ms | 32.4% bf16 MFU | 125794 tok/s step 4531/19560 | loss 3.533920 (-1.50z)| norm 0.2955 (+0.31z)| lr 5.41e-04 | 4158.41 ms | 32.5% bf16 MFU | 125808 tok/s step 4532/19560 | loss 3.554981 (-0.97z)| norm 0.3040 (+0.62z)| lr 5.41e-04 | 4177.64 ms | 32.3% bf16 MFU | 125793 tok/s step 4533/19560 | loss 3.593241 (-0.02z)| norm 0.2849 (-0.09z)| lr 5.41e-04 | 4156.44 ms | 32.5% bf16 MFU | 125810 tok/s step 4534/19560 | loss 3.511212 (-2.02z)| norm 0.2705 (-0.63z)| lr 5.41e-04 | 4172.37 ms | 32.4% bf16 MFU | 125803 tok/s step 4535/19560 | loss 3.763908 (+3.89z)| norm 0.2916 (+0.14z)| lr 5.41e-04 | 4161.10 ms | 32.4% bf16 MFU | 125812 tok/s step 4536/19560 | loss 3.630116 (+0.81z)| norm 0.3274 (+1.45z)| lr 5.41e-04 | 4157.36 ms | 32.5% bf16 MFU | 125827 tok/s step 4537/19560 | loss 3.572610 (-0.53z)| norm 0.3091 (+0.77z)| lr 5.41e-04 | 4162.94 ms | 32.4% bf16 MFU | 125833 tok/s step 4538/19560 | loss 3.606392 (+0.26z)| norm 0.2947 (+0.23z)| lr 5.41e-04 | 4184.34 ms | 32.3% bf16 MFU | 125806 tok/s step 4539/19560 | loss 3.557944 (-0.86z)| norm 0.2809 (-0.29z)| lr 5.41e-04 | 4171.54 ms | 32.4% bf16 MFU | 125800 tok/s step 4540/19560 | loss 3.587846 (-0.17z)| norm 0.2765 (-0.45z)| lr 5.41e-04 | 4169.20 ms | 32.4% bf16 MFU | 125798 tok/s step 4541/19560 | loss 3.566448 (-0.65z)| norm 0.2661 (-0.84z)| lr 5.41e-04 | 4164.02 ms | 32.4% bf16 MFU | 125803 tok/s step 4542/19560 | loss 3.643960 (+1.16z)| norm 0.2586 (-1.10z)| lr 5.41e-04 | 4147.36 ms | 32.6% bf16 MFU | 125834 tok/s step 4543/19560 | loss 3.525778 (-1.58z)| norm 0.2830 (-0.19z)| lr 5.41e-04 | 4182.87 ms | 32.3% bf16 MFU | 125809 tok/s step 4544/19560 | loss 3.598334 (+0.09z)| norm 0.2487 (-1.44z)| lr 5.41e-04 | 4189.37 ms | 32.2% bf16 MFU | 125776 tok/s step 4545/19560 | loss 3.562356 (-0.73z)| norm 0.2518 (-1.31z)| lr 5.41e-04 | 4156.86 ms | 32.5% bf16 MFU | 125794 tok/s step 4546/19560 | loss 3.570322 (-0.55z)| norm 0.2606 (-0.97z)| lr 5.41e-04 | 4162.66 ms | 32.4% bf16 MFU | 125801 tok/s step 4547/19560 | loss 3.572204 (-0.50z)| norm 0.2835 (-0.13z)| lr 5.41e-04 | 4156.54 ms | 32.5% bf16 MFU | 125818 tok/s step 4548/19560 | loss 3.636325 (+0.98z)| norm 0.2631 (-0.87z)| lr 5.40e-04 | 4155.33 ms | 32.5% bf16 MFU | 125836 tok/s step 4549/19560 | loss 3.571655 (-0.52z)| norm 0.2793 (-0.29z)| lr 5.40e-04 | 4597.04 ms | 29.4% bf16 MFU | 125247 tok/s step 4550/19560 | loss 3.595105 (+0.03z)| norm 0.2587 (-1.03z)| lr 5.40e-04 | 4170.47 ms | 32.4% bf16 MFU | 125270 tok/s step 4551/19560 | loss 3.591741 (-0.05z)| norm 0.2802 (-0.26z)| lr 5.40e-04 | 4163.05 ms | 32.4% bf16 MFU | 125303 tok/s step 4552/19560 | loss 3.534978 (-1.34z)| norm 0.2651 (-0.80z)| lr 5.40e-04 | 4156.72 ms | 32.5% bf16 MFU | 125345 tok/s step 4553/19560 | loss 3.592347 (-0.01z)| norm 0.2685 (-0.66z)| lr 5.40e-04 | 4158.99 ms | 32.5% bf16 MFU | 125381 tok/s step 4554/19560 | loss 3.606795 (+0.32z)| norm 0.2705 (-0.58z)| lr 5.40e-04 | 4175.88 ms | 32.3% bf16 MFU | 125389 tok/s step 4555/19560 | loss 3.584433 (-0.20z)| norm 0.2796 (-0.24z)| lr 5.40e-04 | 4168.88 ms | 32.4% bf16 MFU | 125408 tok/s step 4556/19560 | loss 3.577439 (-0.37z)| norm 0.2697 (-0.60z)| lr 5.40e-04 | 4167.22 ms | 32.4% bf16 MFU | 125428 tok/s step 4557/19560 | loss 3.539651 (-1.23z)| norm 0.2727 (-0.50z)| lr 5.40e-04 | 4189.92 ms | 32.2% bf16 MFU | 125413 tok/s step 4558/19560 | loss 3.585302 (-0.17z)| norm 0.2676 (-0.68z)| lr 5.40e-04 | 4160.56 ms | 32.5% bf16 MFU | 125443 tok/s step 4559/19560 | loss 3.595588 (+0.06z)| norm 0.2996 (+0.49z)| lr 5.40e-04 | 4164.56 ms | 32.4% bf16 MFU | 125466 tok/s step 4560/19560 | loss 3.530087 (-1.44z)| norm 0.3039 (+0.65z)| lr 5.40e-04 | 4164.17 ms | 32.4% bf16 MFU | 125488 tok/s step 4561/19560 | loss 3.569360 (-0.51z)| norm 0.2972 (+0.40z)| lr 5.40e-04 | 4164.16 ms | 32.4% bf16 MFU | 125508 tok/s step 4562/19560 | loss 3.535046 (-1.30z)| norm 0.3338 (+1.76z)| lr 5.40e-04 | 4165.67 ms | 32.4% bf16 MFU | 125526 tok/s step 4563/19560 | loss 3.560562 (-0.70z)| norm 0.2957 (+0.35z)| lr 5.40e-04 | 4160.82 ms | 32.4% bf16 MFU | 125550 tok/s step 4564/19560 | loss 3.690586 (+2.26z)| norm 0.2709 (-0.57z)| lr 5.40e-04 | 4192.73 ms | 32.2% bf16 MFU | 125525 tok/s step 4565/19560 | loss 3.568612 (-0.53z)| norm 0.2823 (-0.14z)| lr 5.40e-04 | 4186.55 ms | 32.3% bf16 MFU | 125510 tok/s step 4566/19560 | loss 3.586887 (-0.11z)| norm 0.2685 (-0.65z)| lr 5.40e-04 | 4169.87 ms | 32.4% bf16 MFU | 125521 tok/s step 4567/19560 | loss 3.585286 (-0.14z)| norm 0.2783 (-0.27z)| lr 5.40e-04 | 4164.49 ms | 32.4% bf16 MFU | 125540 tok/s step 4568/19560 | loss 3.686754 (+2.16z)| norm 0.2816 (-0.14z)| lr 5.40e-04 | 4157.48 ms | 32.5% bf16 MFU | 125568 tok/s step 4569/19560 | loss 3.550104 (-0.94z)| norm 0.2804 (-0.18z)| lr 5.40e-04 | 4164.65 ms | 32.4% bf16 MFU | 125584 tok/s step 4570/19560 | loss 3.607934 (+0.38z)| norm 0.2810 (-0.16z)| lr 5.40e-04 | 4152.70 ms | 32.5% bf16 MFU | 125618 tok/s step 4571/19560 | loss 3.579885 (-0.25z)| norm 0.3031 (+0.68z)| lr 5.40e-04 | 4179.68 ms | 32.3% bf16 MFU | 125609 tok/s step 4572/19560 | loss 3.610984 (+0.45z)| norm 0.2750 (-0.38z)| lr 5.40e-04 | 4178.59 ms | 32.3% bf16 MFU | 125602 tok/s step 4573/19560 | loss 3.570131 (-0.47z)| norm 0.3232 (+1.41z)| lr 5.40e-04 | 4169.13 ms | 32.4% bf16 MFU | 125609 tok/s step 4574/19560 | loss 3.618652 (+0.65z)| norm 0.2767 (-0.33z)| lr 5.40e-04 | 4162.54 ms | 32.4% bf16 MFU | 125627 tok/s step 4575/19560 | loss 3.565587 (-0.58z)| norm 0.2952 (+0.36z)| lr 5.40e-04 | 4182.31 ms | 32.3% bf16 MFU | 125613 tok/s step 4576/19560 | loss 3.606643 (+0.37z)| norm 0.2913 (+0.20z)| lr 5.40e-04 | 4155.24 ms | 32.5% bf16 MFU | 125641 tok/s step 4577/19560 | loss 3.643814 (+1.21z)| norm 0.2846 (-0.05z)| lr 5.40e-04 | 4177.07 ms | 32.3% bf16 MFU | 125635 tok/s step 4578/19560 | loss 3.540786 (-1.17z)| norm 0.2590 (-1.00z)| lr 5.40e-04 | 4169.42 ms | 32.4% bf16 MFU | 125641 tok/s step 4579/19560 | loss 3.564417 (-0.61z)| norm 0.2774 (-0.30z)| lr 5.40e-04 | 4190.32 ms | 32.2% bf16 MFU | 125615 tok/s step 4580/19560 | loss 3.692630 (+2.29z)| norm 0.2606 (-0.93z)| lr 5.40e-04 | 4147.98 ms | 32.6% bf16 MFU | 125654 tok/s step 4581/19560 | loss 3.557954 (-0.76z)| norm 0.2937 (+0.32z)| lr 5.39e-04 | 4172.92 ms | 32.4% bf16 MFU | 125653 tok/s step 4582/19560 | loss 3.631479 (+0.89z)| norm 0.2862 (+0.04z)| lr 5.39e-04 | 4167.54 ms | 32.4% bf16 MFU | 125660 tok/s step 4583/19560 | loss 3.595087 (+0.07z)| norm 0.2431 (-1.56z)| lr 5.39e-04 | 4156.67 ms | 32.5% bf16 MFU | 125684 tok/s step 4584/19560 | loss 3.541905 (-1.12z)| norm 0.2367 (-1.77z)| lr 5.39e-04 | 4179.18 ms | 32.3% bf16 MFU | 125672 tok/s step 4585/19560 | loss 3.554572 (-0.82z)| norm 0.2578 (-0.98z)| lr 5.39e-04 | 4196.47 ms | 32.2% bf16 MFU | 125636 tok/s step 4586/19560 | loss 3.537417 (-1.19z)| norm 0.2690 (-0.56z)| lr 5.39e-04 | 4169.18 ms | 32.4% bf16 MFU | 125641 tok/s step 4587/19560 | loss 3.594083 (+0.09z)| norm 0.2492 (-1.27z)| lr 5.39e-04 | 4156.70 ms | 32.5% bf16 MFU | 125666 tok/s step 4588/19560 | loss 3.595556 (+0.12z)| norm 0.2695 (-0.53z)| lr 5.39e-04 | 4178.92 ms | 32.3% bf16 MFU | 125656 tok/s step 4589/19560 | loss 3.584564 (-0.12z)| norm 0.3092 (+1.06z)| lr 5.39e-04 | 4153.40 ms | 32.5% bf16 MFU | 125684 tok/s step 4590/19560 | loss 3.534116 (-1.24z)| norm 0.3052 (+0.93z)| lr 5.39e-04 | 4171.54 ms | 32.4% bf16 MFU | 125684 tok/s step 4591/19560 | loss 3.565886 (-0.52z)| norm 0.2646 (-0.72z)| lr 5.39e-04 | 4164.12 ms | 32.4% bf16 MFU | 125695 tok/s step 4592/19560 | loss 3.610234 (+0.48z)| norm 0.2827 (+0.03z)| lr 5.39e-04 | 4170.04 ms | 32.4% bf16 MFU | 125697 tok/s step 4593/19560 | loss 3.631760 (+0.96z)| norm 0.2929 (+0.43z)| lr 5.39e-04 | 4181.55 ms | 32.3% bf16 MFU | 125681 tok/s step 4594/19560 | loss 3.623428 (+0.76z)| norm 0.2439 (-1.54z)| lr 5.39e-04 | 4159.36 ms | 32.5% bf16 MFU | 125700 tok/s step 4595/19560 | loss 3.605253 (+0.36z)| norm 0.2703 (-0.46z)| lr 5.39e-04 | 4158.70 ms | 32.5% bf16 MFU | 125718 tok/s step 4596/19560 | loss 3.536434 (-1.17z)| norm 0.2760 (-0.23z)| lr 5.39e-04 | 4163.69 ms | 32.4% bf16 MFU | 125728 tok/s step 4597/19560 | loss 3.552431 (-0.80z)| norm 0.2657 (-0.65z)| lr 5.39e-04 | 4162.21 ms | 32.4% bf16 MFU | 125740 tok/s step 4598/19560 | loss 3.568402 (-0.44z)| norm 0.2504 (-1.29z)| lr 5.39e-04 | 4170.50 ms | 32.4% bf16 MFU | 125739 tok/s step 4599/19560 | loss 3.551119 (-0.82z)| norm 0.2495 (-1.31z)| lr 5.39e-04 | 4188.77 ms | 32.2% bf16 MFU | 125710 tok/s step 4600/19560 | loss 3.547057 (-0.90z)| norm 0.2746 (-0.23z)| lr 5.39e-04 | 4159.05 ms | 32.5% bf16 MFU | 125727 tok/s step 4601/19560 | loss 3.701565 (+2.53z)| norm 0.3082 (+1.20z)| lr 5.39e-04 | 4159.21 ms | 32.5% bf16 MFU | 125744 tok/s step 4602/19560 | loss 3.568894 (-0.40z)| norm 0.2803 (+0.07z)| lr 5.39e-04 | 4198.10 ms | 32.2% bf16 MFU | 125701 tok/s step 4603/19560 | loss 3.675162 (+1.94z)| norm 0.3303 (+2.53z)| lr 5.39e-04 | 4174.80 ms | 32.3% bf16 MFU | 125695 tok/s step 4604/19560 | loss 3.606345 (+0.41z)| norm 0.2944 (+0.76z)| lr 5.39e-04 | 4165.22 ms | 32.4% bf16 MFU | 125704 tok/s step 4605/19560 | loss 3.516013 (-1.58z)| norm 0.2640 (-0.73z)| lr 5.39e-04 | 4157.96 ms | 32.5% bf16 MFU | 125723 tok/s step 4606/19560 | loss 3.575498 (-0.26z)| norm 0.2963 (+0.88z)| lr 5.39e-04 | 4168.44 ms | 32.4% bf16 MFU | 125726 tok/s step 4607/19560 | loss 3.571126 (-0.35z)| norm 0.3035 (+1.26z)| lr 5.39e-04 | 4184.11 ms | 32.3% bf16 MFU | 125705 tok/s step 4608/19560 | loss 3.528835 (-1.27z)| norm 0.3098 (+1.55z)| lr 5.39e-04 | 4172.18 ms | 32.4% bf16 MFU | 125703 tok/s step 4609/19560 | loss 3.611438 (+0.54z)| norm 0.2767 (-0.10z)| lr 5.39e-04 | 4157.22 ms | 32.5% bf16 MFU | 125723 tok/s step 4610/19560 | loss 3.570458 (-0.36z)| norm 0.2714 (-0.35z)| lr 5.39e-04 | 4151.18 ms | 32.5% bf16 MFU | 125752 tok/s step 4611/19560 | loss 3.556541 (-0.68z)| norm 0.3125 (+1.68z)| lr 5.39e-04 | 4155.18 ms | 32.5% bf16 MFU | 125773 tok/s step 4612/19560 | loss 3.539519 (-1.05z)| norm 0.3037 (+1.23z)| lr 5.39e-04 | 4155.54 ms | 32.5% bf16 MFU | 125793 tok/s step 4613/19560 | loss 3.585076 (-0.04z)| norm 0.2812 (+0.10z)| lr 5.39e-04 | 4149.27 ms | 32.5% bf16 MFU | 125821 tok/s step 4614/19560 | loss 3.626159 (+0.88z)| norm 0.2962 (+0.84z)| lr 5.38e-04 | 4149.77 ms | 32.5% bf16 MFU | 125847 tok/s step 4615/19560 | loss 3.596098 (+0.20z)| norm 0.2719 (-0.40z)| lr 5.38e-04 | 4170.55 ms | 32.4% bf16 MFU | 125840 tok/s step 4616/19560 | loss 3.571478 (-0.34z)| norm 0.2911 (+0.57z)| lr 5.38e-04 | 4157.53 ms | 32.5% bf16 MFU | 125854 tok/s step 4617/19560 | loss 3.507486 (-1.74z)| norm 0.3042 (+1.22z)| lr 5.38e-04 | 4167.22 ms | 32.4% bf16 MFU | 125852 tok/s step 4618/19560 | loss 3.622710 (+0.82z)| norm 0.2869 (+0.34z)| lr 5.38e-04 | 4194.76 ms | 32.2% bf16 MFU | 125808 tok/s step 4619/19560 | loss 3.611588 (+0.56z)| norm 0.2787 (-0.09z)| lr 5.38e-04 | 4162.40 ms | 32.4% bf16 MFU | 125816 tok/s step 4620/19560 | loss 3.575421 (-0.25z)| norm 0.2863 (+0.29z)| lr 5.38e-04 | 4183.32 ms | 32.3% bf16 MFU | 125791 tok/s step 4621/19560 | loss 3.559285 (-0.59z)| norm 0.2866 (+0.29z)| lr 5.38e-04 | 4147.05 ms | 32.6% bf16 MFU | 125823 tok/s step 4622/19560 | loss 3.570503 (-0.35z)| norm 0.2656 (-0.81z)| lr 5.38e-04 | 4168.76 ms | 32.4% bf16 MFU | 125820 tok/s step 4623/19560 | loss 3.602949 (+0.38z)| norm 0.2580 (-1.20z)| lr 5.38e-04 | 4187.80 ms | 32.2% bf16 MFU | 125789 tok/s step 4624/19560 | loss 3.539256 (-1.04z)| norm 0.2544 (-1.38z)| lr 5.38e-04 | 4186.25 ms | 32.3% bf16 MFU | 125762 tok/s step 4625/19560 | loss 3.603197 (+0.38z)| norm 0.2805 (-0.03z)| lr 5.38e-04 | 4164.28 ms | 32.4% bf16 MFU | 125768 tok/s step 4626/19560 | loss 3.506838 (-1.74z)| norm 0.2471 (-1.74z)| lr 5.38e-04 | 4154.25 ms | 32.5% bf16 MFU | 125790 tok/s step 4627/19560 | loss 3.615314 (+0.66z)| norm 0.2707 (-0.52z)| lr 5.38e-04 | 4155.48 ms | 32.5% bf16 MFU | 125809 tok/s step 4628/19560 | loss 3.600764 (+0.35z)| norm 0.2756 (-0.25z)| lr 5.38e-04 | 4147.92 ms | 32.6% bf16 MFU | 125839 tok/s step 4629/19560 | loss 3.579559 (-0.12z)| norm 0.3015 (+1.09z)| lr 5.38e-04 | 4158.13 ms | 32.5% bf16 MFU | 125851 tok/s step 4630/19560 | loss 3.535204 (-1.09z)| norm 0.2764 (-0.21z)| lr 5.38e-04 | 4146.93 ms | 32.6% bf16 MFU | 125880 tok/s step 4631/19560 | loss 3.643160 (+1.31z)| norm 0.2780 (-0.13z)| lr 5.38e-04 | 4173.40 ms | 32.4% bf16 MFU | 125867 tok/s step 4632/19560 | loss 3.541934 (-0.96z)| norm 0.2859 (+0.27z)| lr 5.38e-04 | 4275.22 ms | 31.6% bf16 MFU | 125706 tok/s step 4633/19560 | loss 3.585497 (+0.02z)| norm 0.2843 (+0.18z)| lr 5.38e-04 | 4158.99 ms | 32.5% bf16 MFU | 125723 tok/s step 4634/19560 | loss 3.556587 (-0.63z)| norm 0.2738 (-0.37z)| lr 5.38e-04 | 4159.82 ms | 32.5% bf16 MFU | 125739 tok/s step 4635/19560 | loss 3.591692 (+0.20z)| norm 0.3035 (+1.18z)| lr 5.38e-04 | 4148.64 ms | 32.5% bf16 MFU | 125771 tok/s step 4636/19560 | loss 3.609295 (+0.61z)| norm 0.2727 (-0.42z)| lr 5.38e-04 | 4153.37 ms | 32.5% bf16 MFU | 125794 tok/s step 4637/19560 | loss 3.585778 (+0.06z)| norm 0.2942 (+0.71z)| lr 5.38e-04 | 4213.52 ms | 32.0% bf16 MFU | 125726 tok/s step 4638/19560 | loss 3.564588 (-0.44z)| norm 0.3074 (+1.44z)| lr 5.38e-04 | 4150.34 ms | 32.5% bf16 MFU | 125756 tok/s step 4639/19560 | loss 3.604736 (+0.50z)| norm 0.2795 (-0.05z)| lr 5.38e-04 | 4156.72 ms | 32.5% bf16 MFU | 125774 tok/s step 4640/19560 | loss 3.581173 (-0.06z)| norm 0.2862 (+0.30z)| lr 5.38e-04 | 4212.97 ms | 32.0% bf16 MFU | 125708 tok/s step 4641/19560 | loss 3.562860 (-0.49z)| norm 0.2704 (-0.54z)| lr 5.38e-04 | 4255.45 ms | 31.7% bf16 MFU | 125583 tok/s step 4642/19560 | loss 3.538033 (-1.06z)| norm 0.2784 (-0.11z)| lr 5.38e-04 | 4153.26 ms | 32.5% bf16 MFU | 125615 tok/s step 4643/19560 | loss 3.518757 (-1.49z)| norm 0.3037 (+1.24z)| lr 5.38e-04 | 4161.64 ms | 32.4% bf16 MFU | 125634 tok/s step 4644/19560 | loss 3.676884 (+2.15z)| norm 0.3061 (+1.35z)| lr 5.38e-04 | 4172.10 ms | 32.4% bf16 MFU | 125635 tok/s step 4645/19560 | loss 3.599162 (+0.36z)| norm 0.2494 (-1.64z)| lr 5.38e-04 | 4954.53 ms | 27.3% bf16 MFU | 124644 tok/s step 4646/19560 | loss 3.580730 (-0.06z)| norm 0.2823 (+0.11z)| lr 5.38e-04 | 4325.32 ms | 31.2% bf16 MFU | 124473 tok/s step 4647/19560 | loss 3.558421 (-0.57z)| norm 0.2654 (-0.78z)| lr 5.37e-04 | 4203.45 ms | 32.1% bf16 MFU | 124486 tok/s step 4648/19560 | loss 3.590157 (+0.15z)| norm 0.2763 (-0.19z)| lr 5.37e-04 | 4201.48 ms | 32.1% bf16 MFU | 124501 tok/s step 4649/19560 | loss 3.584802 (+0.02z)| norm 0.3006 (+1.09z)| lr 5.37e-04 | 4161.38 ms | 32.4% bf16 MFU | 124575 tok/s step 4650/19560 | loss 3.564314 (-0.46z)| norm 0.2734 (-0.36z)| lr 5.37e-04 | 4148.43 ms | 32.5% bf16 MFU | 124665 tok/s step 4651/19560 | loss 3.532313 (-1.18z)| norm 0.3010 (+1.10z)| lr 5.37e-04 | 4165.47 ms | 32.4% bf16 MFU | 124725 tok/s step 4652/19560 | loss 3.622992 (+0.89z)| norm 0.2726 (-0.42z)| lr 5.37e-04 | 4171.73 ms | 32.4% bf16 MFU | 124773 tok/s step 4653/19560 | loss 3.548924 (-0.80z)| norm 0.2750 (-0.29z)| lr 5.37e-04 | 4151.21 ms | 32.5% bf16 MFU | 124849 tok/s step 4654/19560 | loss 3.544806 (-0.88z)| norm 0.2738 (-0.35z)| lr 5.37e-04 | 4159.11 ms | 32.5% bf16 MFU | 124910 tok/s step 4655/19560 | loss 3.534829 (-1.11z)| norm 0.2672 (-0.70z)| lr 5.37e-04 | 4151.02 ms | 32.5% bf16 MFU | 124979 tok/s step 4656/19560 | loss 3.608509 (+0.58z)| norm 0.2947 (+0.75z)| lr 5.37e-04 | 4246.40 ms | 31.8% bf16 MFU | 124904 tok/s step 4657/19560 | loss 3.547783 (-0.82z)| norm 0.2705 (-0.55z)| lr 5.37e-04 | 4159.84 ms | 32.5% bf16 MFU | 124960 tok/s step 4658/19560 | loss 3.594662 (+0.29z)| norm 0.3105 (+1.56z)| lr 5.37e-04 | 4171.99 ms | 32.4% bf16 MFU | 124996 tok/s step 4659/19560 | loss 3.588868 (+0.14z)| norm 0.2928 (+0.63z)| lr 5.37e-04 | 4178.37 ms | 32.3% bf16 MFU | 125020 tok/s step 4660/19560 | loss 3.549423 (-0.80z)| norm 0.3046 (+1.25z)| lr 5.37e-04 | 4152.67 ms | 32.5% bf16 MFU | 125081 tok/s step 4661/19560 | loss 3.570518 (-0.29z)| norm 0.2933 (+0.65z)| lr 5.37e-04 | 4156.54 ms | 32.5% bf16 MFU | 125134 tok/s step 4662/19560 | loss 3.585441 (+0.05z)| norm 0.2910 (+0.52z)| lr 5.37e-04 | 4173.08 ms | 32.4% bf16 MFU | 125159 tok/s step 4663/19560 | loss 3.544612 (-0.96z)| norm 0.3288 (+2.45z)| lr 5.37e-04 | 4173.49 ms | 32.4% bf16 MFU | 125182 tok/s step 4664/19560 | loss 3.555199 (-0.67z)| norm 0.3195 (+1.98z)| lr 5.37e-04 | 4500.46 ms | 30.0% bf16 MFU | 124748 tok/s step 4665/19560 | loss 3.603190 (+0.58z)| norm 0.3070 (+1.34z)| lr 5.37e-04 | 4186.46 ms | 32.3% bf16 MFU | 124772 tok/s step 4666/19560 | loss 3.506984 (-1.90z)| norm 0.2647 (-0.86z)| lr 5.37e-04 | 4233.64 ms | 31.9% bf16 MFU | 124726 tok/s step 4667/19560 | loss 3.563263 (-0.44z)| norm 0.2832 (+0.10z)| lr 5.37e-04 | 4166.29 ms | 32.4% bf16 MFU | 124781 tok/s step 4668/19560 | loss 3.598598 (+0.47z)| norm 0.2804 (-0.04z)| lr 5.37e-04 | 4164.18 ms | 32.4% bf16 MFU | 124838 tok/s step 4669/19560 | loss 3.548194 (-0.83z)| norm 0.2795 (-0.10z)| lr 5.37e-04 | 4156.71 ms | 32.5% bf16 MFU | 124902 tok/s step 4670/19560 | loss 3.544466 (-0.91z)| norm 0.2654 (-0.84z)| lr 5.37e-04 | 4150.40 ms | 32.5% bf16 MFU | 124973 tok/s step 4671/19560 | loss 3.553801 (-0.68z)| norm 0.2665 (-0.78z)| lr 5.37e-04 | 4148.62 ms | 32.5% bf16 MFU | 125043 tok/s step 4672/19560 | loss 3.546022 (-0.87z)| norm 0.2788 (-0.15z)| lr 5.37e-04 | 4224.28 ms | 32.0% bf16 MFU | 124997 tok/s step 4673/19560 | loss 3.545218 (-0.89z)| norm 0.2599 (-1.15z)| lr 5.37e-04 | 4160.72 ms | 32.5% bf16 MFU | 125047 tok/s step 4674/19560 | loss 3.585877 (+0.17z)| norm 0.2521 (-1.56z)| lr 5.37e-04 | 4169.06 ms | 32.4% bf16 MFU | 125083 tok/s step 4675/19560 | loss 3.479518 (-2.52z)| norm 0.2925 (+0.58z)| lr 5.37e-04 | 4156.32 ms | 32.5% bf16 MFU | 125136 tok/s step 4676/19560 | loss 3.621225 (+1.09z)| norm 0.2604 (-1.12z)| lr 5.37e-04 | 4302.86 ms | 31.4% bf16 MFU | 124971 tok/s step 4677/19560 | loss 3.592098 (+0.34z)| norm 0.2824 (+0.04z)| lr 5.37e-04 | 4163.55 ms | 32.4% bf16 MFU | 125019 tok/s step 4678/19560 | loss 3.614383 (+0.90z)| norm 0.2747 (-0.37z)| lr 5.37e-04 | 4168.61 ms | 32.4% bf16 MFU | 125057 tok/s step 4679/19560 | loss 3.594352 (+0.39z)| norm 0.2581 (-1.24z)| lr 5.37e-04 | 4151.86 ms | 32.5% bf16 MFU | 125118 tok/s step 4680/19560 | loss 3.601772 (+0.57z)| norm 0.2640 (-0.93z)| lr 5.36e-04 | 4233.49 ms | 31.9% bf16 MFU | 125054 tok/s step 4681/19560 | loss 3.561490 (-0.45z)| norm 0.2917 (+0.52z)| lr 5.36e-04 | 4166.14 ms | 32.4% bf16 MFU | 125093 tok/s step 4682/19560 | loss 3.608004 (+0.73z)| norm 0.2692 (-0.66z)| lr 5.36e-04 | 4149.32 ms | 32.5% bf16 MFU | 125157 tok/s step 4683/19560 | loss 3.588370 (+0.23z)| norm 0.2687 (-0.68z)| lr 5.36e-04 | 4171.72 ms | 32.4% bf16 MFU | 125183 tok/s step 4684/19560 | loss 3.557861 (-0.54z)| norm 0.2922 (+0.55z)| lr 5.36e-04 | 4159.32 ms | 32.5% bf16 MFU | 125226 tok/s step 4685/19560 | loss 3.591213 (+0.30z)| norm 0.2919 (+0.52z)| lr 5.36e-04 | 4266.49 ms | 31.6% bf16 MFU | 125109 tok/s step 4686/19560 | loss 3.540945 (-0.97z)| norm 0.3114 (+1.52z)| lr 5.36e-04 | 4160.53 ms | 32.5% bf16 MFU | 125154 tok/s step 4687/19560 | loss 3.642904 (+1.60z)| norm 0.3168 (+1.78z)| lr 5.36e-04 | 4161.55 ms | 32.4% bf16 MFU | 125196 tok/s step 4688/19560 | loss 3.545424 (-0.86z)| norm 0.3024 (+1.04z)| lr 5.36e-04 | 4165.13 ms | 32.4% bf16 MFU | 125230 tok/s step 4689/19560 | loss 3.603114 (+0.59z)| norm 0.2919 (+0.50z)| lr 5.36e-04 | 4164.53 ms | 32.4% bf16 MFU | 125263 tok/s step 4690/19560 | loss 3.574350 (-0.15z)| norm 0.3128 (+1.62z)| lr 5.36e-04 | 4168.64 ms | 32.4% bf16 MFU | 125288 tok/s step 4691/19560 | loss 3.643290 (+1.57z)| norm 0.3140 (+1.66z)| lr 5.36e-04 | 4149.62 ms | 32.5% bf16 MFU | 125341 tok/s step 4692/19560 | loss 3.550528 (-0.76z)| norm 0.2848 (+0.12z)| lr 5.36e-04 | 4178.41 ms | 32.3% bf16 MFU | 125348 tok/s step 4693/19560 | loss 3.579595 (-0.01z)| norm 0.3043 (+1.13z)| lr 5.36e-04 | 4169.95 ms | 32.4% bf16 MFU | 125367 tok/s step 4694/19560 | loss 3.601653 (+0.56z)| norm 0.2679 (-0.77z)| lr 5.36e-04 | 4154.85 ms | 32.5% bf16 MFU | 125408 tok/s step 4695/19560 | loss 3.549654 (-0.77z)| norm 0.2869 (+0.22z)| lr 5.36e-04 | 4165.60 ms | 32.4% bf16 MFU | 125431 tok/s step 4696/19560 | loss 3.606756 (+0.74z)| norm 0.3018 (+0.99z)| lr 5.36e-04 | 4164.39 ms | 32.4% bf16 MFU | 125454 tok/s step 4697/19560 | loss 3.555991 (-0.61z)| norm 0.2559 (-1.38z)| lr 5.36e-04 | 4168.31 ms | 32.4% bf16 MFU | 125470 tok/s step 4698/19560 | loss 3.610934 (+0.85z)| norm 0.2842 (+0.08z)| lr 5.36e-04 | 4164.72 ms | 32.4% bf16 MFU | 125491 tok/s step 4699/19560 | loss 3.591046 (+0.32z)| norm 0.2844 (+0.10z)| lr 5.36e-04 | 4158.19 ms | 32.5% bf16 MFU | 125521 tok/s step 4700/19560 | loss 3.552313 (-0.70z)| norm 0.2562 (-1.35z)| lr 5.36e-04 | 4164.07 ms | 32.4% bf16 MFU | 125540 tok/s step 4701/19560 | loss 3.644619 (+1.72z)| norm 0.2501 (-1.65z)| lr 5.36e-04 | 4167.09 ms | 32.4% bf16 MFU | 125554 tok/s step 4702/19560 | loss 3.622075 (+1.12z)| norm 0.2668 (-0.78z)| lr 5.36e-04 | 4169.40 ms | 32.4% bf16 MFU | 125564 tok/s step 4703/19560 | loss 3.599385 (+0.52z)| norm 0.2648 (-0.87z)| lr 5.36e-04 | 4160.67 ms | 32.5% bf16 MFU | 125586 tok/s step 4704/19560 | loss 3.663071 (+2.14z)| norm 0.2433 (-1.93z)| lr 5.36e-04 | 4166.53 ms | 32.4% bf16 MFU | 125598 tok/s step 4705/19560 | loss 3.587376 (+0.20z)| norm 0.2618 (-0.97z)| lr 5.36e-04 | 4162.54 ms | 32.4% bf16 MFU | 125616 tok/s step 4706/19560 | loss 3.594988 (+0.39z)| norm 0.2651 (-0.81z)| lr 5.36e-04 | 4158.85 ms | 32.5% bf16 MFU | 125639 tok/s step 4707/19560 | loss 3.565670 (-0.38z)| norm 0.2738 (-0.37z)| lr 5.36e-04 | 4159.21 ms | 32.5% bf16 MFU | 125659 tok/s step 4708/19560 | loss 3.531368 (-1.28z)| norm 0.2906 (+0.48z)| lr 5.36e-04 | 4161.65 ms | 32.4% bf16 MFU | 125675 tok/s step 4709/19560 | loss 3.633064 (+1.44z)| norm 0.2905 (+0.47z)| lr 5.36e-04 | 4161.61 ms | 32.4% bf16 MFU | 125691 tok/s step 4710/19560 | loss 3.634930 (+1.49z)| norm 0.2961 (+0.76z)| lr 5.36e-04 | 4157.55 ms | 32.5% bf16 MFU | 125711 tok/s step 4711/19560 | loss 3.643098 (+1.68z)| norm 0.2864 (+0.25z)| lr 5.36e-04 | 4148.79 ms | 32.5% bf16 MFU | 125744 tok/s step 4712/19560 | loss 3.599965 (+0.52z)| norm 0.3000 (+0.95z)| lr 5.35e-04 | 4173.87 ms | 32.3% bf16 MFU | 125738 tok/s step 4713/19560 | loss 3.577168 (-0.08z)| norm 0.2753 (-0.37z)| lr 5.35e-04 | 4158.70 ms | 32.5% bf16 MFU | 125754 tok/s step 4714/19560 | loss 3.598942 (+0.48z)| norm 0.2908 (+0.45z)| lr 5.35e-04 | 4174.08 ms | 32.3% bf16 MFU | 125747 tok/s step 4715/19560 | loss 3.628316 (+1.25z)| norm 0.2746 (-0.43z)| lr 5.35e-04 | 4155.63 ms | 32.5% bf16 MFU | 125768 tok/s step 4716/19560 | loss 3.581950 (+0.03z)| norm 0.2716 (-0.59z)| lr 5.35e-04 | 4156.00 ms | 32.5% bf16 MFU | 125787 tok/s step 4717/19560 | loss 3.674284 (+2.40z)| norm 0.2979 (+0.83z)| lr 5.35e-04 | 4157.26 ms | 32.5% bf16 MFU | 125803 tok/s step 4718/19560 | loss 3.521116 (-1.56z)| norm 0.3076 (+1.35z)| lr 5.35e-04 | 4162.69 ms | 32.4% bf16 MFU | 125811 tok/s step 4719/19560 | loss 3.591804 (+0.26z)| norm 0.3035 (+1.12z)| lr 5.35e-04 | 4161.77 ms | 32.4% bf16 MFU | 125819 tok/s step 4720/19560 | loss 3.666192 (+2.14z)| norm 0.3064 (+1.25z)| lr 5.35e-04 | 4158.10 ms | 32.5% bf16 MFU | 125832 tok/s step 4721/19560 | loss 3.596016 (+0.36z)| norm 0.2947 (+0.63z)| lr 5.35e-04 | 4157.84 ms | 32.5% bf16 MFU | 125846 tok/s step 4722/19560 | loss 3.512720 (-1.74z)| norm 0.2686 (-0.80z)| lr 5.35e-04 | 4156.49 ms | 32.5% bf16 MFU | 125860 tok/s step 4723/19560 | loss 3.525605 (-1.39z)| norm 0.2841 (+0.04z)| lr 5.35e-04 | 4165.67 ms | 32.4% bf16 MFU | 125860 tok/s step 4724/19560 | loss 3.581535 (+0.02z)| norm 0.3152 (+1.70z)| lr 5.35e-04 | 4152.70 ms | 32.5% bf16 MFU | 125880 tok/s step 4725/19560 | loss 3.585147 (+0.10z)| norm 0.2567 (-1.45z)| lr 5.35e-04 | 4169.23 ms | 32.4% bf16 MFU | 125873 tok/s step 4726/19560 | loss 3.567707 (-0.34z)| norm 0.2695 (-0.77z)| lr 5.35e-04 | 4170.24 ms | 32.4% bf16 MFU | 125866 tok/s step 4727/19560 | loss 3.548664 (-0.82z)| norm 0.2772 (-0.37z)| lr 5.35e-04 | 4158.96 ms | 32.5% bf16 MFU | 125876 tok/s step 4728/19560 | loss 3.573375 (-0.20z)| norm 0.2926 (+0.47z)| lr 5.35e-04 | 4153.30 ms | 32.5% bf16 MFU | 125894 tok/s step 4729/19560 | loss 3.617188 (+0.97z)| norm 0.3632 (+4.07z)| lr 5.35e-04 | 4154.37 ms | 32.5% bf16 MFU | 125909 tok/s step 4730/19560 | loss 3.590746 (+0.26z)| norm 0.3126 (+1.43z)| lr 5.35e-04 | 4179.86 ms | 32.3% bf16 MFU | 125885 tok/s step 4731/19560 | loss 3.533587 (-1.24z)| norm 0.2792 (-0.27z)| lr 5.35e-04 | 4171.06 ms | 32.4% bf16 MFU | 125876 tok/s step 4732/19560 | loss 3.597091 (+0.47z)| norm 0.2955 (+0.58z)| lr 5.35e-04 | 4156.27 ms | 32.5% bf16 MFU | 125889 tok/s step 4733/19560 | loss 3.553279 (-0.72z)| norm 0.3126 (+1.46z)| lr 5.35e-04 | 4155.40 ms | 32.5% bf16 MFU | 125903 tok/s step 4734/19560 | loss 3.619034 (+1.05z)| norm 0.2791 (-0.29z)| lr 5.35e-04 | 4157.40 ms | 32.5% bf16 MFU | 125913 tok/s step 4735/19560 | loss 3.611682 (+0.84z)| norm 0.2791 (-0.28z)| lr 5.35e-04 | 4159.88 ms | 32.5% bf16 MFU | 125920 tok/s step 4736/19560 | loss 3.542978 (-1.02z)| norm 0.2623 (-1.14z)| lr 5.35e-04 | 4156.78 ms | 32.5% bf16 MFU | 125930 tok/s step 4737/19560 | loss 3.580181 (-0.01z)| norm 0.2671 (-0.89z)| lr 5.35e-04 | 4162.79 ms | 32.4% bf16 MFU | 125931 tok/s step 4738/19560 | loss 3.557127 (-0.63z)| norm 0.2840 (-0.00z)| lr 5.35e-04 | 4164.25 ms | 32.4% bf16 MFU | 125929 tok/s step 4739/19560 | loss 3.538319 (-1.13z)| norm 0.2632 (-1.08z)| lr 5.35e-04 | 4170.46 ms | 32.4% bf16 MFU | 125919 tok/s step 4740/19560 | loss 3.599005 (+0.50z)| norm 0.2803 (-0.17z)| lr 5.35e-04 | 4163.13 ms | 32.4% bf16 MFU | 125919 tok/s step 4741/19560 | loss 3.668012 (+2.30z)| norm 0.2811 (-0.13z)| lr 5.35e-04 | 4173.14 ms | 32.4% bf16 MFU | 125905 tok/s step 4742/19560 | loss 3.574328 (-0.17z)| norm 0.2561 (-1.43z)| lr 5.35e-04 | 4164.66 ms | 32.4% bf16 MFU | 125904 tok/s step 4743/19560 | loss 3.670347 (+2.33z)| norm 0.2584 (-1.29z)| lr 5.35e-04 | 4150.71 ms | 32.5% bf16 MFU | 125925 tok/s step 4744/19560 | loss 3.615913 (+0.89z)| norm 0.2959 (+0.67z)| lr 5.35e-04 | 4178.67 ms | 32.3% bf16 MFU | 125902 tok/s step 4745/19560 | loss 3.637010 (+1.42z)| norm 0.2652 (-0.92z)| lr 5.34e-04 | 4164.88 ms | 32.4% bf16 MFU | 125901 tok/s step 4746/19560 | loss 3.577779 (-0.12z)| norm 0.2763 (-0.34z)| lr 5.34e-04 | 4155.35 ms | 32.5% bf16 MFU | 125915 tok/s step 4747/19560 | loss 3.705748 (+3.11z)| norm 0.2738 (-0.47z)| lr 5.34e-04 | 4162.11 ms | 32.4% bf16 MFU | 125917 tok/s step 4748/19560 | loss 3.587166 (+0.10z)| norm 0.2922 (+0.50z)| lr 5.34e-04 | 4165.87 ms | 32.4% bf16 MFU | 125914 tok/s step 4749/19560 | loss 3.577400 (-0.15z)| norm 0.2856 (+0.15z)| lr 5.34e-04 | 4153.74 ms | 32.5% bf16 MFU | 125929 tok/s step 4750/19560 | loss 3.519975 (-1.59z)| norm 0.2503 (-1.68z)| lr 5.34e-04 | 4176.25 ms | 32.3% bf16 MFU | 125910 tok/s val loss 3.563008 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2748/10042 = 0.273651 step 4751/19560 | loss 3.564603 (-0.46z)| norm 0.3123 (+1.51z)| lr 5.34e-04 | 4153.82 ms | 32.5% bf16 MFU | 125925 tok/s step 4752/19560 | loss 3.556399 (-0.67z)| norm 0.2989 (+0.80z)| lr 5.34e-04 | 4156.86 ms | 32.5% bf16 MFU | 125935 tok/s step 4753/19560 | loss 3.551723 (-0.78z)| norm 0.2894 (+0.31z)| lr 5.34e-04 | 4178.61 ms | 32.3% bf16 MFU | 125912 tok/s step 4754/19560 | loss 3.571234 (-0.30z)| norm 0.2675 (-0.85z)| lr 5.34e-04 | 4452.45 ms | 30.3% bf16 MFU | 125504 tok/s step 4755/19560 | loss 3.608297 (+0.65z)| norm 0.2864 (+0.14z)| lr 5.34e-04 | 4162.53 ms | 32.4% bf16 MFU | 125527 tok/s step 4756/19560 | loss 3.576643 (-0.16z)| norm 0.2744 (-0.49z)| lr 5.34e-04 | 4164.73 ms | 32.4% bf16 MFU | 125545 tok/s step 4757/19560 | loss 3.577570 (-0.13z)| norm 0.2701 (-0.71z)| lr 5.34e-04 | 4160.95 ms | 32.4% bf16 MFU | 125567 tok/s step 4758/19560 | loss 3.711825 (+3.16z)| norm 0.7182 (+10.10z)| lr 5.34e-04 | 4167.02 ms | 32.4% bf16 MFU | 125580 tok/s step 4759/19560 | loss 3.589076 (+0.13z)| norm 0.3386 (+1.19z)| lr 5.34e-04 | 4171.34 ms | 32.4% bf16 MFU | 125585 tok/s step 4760/19560 | loss 3.562040 (-0.55z)| norm 0.2905 (+0.07z)| lr 5.34e-04 | 4176.28 ms | 32.3% bf16 MFU | 125583 tok/s step 4761/19560 | loss 3.591725 (+0.20z)| norm 0.3525 (+1.49z)| lr 5.34e-04 | 4157.39 ms | 32.5% bf16 MFU | 125609 tok/s step 4762/19560 | loss 3.600321 (+0.40z)| norm 0.3140 (+0.59z)| lr 5.34e-04 | 4151.50 ms | 32.5% bf16 MFU | 125643 tok/s step 4763/19560 | loss 3.586209 (+0.05z)| norm 0.2987 (+0.24z)| lr 5.34e-04 | 4222.53 ms | 32.0% bf16 MFU | 125569 tok/s step 4764/19560 | loss 3.601656 (+0.44z)| norm 0.2732 (-0.35z)| lr 5.34e-04 | 4166.13 ms | 32.4% bf16 MFU | 125583 tok/s step 4765/19560 | loss 3.580683 (-0.09z)| norm 0.2571 (-0.71z)| lr 5.34e-04 | 4169.53 ms | 32.4% bf16 MFU | 125591 tok/s step 4766/19560 | loss 3.639405 (+1.37z)| norm 0.2801 (-0.18z)| lr 5.34e-04 | 4185.04 ms | 32.3% bf16 MFU | 125576 tok/s step 4767/19560 | loss 3.592577 (+0.20z)| norm 0.2900 (+0.05z)| lr 5.34e-04 | 4164.21 ms | 32.4% bf16 MFU | 125592 tok/s step 4768/19560 | loss 3.638122 (+1.32z)| norm 0.2645 (-0.53z)| lr 5.34e-04 | 4160.38 ms | 32.5% bf16 MFU | 125613 tok/s step 4769/19560 | loss 3.599180 (+0.34z)| norm 0.3092 (+0.49z)| lr 5.34e-04 | 4166.47 ms | 32.4% bf16 MFU | 125624 tok/s step 4770/19560 | loss 3.567196 (-0.46z)| norm 0.3155 (+0.63z)| lr 5.34e-04 | 4257.67 ms | 31.7% bf16 MFU | 125500 tok/s step 4771/19560 | loss 3.624584 (+0.96z)| norm 0.2901 (+0.04z)| lr 5.34e-04 | 4165.94 ms | 32.4% bf16 MFU | 125518 tok/s step 4772/19560 | loss 3.604413 (+0.48z)| norm 0.2839 (-0.09z)| lr 5.34e-04 | 4183.11 ms | 32.3% bf16 MFU | 125509 tok/s step 4773/19560 | loss 3.651689 (+1.66z)| norm 0.3127 (+0.56z)| lr 5.34e-04 | 4166.56 ms | 32.4% bf16 MFU | 125525 tok/s step 4774/19560 | loss 3.546572 (-0.99z)| norm 0.2625 (-0.60z)| lr 5.34e-04 | 4160.06 ms | 32.5% bf16 MFU | 125550 tok/s step 4775/19560 | loss 3.644098 (+1.44z)| norm 0.2666 (-0.50z)| lr 5.34e-04 | 4150.31 ms | 32.5% bf16 MFU | 125589 tok/s step 4776/19560 | loss 3.572322 (-0.35z)| norm 0.2547 (-0.77z)| lr 5.33e-04 | 4156.92 ms | 32.5% bf16 MFU | 125615 tok/s step 4777/19560 | loss 3.558945 (-0.68z)| norm 0.2816 (-0.15z)| lr 5.33e-04 | 4154.53 ms | 32.5% bf16 MFU | 125644 tok/s step 4778/19560 | loss 3.597353 (+0.27z)| norm 0.2674 (-0.47z)| lr 5.33e-04 | 4147.23 ms | 32.6% bf16 MFU | 125683 tok/s step 4779/19560 | loss 3.583586 (-0.08z)| norm 0.2693 (-0.42z)| lr 5.33e-04 | 4151.05 ms | 32.5% bf16 MFU | 125714 tok/s step 4780/19560 | loss 3.587872 (+0.03z)| norm 0.2992 (+0.26z)| lr 5.33e-04 | 4170.86 ms | 32.4% bf16 MFU | 125714 tok/s step 4781/19560 | loss 3.570419 (-0.42z)| norm 0.2860 (-0.05z)| lr 5.33e-04 | 4166.20 ms | 32.4% bf16 MFU | 125720 tok/s step 4782/19560 | loss 3.583341 (-0.10z)| norm 0.2648 (-0.53z)| lr 5.33e-04 | 4154.19 ms | 32.5% bf16 MFU | 125744 tok/s step 4783/19560 | loss 3.575289 (-0.31z)| norm 0.2572 (-0.70z)| lr 5.33e-04 | 4153.82 ms | 32.5% bf16 MFU | 125768 tok/s step 4784/19560 | loss 3.578124 (-0.23z)| norm 0.2540 (-0.77z)| lr 5.33e-04 | 4159.14 ms | 32.5% bf16 MFU | 125783 tok/s step 4785/19560 | loss 3.540465 (-1.20z)| norm 0.2548 (-0.75z)| lr 5.33e-04 | 4151.40 ms | 32.5% bf16 MFU | 125808 tok/s step 4786/19560 | loss 3.582613 (-0.11z)| norm 0.2714 (-0.36z)| lr 5.33e-04 | 4158.46 ms | 32.5% bf16 MFU | 125821 tok/s step 4787/19560 | loss 3.587022 (+0.00z)| norm 0.2814 (-0.13z)| lr 5.33e-04 | 4160.42 ms | 32.5% bf16 MFU | 125831 tok/s step 4788/19560 | loss 3.555008 (-0.82z)| norm 0.2916 (+0.11z)| lr 5.33e-04 | 4180.22 ms | 32.3% bf16 MFU | 125811 tok/s step 4789/19560 | loss 3.569998 (-0.44z)| norm 0.3049 (+0.41z)| lr 5.33e-04 | 4186.83 ms | 32.2% bf16 MFU | 125781 tok/s step 4790/19560 | loss 3.588301 (+0.03z)| norm 0.2749 (-0.28z)| lr 5.33e-04 | 4154.16 ms | 32.5% bf16 MFU | 125803 tok/s step 4791/19560 | loss 3.521184 (-1.68z)| norm 0.2811 (-0.13z)| lr 5.33e-04 | 4161.05 ms | 32.4% bf16 MFU | 125813 tok/s step 4792/19560 | loss 3.616923 (+0.75z)| norm 0.2773 (-0.20z)| lr 5.33e-04 | 4170.92 ms | 32.4% bf16 MFU | 125807 tok/s step 4793/19560 | loss 3.575953 (-0.29z)| norm 0.2883 (+0.05z)| lr 5.33e-04 | 4169.54 ms | 32.4% bf16 MFU | 125804 tok/s step 4794/19560 | loss 3.547757 (-1.03z)| norm 0.2630 (-0.53z)| lr 5.33e-04 | 4158.87 ms | 32.5% bf16 MFU | 125817 tok/s step 4795/19560 | loss 3.553977 (-0.86z)| norm 0.2768 (-0.21z)| lr 5.33e-04 | 4153.95 ms | 32.5% bf16 MFU | 125837 tok/s step 4796/19560 | loss 3.595345 (+0.21z)| norm 0.2598 (-0.60z)| lr 5.33e-04 | 4168.27 ms | 32.4% bf16 MFU | 125834 tok/s step 4797/19560 | loss 3.553797 (-0.87z)| norm 0.2866 (+0.02z)| lr 5.33e-04 | 4154.89 ms | 32.5% bf16 MFU | 125851 tok/s step 4798/19560 | loss 3.604955 (+0.44z)| norm 0.2849 (-0.03z)| lr 5.33e-04 | 4155.26 ms | 32.5% bf16 MFU | 125868 tok/s step 4799/19560 | loss 3.568146 (-0.52z)| norm 0.2942 (+0.18z)| lr 5.33e-04 | 4157.07 ms | 32.5% bf16 MFU | 125880 tok/s step 4800/19560 | loss 3.599462 (+0.29z)| norm 0.2852 (-0.02z)| lr 5.33e-04 | 4155.07 ms | 32.5% bf16 MFU | 125895 tok/s step 4801/19560 | loss 3.542870 (-1.19z)| norm 0.2745 (-0.28z)| lr 5.33e-04 | 4161.68 ms | 32.4% bf16 MFU | 125899 tok/s step 4802/19560 | loss 3.592033 (+0.09z)| norm 0.2569 (-0.68z)| lr 5.33e-04 | 4149.10 ms | 32.5% bf16 MFU | 125923 tok/s step 4803/19560 | loss 3.562626 (-0.71z)| norm 0.2819 (-0.10z)| lr 5.33e-04 | 4152.62 ms | 32.5% bf16 MFU | 125939 tok/s step 4804/19560 | loss 3.598008 (+0.25z)| norm 0.3136 (+0.62z)| lr 5.33e-04 | 4158.04 ms | 32.5% bf16 MFU | 125947 tok/s step 4805/19560 | loss 3.599337 (+0.28z)| norm 0.2992 (+0.28z)| lr 5.33e-04 | 4160.29 ms | 32.5% bf16 MFU | 125950 tok/s step 4806/19560 | loss 3.529876 (-1.57z)| norm 0.2764 (-0.24z)| lr 5.33e-04 | 4154.65 ms | 32.5% bf16 MFU | 125963 tok/s step 4807/19560 | loss 3.608279 (+0.53z)| norm 0.2991 (+0.27z)| lr 5.33e-04 | 4157.61 ms | 32.5% bf16 MFU | 125970 tok/s step 4808/19560 | loss 3.587879 (-0.01z)| norm 0.3054 (+0.41z)| lr 5.32e-04 | 4153.62 ms | 32.5% bf16 MFU | 125982 tok/s step 4809/19560 | loss 3.639290 (+1.34z)| norm 0.2787 (-0.20z)| lr 5.32e-04 | 4150.78 ms | 32.5% bf16 MFU | 125999 tok/s step 4810/19560 | loss 3.539803 (-1.29z)| norm 0.3033 (+0.36z)| lr 5.32e-04 | 6752.33 ms | 20.0% bf16 MFU | 123581 tok/s step 4811/19560 | loss 3.546948 (-1.09z)| norm 0.2900 (+0.05z)| lr 5.32e-04 | 4156.95 ms | 32.5% bf16 MFU | 123708 tok/s step 4812/19560 | loss 3.571173 (-0.45z)| norm 0.2887 (+0.02z)| lr 5.32e-04 | 4143.54 ms | 32.6% bf16 MFU | 123849 tok/s step 4813/19560 | loss 3.555143 (-0.86z)| norm 0.2853 (-0.06z)| lr 5.32e-04 | 4153.11 ms | 32.5% bf16 MFU | 123969 tok/s step 4814/19560 | loss 3.559834 (-0.75z)| norm 0.2650 (-0.52z)| lr 5.32e-04 | 4164.20 ms | 32.4% bf16 MFU | 124066 tok/s step 4815/19560 | loss 3.622222 (+0.91z)| norm 0.2713 (-0.37z)| lr 5.32e-04 | 4173.40 ms | 32.4% bf16 MFU | 124144 tok/s step 4816/19560 | loss 3.525758 (-1.64z)| norm 0.2831 (-0.09z)| lr 5.32e-04 | 4161.15 ms | 32.4% bf16 MFU | 124236 tok/s step 4817/19560 | loss 3.543159 (-1.16z)| norm 0.2843 (-0.06z)| lr 5.32e-04 | 4160.37 ms | 32.5% bf16 MFU | 124325 tok/s step 4818/19560 | loss 3.610231 (+0.60z)| norm 0.2873 (+0.01z)| lr 5.32e-04 | 4149.42 ms | 32.5% bf16 MFU | 124427 tok/s step 4819/19560 | loss 3.564248 (-0.60z)| norm 0.3176 (+0.72z)| lr 5.32e-04 | 4175.72 ms | 32.3% bf16 MFU | 124483 tok/s step 4820/19560 | loss 3.645129 (+1.51z)| norm 0.2634 (-0.54z)| lr 5.32e-04 | 4176.56 ms | 32.3% bf16 MFU | 124536 tok/s step 4821/19560 | loss 3.570969 (-0.44z)| norm 0.2939 (+0.17z)| lr 5.32e-04 | 4149.97 ms | 32.5% bf16 MFU | 124626 tok/s step 4822/19560 | loss 3.613720 (+0.68z)| norm 0.2734 (-0.30z)| lr 5.32e-04 | 4162.51 ms | 32.4% bf16 MFU | 124692 tok/s step 4823/19560 | loss 3.541704 (-1.20z)| norm 0.2744 (-0.28z)| lr 5.32e-04 | 4161.16 ms | 32.4% bf16 MFU | 124757 tok/s step 4824/19560 | loss 3.560527 (-0.70z)| norm 0.2693 (-0.39z)| lr 5.32e-04 | 4147.95 ms | 32.6% bf16 MFU | 124839 tok/s step 4825/19560 | loss 3.513796 (-1.89z)| norm 0.2719 (-0.33z)| lr 5.32e-04 | 4146.30 ms | 32.6% bf16 MFU | 124920 tok/s step 4826/19560 | loss 3.506995 (-2.02z)| norm 0.2607 (-0.59z)| lr 5.32e-04 | 4169.61 ms | 32.4% bf16 MFU | 124961 tok/s step 4827/19560 | loss 3.591718 (+0.14z)| norm 0.2551 (-0.71z)| lr 5.32e-04 | 4155.87 ms | 32.5% bf16 MFU | 125020 tok/s step 4828/19560 | loss 3.573745 (-0.32z)| norm 0.2870 (+0.02z)| lr 5.32e-04 | 4166.39 ms | 32.4% bf16 MFU | 125061 tok/s step 4829/19560 | loss 3.571661 (-0.36z)| norm 0.2780 (-0.20z)| lr 5.32e-04 | 4154.99 ms | 32.5% bf16 MFU | 125117 tok/s step 4830/19560 | loss 3.548222 (-0.95z)| norm 0.2647 (-0.50z)| lr 5.32e-04 | 4338.06 ms | 31.1% bf16 MFU | 124904 tok/s step 4831/19560 | loss 3.579781 (-0.13z)| norm 0.2753 (-0.26z)| lr 5.32e-04 | 4159.50 ms | 32.5% bf16 MFU | 124961 tok/s step 4832/19560 | loss 3.567502 (-0.44z)| norm 0.2739 (-0.30z)| lr 5.32e-04 | 4161.33 ms | 32.4% bf16 MFU | 125013 tok/s step 4833/19560 | loss 3.566586 (-0.46z)| norm 0.2777 (-0.21z)| lr 5.32e-04 | 4160.24 ms | 32.5% bf16 MFU | 125063 tok/s step 4834/19560 | loss 3.567756 (-0.42z)| norm 0.2754 (-0.27z)| lr 5.32e-04 | 4154.16 ms | 32.5% bf16 MFU | 125121 tok/s step 4835/19560 | loss 3.628444 (+1.15z)| norm 0.2839 (-0.07z)| lr 5.32e-04 | 4156.75 ms | 32.5% bf16 MFU | 125171 tok/s step 4836/19560 | loss 3.563658 (-0.55z)| norm 0.2493 (-0.87z)| lr 5.32e-04 | 4148.52 ms | 32.5% bf16 MFU | 125232 tok/s step 4837/19560 | loss 3.584628 (+0.01z)| norm 0.2639 (-0.52z)| lr 5.32e-04 | 4147.62 ms | 32.6% bf16 MFU | 125290 tok/s step 4838/19560 | loss 3.575028 (-0.23z)| norm 0.3009 (+0.34z)| lr 5.32e-04 | 4158.31 ms | 32.5% bf16 MFU | 125330 tok/s step 4839/19560 | loss 3.546094 (-0.99z)| norm 0.2624 (-0.56z)| lr 5.32e-04 | 4145.89 ms | 32.6% bf16 MFU | 125386 tok/s step 4840/19560 | loss 3.551009 (-0.84z)| norm 0.2957 (+0.22z)| lr 5.31e-04 | 4154.39 ms | 32.5% bf16 MFU | 125427 tok/s step 4841/19560 | loss 3.462428 (-3.07z)| norm 0.2938 (+0.17z)| lr 5.31e-04 | 4173.84 ms | 32.3% bf16 MFU | 125436 tok/s step 4842/19560 | loss 3.628367 (+1.19z)| norm 0.2929 (+0.15z)| lr 5.31e-04 | 4150.85 ms | 32.5% bf16 MFU | 125480 tok/s step 4843/19560 | loss 3.535214 (-1.18z)| norm 0.2770 (-0.22z)| lr 5.31e-04 | 4149.02 ms | 32.5% bf16 MFU | 125524 tok/s step 4844/19560 | loss 3.612962 (+0.80z)| norm 0.2948 (+0.19z)| lr 5.31e-04 | 4228.69 ms | 31.9% bf16 MFU | 125447 tok/s step 4845/19560 | loss 3.585312 (+0.12z)| norm 0.2955 (+0.21z)| lr 5.31e-04 | 4158.88 ms | 32.5% bf16 MFU | 125478 tok/s step 4846/19560 | loss 3.600431 (+0.50z)| norm 0.2862 (-0.01z)| lr 5.31e-04 | 4165.46 ms | 32.4% bf16 MFU | 125497 tok/s step 4847/19560 | loss 3.564931 (-0.43z)| norm 0.2955 (+0.21z)| lr 5.31e-04 | 4161.53 ms | 32.4% bf16 MFU | 125522 tok/s step 4848/19560 | loss 3.524388 (-1.48z)| norm 0.2689 (-0.40z)| lr 5.31e-04 | 4160.15 ms | 32.5% bf16 MFU | 125547 tok/s step 4849/19560 | loss 3.541425 (-1.01z)| norm 0.2582 (-0.64z)| lr 5.31e-04 | 4197.03 ms | 32.2% bf16 MFU | 125516 tok/s step 4850/19560 | loss 3.537575 (-1.13z)| norm 0.2981 (+0.28z)| lr 5.31e-04 | 4150.00 ms | 32.5% bf16 MFU | 125557 tok/s step 4851/19560 | loss 3.557333 (-0.61z)| norm 0.2645 (-0.50z)| lr 5.31e-04 | 4177.35 ms | 32.3% bf16 MFU | 125554 tok/s step 4852/19560 | loss 3.573618 (-0.17z)| norm 0.2900 (+0.10z)| lr 5.31e-04 | 4154.57 ms | 32.5% bf16 MFU | 125586 tok/s step 4853/19560 | loss 3.555742 (-0.65z)| norm 0.2798 (-0.14z)| lr 5.31e-04 | 4161.99 ms | 32.4% bf16 MFU | 125605 tok/s step 4854/19560 | loss 3.655960 (+1.99z)| norm 0.3005 (+0.34z)| lr 5.31e-04 | 4207.88 ms | 32.1% bf16 MFU | 125555 tok/s step 4855/19560 | loss 3.494347 (-2.23z)| norm 0.3154 (+0.68z)| lr 5.31e-04 | 4158.46 ms | 32.5% bf16 MFU | 125581 tok/s step 4856/19560 | loss 3.583545 (+0.09z)| norm 0.3091 (+0.53z)| lr 5.31e-04 | 4152.86 ms | 32.5% bf16 MFU | 125614 tok/s step 4857/19560 | loss 3.561761 (-0.47z)| norm 0.2846 (-0.03z)| lr 5.31e-04 | 4152.40 ms | 32.5% bf16 MFU | 125647 tok/s step 4858/19560 | loss 3.582538 (+0.07z)| norm 0.2621 (-0.55z)| lr 5.31e-04 | 4153.85 ms | 32.5% bf16 MFU | 125675 tok/s step 4859/19560 | loss 3.603029 (+0.60z)| norm 0.2990 (+0.31z)| lr 5.31e-04 | 4268.48 ms | 31.6% bf16 MFU | 125533 tok/s step 4860/19560 | loss 3.577075 (-0.08z)| norm 0.2802 (-0.13z)| lr 5.31e-04 | 4154.72 ms | 32.5% bf16 MFU | 125566 tok/s step 4861/19560 | loss 3.550771 (-0.77z)| norm 0.2895 (+0.10z)| lr 5.31e-04 | 4148.33 ms | 32.5% bf16 MFU | 125607 tok/s step 4862/19560 | loss 3.581130 (+0.04z)| norm 0.2556 (-0.70z)| lr 5.31e-04 | 4178.70 ms | 32.3% bf16 MFU | 125600 tok/s step 4863/19560 | loss 3.555424 (-0.63z)| norm 0.2824 (-0.07z)| lr 5.31e-04 | 4201.75 ms | 32.1% bf16 MFU | 125559 tok/s step 4864/19560 | loss 3.597700 (+0.47z)| norm 0.2812 (-0.10z)| lr 5.31e-04 | 4151.71 ms | 32.5% bf16 MFU | 125595 tok/s step 4865/19560 | loss 3.572328 (-0.19z)| norm 0.2596 (-0.61z)| lr 5.31e-04 | 4164.03 ms | 32.4% bf16 MFU | 125611 tok/s step 4866/19560 | loss 3.593971 (+0.37z)| norm 0.2771 (-0.19z)| lr 5.31e-04 | 4156.83 ms | 32.5% bf16 MFU | 125636 tok/s step 4867/19560 | loss 3.591346 (+0.29z)| norm 0.2686 (-0.39z)| lr 5.31e-04 | 4167.34 ms | 32.4% bf16 MFU | 125645 tok/s step 4868/19560 | loss 3.533832 (-1.22z)| norm 0.2663 (-0.44z)| lr 5.31e-04 | 4161.81 ms | 32.4% bf16 MFU | 125662 tok/s step 4869/19560 | loss 3.564094 (-0.40z)| norm 0.2329 (-1.22z)| lr 5.31e-04 | 4156.62 ms | 32.5% bf16 MFU | 125685 tok/s step 4870/19560 | loss 3.685883 (+2.77z)| norm 0.2833 (-0.04z)| lr 5.31e-04 | 4151.18 ms | 32.5% bf16 MFU | 125716 tok/s step 4871/19560 | loss 3.602282 (+0.61z)| norm 0.2838 (-0.03z)| lr 5.30e-04 | 4169.72 ms | 32.4% bf16 MFU | 125717 tok/s step 4872/19560 | loss 3.531633 (-1.25z)| norm 0.2768 (-0.19z)| lr 5.30e-04 | 4158.53 ms | 32.5% bf16 MFU | 125735 tok/s step 4873/19560 | loss 3.584944 (+0.18z)| norm 0.2789 (-0.15z)| lr 5.30e-04 | 4328.90 ms | 31.2% bf16 MFU | 125504 tok/s step 4874/19560 | loss 3.552842 (-0.68z)| norm 0.2617 (-0.55z)| lr 5.30e-04 | 4236.84 ms | 31.9% bf16 MFU | 125416 tok/s step 4875/19560 | loss 3.638585 (+1.70z)| norm 0.2855 (+0.01z)| lr 5.30e-04 | 4519.75 ms | 29.9% bf16 MFU | 124945 tok/s step 4876/19560 | loss 3.550113 (-0.76z)| norm 0.2857 (+0.01z)| lr 5.30e-04 | 4353.53 ms | 31.0% bf16 MFU | 124719 tok/s step 4877/19560 | loss 3.601806 (+0.67z)| norm 0.2684 (-0.39z)| lr 5.30e-04 | 4209.56 ms | 32.1% bf16 MFU | 124710 tok/s step 4878/19560 | loss 3.605831 (+0.77z)| norm 0.2893 (+0.09z)| lr 5.30e-04 | 4171.15 ms | 32.4% bf16 MFU | 124760 tok/s step 4879/19560 | loss 3.549432 (-0.80z)| norm 0.2838 (-0.03z)| lr 5.30e-04 | 4171.65 ms | 32.4% bf16 MFU | 124806 tok/s step 4880/19560 | loss 3.609893 (+0.88z)| norm 0.2929 (+0.19z)| lr 5.30e-04 | 4297.47 ms | 31.4% bf16 MFU | 124665 tok/s step 4881/19560 | loss 3.565712 (-0.36z)| norm 0.2755 (-0.22z)| lr 5.30e-04 | 4155.39 ms | 32.5% bf16 MFU | 124741 tok/s step 4882/19560 | loss 3.545317 (-0.92z)| norm 0.2863 (+0.03z)| lr 5.30e-04 | 4169.07 ms | 32.4% bf16 MFU | 124791 tok/s step 4883/19560 | loss 3.627390 (+1.35z)| norm 0.2935 (+0.20z)| lr 5.30e-04 | 4160.71 ms | 32.5% bf16 MFU | 124852 tok/s step 4884/19560 | loss 3.565761 (-0.35z)| norm 0.2822 (-0.07z)| lr 5.30e-04 | 4180.09 ms | 32.3% bf16 MFU | 124881 tok/s step 4885/19560 | loss 3.583212 (+0.13z)| norm 0.2650 (-0.47z)| lr 5.30e-04 | 4166.01 ms | 32.4% bf16 MFU | 124929 tok/s step 4886/19560 | loss 3.542828 (-1.00z)| norm 0.2789 (-0.16z)| lr 5.30e-04 | 4186.62 ms | 32.2% bf16 MFU | 124944 tok/s step 4887/19560 | loss 3.576382 (-0.02z)| norm 0.2501 (-1.79z)| lr 5.30e-04 | 4170.00 ms | 32.4% bf16 MFU | 124983 tok/s step 4888/19560 | loss 3.628137 (+1.46z)| norm 0.2545 (-1.51z)| lr 5.30e-04 | 4164.48 ms | 32.4% bf16 MFU | 125029 tok/s step 4889/19560 | loss 3.579300 (+0.05z)| norm 0.2646 (-0.95z)| lr 5.30e-04 | 4214.17 ms | 32.0% bf16 MFU | 124998 tok/s step 4890/19560 | loss 3.578059 (+0.02z)| norm 0.2660 (-0.85z)| lr 5.30e-04 | 4155.11 ms | 32.5% bf16 MFU | 125057 tok/s step 4891/19560 | loss 3.553726 (-0.68z)| norm 0.2635 (-0.99z)| lr 5.30e-04 | 4231.53 ms | 31.9% bf16 MFU | 124999 tok/s step 4892/19560 | loss 3.625276 (+1.39z)| norm 0.2832 (+0.23z)| lr 5.30e-04 | 4174.73 ms | 32.3% bf16 MFU | 125029 tok/s step 4893/19560 | loss 3.594804 (+0.50z)| norm 0.2781 (-0.09z)| lr 5.30e-04 | 4257.72 ms | 31.7% bf16 MFU | 124934 tok/s step 4894/19560 | loss 3.590523 (+0.40z)| norm 0.3522 (+4.22z)| lr 5.30e-04 | 4281.50 ms | 31.5% bf16 MFU | 124810 tok/s step 4895/19560 | loss 3.592139 (+0.44z)| norm 0.3727 (+4.86z)| lr 5.30e-04 | 4287.33 ms | 31.5% bf16 MFU | 124684 tok/s step 4896/19560 | loss 3.561477 (-0.44z)| norm 0.2715 (-0.50z)| lr 5.30e-04 | 4182.11 ms | 32.3% bf16 MFU | 124718 tok/s step 4897/19560 | loss 3.522663 (-1.56z)| norm 0.2979 (+0.91z)| lr 5.30e-04 | 4167.64 ms | 32.4% bf16 MFU | 124772 tok/s step 4898/19560 | loss 3.634115 (+1.67z)| norm 0.3293 (+2.55z)| lr 5.30e-04 | 4153.75 ms | 32.5% bf16 MFU | 124845 tok/s step 4899/19560 | loss 3.598630 (+0.66z)| norm 0.3089 (+1.46z)| lr 5.30e-04 | 4167.07 ms | 32.4% bf16 MFU | 124893 tok/s step 4900/19560 | loss 3.604854 (+0.84z)| norm 0.3169 (+1.83z)| lr 5.30e-04 | 4155.85 ms | 32.5% bf16 MFU | 124956 tok/s step 4901/19560 | loss 3.641179 (+1.91z)| norm 0.2824 (+0.07z)| lr 5.30e-04 | 4160.93 ms | 32.4% bf16 MFU | 125009 tok/s step 4902/19560 | loss 3.541060 (-1.02z)| norm 0.2778 (-0.18z)| lr 5.29e-04 | 4171.38 ms | 32.4% bf16 MFU | 125043 tok/s step 4903/19560 | loss 3.582860 (+0.22z)| norm 0.2888 (+0.39z)| lr 5.29e-04 | 4224.32 ms | 32.0% bf16 MFU | 124996 tok/s step 4904/19560 | loss 3.596567 (+0.62z)| norm 0.2581 (-1.22z)| lr 5.29e-04 | 4180.88 ms | 32.3% bf16 MFU | 125016 tok/s step 4905/19560 | loss 3.574380 (-0.04z)| norm 0.2738 (-0.40z)| lr 5.29e-04 | 4229.76 ms | 31.9% bf16 MFU | 124963 tok/s step 4906/19560 | loss 3.593129 (+0.52z)| norm 0.2681 (-0.69z)| lr 5.29e-04 | 4312.94 ms | 31.3% bf16 MFU | 124793 tok/s step 4907/19560 | loss 3.557737 (-0.53z)| norm 0.2622 (-1.00z)| lr 5.29e-04 | 4166.62 ms | 32.4% bf16 MFU | 124845 tok/s step 4908/19560 | loss 3.567991 (-0.22z)| norm 0.2725 (-0.45z)| lr 5.29e-04 | 4172.19 ms | 32.4% bf16 MFU | 124886 tok/s step 4909/19560 | loss 3.578363 (+0.09z)| norm 0.2953 (+0.74z)| lr 5.29e-04 | 4173.69 ms | 32.3% bf16 MFU | 124922 tok/s step 4910/19560 | loss 3.728550 (+4.19z)| norm 0.2620 (-1.00z)| lr 5.29e-04 | 4163.51 ms | 32.4% bf16 MFU | 124972 tok/s step 4911/19560 | loss 3.591584 (+0.41z)| norm 0.2823 (+0.05z)| lr 5.29e-04 | 4157.99 ms | 32.5% bf16 MFU | 125028 tok/s step 4912/19560 | loss 3.541588 (-0.96z)| norm 0.2885 (+0.37z)| lr 5.29e-04 | 4169.31 ms | 32.4% bf16 MFU | 125064 tok/s step 4913/19560 | loss 3.630466 (+1.46z)| norm 0.3005 (+0.99z)| lr 5.29e-04 | 4176.23 ms | 32.3% bf16 MFU | 125088 tok/s step 4914/19560 | loss 3.560636 (-0.45z)| norm 0.2595 (-1.18z)| lr 5.29e-04 | 4178.79 ms | 32.3% bf16 MFU | 125107 tok/s step 4915/19560 | loss 3.591409 (+0.39z)| norm 0.2628 (-1.00z)| lr 5.29e-04 | 4174.25 ms | 32.3% bf16 MFU | 125132 tok/s step 4916/19560 | loss 3.612005 (+0.94z)| norm 0.2755 (-0.32z)| lr 5.29e-04 | 4159.18 ms | 32.5% bf16 MFU | 125178 tok/s step 4917/19560 | loss 3.599155 (+0.58z)| norm 0.2684 (-0.68z)| lr 5.29e-04 | 4168.77 ms | 32.4% bf16 MFU | 125207 tok/s step 4918/19560 | loss 3.681377 (+2.72z)| norm 0.2669 (-0.76z)| lr 5.29e-04 | 4178.42 ms | 32.3% bf16 MFU | 125221 tok/s step 4919/19560 | loss 3.574413 (-0.12z)| norm 0.2709 (-0.54z)| lr 5.29e-04 | 4171.72 ms | 32.4% bf16 MFU | 125244 tok/s step 4920/19560 | loss 3.617112 (+1.02z)| norm 0.2927 (+0.60z)| lr 5.29e-04 | 4174.56 ms | 32.3% bf16 MFU | 125261 tok/s step 4921/19560 | loss 3.582983 (+0.11z)| norm 0.2658 (-0.80z)| lr 5.29e-04 | 4169.71 ms | 32.4% bf16 MFU | 125285 tok/s step 4922/19560 | loss 3.561554 (-0.47z)| norm 0.2850 (+0.20z)| lr 5.29e-04 | 4178.30 ms | 32.3% bf16 MFU | 125294 tok/s step 4923/19560 | loss 3.579011 (-0.00z)| norm 0.2603 (-1.09z)| lr 5.29e-04 | 4417.72 ms | 30.6% bf16 MFU | 124964 tok/s step 4924/19560 | loss 3.544474 (-0.92z)| norm 0.2583 (-1.20z)| lr 5.29e-04 | 4681.05 ms | 28.8% bf16 MFU | 124316 tok/s step 4925/19560 | loss 3.579300 (+0.01z)| norm 0.2691 (-0.62z)| lr 5.29e-04 | 4158.76 ms | 32.5% bf16 MFU | 124403 tok/s step 4926/19560 | loss 3.543607 (-0.93z)| norm 0.3206 (+2.03z)| lr 5.29e-04 | 4174.93 ms | 32.3% bf16 MFU | 124462 tok/s step 4927/19560 | loss 3.563757 (-0.39z)| norm 0.2445 (-1.86z)| lr 5.29e-04 | 4161.40 ms | 32.4% bf16 MFU | 124538 tok/s step 4928/19560 | loss 3.578572 (+0.01z)| norm 0.2662 (-0.74z)| lr 5.29e-04 | 4227.93 ms | 31.9% bf16 MFU | 124512 tok/s step 4929/19560 | loss 3.606866 (+0.75z)| norm 0.2826 (+0.09z)| lr 5.29e-04 | 4164.94 ms | 32.4% bf16 MFU | 124580 tok/s step 4930/19560 | loss 3.545857 (-0.87z)| norm 0.2884 (+0.38z)| lr 5.29e-04 | 4158.20 ms | 32.5% bf16 MFU | 124655 tok/s step 4931/19560 | loss 3.642716 (+1.68z)| norm 0.3025 (+1.09z)| lr 5.29e-04 | 4152.66 ms | 32.5% bf16 MFU | 124735 tok/s step 4932/19560 | loss 3.683064 (+2.65z)| norm 0.2935 (+0.64z)| lr 5.29e-04 | 4181.80 ms | 32.3% bf16 MFU | 124767 tok/s step 4933/19560 | loss 3.590979 (+0.29z)| norm 0.2803 (-0.03z)| lr 5.28e-04 | 4169.09 ms | 32.4% bf16 MFU | 124817 tok/s step 4934/19560 | loss 3.574152 (-0.15z)| norm 0.2855 (+0.23z)| lr 5.28e-04 | 4166.68 ms | 32.4% bf16 MFU | 124867 tok/s step 4935/19560 | loss 3.543952 (-0.92z)| norm 0.2932 (+0.63z)| lr 5.28e-04 | 4157.60 ms | 32.5% bf16 MFU | 124929 tok/s step 4936/19560 | loss 3.547633 (-0.81z)| norm 0.2844 (+0.19z)| lr 5.28e-04 | 4182.95 ms | 32.3% bf16 MFU | 124950 tok/s step 4937/19560 | loss 3.712088 (+3.29z)| norm 0.2794 (-0.07z)| lr 5.28e-04 | 4177.11 ms | 32.3% bf16 MFU | 124978 tok/s step 4938/19560 | loss 3.576195 (-0.10z)| norm 0.3170 (+1.87z)| lr 5.28e-04 | 4159.97 ms | 32.5% bf16 MFU | 125031 tok/s step 4939/19560 | loss 3.617499 (+0.92z)| norm 0.3064 (+1.30z)| lr 5.28e-04 | 4154.88 ms | 32.5% bf16 MFU | 125088 tok/s step 4940/19560 | loss 3.635252 (+1.34z)| norm 0.2713 (-0.49z)| lr 5.28e-04 | 4175.94 ms | 32.3% bf16 MFU | 125111 tok/s step 4941/19560 | loss 3.594553 (+0.33z)| norm 0.2779 (-0.15z)| lr 5.28e-04 | 4178.73 ms | 32.3% bf16 MFU | 125129 tok/s step 4942/19560 | loss 3.622524 (+1.01z)| norm 0.2464 (-1.74z)| lr 5.28e-04 | 4157.66 ms | 32.5% bf16 MFU | 125178 tok/s step 4943/19560 | loss 3.608219 (+0.66z)| norm 0.3019 (+1.07z)| lr 5.28e-04 | 4179.07 ms | 32.3% bf16 MFU | 125192 tok/s step 4944/19560 | loss 3.537359 (-1.11z)| norm 0.2732 (-0.38z)| lr 5.28e-04 | 4150.04 ms | 32.5% bf16 MFU | 125249 tok/s step 4945/19560 | loss 3.597638 (+0.38z)| norm 0.2916 (+0.54z)| lr 5.28e-04 | 4158.72 ms | 32.5% bf16 MFU | 125290 tok/s step 4946/19560 | loss 3.587325 (+0.13z)| norm 0.2841 (+0.17z)| lr 5.28e-04 | 4171.38 ms | 32.4% bf16 MFU | 125310 tok/s step 4947/19560 | loss 3.554006 (-0.70z)| norm 0.2711 (-0.48z)| lr 5.28e-04 | 4148.64 ms | 32.5% bf16 MFU | 125363 tok/s step 4948/19560 | loss 3.558496 (-0.58z)| norm 0.2712 (-0.48z)| lr 5.28e-04 | 4176.29 ms | 32.3% bf16 MFU | 125372 tok/s step 4949/19560 | loss 3.518225 (-1.57z)| norm 0.2680 (-0.63z)| lr 5.28e-04 | 4187.69 ms | 32.2% bf16 MFU | 125363 tok/s step 4950/19560 | loss 3.577262 (-0.09z)| norm 0.2618 (-0.94z)| lr 5.28e-04 | 4174.69 ms | 32.3% bf16 MFU | 125374 tok/s step 4951/19560 | loss 3.559835 (-0.53z)| norm 0.2773 (-0.15z)| lr 5.28e-04 | 4169.80 ms | 32.4% bf16 MFU | 125392 tok/s step 4952/19560 | loss 3.555767 (-0.63z)| norm 0.2556 (-1.25z)| lr 5.28e-04 | 4167.86 ms | 32.4% bf16 MFU | 125412 tok/s step 4953/19560 | loss 3.632552 (+1.28z)| norm 0.2495 (-1.54z)| lr 5.28e-04 | 4191.17 ms | 32.2% bf16 MFU | 125396 tok/s step 4954/19560 | loss 3.621428 (+0.99z)| norm 0.2919 (+0.59z)| lr 5.28e-04 | 4208.17 ms | 32.1% bf16 MFU | 125356 tok/s step 4955/19560 | loss 3.552740 (-0.75z)| norm 0.2843 (+0.19z)| lr 5.28e-04 | 4168.70 ms | 32.4% bf16 MFU | 125377 tok/s step 4956/19560 | loss 3.539460 (-1.08z)| norm 0.3156 (+1.76z)| lr 5.28e-04 | 4168.44 ms | 32.4% bf16 MFU | 125396 tok/s step 4957/19560 | loss 3.552769 (-0.74z)| norm 0.3065 (+1.28z)| lr 5.28e-04 | 4171.18 ms | 32.4% bf16 MFU | 125411 tok/s step 4958/19560 | loss 3.542474 (-0.99z)| norm 0.3108 (+1.47z)| lr 5.28e-04 | 4161.85 ms | 32.4% bf16 MFU | 125439 tok/s step 4959/19560 | loss 3.643695 (+1.53z)| norm 0.3132 (+1.56z)| lr 5.28e-04 | 4164.41 ms | 32.4% bf16 MFU | 125462 tok/s step 4960/19560 | loss 3.608435 (+0.64z)| norm 0.3015 (+0.97z)| lr 5.28e-04 | 4172.92 ms | 32.4% bf16 MFU | 125471 tok/s step 4961/19560 | loss 3.569127 (-0.34z)| norm 0.2890 (+0.35z)| lr 5.28e-04 | 4166.01 ms | 32.4% bf16 MFU | 125490 tok/s step 4962/19560 | loss 3.641448 (+1.44z)| norm 0.2954 (+0.65z)| lr 5.28e-04 | 4181.72 ms | 32.3% bf16 MFU | 125484 tok/s step 4963/19560 | loss 3.572082 (-0.27z)| norm 0.2648 (-0.84z)| lr 5.28e-04 | 4183.51 ms | 32.3% bf16 MFU | 125476 tok/s step 4964/19560 | loss 3.624055 (+1.01z)| norm 0.3080 (+1.26z)| lr 5.27e-04 | 4163.12 ms | 32.4% bf16 MFU | 125499 tok/s step 4965/19560 | loss 3.549104 (-0.84z)| norm 0.2889 (+0.32z)| lr 5.27e-04 | 4188.40 ms | 32.2% bf16 MFU | 125483 tok/s step 4966/19560 | loss 3.585730 (+0.07z)| norm 0.2751 (-0.36z)| lr 5.27e-04 | 4162.25 ms | 32.4% bf16 MFU | 125507 tok/s step 4967/19560 | loss 3.625808 (+1.04z)| norm 0.2840 (+0.08z)| lr 5.27e-04 | 4158.27 ms | 32.5% bf16 MFU | 125536 tok/s step 4968/19560 | loss 3.613374 (+0.72z)| norm 0.2891 (+0.33z)| lr 5.27e-04 | 4307.40 ms | 31.3% bf16 MFU | 125345 tok/s step 4969/19560 | loss 3.756712 (+4.07z)| norm 0.2730 (-0.46z)| lr 5.27e-04 | 4178.04 ms | 32.3% bf16 MFU | 125352 tok/s step 4970/19560 | loss 3.574407 (-0.28z)| norm 0.2929 (+0.53z)| lr 5.27e-04 | 4202.87 ms | 32.1% bf16 MFU | 125322 tok/s step 4971/19560 | loss 3.574591 (-0.29z)| norm 0.3027 (+1.00z)| lr 5.27e-04 | 4172.59 ms | 32.4% bf16 MFU | 125338 tok/s step 4972/19560 | loss 3.634406 (+1.15z)| norm 0.3105 (+1.37z)| lr 5.27e-04 | 4191.26 ms | 32.2% bf16 MFU | 125326 tok/s step 4973/19560 | loss 3.518388 (-1.61z)| norm 0.3186 (+1.74z)| lr 5.27e-04 | 4157.91 ms | 32.5% bf16 MFU | 125364 tok/s step 4974/19560 | loss 3.546084 (-0.94z)| norm 0.2859 (+0.15z)| lr 5.27e-04 | 4163.35 ms | 32.4% bf16 MFU | 125392 tok/s step 4975/19560 | loss 3.571738 (-0.33z)| norm 0.3094 (+1.28z)| lr 5.27e-04 | 4182.52 ms | 32.3% bf16 MFU | 125390 tok/s step 4976/19560 | loss 3.549750 (-0.86z)| norm 0.3033 (+0.98z)| lr 5.27e-04 | 4321.31 ms | 31.2% bf16 MFU | 125187 tok/s step 4977/19560 | loss 3.586625 (+0.01z)| norm 0.2568 (-1.27z)| lr 5.27e-04 | 4172.73 ms | 32.4% bf16 MFU | 125210 tok/s step 4978/19560 | loss 3.571421 (-0.36z)| norm 0.2630 (-0.96z)| lr 5.27e-04 | 4170.69 ms | 32.4% bf16 MFU | 125235 tok/s step 4979/19560 | loss 3.552176 (-0.83z)| norm 0.2442 (-1.84z)| lr 5.27e-04 | 4190.79 ms | 32.2% bf16 MFU | 125229 tok/s step 4980/19560 | loss 3.487722 (-2.31z)| norm 0.2636 (-0.90z)| lr 5.27e-04 | 4300.53 ms | 31.4% bf16 MFU | 125063 tok/s step 4981/19560 | loss 3.527512 (-1.36z)| norm 0.2575 (-1.18z)| lr 5.27e-04 | 4174.96 ms | 32.3% bf16 MFU | 125089 tok/s step 4982/19560 | loss 3.534066 (-1.19z)| norm 0.2762 (-0.28z)| lr 5.27e-04 | 4183.74 ms | 32.3% bf16 MFU | 125100 tok/s step 4983/19560 | loss 3.540749 (-1.06z)| norm 0.2585 (-1.10z)| lr 5.27e-04 | 4182.60 ms | 32.3% bf16 MFU | 125112 tok/s step 4984/19560 | loss 3.604849 (+0.47z)| norm 0.3082 (+1.27z)| lr 5.27e-04 | 4154.46 ms | 32.5% bf16 MFU | 125167 tok/s step 4985/19560 | loss 3.639818 (+1.29z)| norm 0.3083 (+1.25z)| lr 5.27e-04 | 4164.92 ms | 32.4% bf16 MFU | 125202 tok/s step 4986/19560 | loss 3.512542 (-1.71z)| norm 0.3009 (+0.89z)| lr 5.27e-04 | 4162.33 ms | 32.4% bf16 MFU | 125240 tok/s step 4987/19560 | loss 3.570291 (-0.34z)| norm 0.3153 (+1.55z)| lr 5.27e-04 | 4311.59 ms | 31.3% bf16 MFU | 125058 tok/s step 4988/19560 | loss 3.564370 (-0.48z)| norm 0.2824 (+0.00z)| lr 5.27e-04 | 4173.09 ms | 32.4% bf16 MFU | 125087 tok/s step 4989/19560 | loss 3.588146 (+0.07z)| norm 0.2719 (-0.48z)| lr 5.27e-04 | 4183.51 ms | 32.3% bf16 MFU | 125099 tok/s step 4990/19560 | loss 3.568997 (-0.38z)| norm 0.3211 (+1.80z)| lr 5.27e-04 | 4154.94 ms | 32.5% bf16 MFU | 125153 tok/s step 4991/19560 | loss 3.559133 (-0.61z)| norm 0.3120 (+1.35z)| lr 5.27e-04 | 4178.01 ms | 32.3% bf16 MFU | 125170 tok/s step 4992/19560 | loss 3.610352 (+0.59z)| norm 0.2563 (-1.22z)| lr 5.27e-04 | 4164.59 ms | 32.4% bf16 MFU | 125206 tok/s step 4993/19560 | loss 3.582770 (-0.06z)| norm 0.2881 (+0.24z)| lr 5.27e-04 | 4177.11 ms | 32.3% bf16 MFU | 125221 tok/s step 4994/19560 | loss 3.586687 (+0.04z)| norm 0.3034 (+0.94z)| lr 5.27e-04 | 4165.94 ms | 32.4% bf16 MFU | 125253 tok/s step 4995/19560 | loss 3.549006 (-0.84z)| norm 0.2994 (+0.74z)| lr 5.26e-04 | 4169.71 ms | 32.4% bf16 MFU | 125277 tok/s step 4996/19560 | loss 3.561727 (-0.55z)| norm 0.2976 (+0.65z)| lr 5.26e-04 | 4181.24 ms | 32.3% bf16 MFU | 125283 tok/s step 4997/19560 | loss 3.595713 (+0.25z)| norm 0.2739 (-0.47z)| lr 5.26e-04 | 4156.79 ms | 32.5% bf16 MFU | 125325 tok/s step 4998/19560 | loss 3.566698 (-0.43z)| norm 0.2944 (+0.49z)| lr 5.26e-04 | 4186.15 ms | 32.3% bf16 MFU | 125321 tok/s step 4999/19560 | loss 3.688572 (+2.44z)| norm 0.2857 (+0.08z)| lr 5.26e-04 | 4176.78 ms | 32.3% bf16 MFU | 125331 tok/s step 5000/19560 | loss 3.535026 (-1.18z)| norm 0.2956 (+0.54z)| lr 5.26e-04 | 4170.08 ms | 32.4% bf16 MFU | 125351 tok/s val loss 3.548242 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2732/10042 = 0.272057 Writing checkpoint at step 5000 Writing model to log124M/model_00005000.bin Writing state to log124M/state_00005000_00000.bin step 5001/19560 | loss 3.588061 (+0.07z)| norm 0.3002 (+0.74z)| lr 5.26e-04 | 4238.16 ms | 31.9% bf16 MFU | 125269 tok/s step 5002/19560 | loss 3.583973 (-0.03z)| norm 0.2720 (-0.59z)| lr 5.26e-04 | 4184.20 ms | 32.3% bf16 MFU | 125270 tok/s step 5003/19560 | loss 3.619030 (+0.80z)| norm 0.2714 (-0.61z)| lr 5.26e-04 | 4299.42 ms | 31.4% bf16 MFU | 125104 tok/s step 5004/19560 | loss 3.553403 (-0.75z)| norm 0.2938 (+0.44z)| lr 5.26e-04 | 4371.77 ms | 30.9% bf16 MFU | 124845 tok/s step 5005/19560 | loss 3.595596 (+0.25z)| norm 0.2749 (-0.45z)| lr 5.26e-04 | 4234.89 ms | 31.9% bf16 MFU | 124793 tok/s step 5006/19560 | loss 3.535383 (-1.16z)| norm 0.2747 (-0.45z)| lr 5.26e-04 | 4198.89 ms | 32.2% bf16 MFU | 124796 tok/s step 5007/19560 | loss 3.569572 (-0.36z)| norm 0.2940 (+0.45z)| lr 5.26e-04 | 4174.74 ms | 32.3% bf16 MFU | 124836 tok/s step 5008/19560 | loss 3.611423 (+0.63z)| norm 0.3025 (+0.85z)| lr 5.26e-04 | 4304.28 ms | 31.4% bf16 MFU | 124684 tok/s step 5009/19560 | loss 3.575241 (-0.23z)| norm 0.2992 (+0.68z)| lr 5.26e-04 | 4167.97 ms | 32.4% bf16 MFU | 124740 tok/s step 5010/19560 | loss 3.509429 (-1.76z)| norm 0.3229 (+1.76z)| lr 5.26e-04 | 4234.82 ms | 31.9% bf16 MFU | 124693 tok/s step 5011/19560 | loss 3.590772 (+0.15z)| norm 0.2736 (-0.52z)| lr 5.26e-04 | 4186.79 ms | 32.2% bf16 MFU | 124720 tok/s step 5012/19560 | loss 3.579989 (-0.10z)| norm 0.2802 (-0.21z)| lr 5.26e-04 | 4161.77 ms | 32.4% bf16 MFU | 124782 tok/s step 5013/19560 | loss 3.574133 (-0.24z)| norm 0.2715 (-0.62z)| lr 5.26e-04 | 4198.06 ms | 32.2% bf16 MFU | 124788 tok/s step 5014/19560 | loss 3.628134 (+1.02z)| norm 0.2560 (-1.32z)| lr 5.26e-04 | 4189.38 ms | 32.2% bf16 MFU | 124806 tok/s step 5015/19560 | loss 3.557350 (-0.65z)| norm 0.2469 (-1.74z)| lr 5.26e-04 | 4161.05 ms | 32.4% bf16 MFU | 124865 tok/s step 5016/19560 | loss 3.562630 (-0.51z)| norm 0.2914 (+0.30z)| lr 5.26e-04 | 4165.77 ms | 32.4% bf16 MFU | 124915 tok/s step 5017/19560 | loss 3.572370 (-0.28z)| norm 0.2985 (+0.62z)| lr 5.26e-04 | 4156.20 ms | 32.5% bf16 MFU | 124976 tok/s step 5018/19560 | loss 3.601891 (+0.41z)| norm 0.2843 (-0.05z)| lr 5.26e-04 | 4180.38 ms | 32.3% bf16 MFU | 124998 tok/s step 5019/19560 | loss 3.591249 (+0.15z)| norm 0.2879 (+0.11z)| lr 5.26e-04 | 4192.79 ms | 32.2% bf16 MFU | 125001 tok/s step 5020/19560 | loss 3.576477 (-0.19z)| norm 0.2790 (-0.30z)| lr 5.26e-04 | 4163.08 ms | 32.4% bf16 MFU | 125048 tok/s step 5021/19560 | loss 3.484173 (-2.31z)| norm 0.2550 (-1.41z)| lr 5.26e-04 | 4168.77 ms | 32.4% bf16 MFU | 125083 tok/s step 5022/19560 | loss 3.552340 (-0.72z)| norm 0.2740 (-0.52z)| lr 5.26e-04 | 4165.86 ms | 32.4% bf16 MFU | 125122 tok/s step 5023/19560 | loss 3.568187 (-0.34z)| norm 0.2496 (-1.76z)| lr 5.26e-04 | 4206.82 ms | 32.1% bf16 MFU | 125097 tok/s step 5024/19560 | loss 3.544942 (-0.88z)| norm 0.2897 (+0.30z)| lr 5.26e-04 | 4162.70 ms | 32.4% bf16 MFU | 125140 tok/s step 5025/19560 | loss 3.618940 (+0.82z)| norm 0.2719 (-0.61z)| lr 5.25e-04 | 4178.19 ms | 32.3% bf16 MFU | 125157 tok/s step 5026/19560 | loss 3.563589 (-0.46z)| norm 0.2609 (-1.17z)| lr 5.25e-04 | 4227.17 ms | 31.9% bf16 MFU | 125101 tok/s step 5027/19560 | loss 3.553058 (-0.69z)| norm 0.2516 (-1.62z)| lr 5.25e-04 | 4160.48 ms | 32.5% bf16 MFU | 125146 tok/s step 5028/19560 | loss 3.557401 (-0.58z)| norm 0.2783 (-0.22z)| lr 5.25e-04 | 4160.58 ms | 32.5% bf16 MFU | 125190 tok/s step 5029/19560 | loss 3.605160 (+0.54z)| norm 0.2682 (-0.75z)| lr 5.25e-04 | 4159.72 ms | 32.5% bf16 MFU | 125232 tok/s step 5030/19560 | loss 3.635333 (+1.23z)| norm 0.2675 (-0.78z)| lr 5.25e-04 | 4176.53 ms | 32.3% bf16 MFU | 125247 tok/s step 5031/19560 | loss 3.567928 (-0.35z)| norm 0.2793 (-0.15z)| lr 5.25e-04 | 4283.00 ms | 31.5% bf16 MFU | 125105 tok/s step 5032/19560 | loss 3.532787 (-1.15z)| norm 0.2958 (+0.71z)| lr 5.25e-04 | 4190.08 ms | 32.2% bf16 MFU | 125106 tok/s step 5033/19560 | loss 3.520251 (-1.42z)| norm 0.2677 (-0.78z)| lr 5.25e-04 | 4178.24 ms | 32.3% bf16 MFU | 125125 tok/s step 5034/19560 | loss 3.568486 (-0.30z)| norm 0.2858 (+0.18z)| lr 5.25e-04 | 4177.90 ms | 32.3% bf16 MFU | 125143 tok/s step 5035/19560 | loss 3.571399 (-0.24z)| norm 0.2550 (-1.45z)| lr 5.25e-04 | 4181.76 ms | 32.3% bf16 MFU | 125155 tok/s step 5036/19560 | loss 3.615297 (+0.77z)| norm 0.2482 (-1.78z)| lr 5.25e-04 | 4163.61 ms | 32.4% bf16 MFU | 125193 tok/s step 5037/19560 | loss 3.611826 (+0.68z)| norm 0.2847 (+0.13z)| lr 5.25e-04 | 4178.17 ms | 32.3% bf16 MFU | 125208 tok/s step 5038/19560 | loss 3.591847 (+0.25z)| norm 0.2767 (-0.29z)| lr 5.25e-04 | 4177.62 ms | 32.3% bf16 MFU | 125222 tok/s step 5039/19560 | loss 3.599863 (+0.45z)| norm 0.3090 (+1.38z)| lr 5.25e-04 | 4204.96 ms | 32.1% bf16 MFU | 125195 tok/s step 5040/19560 | loss 3.528427 (-1.27z)| norm 0.3007 (+0.94z)| lr 5.25e-04 | 4181.85 ms | 32.3% bf16 MFU | 125204 tok/s step 5041/19560 | loss 3.568214 (-0.30z)| norm 0.2975 (+0.78z)| lr 5.25e-04 | 4164.73 ms | 32.4% bf16 MFU | 125238 tok/s step 5042/19560 | loss 3.601868 (+0.50z)| norm 0.2730 (-0.51z)| lr 5.25e-04 | 4194.40 ms | 32.2% bf16 MFU | 125226 tok/s step 5043/19560 | loss 3.581100 (+0.00z)| norm 0.2740 (-0.46z)| lr 5.25e-04 | 4173.77 ms | 32.3% bf16 MFU | 125246 tok/s step 5044/19560 | loss 3.521135 (-1.42z)| norm 0.2780 (-0.25z)| lr 5.25e-04 | 4180.92 ms | 32.3% bf16 MFU | 125253 tok/s step 5045/19560 | loss 3.602952 (+0.54z)| norm 0.2809 (-0.11z)| lr 5.25e-04 | 4171.12 ms | 32.4% bf16 MFU | 125276 tok/s step 5046/19560 | loss 3.581417 (+0.05z)| norm 0.2850 (+0.10z)| lr 5.25e-04 | 4173.59 ms | 32.4% bf16 MFU | 125293 tok/s step 5047/19560 | loss 3.557418 (-0.54z)| norm 0.2859 (+0.15z)| lr 5.25e-04 | 4177.06 ms | 32.3% bf16 MFU | 125304 tok/s step 5048/19560 | loss 3.558115 (-0.51z)| norm 0.2699 (-0.69z)| lr 5.25e-04 | 4164.98 ms | 32.4% bf16 MFU | 125333 tok/s step 5049/19560 | loss 3.546314 (-0.80z)| norm 0.2750 (-0.42z)| lr 5.25e-04 | 4187.99 ms | 32.2% bf16 MFU | 125326 tok/s step 5050/19560 | loss 3.532376 (-1.13z)| norm 0.2712 (-0.62z)| lr 5.25e-04 | 4169.94 ms | 32.4% bf16 MFU | 125346 tok/s step 5051/19560 | loss 3.629347 (+1.23z)| norm 0.2852 (+0.11z)| lr 5.25e-04 | 4160.98 ms | 32.4% bf16 MFU | 125379 tok/s step 5052/19560 | loss 3.560310 (-0.45z)| norm 0.2720 (-0.60z)| lr 5.25e-04 | 4158.82 ms | 32.5% bf16 MFU | 125413 tok/s step 5053/19560 | loss 3.544661 (-0.83z)| norm 0.2627 (-1.09z)| lr 5.25e-04 | 4169.45 ms | 32.4% bf16 MFU | 125430 tok/s step 5054/19560 | loss 3.493914 (-2.02z)| norm 0.2567 (-1.40z)| lr 5.25e-04 | 4178.06 ms | 32.3% bf16 MFU | 125432 tok/s step 5055/19560 | loss 3.598732 (+0.48z)| norm 0.2478 (-1.88z)| lr 5.24e-04 | 4159.67 ms | 32.5% bf16 MFU | 125463 tok/s step 5056/19560 | loss 3.539947 (-0.92z)| norm 0.2631 (-1.05z)| lr 5.24e-04 | 4159.42 ms | 32.5% bf16 MFU | 125492 tok/s step 5057/19560 | loss 3.563635 (-0.34z)| norm 0.2516 (-1.64z)| lr 5.24e-04 | 4178.09 ms | 32.3% bf16 MFU | 125492 tok/s step 5058/19560 | loss 3.606231 (+0.67z)| norm 0.2700 (-0.65z)| lr 5.24e-04 | 4170.24 ms | 32.4% bf16 MFU | 125503 tok/s step 5059/19560 | loss 3.537869 (-0.96z)| norm 0.2846 (+0.13z)| lr 5.24e-04 | 4241.91 ms | 31.8% bf16 MFU | 125408 tok/s step 5060/19560 | loss 3.602481 (+0.63z)| norm 0.2665 (-0.82z)| lr 5.24e-04 | 4282.53 ms | 31.5% bf16 MFU | 125259 tok/s step 5061/19560 | loss 3.566199 (-0.26z)| norm 0.2920 (+0.53z)| lr 5.24e-04 | 4156.35 ms | 32.5% bf16 MFU | 125303 tok/s step 5062/19560 | loss 3.600887 (+0.59z)| norm 0.3124 (+1.59z)| lr 5.24e-04 | 4162.19 ms | 32.4% bf16 MFU | 125336 tok/s step 5063/19560 | loss 3.524280 (-1.29z)| norm 0.2589 (-1.21z)| lr 5.24e-04 | 4182.65 ms | 32.3% bf16 MFU | 125337 tok/s step 5064/19560 | loss 3.611221 (+0.83z)| norm 0.2992 (+0.90z)| lr 5.24e-04 | 4199.76 ms | 32.1% bf16 MFU | 125312 tok/s step 5065/19560 | loss 3.588387 (+0.31z)| norm 0.2893 (+0.37z)| lr 5.24e-04 | 4189.32 ms | 32.2% bf16 MFU | 125303 tok/s step 5066/19560 | loss 3.526117 (-1.27z)| norm 0.2824 (+0.02z)| lr 5.24e-04 | 4180.36 ms | 32.3% bf16 MFU | 125309 tok/s step 5067/19560 | loss 3.597664 (+0.56z)| norm 0.2951 (+0.71z)| lr 5.24e-04 | 4176.72 ms | 32.3% bf16 MFU | 125320 tok/s step 5068/19560 | loss 3.537898 (-0.95z)| norm 0.2832 (+0.07z)| lr 5.24e-04 | 4159.50 ms | 32.5% bf16 MFU | 125356 tok/s step 5069/19560 | loss 3.533576 (-1.05z)| norm 0.2872 (+0.28z)| lr 5.24e-04 | 4168.89 ms | 32.4% bf16 MFU | 125377 tok/s step 5070/19560 | loss 3.520607 (-1.36z)| norm 0.2532 (-1.54z)| lr 5.24e-04 | 4173.43 ms | 32.4% bf16 MFU | 125389 tok/s step 5071/19560 | loss 3.491949 (-2.04z)| norm 0.2838 (+0.10z)| lr 5.24e-04 | 4163.70 ms | 32.4% bf16 MFU | 125416 tok/s step 5072/19560 | loss 3.517534 (-1.39z)| norm 0.2856 (+0.19z)| lr 5.24e-04 | 4163.50 ms | 32.4% bf16 MFU | 125441 tok/s step 5073/19560 | loss 3.566592 (-0.15z)| norm 0.2716 (-0.55z)| lr 5.24e-04 | 4186.05 ms | 32.3% bf16 MFU | 125431 tok/s step 5074/19560 | loss 3.487109 (-2.09z)| norm 0.2755 (-0.34z)| lr 5.24e-04 | 4180.65 ms | 32.3% bf16 MFU | 125430 tok/s step 5075/19560 | loss 3.531646 (-0.98z)| norm 0.2805 (-0.07z)| lr 5.24e-04 | 4165.80 ms | 32.4% bf16 MFU | 125451 tok/s step 5076/19560 | loss 3.469282 (-2.45z)| norm 0.2733 (-0.46z)| lr 5.24e-04 | 4169.43 ms | 32.4% bf16 MFU | 125466 tok/s step 5077/19560 | loss 3.576920 (+0.14z)| norm 0.2758 (-0.33z)| lr 5.24e-04 | 4164.19 ms | 32.4% bf16 MFU | 125488 tok/s step 5078/19560 | loss 3.525529 (-1.09z)| norm 0.3011 (+1.02z)| lr 5.24e-04 | 4160.99 ms | 32.4% bf16 MFU | 125514 tok/s step 5079/19560 | loss 3.514842 (-1.33z)| norm 0.3138 (+1.67z)| lr 5.24e-04 | 4584.05 ms | 29.5% bf16 MFU | 124957 tok/s step 5080/19560 | loss 3.550585 (-0.48z)| norm 0.3137 (+1.64z)| lr 5.24e-04 | 4163.24 ms | 32.4% bf16 MFU | 125005 tok/s step 5081/19560 | loss 3.528510 (-0.99z)| norm 0.2828 (-0.02z)| lr 5.24e-04 | 4187.75 ms | 32.2% bf16 MFU | 125015 tok/s step 5082/19560 | loss 3.607751 (+0.92z)| norm 0.2834 (+0.01z)| lr 5.24e-04 | 4166.41 ms | 32.4% bf16 MFU | 125056 tok/s step 5083/19560 | loss 3.593349 (+0.57z)| norm 0.2797 (-0.18z)| lr 5.24e-04 | 4171.62 ms | 32.4% bf16 MFU | 125087 tok/s step 5084/19560 | loss 3.525754 (-1.06z)| norm 0.2852 (+0.12z)| lr 5.24e-04 | 4173.82 ms | 32.3% bf16 MFU | 125113 tok/s step 5085/19560 | loss 3.665684 (+2.24z)| norm 0.9557 (+10.77z)| lr 5.23e-04 | 4173.85 ms | 32.3% bf16 MFU | 125138 tok/s step 5086/19560 | loss 3.575217 (+0.10z)| norm 0.3392 (+0.82z)| lr 5.23e-04 | 4172.78 ms | 32.4% bf16 MFU | 125164 tok/s step 5087/19560 | loss 3.507494 (-1.48z)| norm 0.3017 (+0.22z)| lr 5.23e-04 | 4179.01 ms | 32.3% bf16 MFU | 125178 tok/s step 5088/19560 | loss 3.536355 (-0.78z)| norm 0.3259 (+0.60z)| lr 5.23e-04 | 4160.68 ms | 32.5% bf16 MFU | 125220 tok/s step 5089/19560 | loss 3.541048 (-0.66z)| norm 0.2806 (-0.12z)| lr 5.23e-04 | 4168.57 ms | 32.4% bf16 MFU | 125248 tok/s step 5090/19560 | loss 3.588070 (+0.47z)| norm 0.2750 (-0.21z)| lr 5.23e-04 | 4161.69 ms | 32.4% bf16 MFU | 125284 tok/s step 5091/19560 | loss 3.562709 (-0.14z)| norm 0.3213 (+0.53z)| lr 5.23e-04 | 4185.18 ms | 32.3% bf16 MFU | 125284 tok/s step 5092/19560 | loss 3.537226 (-0.74z)| norm 0.2954 (+0.11z)| lr 5.23e-04 | 4192.04 ms | 32.2% bf16 MFU | 125273 tok/s step 5093/19560 | loss 3.535589 (-0.77z)| norm 0.2835 (-0.08z)| lr 5.23e-04 | 4173.49 ms | 32.4% bf16 MFU | 125290 tok/s step 5094/19560 | loss 3.503145 (-1.52z)| norm 0.2729 (-0.25z)| lr 5.23e-04 | 4163.91 ms | 32.4% bf16 MFU | 125321 tok/s step 5095/19560 | loss 3.530766 (-0.85z)| norm 0.2689 (-0.31z)| lr 5.23e-04 | 4164.22 ms | 32.4% bf16 MFU | 125351 tok/s step 5096/19560 | loss 3.588965 (+0.55z)| norm 0.2471 (-0.66z)| lr 5.23e-04 | 4165.10 ms | 32.4% bf16 MFU | 125377 tok/s step 5097/19560 | loss 3.523604 (-1.07z)| norm 0.2671 (-0.33z)| lr 5.23e-04 | 4172.25 ms | 32.4% bf16 MFU | 125391 tok/s step 5098/19560 | loss 3.609594 (+1.17z)| norm 0.3271 (+0.62z)| lr 5.23e-04 | 4157.67 ms | 32.5% bf16 MFU | 125427 tok/s step 5099/19560 | loss 3.471162 (-2.36z)| norm 0.2838 (-0.07z)| lr 5.23e-04 | 4624.96 ms | 29.2% bf16 MFU | 124823 tok/s step 5100/19560 | loss 3.601937 (+0.99z)| norm 0.2861 (-0.03z)| lr 5.23e-04 | 4188.97 ms | 32.2% bf16 MFU | 124840 tok/s step 5101/19560 | loss 3.520140 (-1.12z)| norm 0.2649 (-0.36z)| lr 5.23e-04 | 4162.47 ms | 32.4% bf16 MFU | 124896 tok/s step 5102/19560 | loss 3.574726 (+0.28z)| norm 0.2918 (+0.07z)| lr 5.23e-04 | 4169.72 ms | 32.4% bf16 MFU | 124938 tok/s step 5103/19560 | loss 3.509565 (-1.37z)| norm 0.2639 (-0.37z)| lr 5.23e-04 | 4155.04 ms | 32.5% bf16 MFU | 125000 tok/s step 5104/19560 | loss 3.538807 (-0.62z)| norm 0.2946 (+0.12z)| lr 5.23e-04 | 4178.70 ms | 32.3% bf16 MFU | 125023 tok/s step 5105/19560 | loss 3.527648 (-0.90z)| norm 0.3111 (+0.38z)| lr 5.23e-04 | 4179.45 ms | 32.3% bf16 MFU | 125044 tok/s step 5106/19560 | loss 3.488863 (-1.84z)| norm 0.2732 (-0.23z)| lr 5.23e-04 | 4175.70 ms | 32.3% bf16 MFU | 125070 tok/s step 5107/19560 | loss 3.542929 (-0.48z)| norm 0.3151 (+0.43z)| lr 5.23e-04 | 4162.76 ms | 32.4% bf16 MFU | 125114 tok/s step 5108/19560 | loss 3.502254 (-1.51z)| norm 0.2840 (-0.07z)| lr 5.23e-04 | 4166.68 ms | 32.4% bf16 MFU | 125150 tok/s step 5109/19560 | loss 3.534431 (-0.70z)| norm 0.2825 (-0.09z)| lr 5.23e-04 | 4171.12 ms | 32.4% bf16 MFU | 125177 tok/s step 5110/19560 | loss 3.560841 (-0.04z)| norm 0.2767 (-0.19z)| lr 5.23e-04 | 4177.75 ms | 32.3% bf16 MFU | 125193 tok/s step 5111/19560 | loss 3.514973 (-1.19z)| norm 0.2969 (+0.13z)| lr 5.23e-04 | 4158.00 ms | 32.5% bf16 MFU | 125238 tok/s step 5112/19560 | loss 3.570926 (+0.23z)| norm 0.2737 (-0.24z)| lr 5.23e-04 | 4170.96 ms | 32.4% bf16 MFU | 125261 tok/s step 5113/19560 | loss 3.559951 (-0.04z)| norm 0.2776 (-0.17z)| lr 5.23e-04 | 4167.63 ms | 32.4% bf16 MFU | 125288 tok/s step 5114/19560 | loss 3.580312 (+0.48z)| norm 0.2613 (-0.43z)| lr 5.23e-04 | 4160.09 ms | 32.5% bf16 MFU | 125325 tok/s step 5115/19560 | loss 3.530078 (-0.81z)| norm 0.3808 (+1.48z)| lr 5.22e-04 | 4166.14 ms | 32.4% bf16 MFU | 125351 tok/s step 5116/19560 | loss 3.600132 (+0.98z)| norm 0.3407 (+0.83z)| lr 5.22e-04 | 4171.93 ms | 32.4% bf16 MFU | 125367 tok/s step 5117/19560 | loss 3.538854 (-0.58z)| norm 0.2834 (-0.09z)| lr 5.22e-04 | 4175.79 ms | 32.3% bf16 MFU | 125376 tok/s step 5118/19560 | loss 3.568219 (+0.17z)| norm 0.2634 (-0.40z)| lr 5.22e-04 | 4167.51 ms | 32.4% bf16 MFU | 125398 tok/s step 5119/19560 | loss 3.517941 (-1.11z)| norm 0.2625 (-0.41z)| lr 5.22e-04 | 4180.94 ms | 32.3% bf16 MFU | 125398 tok/s step 5120/19560 | loss 3.550000 (-0.27z)| norm 0.2628 (-0.40z)| lr 5.22e-04 | 4165.66 ms | 32.4% bf16 MFU | 125421 tok/s step 5121/19560 | loss 3.596494 (+0.92z)| norm 0.2880 (-0.00z)| lr 5.22e-04 | 4179.85 ms | 32.3% bf16 MFU | 125421 tok/s step 5122/19560 | loss 3.559764 (-0.02z)| norm 0.2882 (+0.00z)| lr 5.22e-04 | 4179.12 ms | 32.3% bf16 MFU | 125423 tok/s step 5123/19560 | loss 3.563391 (+0.07z)| norm 0.2628 (-0.40z)| lr 5.22e-04 | 4163.32 ms | 32.4% bf16 MFU | 125448 tok/s step 5124/19560 | loss 3.509941 (-1.29z)| norm 0.2937 (+0.09z)| lr 5.22e-04 | 4174.03 ms | 32.3% bf16 MFU | 125456 tok/s step 5125/19560 | loss 3.545996 (-0.36z)| norm 0.2860 (-0.03z)| lr 5.22e-04 | 4168.61 ms | 32.4% bf16 MFU | 125472 tok/s step 5126/19560 | loss 3.526032 (-0.86z)| norm 0.2778 (-0.16z)| lr 5.22e-04 | 4163.61 ms | 32.4% bf16 MFU | 125494 tok/s step 5127/19560 | loss 3.603516 (+1.19z)| norm 0.2870 (-0.01z)| lr 5.22e-04 | 4164.57 ms | 32.4% bf16 MFU | 125514 tok/s step 5128/19560 | loss 3.570529 (+0.30z)| norm 0.3304 (+0.68z)| lr 5.22e-04 | 4176.09 ms | 32.3% bf16 MFU | 125516 tok/s step 5129/19560 | loss 3.563568 (+0.12z)| norm 0.2923 (+0.07z)| lr 5.22e-04 | 4177.24 ms | 32.3% bf16 MFU | 125516 tok/s step 5130/19560 | loss 3.532495 (-0.70z)| norm 0.2951 (+0.11z)| lr 5.22e-04 | 4167.59 ms | 32.4% bf16 MFU | 125530 tok/s step 5131/19560 | loss 3.515880 (-1.12z)| norm 0.2668 (-0.34z)| lr 5.22e-04 | 4168.13 ms | 32.4% bf16 MFU | 125543 tok/s step 5132/19560 | loss 3.567008 (+0.24z)| norm 0.2918 (+0.06z)| lr 5.22e-04 | 4168.99 ms | 32.4% bf16 MFU | 125553 tok/s step 5133/19560 | loss 3.513367 (-1.18z)| norm 0.2991 (+0.17z)| lr 5.22e-04 | 4174.74 ms | 32.3% bf16 MFU | 125555 tok/s step 5134/19560 | loss 3.592681 (+0.94z)| norm 0.3225 (+0.54z)| lr 5.22e-04 | 4164.69 ms | 32.4% bf16 MFU | 125572 tok/s step 5135/19560 | loss 3.565120 (+0.20z)| norm 0.3269 (+0.61z)| lr 5.22e-04 | 4162.95 ms | 32.4% bf16 MFU | 125590 tok/s step 5136/19560 | loss 3.554988 (-0.06z)| norm 0.3288 (+0.63z)| lr 5.22e-04 | 4174.80 ms | 32.3% bf16 MFU | 125590 tok/s step 5137/19560 | loss 3.610765 (+1.43z)| norm 0.2893 (+0.01z)| lr 5.22e-04 | 4158.72 ms | 32.5% bf16 MFU | 125614 tok/s step 5138/19560 | loss 3.557453 (-0.01z)| norm 0.3342 (+0.72z)| lr 5.22e-04 | 4171.47 ms | 32.4% bf16 MFU | 125617 tok/s step 5139/19560 | loss 3.539826 (-0.48z)| norm 0.3079 (+0.29z)| lr 5.22e-04 | 4198.03 ms | 32.2% bf16 MFU | 125581 tok/s step 5140/19560 | loss 3.486546 (-1.87z)| norm 0.2774 (-0.19z)| lr 5.22e-04 | 4187.34 ms | 32.2% bf16 MFU | 125562 tok/s step 5141/19560 | loss 3.556589 (-0.00z)| norm 0.3189 (+0.46z)| lr 5.22e-04 | 4172.96 ms | 32.4% bf16 MFU | 125566 tok/s step 5142/19560 | loss 3.558497 (+0.07z)| norm 0.2825 (-0.12z)| lr 5.22e-04 | 4170.03 ms | 32.4% bf16 MFU | 125574 tok/s step 5143/19560 | loss 3.564637 (+0.23z)| norm 0.2875 (-0.04z)| lr 5.22e-04 | 4170.74 ms | 32.4% bf16 MFU | 125581 tok/s step 5144/19560 | loss 3.551883 (-0.11z)| norm 0.2506 (-0.62z)| lr 5.22e-04 | 4176.23 ms | 32.3% bf16 MFU | 125579 tok/s step 5145/19560 | loss 3.514355 (-1.11z)| norm 0.2804 (-0.15z)| lr 5.21e-04 | 4166.66 ms | 32.4% bf16 MFU | 125591 tok/s step 5146/19560 | loss 3.537068 (-0.49z)| norm 0.2794 (-0.16z)| lr 5.21e-04 | 4184.74 ms | 32.3% bf16 MFU | 125576 tok/s step 5147/19560 | loss 3.596776 (+1.13z)| norm 0.2650 (-0.39z)| lr 5.21e-04 | 4174.53 ms | 32.3% bf16 MFU | 125577 tok/s step 5148/19560 | loss 3.487075 (-1.80z)| norm 0.2870 (-0.04z)| lr 5.21e-04 | 4167.03 ms | 32.4% bf16 MFU | 125589 tok/s step 5149/19560 | loss 3.578075 (+0.62z)| norm 0.2893 (-0.01z)| lr 5.21e-04 | 4165.17 ms | 32.4% bf16 MFU | 125603 tok/s step 5150/19560 | loss 3.528687 (-0.71z)| norm 0.2752 (-0.23z)| lr 5.21e-04 | 4178.08 ms | 32.3% bf16 MFU | 125597 tok/s step 5151/19560 | loss 3.481208 (-1.95z)| norm 0.2838 (-0.10z)| lr 5.21e-04 | 4174.85 ms | 32.3% bf16 MFU | 125597 tok/s step 5152/19560 | loss 3.610945 (+1.48z)| norm 0.2871 (-0.05z)| lr 5.21e-04 | 4161.51 ms | 32.4% bf16 MFU | 125616 tok/s step 5153/19560 | loss 3.542170 (-0.32z)| norm 0.2716 (-0.29z)| lr 5.21e-04 | 4168.07 ms | 32.4% bf16 MFU | 125625 tok/s step 5154/19560 | loss 3.498210 (-1.47z)| norm 0.2943 (+0.06z)| lr 5.21e-04 | 4178.03 ms | 32.3% bf16 MFU | 125618 tok/s step 5155/19560 | loss 3.506963 (-1.22z)| norm 0.2590 (-0.50z)| lr 5.21e-04 | 4181.21 ms | 32.3% bf16 MFU | 125606 tok/s step 5156/19560 | loss 3.558421 (+0.13z)| norm 0.2910 (+0.01z)| lr 5.21e-04 | 4202.84 ms | 32.1% bf16 MFU | 125563 tok/s step 5157/19560 | loss 3.552474 (-0.01z)| norm 0.2784 (-0.19z)| lr 5.21e-04 | 4160.72 ms | 32.5% bf16 MFU | 125586 tok/s step 5158/19560 | loss 3.604218 (+1.38z)| norm 0.2868 (-0.06z)| lr 5.21e-04 | 4161.09 ms | 32.4% bf16 MFU | 125606 tok/s step 5159/19560 | loss 3.623733 (+1.87z)| norm 0.2903 (-0.01z)| lr 5.21e-04 | 4166.30 ms | 32.4% bf16 MFU | 125618 tok/s step 5160/19560 | loss 3.489177 (-1.67z)| norm 0.3204 (+0.47z)| lr 5.21e-04 | 4170.49 ms | 32.4% bf16 MFU | 125623 tok/s step 5161/19560 | loss 3.594089 (+1.07z)| norm 0.3145 (+0.37z)| lr 5.21e-04 | 4170.76 ms | 32.4% bf16 MFU | 125627 tok/s step 5162/19560 | loss 3.478565 (-1.91z)| norm 0.3110 (+0.31z)| lr 5.21e-04 | 4171.48 ms | 32.4% bf16 MFU | 125630 tok/s step 5163/19560 | loss 3.558100 (+0.14z)| norm 0.2764 (-0.25z)| lr 5.21e-04 | 4151.09 ms | 32.5% bf16 MFU | 125663 tok/s step 5164/19560 | loss 3.483079 (-1.77z)| norm 0.3430 (+0.80z)| lr 5.21e-04 | 4168.90 ms | 32.4% bf16 MFU | 125668 tok/s step 5165/19560 | loss 3.548264 (-0.07z)| norm 0.3010 (+0.13z)| lr 5.21e-04 | 4163.42 ms | 32.4% bf16 MFU | 125681 tok/s step 5166/19560 | loss 3.621231 (+1.81z)| norm 0.2559 (-0.58z)| lr 5.21e-04 | 4172.04 ms | 32.4% bf16 MFU | 125680 tok/s step 5167/19560 | loss 3.527782 (-0.59z)| norm 0.2934 (+0.02z)| lr 5.21e-04 | 4163.76 ms | 32.4% bf16 MFU | 125692 tok/s step 5168/19560 | loss 3.508516 (-1.09z)| norm 0.2650 (-0.43z)| lr 5.21e-04 | 4164.31 ms | 32.4% bf16 MFU | 125703 tok/s step 5169/19560 | loss 3.561398 (+0.28z)| norm 0.2677 (-0.38z)| lr 5.21e-04 | 4165.97 ms | 32.4% bf16 MFU | 125710 tok/s step 5170/19560 | loss 3.542511 (-0.20z)| norm 0.2481 (-0.69z)| lr 5.21e-04 | 4159.83 ms | 32.5% bf16 MFU | 125726 tok/s step 5171/19560 | loss 3.606747 (+1.46z)| norm 0.2757 (-0.25z)| lr 5.21e-04 | 4162.35 ms | 32.4% bf16 MFU | 125738 tok/s step 5172/19560 | loss 3.541410 (-0.23z)| norm 0.2848 (-0.11z)| lr 5.21e-04 | 4165.80 ms | 32.4% bf16 MFU | 125744 tok/s step 5173/19560 | loss 3.563044 (+0.34z)| norm 0.3048 (+0.21z)| lr 5.21e-04 | 4168.98 ms | 32.4% bf16 MFU | 125745 tok/s step 5174/19560 | loss 3.512853 (-0.96z)| norm 0.2733 (-0.29z)| lr 5.21e-04 | 4174.89 ms | 32.3% bf16 MFU | 125736 tok/s step 5175/19560 | loss 3.496886 (-1.35z)| norm 0.2916 (-0.00z)| lr 5.20e-04 | 4164.96 ms | 32.4% bf16 MFU | 125744 tok/s step 5176/19560 | loss 3.508832 (-1.03z)| norm 0.2985 (+0.10z)| lr 5.20e-04 | 4168.98 ms | 32.4% bf16 MFU | 125744 tok/s step 5177/19560 | loss 3.528332 (-0.52z)| norm 0.2973 (+0.08z)| lr 5.20e-04 | 4161.42 ms | 32.4% bf16 MFU | 125757 tok/s step 5178/19560 | loss 3.524500 (-0.62z)| norm 0.2714 (-0.33z)| lr 5.20e-04 | 4241.59 ms | 31.8% bf16 MFU | 125649 tok/s step 5179/19560 | loss 3.640253 (+2.35z)| norm 0.2978 (+0.09z)| lr 5.20e-04 | 4160.21 ms | 32.5% bf16 MFU | 125668 tok/s step 5180/19560 | loss 3.528167 (-0.52z)| norm 0.2951 (+0.04z)| lr 5.20e-04 | 4159.33 ms | 32.5% bf16 MFU | 125687 tok/s step 5181/19560 | loss 3.514719 (-0.85z)| norm 0.2856 (-0.11z)| lr 5.20e-04 | 4165.25 ms | 32.4% bf16 MFU | 125696 tok/s step 5182/19560 | loss 3.597695 (+1.25z)| norm 0.3194 (+0.42z)| lr 5.20e-04 | 4170.25 ms | 32.4% bf16 MFU | 125698 tok/s step 5183/19560 | loss 3.533790 (-0.38z)| norm 0.2949 (+0.02z)| lr 5.20e-04 | 4160.65 ms | 32.5% bf16 MFU | 125713 tok/s step 5184/19560 | loss 3.571203 (+0.58z)| norm 0.2905 (-0.05z)| lr 5.20e-04 | 4160.68 ms | 32.5% bf16 MFU | 125728 tok/s step 5185/19560 | loss 3.560364 (+0.30z)| norm 0.3105 (+0.26z)| lr 5.20e-04 | 4191.59 ms | 32.2% bf16 MFU | 125696 tok/s step 5186/19560 | loss 3.560236 (+0.31z)| norm 0.3226 (+0.45z)| lr 5.20e-04 | 4160.71 ms | 32.5% bf16 MFU | 125711 tok/s step 5187/19560 | loss 3.711333 (+3.93z)| norm 0.3335 (+0.61z)| lr 5.20e-04 | 4467.97 ms | 30.2% bf16 MFU | 125293 tok/s step 5188/19560 | loss 3.534624 (-0.35z)| norm 0.3296 (+0.54z)| lr 5.20e-04 | 4158.03 ms | 32.5% bf16 MFU | 125333 tok/s step 5189/19560 | loss 3.548311 (-0.01z)| norm 0.2995 (+0.06z)| lr 5.20e-04 | 4167.92 ms | 32.4% bf16 MFU | 125356 tok/s step 5190/19560 | loss 3.529973 (-0.45z)| norm 0.2500 (-0.72z)| lr 5.20e-04 | 4158.03 ms | 32.5% bf16 MFU | 125393 tok/s step 5191/19560 | loss 3.473197 (-1.82z)| norm 0.3056 (+0.16z)| lr 5.20e-04 | 4158.17 ms | 32.5% bf16 MFU | 125427 tok/s step 5192/19560 | loss 3.516064 (-0.76z)| norm 0.2625 (-0.52z)| lr 5.20e-04 | 4165.34 ms | 32.4% bf16 MFU | 125449 tok/s step 5193/19560 | loss 3.558124 (+0.27z)| norm 0.2714 (-0.38z)| lr 5.20e-04 | 4167.82 ms | 32.4% bf16 MFU | 125467 tok/s step 5194/19560 | loss 3.560488 (+0.33z)| norm 0.2787 (-0.26z)| lr 5.20e-04 | 4156.13 ms | 32.5% bf16 MFU | 125501 tok/s step 5195/19560 | loss 3.551781 (+0.12z)| norm 0.2526 (-0.67z)| lr 5.20e-04 | 4159.86 ms | 32.5% bf16 MFU | 125527 tok/s step 5196/19560 | loss 3.556627 (+0.24z)| norm 0.2884 (-0.10z)| lr 5.20e-04 | 4163.15 ms | 32.4% bf16 MFU | 125548 tok/s step 5197/19560 | loss 3.582022 (+0.85z)| norm 0.2928 (-0.03z)| lr 5.20e-04 | 4185.01 ms | 32.3% bf16 MFU | 125534 tok/s step 5198/19560 | loss 3.494595 (-1.29z)| norm 0.2959 (+0.01z)| lr 5.20e-04 | 4169.41 ms | 32.4% bf16 MFU | 125545 tok/s step 5199/19560 | loss 3.496473 (-1.25z)| norm 0.2677 (-0.43z)| lr 5.20e-04 | 4160.96 ms | 32.4% bf16 MFU | 125568 tok/s step 5200/19560 | loss 3.566618 (+0.47z)| norm 0.2811 (-0.22z)| lr 5.20e-04 | 4194.56 ms | 32.2% bf16 MFU | 125539 tok/s step 5201/19560 | loss 3.670758 (+2.91z)| norm 0.2973 (+0.03z)| lr 5.20e-04 | 4178.07 ms | 32.3% bf16 MFU | 125536 tok/s step 5202/19560 | loss 3.505955 (-1.02z)| norm 0.3064 (+0.17z)| lr 5.20e-04 | 4164.40 ms | 32.4% bf16 MFU | 125554 tok/s step 5203/19560 | loss 3.510569 (-0.90z)| norm 0.2967 (+0.02z)| lr 5.20e-04 | 4176.73 ms | 32.3% bf16 MFU | 125553 tok/s step 5204/19560 | loss 3.554723 (+0.14z)| norm 0.2610 (-0.55z)| lr 5.19e-04 | 4152.69 ms | 32.5% bf16 MFU | 125588 tok/s step 5205/19560 | loss 3.557345 (+0.20z)| norm 0.2567 (-0.61z)| lr 5.19e-04 | 4172.60 ms | 32.4% bf16 MFU | 125591 tok/s step 5206/19560 | loss 3.576707 (+0.66z)| norm 0.2871 (-0.13z)| lr 5.19e-04 | 4160.59 ms | 32.5% bf16 MFU | 125612 tok/s step 5207/19560 | loss 3.575824 (+0.63z)| norm 0.2639 (-0.49z)| lr 5.19e-04 | 4163.16 ms | 32.4% bf16 MFU | 125628 tok/s step 5208/19560 | loss 3.495685 (-1.29z)| norm 0.2646 (-0.47z)| lr 5.19e-04 | 4165.80 ms | 32.4% bf16 MFU | 125640 tok/s step 5209/19560 | loss 3.582254 (+0.78z)| norm 0.2929 (-0.02z)| lr 5.19e-04 | 4174.00 ms | 32.3% bf16 MFU | 125638 tok/s step 5210/19560 | loss 3.532824 (-0.40z)| norm 0.3061 (+0.18z)| lr 5.19e-04 | 4160.25 ms | 32.5% bf16 MFU | 125657 tok/s step 5211/19560 | loss 3.492414 (-1.35z)| norm 0.3247 (+0.47z)| lr 5.19e-04 | 4195.35 ms | 32.2% bf16 MFU | 125623 tok/s step 5212/19560 | loss 3.687172 (+3.19z)| norm 0.3165 (+0.34z)| lr 5.19e-04 | 4185.34 ms | 32.3% bf16 MFU | 125605 tok/s step 5213/19560 | loss 3.539380 (-0.22z)| norm 0.3054 (+0.66z)| lr 5.19e-04 | 4169.76 ms | 32.4% bf16 MFU | 125612 tok/s step 5214/19560 | loss 3.563035 (+0.34z)| norm 0.3439 (+2.30z)| lr 5.19e-04 | 4159.66 ms | 32.5% bf16 MFU | 125633 tok/s step 5215/19560 | loss 3.552726 (+0.09z)| norm 0.3083 (+0.77z)| lr 5.19e-04 | 4159.91 ms | 32.5% bf16 MFU | 125653 tok/s step 5216/19560 | loss 3.526490 (-0.54z)| norm 0.2805 (-0.40z)| lr 5.19e-04 | 4168.40 ms | 32.4% bf16 MFU | 125659 tok/s step 5217/19560 | loss 3.597558 (+1.15z)| norm 0.2912 (+0.05z)| lr 5.19e-04 | 4166.52 ms | 32.4% bf16 MFU | 125668 tok/s step 5218/19560 | loss 3.505837 (-1.02z)| norm 0.3552 (+2.71z)| lr 5.19e-04 | 4167.52 ms | 32.4% bf16 MFU | 125675 tok/s step 5219/19560 | loss 3.608470 (+1.41z)| norm 0.2850 (-0.22z)| lr 5.19e-04 | 4166.36 ms | 32.4% bf16 MFU | 125683 tok/s step 5220/19560 | loss 3.553550 (+0.10z)| norm 0.2999 (+0.40z)| lr 5.19e-04 | 4166.00 ms | 32.4% bf16 MFU | 125691 tok/s step 5221/19560 | loss 3.478967 (-1.63z)| norm 0.3052 (+0.62z)| lr 5.19e-04 | 4194.36 ms | 32.2% bf16 MFU | 125657 tok/s step 5222/19560 | loss 3.566118 (+0.40z)| norm 0.2877 (-0.13z)| lr 5.19e-04 | 4162.23 ms | 32.4% bf16 MFU | 125672 tok/s step 5223/19560 | loss 3.555587 (+0.15z)| norm 0.2684 (-0.94z)| lr 5.19e-04 | 4176.41 ms | 32.3% bf16 MFU | 125665 tok/s step 5224/19560 | loss 3.542808 (-0.15z)| norm 0.2884 (-0.11z)| lr 5.19e-04 | 4157.49 ms | 32.5% bf16 MFU | 125687 tok/s step 5225/19560 | loss 3.524294 (-0.59z)| norm 0.2985 (+0.32z)| lr 5.19e-04 | 4167.61 ms | 32.4% bf16 MFU | 125693 tok/s step 5226/19560 | loss 3.540534 (-0.19z)| norm 0.2736 (-0.74z)| lr 5.19e-04 | 4159.88 ms | 32.5% bf16 MFU | 125710 tok/s step 5227/19560 | loss 3.491407 (-1.37z)| norm 0.2541 (-1.56z)| lr 5.19e-04 | 4157.63 ms | 32.5% bf16 MFU | 125730 tok/s step 5228/19560 | loss 3.539807 (-0.20z)| norm 0.3300 (+1.66z)| lr 5.19e-04 | 4150.58 ms | 32.5% bf16 MFU | 125759 tok/s step 5229/19560 | loss 3.542357 (-0.15z)| norm 0.2582 (-1.38z)| lr 5.19e-04 | 4159.16 ms | 32.5% bf16 MFU | 125774 tok/s step 5230/19560 | loss 3.603656 (+1.32z)| norm 0.2887 (-0.09z)| lr 5.19e-04 | 4170.41 ms | 32.4% bf16 MFU | 125771 tok/s step 5231/19560 | loss 3.564991 (+0.38z)| norm 0.2789 (-0.51z)| lr 5.19e-04 | 4176.00 ms | 32.3% bf16 MFU | 125760 tok/s step 5232/19560 | loss 3.489274 (-1.42z)| norm 0.3204 (+1.24z)| lr 5.19e-04 | 4164.57 ms | 32.4% bf16 MFU | 125766 tok/s step 5233/19560 | loss 3.514698 (-0.81z)| norm 0.3197 (+1.20z)| lr 5.18e-04 | 4155.30 ms | 32.5% bf16 MFU | 125787 tok/s step 5234/19560 | loss 3.570089 (+0.50z)| norm 0.2970 (+0.24z)| lr 5.18e-04 | 4157.52 ms | 32.5% bf16 MFU | 125803 tok/s step 5235/19560 | loss 3.511415 (-0.90z)| norm 0.2706 (-0.86z)| lr 5.18e-04 | 4159.62 ms | 32.5% bf16 MFU | 125815 tok/s step 5236/19560 | loss 3.562369 (+0.31z)| norm 0.2868 (-0.18z)| lr 5.18e-04 | 4160.28 ms | 32.5% bf16 MFU | 125825 tok/s step 5237/19560 | loss 3.463516 (-2.02z)| norm 0.2580 (-1.38z)| lr 5.18e-04 | 4155.16 ms | 32.5% bf16 MFU | 125843 tok/s step 5238/19560 | loss 3.538607 (-0.24z)| norm 0.2728 (-0.76z)| lr 5.18e-04 | 4159.01 ms | 32.5% bf16 MFU | 125854 tok/s step 5239/19560 | loss 3.514018 (-0.82z)| norm 0.2720 (-0.78z)| lr 5.18e-04 | 4161.20 ms | 32.4% bf16 MFU | 125861 tok/s step 5240/19560 | loss 3.515412 (-0.78z)| norm 0.2988 (+0.34z)| lr 5.18e-04 | 4162.12 ms | 32.4% bf16 MFU | 125866 tok/s step 5241/19560 | loss 3.559716 (+0.27z)| norm 0.2764 (-0.60z)| lr 5.18e-04 | 4178.41 ms | 32.3% bf16 MFU | 125846 tok/s step 5242/19560 | loss 3.503073 (-1.05z)| norm 0.2793 (-0.49z)| lr 5.18e-04 | 4154.37 ms | 32.5% bf16 MFU | 125864 tok/s step 5243/19560 | loss 3.529408 (-0.43z)| norm 0.2928 (+0.11z)| lr 5.18e-04 | 4163.06 ms | 32.4% bf16 MFU | 125868 tok/s step 5244/19560 | loss 3.525401 (-0.51z)| norm 0.2628 (-1.22z)| lr 5.18e-04 | 4163.14 ms | 32.4% bf16 MFU | 125871 tok/s step 5245/19560 | loss 3.519193 (-0.66z)| norm 0.3237 (+1.52z)| lr 5.18e-04 | 4159.17 ms | 32.5% bf16 MFU | 125880 tok/s step 5246/19560 | loss 3.537335 (-0.22z)| norm 0.2993 (+0.41z)| lr 5.18e-04 | 4157.40 ms | 32.5% bf16 MFU | 125892 tok/s step 5247/19560 | loss 3.506286 (-0.95z)| norm 0.2814 (-0.41z)| lr 5.18e-04 | 4160.36 ms | 32.5% bf16 MFU | 125898 tok/s step 5248/19560 | loss 3.506437 (-0.94z)| norm 0.2940 (+0.15z)| lr 5.18e-04 | 4167.01 ms | 32.4% bf16 MFU | 125894 tok/s step 5249/19560 | loss 3.527666 (-0.43z)| norm 0.2997 (+0.41z)| lr 5.18e-04 | 4178.90 ms | 32.3% bf16 MFU | 125873 tok/s step 5250/19560 | loss 3.549173 (+0.08z)| norm 0.2749 (-0.72z)| lr 5.18e-04 | 4169.19 ms | 32.4% bf16 MFU | 125867 tok/s val loss 3.540486 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2796/10042 = 0.278431 step 5251/19560 | loss 3.458388 (-2.02z)| norm 0.2744 (-0.75z)| lr 5.18e-04 | 4278.89 ms | 31.6% bf16 MFU | 125700 tok/s step 5252/19560 | loss 3.568746 (+0.55z)| norm 0.2662 (-1.11z)| lr 5.18e-04 | 4161.27 ms | 32.4% bf16 MFU | 125714 tok/s step 5253/19560 | loss 3.534644 (-0.25z)| norm 0.3043 (+0.62z)| lr 5.18e-04 | 4176.85 ms | 32.3% bf16 MFU | 125705 tok/s step 5254/19560 | loss 3.551779 (+0.15z)| norm 0.2787 (-0.54z)| lr 5.18e-04 | 4164.21 ms | 32.4% bf16 MFU | 125715 tok/s step 5255/19560 | loss 3.527874 (-0.40z)| norm 0.2800 (-0.48z)| lr 5.18e-04 | 4237.38 ms | 31.9% bf16 MFU | 125615 tok/s step 5256/19560 | loss 3.615925 (+1.65z)| norm 0.2937 (+0.15z)| lr 5.18e-04 | 4166.15 ms | 32.4% bf16 MFU | 125627 tok/s step 5257/19560 | loss 3.573869 (+0.67z)| norm 0.3230 (+1.48z)| lr 5.18e-04 | 4159.74 ms | 32.5% bf16 MFU | 125648 tok/s step 5258/19560 | loss 3.539828 (-0.13z)| norm 0.3265 (+1.62z)| lr 5.18e-04 | 4166.50 ms | 32.4% bf16 MFU | 125657 tok/s step 5259/19560 | loss 3.515093 (-0.70z)| norm 0.2866 (-0.20z)| lr 5.18e-04 | 4164.70 ms | 32.4% bf16 MFU | 125668 tok/s step 5260/19560 | loss 3.574266 (+0.67z)| norm 0.2815 (-0.43z)| lr 5.18e-04 | 4159.43 ms | 32.5% bf16 MFU | 125687 tok/s step 5261/19560 | loss 3.574668 (+0.67z)| norm 0.2871 (-0.17z)| lr 5.18e-04 | 4160.98 ms | 32.4% bf16 MFU | 125703 tok/s step 5262/19560 | loss 3.517297 (-0.65z)| norm 0.2538 (-1.65z)| lr 5.18e-04 | 4150.47 ms | 32.5% bf16 MFU | 125734 tok/s step 5263/19560 | loss 3.541355 (-0.09z)| norm 0.2643 (-1.16z)| lr 5.17e-04 | 4162.03 ms | 32.4% bf16 MFU | 125746 tok/s step 5264/19560 | loss 3.600226 (+1.28z)| norm 0.2682 (-0.97z)| lr 5.17e-04 | 4157.38 ms | 32.5% bf16 MFU | 125764 tok/s step 5265/19560 | loss 3.464910 (-1.84z)| norm 0.2594 (-1.36z)| lr 5.17e-04 | 4162.38 ms | 32.4% bf16 MFU | 125774 tok/s step 5266/19560 | loss 3.521199 (-0.53z)| norm 0.2771 (-0.53z)| lr 5.17e-04 | 4159.26 ms | 32.5% bf16 MFU | 125788 tok/s step 5267/19560 | loss 3.590837 (+1.07z)| norm 0.2916 (+0.15z)| lr 5.17e-04 | 4157.08 ms | 32.5% bf16 MFU | 125804 tok/s step 5268/19560 | loss 3.535321 (-0.22z)| norm 0.2605 (-1.29z)| lr 5.17e-04 | 4168.04 ms | 32.4% bf16 MFU | 125803 tok/s step 5269/19560 | loss 3.571898 (+0.63z)| norm 0.2576 (-1.40z)| lr 5.17e-04 | 4145.84 ms | 32.6% bf16 MFU | 125836 tok/s step 5270/19560 | loss 3.530588 (-0.32z)| norm 0.2759 (-0.55z)| lr 5.17e-04 | 4172.27 ms | 32.4% bf16 MFU | 125827 tok/s step 5271/19560 | loss 3.583226 (+0.89z)| norm 0.2917 (+0.18z)| lr 5.17e-04 | 4163.30 ms | 32.4% bf16 MFU | 125833 tok/s step 5272/19560 | loss 3.544364 (-0.01z)| norm 0.2960 (+0.36z)| lr 5.17e-04 | 4151.76 ms | 32.5% bf16 MFU | 125855 tok/s step 5273/19560 | loss 3.557108 (+0.28z)| norm 0.2489 (-1.80z)| lr 5.17e-04 | 4157.14 ms | 32.5% bf16 MFU | 125868 tok/s step 5274/19560 | loss 3.525633 (-0.45z)| norm 0.2537 (-1.56z)| lr 5.17e-04 | 4158.55 ms | 32.5% bf16 MFU | 125878 tok/s step 5275/19560 | loss 3.530764 (-0.32z)| norm 0.2703 (-0.80z)| lr 5.17e-04 | 4157.90 ms | 32.5% bf16 MFU | 125889 tok/s step 5276/19560 | loss 3.523389 (-0.50z)| norm 0.2625 (-1.14z)| lr 5.17e-04 | 4156.17 ms | 32.5% bf16 MFU | 125902 tok/s step 5277/19560 | loss 3.546159 (+0.04z)| norm 0.3307 (+1.92z)| lr 5.17e-04 | 4161.97 ms | 32.4% bf16 MFU | 125906 tok/s step 5278/19560 | loss 3.588493 (+1.02z)| norm 0.3685 (+3.42z)| lr 5.17e-04 | 4163.61 ms | 32.4% bf16 MFU | 125906 tok/s step 5279/19560 | loss 3.533408 (-0.28z)| norm 0.3380 (+2.06z)| lr 5.17e-04 | 4152.10 ms | 32.5% bf16 MFU | 125925 tok/s step 5280/19560 | loss 3.487680 (-1.34z)| norm 0.3036 (+0.60z)| lr 5.17e-04 | 4160.17 ms | 32.5% bf16 MFU | 125930 tok/s step 5281/19560 | loss 3.542515 (-0.04z)| norm 0.3169 (+1.15z)| lr 5.17e-04 | 4152.46 ms | 32.5% bf16 MFU | 125946 tok/s step 5282/19560 | loss 3.511569 (-0.78z)| norm 0.2957 (+0.25z)| lr 5.17e-04 | 4155.70 ms | 32.5% bf16 MFU | 125957 tok/s step 5283/19560 | loss 3.564764 (+0.47z)| norm 0.2811 (-0.36z)| lr 5.17e-04 | 4150.77 ms | 32.5% bf16 MFU | 125975 tok/s step 5284/19560 | loss 3.743004 (+4.31z)| norm 0.2990 (+0.39z)| lr 5.17e-04 | 4152.23 ms | 32.5% bf16 MFU | 125989 tok/s step 5285/19560 | loss 3.588558 (+0.92z)| norm 0.3328 (+1.77z)| lr 5.17e-04 | 4166.33 ms | 32.4% bf16 MFU | 125982 tok/s step 5286/19560 | loss 3.489131 (-1.24z)| norm 0.3109 (+0.85z)| lr 5.17e-04 | 4157.28 ms | 32.5% bf16 MFU | 125988 tok/s step 5287/19560 | loss 3.583896 (+0.85z)| norm 0.3236 (+1.36z)| lr 5.17e-04 | 4154.06 ms | 32.5% bf16 MFU | 125999 tok/s step 5288/19560 | loss 3.534212 (-0.26z)| norm 0.2735 (-0.70z)| lr 5.17e-04 | 4166.57 ms | 32.4% bf16 MFU | 125991 tok/s step 5289/19560 | loss 3.600000 (+1.20z)| norm 0.2920 (+0.08z)| lr 5.17e-04 | 4165.89 ms | 32.4% bf16 MFU | 125984 tok/s step 5290/19560 | loss 3.514430 (-0.71z)| norm 0.2922 (+0.09z)| lr 5.17e-04 | 4158.98 ms | 32.5% bf16 MFU | 125988 tok/s step 5291/19560 | loss 3.584173 (+0.84z)| norm 0.2609 (-1.20z)| lr 5.17e-04 | 4160.28 ms | 32.5% bf16 MFU | 125990 tok/s step 5292/19560 | loss 3.648804 (+2.22z)| norm 0.2900 (+0.02z)| lr 5.16e-04 | 4170.01 ms | 32.4% bf16 MFU | 125977 tok/s step 5293/19560 | loss 3.532867 (-0.32z)| norm 0.2734 (-0.67z)| lr 5.16e-04 | 4157.03 ms | 32.5% bf16 MFU | 125984 tok/s step 5294/19560 | loss 3.508846 (-0.84z)| norm 0.2881 (-0.06z)| lr 5.16e-04 | 4163.74 ms | 32.4% bf16 MFU | 125981 tok/s step 5295/19560 | loss 3.584616 (+0.83z)| norm 0.2881 (-0.06z)| lr 5.16e-04 | 4161.20 ms | 32.4% bf16 MFU | 125981 tok/s step 5296/19560 | loss 3.545813 (-0.03z)| norm 0.2846 (-0.22z)| lr 5.16e-04 | 4156.02 ms | 32.5% bf16 MFU | 125990 tok/s step 5297/19560 | loss 3.521515 (-0.57z)| norm 0.2602 (-1.26z)| lr 5.16e-04 | 4150.98 ms | 32.5% bf16 MFU | 126005 tok/s step 5298/19560 | loss 3.497395 (-1.09z)| norm 0.2546 (-1.50z)| lr 5.16e-04 | 4168.49 ms | 32.4% bf16 MFU | 125994 tok/s step 5299/19560 | loss 3.505274 (-0.90z)| norm 0.2578 (-1.35z)| lr 5.16e-04 | 4155.51 ms | 32.5% bf16 MFU | 126003 tok/s step 5300/19560 | loss 3.541771 (-0.09z)| norm 0.2448 (-1.86z)| lr 5.16e-04 | 4158.51 ms | 32.5% bf16 MFU | 126006 tok/s step 5301/19560 | loss 3.518695 (-0.59z)| norm 0.2649 (-1.00z)| lr 5.16e-04 | 4146.57 ms | 32.6% bf16 MFU | 126028 tok/s step 5302/19560 | loss 3.558514 (+0.28z)| norm 0.2792 (-0.41z)| lr 5.16e-04 | 4154.21 ms | 32.5% bf16 MFU | 126037 tok/s step 5303/19560 | loss 3.546873 (+0.01z)| norm 0.2485 (-1.67z)| lr 5.16e-04 | 4147.26 ms | 32.6% bf16 MFU | 126056 tok/s step 5304/19560 | loss 3.512654 (-0.75z)| norm 0.2596 (-1.18z)| lr 5.16e-04 | 4156.22 ms | 32.5% bf16 MFU | 126060 tok/s step 5305/19560 | loss 3.537531 (-0.20z)| norm 0.2914 (+0.13z)| lr 5.16e-04 | 4163.88 ms | 32.4% bf16 MFU | 126053 tok/s step 5306/19560 | loss 3.536018 (-0.23z)| norm 0.2956 (+0.30z)| lr 5.16e-04 | 4163.25 ms | 32.4% bf16 MFU | 126047 tok/s step 5307/19560 | loss 3.549850 (+0.09z)| norm 0.2990 (+0.44z)| lr 5.16e-04 | 4159.48 ms | 32.5% bf16 MFU | 126047 tok/s step 5308/19560 | loss 3.501740 (-0.99z)| norm 0.2623 (-1.06z)| lr 5.16e-04 | 4153.54 ms | 32.5% bf16 MFU | 126056 tok/s step 5309/19560 | loss 3.550590 (+0.11z)| norm 0.2759 (-0.50z)| lr 5.16e-04 | 4169.52 ms | 32.4% bf16 MFU | 126040 tok/s step 5310/19560 | loss 3.525092 (-0.46z)| norm 0.2834 (-0.18z)| lr 5.16e-04 | 4158.60 ms | 32.5% bf16 MFU | 126042 tok/s step 5311/19560 | loss 3.602327 (+1.28z)| norm 0.3058 (+0.74z)| lr 5.16e-04 | 4161.08 ms | 32.4% bf16 MFU | 126040 tok/s step 5312/19560 | loss 3.521062 (-0.55z)| norm 0.2844 (-0.14z)| lr 5.16e-04 | 4157.04 ms | 32.5% bf16 MFU | 126044 tok/s step 5313/19560 | loss 3.556121 (+0.24z)| norm 0.2549 (-1.34z)| lr 5.16e-04 | 4158.21 ms | 32.5% bf16 MFU | 126046 tok/s step 5314/19560 | loss 3.510482 (-0.78z)| norm 0.2745 (-0.52z)| lr 5.16e-04 | 4153.06 ms | 32.5% bf16 MFU | 126056 tok/s step 5315/19560 | loss 3.506023 (-0.89z)| norm 0.2833 (-0.14z)| lr 5.16e-04 | 4166.13 ms | 32.4% bf16 MFU | 126045 tok/s step 5316/19560 | loss 3.574530 (+0.73z)| norm 0.2567 (-1.25z)| lr 5.16e-04 | 4150.03 ms | 32.5% bf16 MFU | 126059 tok/s step 5317/19560 | loss 3.478049 (-1.54z)| norm 0.2611 (-1.04z)| lr 5.16e-04 | 4162.82 ms | 32.4% bf16 MFU | 126054 tok/s step 5318/19560 | loss 3.537206 (-0.14z)| norm 0.2906 (+0.19z)| lr 5.16e-04 | 4160.11 ms | 32.5% bf16 MFU | 126052 tok/s step 5319/19560 | loss 3.537581 (-0.15z)| norm 0.2987 (+0.54z)| lr 5.16e-04 | 4155.48 ms | 32.5% bf16 MFU | 126058 tok/s step 5320/19560 | loss 3.531259 (-0.30z)| norm 0.2885 (+0.10z)| lr 5.15e-04 | 4166.17 ms | 32.4% bf16 MFU | 126047 tok/s step 5321/19560 | loss 3.600284 (+1.33z)| norm 0.2794 (-0.30z)| lr 5.15e-04 | 4162.31 ms | 32.4% bf16 MFU | 126043 tok/s step 5322/19560 | loss 3.549785 (+0.13z)| norm 0.3015 (+0.65z)| lr 5.15e-04 | 4155.48 ms | 32.5% bf16 MFU | 126049 tok/s step 5323/19560 | loss 3.510057 (-0.80z)| norm 0.3009 (+0.61z)| lr 5.15e-04 | 4155.11 ms | 32.5% bf16 MFU | 126056 tok/s step 5324/19560 | loss 3.573162 (+0.69z)| norm 0.2779 (-0.38z)| lr 5.15e-04 | 4164.65 ms | 32.4% bf16 MFU | 126048 tok/s step 5325/19560 | loss 3.519703 (-0.56z)| norm 0.3007 (+0.60z)| lr 5.15e-04 | 4163.04 ms | 32.4% bf16 MFU | 126042 tok/s step 5326/19560 | loss 3.548132 (+0.10z)| norm 0.2672 (-0.84z)| lr 5.15e-04 | 4161.12 ms | 32.4% bf16 MFU | 126040 tok/s step 5327/19560 | loss 3.519265 (-0.59z)| norm 0.2678 (-0.81z)| lr 5.15e-04 | 4148.22 ms | 32.5% bf16 MFU | 126057 tok/s step 5328/19560 | loss 3.566658 (+0.54z)| norm 0.2650 (-0.92z)| lr 5.15e-04 | 4155.40 ms | 32.5% bf16 MFU | 126063 tok/s step 5329/19560 | loss 3.607076 (+1.56z)| norm 0.2716 (-0.63z)| lr 5.15e-04 | 4156.58 ms | 32.5% bf16 MFU | 126067 tok/s step 5330/19560 | loss 3.556435 (+0.31z)| norm 0.2705 (-0.67z)| lr 5.15e-04 | 4170.31 ms | 32.4% bf16 MFU | 126049 tok/s step 5331/19560 | loss 3.622425 (+1.89z)| norm 0.2955 (+0.41z)| lr 5.15e-04 | 4148.22 ms | 32.5% bf16 MFU | 126066 tok/s step 5332/19560 | loss 3.519696 (-0.60z)| norm 0.2978 (+0.50z)| lr 5.15e-04 | 4167.25 ms | 32.4% bf16 MFU | 126053 tok/s step 5333/19560 | loss 3.631029 (+2.06z)| norm 0.2880 (+0.06z)| lr 5.15e-04 | 4172.44 ms | 32.4% bf16 MFU | 126034 tok/s step 5334/19560 | loss 3.685076 (+3.20z)| norm 0.2806 (-0.25z)| lr 5.15e-04 | 4162.18 ms | 32.4% bf16 MFU | 126030 tok/s step 5335/19560 | loss 3.575721 (+0.69z)| norm 0.2996 (+0.56z)| lr 5.15e-04 | 4159.80 ms | 32.5% bf16 MFU | 126030 tok/s step 5336/19560 | loss 3.550175 (+0.09z)| norm 0.2808 (-0.27z)| lr 5.15e-04 | 4149.70 ms | 32.5% bf16 MFU | 126046 tok/s step 5337/19560 | loss 3.576293 (+0.69z)| norm 0.2605 (-1.13z)| lr 5.15e-04 | 4153.62 ms | 32.5% bf16 MFU | 126055 tok/s step 5338/19560 | loss 3.556126 (+0.22z)| norm 0.2489 (-1.61z)| lr 5.15e-04 | 4147.31 ms | 32.6% bf16 MFU | 126073 tok/s step 5339/19560 | loss 3.574445 (+0.64z)| norm 0.2674 (-0.79z)| lr 5.15e-04 | 4147.64 ms | 32.6% bf16 MFU | 126090 tok/s step 5340/19560 | loss 3.499304 (-1.12z)| norm 0.2769 (-0.37z)| lr 5.15e-04 | 4160.38 ms | 32.5% bf16 MFU | 126086 tok/s step 5341/19560 | loss 3.561393 (+0.38z)| norm 0.2612 (-1.04z)| lr 5.15e-04 | 4161.83 ms | 32.4% bf16 MFU | 126081 tok/s step 5342/19560 | loss 3.487454 (-1.39z)| norm 0.2754 (-0.41z)| lr 5.15e-04 | 4150.12 ms | 32.5% bf16 MFU | 126093 tok/s step 5343/19560 | loss 3.544806 (-0.01z)| norm 0.2874 (+0.14z)| lr 5.15e-04 | 4149.64 ms | 32.5% bf16 MFU | 126106 tok/s step 5344/19560 | loss 3.538504 (-0.16z)| norm 0.2414 (-1.90z)| lr 5.15e-04 | 4148.07 ms | 32.5% bf16 MFU | 126120 tok/s step 5345/19560 | loss 3.502598 (-1.01z)| norm 0.2558 (-1.23z)| lr 5.15e-04 | 4165.39 ms | 32.4% bf16 MFU | 126108 tok/s step 5346/19560 | loss 3.574344 (+0.71z)| norm 0.2765 (-0.31z)| lr 5.15e-04 | 4167.17 ms | 32.4% bf16 MFU | 126093 tok/s step 5347/19560 | loss 3.634034 (+2.12z)| norm 0.2518 (-1.42z)| lr 5.15e-04 | 4158.21 ms | 32.5% bf16 MFU | 126092 tok/s step 5348/19560 | loss 3.603662 (+1.38z)| norm 0.2684 (-0.65z)| lr 5.15e-04 | 4164.11 ms | 32.4% bf16 MFU | 126083 tok/s step 5349/19560 | loss 3.586942 (+0.97z)| norm 0.3822 (+4.21z)| lr 5.14e-04 | 4161.78 ms | 32.4% bf16 MFU | 126078 tok/s step 5350/19560 | loss 3.557797 (+0.27z)| norm 0.2887 (+0.23z)| lr 5.14e-04 | 4165.27 ms | 32.4% bf16 MFU | 126068 tok/s step 5351/19560 | loss 3.541244 (-0.12z)| norm 0.2829 (-0.02z)| lr 5.14e-04 | 4160.96 ms | 32.4% bf16 MFU | 126064 tok/s step 5352/19560 | loss 3.596682 (+1.19z)| norm 0.2663 (-0.72z)| lr 5.14e-04 | 4155.21 ms | 32.5% bf16 MFU | 126070 tok/s step 5353/19560 | loss 3.553718 (+0.16z)| norm 0.3352 (+2.16z)| lr 5.14e-04 | 4159.45 ms | 32.5% bf16 MFU | 126069 tok/s step 5354/19560 | loss 3.544369 (-0.06z)| norm 0.3181 (+1.42z)| lr 5.14e-04 | 4159.21 ms | 32.5% bf16 MFU | 126068 tok/s step 5355/19560 | loss 3.585297 (+0.90z)| norm 0.2997 (+0.65z)| lr 5.14e-04 | 4152.95 ms | 32.5% bf16 MFU | 126077 tok/s step 5356/19560 | loss 3.478642 (-1.62z)| norm 0.3149 (+1.30z)| lr 5.14e-04 | 4158.06 ms | 32.5% bf16 MFU | 126077 tok/s step 5357/19560 | loss 3.586980 (+0.93z)| norm 0.3082 (+1.00z)| lr 5.14e-04 | 4168.18 ms | 32.4% bf16 MFU | 126063 tok/s step 5358/19560 | loss 3.544291 (-0.07z)| norm 0.3116 (+1.13z)| lr 5.14e-04 | 4153.95 ms | 32.5% bf16 MFU | 126070 tok/s step 5359/19560 | loss 3.567152 (+0.47z)| norm 0.2603 (-1.01z)| lr 5.14e-04 | 4155.03 ms | 32.5% bf16 MFU | 126076 tok/s step 5360/19560 | loss 3.523486 (-0.57z)| norm 0.2832 (-0.04z)| lr 5.14e-04 | 4163.27 ms | 32.4% bf16 MFU | 126069 tok/s step 5361/19560 | loss 3.539586 (-0.19z)| norm 0.2876 (+0.15z)| lr 5.14e-04 | 4150.93 ms | 32.5% bf16 MFU | 126081 tok/s step 5362/19560 | loss 3.575956 (+0.68z)| norm 0.3006 (+0.70z)| lr 5.14e-04 | 4155.28 ms | 32.5% bf16 MFU | 126085 tok/s step 5363/19560 | loss 3.588701 (+0.97z)| norm 0.2919 (+0.33z)| lr 5.14e-04 | 4162.41 ms | 32.4% bf16 MFU | 126079 tok/s step 5364/19560 | loss 3.483435 (-1.52z)| norm 0.2854 (+0.05z)| lr 5.14e-04 | 4152.17 ms | 32.5% bf16 MFU | 126088 tok/s step 5365/19560 | loss 3.724112 (+3.93z)| norm 0.3058 (+0.90z)| lr 5.14e-04 | 4163.32 ms | 32.4% bf16 MFU | 126080 tok/s step 5366/19560 | loss 3.538586 (-0.25z)| norm 0.2887 (+0.17z)| lr 5.14e-04 | 4151.14 ms | 32.5% bf16 MFU | 126091 tok/s step 5367/19560 | loss 3.515273 (-0.78z)| norm 0.3137 (+1.21z)| lr 5.14e-04 | 4151.80 ms | 32.5% bf16 MFU | 126101 tok/s step 5368/19560 | loss 3.587079 (+0.83z)| norm 0.3578 (+2.96z)| lr 5.14e-04 | 4156.47 ms | 32.5% bf16 MFU | 126103 tok/s step 5369/19560 | loss 3.501813 (-1.08z)| norm 0.3224 (+1.48z)| lr 5.14e-04 | 4149.65 ms | 32.5% bf16 MFU | 126115 tok/s step 5370/19560 | loss 3.569426 (+0.43z)| norm 0.2965 (+0.43z)| lr 5.14e-04 | 4156.14 ms | 32.5% bf16 MFU | 126116 tok/s step 5371/19560 | loss 3.626784 (+1.69z)| norm 0.3281 (+1.68z)| lr 5.14e-04 | 4154.11 ms | 32.5% bf16 MFU | 126121 tok/s step 5372/19560 | loss 3.551532 (+0.01z)| norm 0.3049 (+0.74z)| lr 5.14e-04 | 4154.40 ms | 32.5% bf16 MFU | 126125 tok/s step 5373/19560 | loss 3.527184 (-0.54z)| norm 0.2835 (-0.11z)| lr 5.14e-04 | 4157.35 ms | 32.5% bf16 MFU | 126124 tok/s step 5374/19560 | loss 3.551677 (+0.00z)| norm 0.3054 (+0.77z)| lr 5.14e-04 | 4156.02 ms | 32.5% bf16 MFU | 126126 tok/s step 5375/19560 | loss 3.582403 (+0.68z)| norm 0.2801 (-0.25z)| lr 5.14e-04 | 4157.77 ms | 32.5% bf16 MFU | 126124 tok/s step 5376/19560 | loss 3.552329 (-0.00z)| norm 0.3125 (+1.05z)| lr 5.14e-04 | 4150.30 ms | 32.5% bf16 MFU | 126134 tok/s step 5377/19560 | loss 3.571421 (+0.42z)| norm 0.2991 (+0.51z)| lr 5.14e-04 | 4158.23 ms | 32.5% bf16 MFU | 126132 tok/s step 5378/19560 | loss 3.497335 (-1.23z)| norm 0.2562 (-1.21z)| lr 5.13e-04 | 4168.21 ms | 32.4% bf16 MFU | 126114 tok/s step 5379/19560 | loss 3.462920 (-2.00z)| norm 0.3036 (+0.69z)| lr 5.13e-04 | 4143.54 ms | 32.6% bf16 MFU | 126135 tok/s step 5380/19560 | loss 3.505935 (-1.03z)| norm 0.2520 (-1.37z)| lr 5.13e-04 | 4149.53 ms | 32.5% bf16 MFU | 126146 tok/s step 5381/19560 | loss 3.513799 (-0.85z)| norm 0.2691 (-0.68z)| lr 5.13e-04 | 4149.51 ms | 32.5% bf16 MFU | 126156 tok/s step 5382/19560 | loss 3.592363 (+0.89z)| norm 0.3035 (+0.68z)| lr 5.13e-04 | 4155.20 ms | 32.5% bf16 MFU | 126157 tok/s step 5383/19560 | loss 3.525893 (-0.58z)| norm 0.2957 (+0.37z)| lr 5.13e-04 | 4150.53 ms | 32.5% bf16 MFU | 126165 tok/s step 5384/19560 | loss 3.550439 (-0.02z)| norm 0.3235 (+1.46z)| lr 5.13e-04 | 4159.32 ms | 32.5% bf16 MFU | 126160 tok/s step 5385/19560 | loss 3.510487 (-0.91z)| norm 0.2998 (+0.53z)| lr 5.13e-04 | 4160.93 ms | 32.4% bf16 MFU | 126152 tok/s step 5386/19560 | loss 3.571432 (+0.45z)| norm 0.2902 (+0.16z)| lr 5.13e-04 | 4157.17 ms | 32.5% bf16 MFU | 126150 tok/s step 5387/19560 | loss 3.517725 (-0.75z)| norm 0.3011 (+0.59z)| lr 5.13e-04 | 4152.78 ms | 32.5% bf16 MFU | 126155 tok/s step 5388/19560 | loss 3.515651 (-0.78z)| norm 0.2797 (-0.27z)| lr 5.13e-04 | 4168.59 ms | 32.4% bf16 MFU | 126136 tok/s step 5389/19560 | loss 3.560340 (+0.21z)| norm 0.2860 (-0.01z)| lr 5.13e-04 | 4194.77 ms | 32.2% bf16 MFU | 126078 tok/s step 5390/19560 | loss 3.486830 (-1.41z)| norm 0.3335 (+1.86z)| lr 5.13e-04 | 4151.97 ms | 32.5% bf16 MFU | 126088 tok/s step 5391/19560 | loss 3.555560 (+0.11z)| norm 0.3137 (+1.05z)| lr 5.13e-04 | 4157.21 ms | 32.5% bf16 MFU | 126089 tok/s step 5392/19560 | loss 3.514498 (-0.79z)| norm 0.3078 (+0.80z)| lr 5.13e-04 | 4163.20 ms | 32.4% bf16 MFU | 126082 tok/s step 5393/19560 | loss 3.536850 (-0.31z)| norm 0.2597 (-1.11z)| lr 5.13e-04 | 4139.53 ms | 32.6% bf16 MFU | 126110 tok/s step 5394/19560 | loss 3.556093 (+0.12z)| norm 0.3047 (+0.67z)| lr 5.13e-04 | 4149.94 ms | 32.5% bf16 MFU | 126122 tok/s step 5395/19560 | loss 3.505685 (-1.00z)| norm 0.3038 (+0.63z)| lr 5.13e-04 | 4153.18 ms | 32.5% bf16 MFU | 126127 tok/s step 5396/19560 | loss 3.592370 (+0.94z)| norm 0.3011 (+0.51z)| lr 5.13e-04 | 4149.73 ms | 32.5% bf16 MFU | 126138 tok/s step 5397/19560 | loss 3.552444 (+0.05z)| norm 0.2638 (-0.97z)| lr 5.13e-04 | 4159.92 ms | 32.5% bf16 MFU | 126133 tok/s step 5398/19560 | loss 3.564515 (+0.31z)| norm 0.2964 (+0.32z)| lr 5.13e-04 | 4155.21 ms | 32.5% bf16 MFU | 126135 tok/s step 5399/19560 | loss 3.494993 (-1.23z)| norm 0.2703 (-0.71z)| lr 5.13e-04 | 4158.88 ms | 32.5% bf16 MFU | 126132 tok/s step 5400/19560 | loss 3.564245 (+0.32z)| norm 0.2741 (-0.56z)| lr 5.13e-04 | 4144.43 ms | 32.6% bf16 MFU | 126150 tok/s step 5401/19560 | loss 3.486705 (-1.40z)| norm 0.2446 (-1.73z)| lr 5.13e-04 | 4158.33 ms | 32.5% bf16 MFU | 126147 tok/s step 5402/19560 | loss 3.566381 (+0.37z)| norm 0.2667 (-0.86z)| lr 5.13e-04 | 4155.21 ms | 32.5% bf16 MFU | 126148 tok/s step 5403/19560 | loss 3.573718 (+0.52z)| norm 0.2610 (-1.08z)| lr 5.13e-04 | 4145.57 ms | 32.6% bf16 MFU | 126164 tok/s step 5404/19560 | loss 3.551755 (+0.03z)| norm 0.2623 (-1.03z)| lr 5.13e-04 | 4157.00 ms | 32.5% bf16 MFU | 126162 tok/s step 5405/19560 | loss 3.609787 (+1.30z)| norm 0.2583 (-1.17z)| lr 5.13e-04 | 4157.54 ms | 32.5% bf16 MFU | 126159 tok/s step 5406/19560 | loss 3.509737 (-0.90z)| norm 0.2897 (+0.12z)| lr 5.12e-04 | 4150.08 ms | 32.5% bf16 MFU | 126168 tok/s step 5407/19560 | loss 3.560654 (+0.22z)| norm 0.2591 (-1.15z)| lr 5.12e-04 | 4170.18 ms | 32.4% bf16 MFU | 126146 tok/s step 5408/19560 | loss 3.513140 (-0.84z)| norm 0.2891 (+0.13z)| lr 5.12e-04 | 4156.11 ms | 32.5% bf16 MFU | 126146 tok/s step 5409/19560 | loss 3.549685 (-0.03z)| norm 0.2730 (-0.55z)| lr 5.12e-04 | 4159.46 ms | 32.5% bf16 MFU | 126141 tok/s step 5410/19560 | loss 3.521467 (-0.66z)| norm 0.2861 (+0.01z)| lr 5.12e-04 | 4159.20 ms | 32.5% bf16 MFU | 126137 tok/s step 5411/19560 | loss 3.521777 (-0.64z)| norm 0.3193 (+1.41z)| lr 5.12e-04 | 4154.46 ms | 32.5% bf16 MFU | 126140 tok/s step 5412/19560 | loss 3.556417 (+0.18z)| norm 0.3098 (+1.00z)| lr 5.12e-04 | 4157.25 ms | 32.5% bf16 MFU | 126138 tok/s step 5413/19560 | loss 3.546347 (-0.06z)| norm 0.3337 (+2.01z)| lr 5.12e-04 | 4148.68 ms | 32.5% bf16 MFU | 126150 tok/s step 5414/19560 | loss 3.551710 (+0.06z)| norm 0.2856 (-0.01z)| lr 5.12e-04 | 4150.49 ms | 32.5% bf16 MFU | 126159 tok/s step 5415/19560 | loss 3.523982 (-0.60z)| norm 0.2623 (-0.99z)| lr 5.12e-04 | 4151.14 ms | 32.5% bf16 MFU | 126166 tok/s step 5416/19560 | loss 3.547482 (-0.03z)| norm 0.2917 (+0.26z)| lr 5.12e-04 | 4156.34 ms | 32.5% bf16 MFU | 126165 tok/s step 5417/19560 | loss 3.626560 (+1.87z)| norm 0.2692 (-0.69z)| lr 5.12e-04 | 4148.44 ms | 32.5% bf16 MFU | 126175 tok/s step 5418/19560 | loss 3.591506 (+1.01z)| norm 0.3030 (+0.75z)| lr 5.12e-04 | 4152.65 ms | 32.5% bf16 MFU | 126179 tok/s step 5419/19560 | loss 3.525445 (-0.58z)| norm 0.2942 (+0.36z)| lr 5.12e-04 | 4155.35 ms | 32.5% bf16 MFU | 126179 tok/s step 5420/19560 | loss 3.532460 (-0.39z)| norm 0.2620 (-1.00z)| lr 5.12e-04 | 4157.77 ms | 32.5% bf16 MFU | 126175 tok/s step 5421/19560 | loss 3.530824 (-0.43z)| norm 0.2688 (-0.71z)| lr 5.12e-04 | 4145.34 ms | 32.6% bf16 MFU | 126190 tok/s step 5422/19560 | loss 3.488565 (-1.47z)| norm 0.2880 (+0.11z)| lr 5.12e-04 | 4160.12 ms | 32.5% bf16 MFU | 126182 tok/s step 5423/19560 | loss 3.567797 (+0.49z)| norm 0.3022 (+0.70z)| lr 5.12e-04 | 4152.06 ms | 32.5% bf16 MFU | 126186 tok/s step 5424/19560 | loss 3.610104 (+1.51z)| norm 0.2762 (-0.40z)| lr 5.12e-04 | 4154.32 ms | 32.5% bf16 MFU | 126187 tok/s step 5425/19560 | loss 3.576483 (+0.67z)| norm 0.2580 (-1.17z)| lr 5.12e-04 | 4143.93 ms | 32.6% bf16 MFU | 126204 tok/s step 5426/19560 | loss 3.511154 (-0.93z)| norm 0.2614 (-1.03z)| lr 5.12e-04 | 4150.48 ms | 32.5% bf16 MFU | 126210 tok/s step 5427/19560 | loss 3.505538 (-1.07z)| norm 0.2659 (-0.84z)| lr 5.12e-04 | 4153.53 ms | 32.5% bf16 MFU | 126210 tok/s step 5428/19560 | loss 3.532311 (-0.41z)| norm 0.2827 (-0.14z)| lr 5.12e-04 | 4155.90 ms | 32.5% bf16 MFU | 126208 tok/s step 5429/19560 | loss 3.593418 (+1.07z)| norm 0.2827 (-0.15z)| lr 5.12e-04 | 4152.09 ms | 32.5% bf16 MFU | 126211 tok/s step 5430/19560 | loss 3.576167 (+0.65z)| norm 0.3126 (+1.13z)| lr 5.12e-04 | 4150.49 ms | 32.5% bf16 MFU | 126216 tok/s step 5431/19560 | loss 3.539352 (-0.25z)| norm 0.2937 (+0.30z)| lr 5.12e-04 | 4149.39 ms | 32.5% bf16 MFU | 126223 tok/s step 5432/19560 | loss 3.582159 (+0.78z)| norm 0.2779 (-0.39z)| lr 5.12e-04 | 4161.14 ms | 32.4% bf16 MFU | 126212 tok/s step 5433/19560 | loss 3.544517 (-0.14z)| norm 0.2723 (-0.63z)| lr 5.12e-04 | 4152.02 ms | 32.5% bf16 MFU | 126215 tok/s step 5434/19560 | loss 3.566528 (+0.39z)| norm 0.2800 (-0.29z)| lr 5.11e-04 | 4151.29 ms | 32.5% bf16 MFU | 126219 tok/s step 5435/19560 | loss 3.552068 (+0.04z)| norm 0.2559 (-1.32z)| lr 5.11e-04 | 4151.16 ms | 32.5% bf16 MFU | 126223 tok/s step 5436/19560 | loss 3.561441 (+0.26z)| norm 0.2484 (-1.63z)| lr 5.11e-04 | 4147.15 ms | 32.6% bf16 MFU | 126233 tok/s step 5437/19560 | loss 3.531662 (-0.47z)| norm 0.2674 (-0.80z)| lr 5.11e-04 | 4159.29 ms | 32.5% bf16 MFU | 126224 tok/s step 5438/19560 | loss 3.579224 (+0.69z)| norm 0.2575 (-1.22z)| lr 5.11e-04 | 4149.97 ms | 32.5% bf16 MFU | 126229 tok/s step 5439/19560 | loss 3.562713 (+0.29z)| norm 0.2775 (-0.35z)| lr 5.11e-04 | 4155.61 ms | 32.5% bf16 MFU | 126226 tok/s step 5440/19560 | loss 3.509575 (-1.02z)| norm 0.2853 (-0.02z)| lr 5.11e-04 | 4160.41 ms | 32.5% bf16 MFU | 126216 tok/s step 5441/19560 | loss 3.431944 (-2.82z)| norm 0.2791 (-0.29z)| lr 5.11e-04 | 4189.64 ms | 32.2% bf16 MFU | 126162 tok/s step 5442/19560 | loss 3.566001 (+0.38z)| norm 0.2458 (-1.70z)| lr 5.11e-04 | 4154.08 ms | 32.5% bf16 MFU | 126164 tok/s step 5443/19560 | loss 3.533493 (-0.41z)| norm 0.2876 (+0.08z)| lr 5.11e-04 | 4230.93 ms | 31.9% bf16 MFU | 126052 tok/s step 5444/19560 | loss 3.525984 (-0.58z)| norm 0.2913 (+0.23z)| lr 5.11e-04 | 4152.49 ms | 32.5% bf16 MFU | 126062 tok/s step 5445/19560 | loss 3.546351 (-0.11z)| norm 0.2928 (+0.29z)| lr 5.11e-04 | 4172.56 ms | 32.4% bf16 MFU | 126042 tok/s step 5446/19560 | loss 3.602563 (+1.24z)| norm 0.2559 (-1.29z)| lr 5.11e-04 | 4244.71 ms | 31.8% bf16 MFU | 125915 tok/s step 5447/19560 | loss 3.492614 (-1.40z)| norm 0.2739 (-0.51z)| lr 5.11e-04 | 4178.88 ms | 32.3% bf16 MFU | 125893 tok/s step 5448/19560 | loss 3.639560 (+2.08z)| norm 0.2714 (-0.61z)| lr 5.11e-04 | 4160.43 ms | 32.5% bf16 MFU | 125899 tok/s step 5449/19560 | loss 3.581798 (+0.72z)| norm 0.2862 (+0.03z)| lr 5.11e-04 | 4152.78 ms | 32.5% bf16 MFU | 125917 tok/s step 5450/19560 | loss 3.538288 (-0.31z)| norm 0.2913 (+0.25z)| lr 5.11e-04 | 4154.34 ms | 32.5% bf16 MFU | 125931 tok/s step 5451/19560 | loss 3.546329 (-0.13z)| norm 0.2772 (-0.35z)| lr 5.11e-04 | 4149.55 ms | 32.5% bf16 MFU | 125952 tok/s step 5452/19560 | loss 3.524142 (-0.65z)| norm 0.2663 (-0.81z)| lr 5.11e-04 | 4161.72 ms | 32.4% bf16 MFU | 125953 tok/s step 5453/19560 | loss 3.541686 (-0.23z)| norm 0.2755 (-0.41z)| lr 5.11e-04 | 4168.86 ms | 32.4% bf16 MFU | 125944 tok/s step 5454/19560 | loss 3.573066 (+0.51z)| norm 0.2562 (-1.23z)| lr 5.11e-04 | 4153.61 ms | 32.5% bf16 MFU | 125958 tok/s step 5455/19560 | loss 3.541387 (-0.25z)| norm 0.2566 (-1.21z)| lr 5.11e-04 | 4152.68 ms | 32.5% bf16 MFU | 125972 tok/s step 5456/19560 | loss 3.497769 (-1.27z)| norm 0.2681 (-0.72z)| lr 5.11e-04 | 4160.76 ms | 32.5% bf16 MFU | 125974 tok/s step 5457/19560 | loss 3.568469 (+0.42z)| norm 0.2576 (-1.16z)| lr 5.11e-04 | 4161.75 ms | 32.4% bf16 MFU | 125974 tok/s step 5458/19560 | loss 3.582689 (+0.75z)| norm 0.3072 (+0.94z)| lr 5.11e-04 | 4151.22 ms | 32.5% bf16 MFU | 125990 tok/s step 5459/19560 | loss 3.596172 (+1.08z)| norm 0.3152 (+1.26z)| lr 5.11e-04 | 4154.36 ms | 32.5% bf16 MFU | 126001 tok/s step 5460/19560 | loss 3.569272 (+0.43z)| norm 0.3013 (+0.67z)| lr 5.11e-04 | 4163.37 ms | 32.4% bf16 MFU | 125997 tok/s step 5461/19560 | loss 3.568927 (+0.44z)| norm 0.2926 (+0.30z)| lr 5.11e-04 | 4164.63 ms | 32.4% bf16 MFU | 125992 tok/s step 5462/19560 | loss 3.526260 (-0.60z)| norm 0.2628 (-0.94z)| lr 5.11e-04 | 4149.95 ms | 32.5% bf16 MFU | 126009 tok/s step 5463/19560 | loss 3.541168 (-0.21z)| norm 0.2557 (-1.22z)| lr 5.10e-04 | 4159.01 ms | 32.5% bf16 MFU | 126012 tok/s step 5464/19560 | loss 3.489142 (-1.51z)| norm 0.2765 (-0.35z)| lr 5.10e-04 | 4153.55 ms | 32.5% bf16 MFU | 126023 tok/s step 5465/19560 | loss 3.512622 (-0.90z)| norm 0.2661 (-0.79z)| lr 5.10e-04 | 4160.38 ms | 32.5% bf16 MFU | 126022 tok/s step 5466/19560 | loss 3.520011 (-0.71z)| norm 0.2599 (-1.05z)| lr 5.10e-04 | 4176.90 ms | 32.3% bf16 MFU | 125997 tok/s step 5467/19560 | loss 3.533866 (-0.35z)| norm 0.2730 (-0.50z)| lr 5.10e-04 | 4156.72 ms | 32.5% bf16 MFU | 126004 tok/s step 5468/19560 | loss 3.545855 (-0.06z)| norm 0.2622 (-0.95z)| lr 5.10e-04 | 4163.22 ms | 32.4% bf16 MFU | 126000 tok/s step 5469/19560 | loss 3.538595 (-0.24z)| norm 0.2905 (+0.23z)| lr 5.10e-04 | 4164.13 ms | 32.4% bf16 MFU | 125996 tok/s step 5470/19560 | loss 3.522982 (-0.65z)| norm 0.2566 (-1.19z)| lr 5.10e-04 | 4164.84 ms | 32.4% bf16 MFU | 125990 tok/s step 5471/19560 | loss 3.562627 (+0.36z)| norm 0.2590 (-1.07z)| lr 5.10e-04 | 4159.36 ms | 32.5% bf16 MFU | 125993 tok/s step 5472/19560 | loss 3.528530 (-0.51z)| norm 0.2930 (+0.33z)| lr 5.10e-04 | 4156.06 ms | 32.5% bf16 MFU | 126001 tok/s step 5473/19560 | loss 3.525603 (-0.59z)| norm 0.3082 (+0.96z)| lr 5.10e-04 | 4164.22 ms | 32.4% bf16 MFU | 125996 tok/s step 5474/19560 | loss 3.479158 (-1.74z)| norm 0.2766 (-0.38z)| lr 5.10e-04 | 4165.21 ms | 32.4% bf16 MFU | 125990 tok/s step 5475/19560 | loss 3.595253 (+1.22z)| norm 0.2651 (-0.87z)| lr 5.10e-04 | 4152.30 ms | 32.5% bf16 MFU | 126004 tok/s step 5476/19560 | loss 3.512886 (-0.87z)| norm 0.2715 (-0.61z)| lr 5.10e-04 | 4157.02 ms | 32.5% bf16 MFU | 126010 tok/s step 5477/19560 | loss 3.556609 (+0.26z)| norm 0.2741 (-0.49z)| lr 5.10e-04 | 4174.39 ms | 32.3% bf16 MFU | 125989 tok/s step 5478/19560 | loss 3.531235 (-0.39z)| norm 0.2878 (+0.14z)| lr 5.10e-04 | 4156.59 ms | 32.5% bf16 MFU | 125996 tok/s step 5479/19560 | loss 3.567898 (+0.55z)| norm 0.2993 (+0.66z)| lr 5.10e-04 | 4159.96 ms | 32.5% bf16 MFU | 125998 tok/s step 5480/19560 | loss 3.547062 (+0.02z)| norm 0.3030 (+0.81z)| lr 5.10e-04 | 4156.42 ms | 32.5% bf16 MFU | 126005 tok/s step 5481/19560 | loss 3.571701 (+0.66z)| norm 0.2917 (+0.32z)| lr 5.10e-04 | 4165.07 ms | 32.4% bf16 MFU | 125999 tok/s step 5482/19560 | loss 3.626382 (+2.03z)| norm 0.3125 (+1.29z)| lr 5.10e-04 | 4158.59 ms | 32.5% bf16 MFU | 126002 tok/s step 5483/19560 | loss 3.580847 (+0.87z)| norm 0.2790 (-0.27z)| lr 5.10e-04 | 4155.29 ms | 32.5% bf16 MFU | 126011 tok/s step 5484/19560 | loss 3.582358 (+0.89z)| norm 0.2617 (-1.06z)| lr 5.10e-04 | 4158.76 ms | 32.5% bf16 MFU | 126014 tok/s step 5485/19560 | loss 3.530992 (-0.42z)| norm 0.2717 (-0.58z)| lr 5.10e-04 | 4165.17 ms | 32.4% bf16 MFU | 126007 tok/s step 5486/19560 | loss 3.590945 (+1.11z)| norm 0.2879 (+0.19z)| lr 5.10e-04 | 4156.98 ms | 32.5% bf16 MFU | 126013 tok/s step 5487/19560 | loss 3.538160 (-0.24z)| norm 0.2690 (-0.71z)| lr 5.10e-04 | 4153.46 ms | 32.5% bf16 MFU | 126023 tok/s step 5488/19560 | loss 3.558316 (+0.27z)| norm 0.2716 (-0.58z)| lr 5.10e-04 | 4156.91 ms | 32.5% bf16 MFU | 126028 tok/s step 5489/19560 | loss 3.546169 (-0.04z)| norm 0.2931 (+0.44z)| lr 5.10e-04 | 4152.36 ms | 32.5% bf16 MFU | 126040 tok/s step 5490/19560 | loss 3.513818 (-0.86z)| norm 0.2777 (-0.28z)| lr 5.10e-04 | 4153.34 ms | 32.5% bf16 MFU | 126050 tok/s step 5491/19560 | loss 3.544613 (-0.06z)| norm 0.2621 (-1.01z)| lr 5.09e-04 | 4155.92 ms | 32.5% bf16 MFU | 126055 tok/s step 5492/19560 | loss 3.537883 (-0.25z)| norm 0.3067 (+1.09z)| lr 5.09e-04 | 4164.48 ms | 32.4% bf16 MFU | 126047 tok/s step 5493/19560 | loss 3.532587 (-0.38z)| norm 0.3070 (+1.10z)| lr 5.09e-04 | 4181.46 ms | 32.3% bf16 MFU | 126014 tok/s step 5494/19560 | loss 3.546492 (+0.02z)| norm 0.2796 (-0.18z)| lr 5.09e-04 | 4155.38 ms | 32.5% bf16 MFU | 126022 tok/s step 5495/19560 | loss 3.530247 (-0.45z)| norm 0.2721 (-0.53z)| lr 5.09e-04 | 4161.81 ms | 32.4% bf16 MFU | 126019 tok/s step 5496/19560 | loss 3.566746 (+0.60z)| norm 0.2782 (-0.22z)| lr 5.09e-04 | 4156.08 ms | 32.5% bf16 MFU | 126026 tok/s step 5497/19560 | loss 3.591794 (+1.30z)| norm 0.2775 (-0.24z)| lr 5.09e-04 | 4156.67 ms | 32.5% bf16 MFU | 126031 tok/s step 5498/19560 | loss 3.550367 (+0.11z)| norm 0.2909 (+0.45z)| lr 5.09e-04 | 4155.27 ms | 32.5% bf16 MFU | 126038 tok/s step 5499/19560 | loss 3.563052 (+0.50z)| norm 0.2850 (+0.16z)| lr 5.09e-04 | 4164.10 ms | 32.4% bf16 MFU | 126032 tok/s step 5500/19560 | loss 3.579283 (+0.97z)| norm 0.2890 (+0.38z)| lr 5.09e-04 | 4150.57 ms | 32.5% bf16 MFU | 126046 tok/s val loss 3.526718 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2790/10042 = 0.277833 step 5501/19560 | loss 3.578588 (+0.94z)| norm 0.3127 (+1.59z)| lr 5.09e-04 | 4147.48 ms | 32.6% bf16 MFU | 126064 tok/s step 5502/19560 | loss 3.552864 (+0.18z)| norm 0.2852 (+0.18z)| lr 5.09e-04 | 4157.11 ms | 32.5% bf16 MFU | 126067 tok/s step 5503/19560 | loss 3.544630 (-0.05z)| norm 0.2672 (-0.75z)| lr 5.09e-04 | 4160.90 ms | 32.4% bf16 MFU | 126064 tok/s step 5504/19560 | loss 3.546224 (-0.00z)| norm 0.2881 (+0.35z)| lr 5.09e-04 | 4158.05 ms | 32.5% bf16 MFU | 126065 tok/s step 5505/19560 | loss 3.627315 (+2.33z)| norm 0.2952 (+0.72z)| lr 5.09e-04 | 4161.21 ms | 32.4% bf16 MFU | 126062 tok/s step 5506/19560 | loss 3.534351 (-0.37z)| norm 0.2820 (+0.02z)| lr 5.09e-04 | 4160.61 ms | 32.5% bf16 MFU | 126059 tok/s step 5507/19560 | loss 3.551308 (+0.11z)| norm 0.2883 (+0.36z)| lr 5.09e-04 | 4161.05 ms | 32.4% bf16 MFU | 126056 tok/s step 5508/19560 | loss 3.529044 (-0.56z)| norm 0.2785 (-0.18z)| lr 5.09e-04 | 4159.14 ms | 32.5% bf16 MFU | 126056 tok/s step 5509/19560 | loss 3.578255 (+0.90z)| norm 0.2822 (+0.02z)| lr 5.09e-04 | 4160.62 ms | 32.5% bf16 MFU | 126054 tok/s step 5510/19560 | loss 3.556529 (+0.26z)| norm 0.2782 (-0.19z)| lr 5.09e-04 | 4157.13 ms | 32.5% bf16 MFU | 126057 tok/s step 5511/19560 | loss 3.579941 (+0.95z)| norm 0.3121 (+1.63z)| lr 5.09e-04 | 4182.29 ms | 32.3% bf16 MFU | 126022 tok/s step 5512/19560 | loss 3.567540 (+0.57z)| norm 0.2855 (+0.22z)| lr 5.09e-04 | 4165.28 ms | 32.4% bf16 MFU | 126015 tok/s step 5513/19560 | loss 3.499965 (-1.46z)| norm 0.2788 (-0.14z)| lr 5.09e-04 | 4165.72 ms | 32.4% bf16 MFU | 126007 tok/s step 5514/19560 | loss 3.678970 (+3.68z)| norm 0.3047 (+1.27z)| lr 5.09e-04 | 4158.53 ms | 32.5% bf16 MFU | 126010 tok/s step 5515/19560 | loss 3.531752 (-0.51z)| norm 0.3045 (+1.26z)| lr 5.09e-04 | 4184.53 ms | 32.3% bf16 MFU | 125974 tok/s step 5516/19560 | loss 3.555985 (+0.18z)| norm 0.2917 (+0.55z)| lr 5.09e-04 | 4153.34 ms | 32.5% bf16 MFU | 125987 tok/s step 5517/19560 | loss 3.481212 (-1.92z)| norm 0.2981 (+0.89z)| lr 5.09e-04 | 4151.34 ms | 32.5% bf16 MFU | 126003 tok/s step 5518/19560 | loss 3.546373 (-0.09z)| norm 0.2750 (-0.34z)| lr 5.08e-04 | 4165.44 ms | 32.4% bf16 MFU | 125996 tok/s step 5519/19560 | loss 3.546520 (-0.09z)| norm 0.3057 (+1.38z)| lr 5.08e-04 | 4160.57 ms | 32.5% bf16 MFU | 125997 tok/s step 5520/19560 | loss 3.628427 (+2.19z)| norm 0.3077 (+1.50z)| lr 5.08e-04 | 4162.33 ms | 32.4% bf16 MFU | 125995 tok/s step 5521/19560 | loss 3.521531 (-0.81z)| norm 0.3198 (+2.13z)| lr 5.08e-04 | 4159.42 ms | 32.5% bf16 MFU | 125997 tok/s step 5522/19560 | loss 3.563657 (+0.37z)| norm 0.2689 (-0.70z)| lr 5.08e-04 | 4165.49 ms | 32.4% bf16 MFU | 125991 tok/s step 5523/19560 | loss 3.584701 (+0.95z)| norm 0.3308 (+2.70z)| lr 5.08e-04 | 4150.52 ms | 32.5% bf16 MFU | 126007 tok/s step 5524/19560 | loss 3.511802 (-1.09z)| norm 0.3216 (+2.15z)| lr 5.08e-04 | 4153.26 ms | 32.5% bf16 MFU | 126019 tok/s step 5525/19560 | loss 3.576216 (+0.72z)| norm 0.2576 (-1.30z)| lr 5.08e-04 | 4162.00 ms | 32.4% bf16 MFU | 126016 tok/s step 5526/19560 | loss 3.516359 (-0.95z)| norm 0.2672 (-0.76z)| lr 5.08e-04 | 4165.41 ms | 32.4% bf16 MFU | 126009 tok/s step 5527/19560 | loss 3.554206 (+0.10z)| norm 0.2942 (+0.68z)| lr 5.08e-04 | 4153.27 ms | 32.5% bf16 MFU | 126020 tok/s step 5528/19560 | loss 3.626497 (+2.09z)| norm 0.2999 (+0.97z)| lr 5.08e-04 | 4162.53 ms | 32.4% bf16 MFU | 126017 tok/s step 5529/19560 | loss 3.523029 (-0.80z)| norm 0.3107 (+1.53z)| lr 5.08e-04 | 4158.58 ms | 32.5% bf16 MFU | 126020 tok/s step 5530/19560 | loss 3.507395 (-1.22z)| norm 0.2995 (+0.91z)| lr 5.08e-04 | 4167.43 ms | 32.4% bf16 MFU | 126009 tok/s step 5531/19560 | loss 3.550505 (-0.01z)| norm 0.2800 (-0.15z)| lr 5.08e-04 | 4149.45 ms | 32.5% bf16 MFU | 126026 tok/s step 5532/19560 | loss 3.572662 (+0.61z)| norm 0.2883 (+0.29z)| lr 5.08e-04 | 4153.73 ms | 32.5% bf16 MFU | 126036 tok/s step 5533/19560 | loss 3.541937 (-0.24z)| norm 0.2774 (-0.31z)| lr 5.08e-04 | 4163.83 ms | 32.4% bf16 MFU | 126030 tok/s step 5534/19560 | loss 3.517347 (-0.94z)| norm 0.2682 (-0.80z)| lr 5.08e-04 | 4155.55 ms | 32.5% bf16 MFU | 126037 tok/s step 5535/19560 | loss 3.649073 (+2.69z)| norm 0.3178 (+1.87z)| lr 5.08e-04 | 4155.74 ms | 32.5% bf16 MFU | 126043 tok/s step 5536/19560 | loss 3.485092 (-1.79z)| norm 0.2932 (+0.53z)| lr 5.08e-04 | 4146.96 ms | 32.6% bf16 MFU | 126062 tok/s step 5537/19560 | loss 3.624756 (+1.96z)| norm 0.3055 (+1.18z)| lr 5.08e-04 | 4157.76 ms | 32.5% bf16 MFU | 126064 tok/s step 5538/19560 | loss 3.574795 (+0.61z)| norm 0.2990 (+0.82z)| lr 5.08e-04 | 4156.14 ms | 32.5% bf16 MFU | 126068 tok/s step 5539/19560 | loss 3.532163 (-0.53z)| norm 0.2771 (-0.34z)| lr 5.08e-04 | 4167.12 ms | 32.4% bf16 MFU | 126055 tok/s step 5540/19560 | loss 3.528550 (-0.62z)| norm 0.2891 (+0.32z)| lr 5.08e-04 | 4164.23 ms | 32.4% bf16 MFU | 126048 tok/s step 5541/19560 | loss 3.618270 (+1.75z)| norm 0.2952 (+0.70z)| lr 5.08e-04 | 4145.54 ms | 32.6% bf16 MFU | 126069 tok/s step 5542/19560 | loss 3.554348 (+0.05z)| norm 0.3026 (+1.10z)| lr 5.08e-04 | 4150.58 ms | 32.5% bf16 MFU | 126081 tok/s step 5543/19560 | loss 3.564520 (+0.31z)| norm 0.2890 (+0.32z)| lr 5.08e-04 | 4156.80 ms | 32.5% bf16 MFU | 126084 tok/s step 5544/19560 | loss 3.547972 (-0.13z)| norm 0.2766 (-0.37z)| lr 5.08e-04 | 4148.26 ms | 32.5% bf16 MFU | 126099 tok/s step 5545/19560 | loss 3.584949 (+0.88z)| norm 0.3230 (+2.20z)| lr 5.08e-04 | 4162.80 ms | 32.4% bf16 MFU | 126091 tok/s step 5546/19560 | loss 3.586432 (+0.92z)| norm 0.2901 (+0.37z)| lr 5.07e-04 | 4154.94 ms | 32.5% bf16 MFU | 126096 tok/s step 5547/19560 | loss 3.526985 (-0.68z)| norm 0.2746 (-0.49z)| lr 5.07e-04 | 4652.50 ms | 29.0% bf16 MFU | 125425 tok/s step 5548/19560 | loss 3.526390 (-0.70z)| norm 0.3020 (+1.03z)| lr 5.07e-04 | 4157.35 ms | 32.5% bf16 MFU | 125460 tok/s step 5549/19560 | loss 3.554636 (+0.06z)| norm 0.3118 (+1.55z)| lr 5.07e-04 | 4150.84 ms | 32.5% bf16 MFU | 125502 tok/s step 5550/19560 | loss 3.555153 (+0.06z)| norm 0.2916 (+0.42z)| lr 5.07e-04 | 4155.34 ms | 32.5% bf16 MFU | 125536 tok/s step 5551/19560 | loss 3.503079 (-1.34z)| norm 0.2702 (-0.76z)| lr 5.07e-04 | 4159.35 ms | 32.5% bf16 MFU | 125561 tok/s step 5552/19560 | loss 3.598535 (+1.25z)| norm 0.2697 (-0.78z)| lr 5.07e-04 | 4156.71 ms | 32.5% bf16 MFU | 125590 tok/s step 5553/19560 | loss 3.535732 (-0.45z)| norm 0.3401 (+3.02z)| lr 5.07e-04 | 4157.62 ms | 32.5% bf16 MFU | 125616 tok/s step 5554/19560 | loss 3.522878 (-0.80z)| norm 0.2653 (-1.04z)| lr 5.07e-04 | 4155.28 ms | 32.5% bf16 MFU | 125643 tok/s step 5555/19560 | loss 3.515100 (-1.02z)| norm 0.2555 (-1.56z)| lr 5.07e-04 | 4152.60 ms | 32.5% bf16 MFU | 125674 tok/s step 5556/19560 | loss 3.532009 (-0.55z)| norm 0.2650 (-1.03z)| lr 5.07e-04 | 4166.63 ms | 32.4% bf16 MFU | 125682 tok/s step 5557/19560 | loss 3.563045 (+0.30z)| norm 0.2642 (-1.06z)| lr 5.07e-04 | 4153.67 ms | 32.5% bf16 MFU | 125709 tok/s step 5558/19560 | loss 3.579961 (+0.77z)| norm 0.2781 (-0.31z)| lr 5.07e-04 | 4164.24 ms | 32.4% bf16 MFU | 125719 tok/s step 5559/19560 | loss 3.491930 (-1.62z)| norm 0.2968 (+0.70z)| lr 5.07e-04 | 4149.80 ms | 32.5% bf16 MFU | 125750 tok/s step 5560/19560 | loss 3.526494 (-0.67z)| norm 0.2944 (+0.57z)| lr 5.07e-04 | 4148.57 ms | 32.5% bf16 MFU | 125781 tok/s step 5561/19560 | loss 3.542410 (-0.24z)| norm 0.2690 (-0.80z)| lr 5.07e-04 | 4152.81 ms | 32.5% bf16 MFU | 125805 tok/s step 5562/19560 | loss 3.551370 (+0.01z)| norm 0.2702 (-0.73z)| lr 5.07e-04 | 4156.37 ms | 32.5% bf16 MFU | 125821 tok/s step 5563/19560 | loss 3.567635 (+0.45z)| norm 0.2629 (-1.13z)| lr 5.07e-04 | 4149.46 ms | 32.5% bf16 MFU | 125848 tok/s step 5564/19560 | loss 3.508265 (-1.15z)| norm 0.2339 (-2.65z)| lr 5.07e-04 | 4188.85 ms | 32.2% bf16 MFU | 125814 tok/s step 5565/19560 | loss 3.498927 (-1.39z)| norm 0.2716 (-0.65z)| lr 5.07e-04 | 4154.14 ms | 32.5% bf16 MFU | 125833 tok/s step 5566/19560 | loss 3.554559 (+0.11z)| norm 0.2690 (-0.80z)| lr 5.07e-04 | 4156.32 ms | 32.5% bf16 MFU | 125849 tok/s step 5567/19560 | loss 3.547494 (-0.08z)| norm 0.2876 (+0.19z)| lr 5.07e-04 | 4143.57 ms | 32.6% bf16 MFU | 125883 tok/s step 5568/19560 | loss 3.654191 (+2.70z)| norm 0.2895 (+0.29z)| lr 5.07e-04 | 4153.93 ms | 32.5% bf16 MFU | 125899 tok/s step 5569/19560 | loss 3.516871 (-0.96z)| norm 0.2980 (+0.74z)| lr 5.07e-04 | 4166.28 ms | 32.4% bf16 MFU | 125897 tok/s step 5570/19560 | loss 3.534559 (-0.47z)| norm 0.2997 (+0.82z)| lr 5.07e-04 | 4160.71 ms | 32.5% bf16 MFU | 125902 tok/s step 5571/19560 | loss 3.596441 (+1.20z)| norm 0.2793 (-0.28z)| lr 5.07e-04 | 4150.87 ms | 32.5% bf16 MFU | 125922 tok/s step 5572/19560 | loss 3.526293 (-0.71z)| norm 0.2667 (-0.95z)| lr 5.07e-04 | 4152.17 ms | 32.5% bf16 MFU | 125940 tok/s step 5573/19560 | loss 3.574741 (+0.60z)| norm 0.2747 (-0.51z)| lr 5.07e-04 | 4161.84 ms | 32.4% bf16 MFU | 125942 tok/s step 5574/19560 | loss 3.517255 (-0.94z)| norm 0.2691 (-0.82z)| lr 5.06e-04 | 4151.39 ms | 32.5% bf16 MFU | 125959 tok/s step 5575/19560 | loss 3.509907 (-1.15z)| norm 0.2637 (-1.11z)| lr 5.06e-04 | 4165.30 ms | 32.4% bf16 MFU | 125955 tok/s step 5576/19560 | loss 3.513214 (-1.05z)| norm 0.2759 (-0.45z)| lr 5.06e-04 | 4151.18 ms | 32.5% bf16 MFU | 125972 tok/s step 5577/19560 | loss 3.559959 (+0.25z)| norm 0.2514 (-1.75z)| lr 5.06e-04 | 4160.37 ms | 32.5% bf16 MFU | 125974 tok/s step 5578/19560 | loss 3.509343 (-1.15z)| norm 0.2816 (-0.12z)| lr 5.06e-04 | 4146.96 ms | 32.6% bf16 MFU | 125997 tok/s step 5579/19560 | loss 3.501261 (-1.35z)| norm 0.2728 (-0.59z)| lr 5.06e-04 | 4150.06 ms | 32.5% bf16 MFU | 126014 tok/s step 5580/19560 | loss 3.528352 (-0.61z)| norm 0.2492 (-1.83z)| lr 5.06e-04 | 4147.19 ms | 32.6% bf16 MFU | 126034 tok/s step 5581/19560 | loss 3.515725 (-0.95z)| norm 0.2821 (-0.09z)| lr 5.06e-04 | 4162.38 ms | 32.4% bf16 MFU | 126030 tok/s step 5582/19560 | loss 3.572062 (+0.61z)| norm 0.2940 (+0.53z)| lr 5.06e-04 | 4151.50 ms | 32.5% bf16 MFU | 126043 tok/s step 5583/19560 | loss 3.538188 (-0.33z)| norm 0.2691 (-0.81z)| lr 5.06e-04 | 4182.54 ms | 32.3% bf16 MFU | 126009 tok/s step 5584/19560 | loss 3.612822 (+1.70z)| norm 0.2792 (-0.27z)| lr 5.06e-04 | 4155.99 ms | 32.5% bf16 MFU | 126016 tok/s step 5585/19560 | loss 3.572133 (+0.58z)| norm 0.2957 (+0.61z)| lr 5.06e-04 | 4154.63 ms | 32.5% bf16 MFU | 126025 tok/s step 5586/19560 | loss 3.555688 (+0.13z)| norm 0.2786 (-0.31z)| lr 5.06e-04 | 4154.55 ms | 32.5% bf16 MFU | 126033 tok/s step 5587/19560 | loss 3.532423 (-0.49z)| norm 0.2668 (-0.95z)| lr 5.06e-04 | 4175.33 ms | 32.3% bf16 MFU | 126010 tok/s step 5588/19560 | loss 3.544157 (-0.16z)| norm 0.2699 (-0.76z)| lr 5.06e-04 | 4153.29 ms | 32.5% bf16 MFU | 126021 tok/s step 5589/19560 | loss 3.545865 (-0.11z)| norm 0.2561 (-1.50z)| lr 5.06e-04 | 4160.20 ms | 32.5% bf16 MFU | 126021 tok/s step 5590/19560 | loss 3.451900 (-2.63z)| norm 0.2774 (-0.34z)| lr 5.06e-04 | 4148.01 ms | 32.5% bf16 MFU | 126040 tok/s step 5591/19560 | loss 3.534666 (-0.39z)| norm 0.2919 (+0.45z)| lr 5.06e-04 | 4153.47 ms | 32.5% bf16 MFU | 126049 tok/s step 5592/19560 | loss 3.540700 (-0.24z)| norm 0.2721 (-0.64z)| lr 5.06e-04 | 4154.01 ms | 32.5% bf16 MFU | 126058 tok/s step 5593/19560 | loss 3.520009 (-0.81z)| norm 0.2620 (-1.20z)| lr 5.06e-04 | 4156.68 ms | 32.5% bf16 MFU | 126061 tok/s step 5594/19560 | loss 3.557093 (+0.19z)| norm 0.2938 (+0.54z)| lr 5.06e-04 | 4149.10 ms | 32.5% bf16 MFU | 126076 tok/s step 5595/19560 | loss 3.573260 (+0.63z)| norm 0.2596 (-1.34z)| lr 5.06e-04 | 4147.77 ms | 32.6% bf16 MFU | 126093 tok/s step 5596/19560 | loss 3.520812 (-0.80z)| norm 0.2736 (-0.57z)| lr 5.06e-04 | 4161.86 ms | 32.4% bf16 MFU | 126087 tok/s step 5597/19560 | loss 3.545534 (-0.13z)| norm 0.2927 (+0.48z)| lr 5.06e-04 | 4156.32 ms | 32.5% bf16 MFU | 126089 tok/s step 5598/19560 | loss 3.583232 (+0.89z)| norm 0.3186 (+1.88z)| lr 5.06e-04 | 4154.36 ms | 32.5% bf16 MFU | 126095 tok/s step 5599/19560 | loss 3.555161 (+0.12z)| norm 0.3081 (+1.28z)| lr 5.06e-04 | 4151.21 ms | 32.5% bf16 MFU | 126105 tok/s step 5600/19560 | loss 3.628838 (+2.08z)| norm 0.2808 (-0.22z)| lr 5.06e-04 | 4145.74 ms | 32.6% bf16 MFU | 126123 tok/s step 5601/19560 | loss 3.541303 (-0.27z)| norm 0.2900 (+0.30z)| lr 5.05e-04 | 4150.87 ms | 32.5% bf16 MFU | 126132 tok/s step 5602/19560 | loss 3.542624 (-0.26z)| norm 0.2820 (-0.15z)| lr 5.05e-04 | 4159.55 ms | 32.5% bf16 MFU | 126128 tok/s step 5603/19560 | loss 3.563872 (+0.33z)| norm 0.2660 (-1.04z)| lr 5.05e-04 | 4146.52 ms | 32.6% bf16 MFU | 126144 tok/s step 5604/19560 | loss 3.535272 (-0.46z)| norm 0.2779 (-0.38z)| lr 5.05e-04 | 4155.17 ms | 32.5% bf16 MFU | 126145 tok/s step 5605/19560 | loss 3.548602 (-0.09z)| norm 0.3198 (+1.91z)| lr 5.05e-04 | 4143.16 ms | 32.6% bf16 MFU | 126165 tok/s step 5606/19560 | loss 3.585118 (+0.90z)| norm 1.0346 (+10.87z)| lr 5.05e-04 | 4163.32 ms | 32.4% bf16 MFU | 126153 tok/s step 5607/19560 | loss 3.625237 (+1.96z)| norm 0.4142 (+1.77z)| lr 5.05e-04 | 4151.43 ms | 32.5% bf16 MFU | 126160 tok/s step 5608/19560 | loss 3.511911 (-1.09z)| norm 0.4024 (+1.57z)| lr 5.05e-04 | 4147.64 ms | 32.6% bf16 MFU | 126173 tok/s step 5609/19560 | loss 3.589939 (+1.00z)| norm 0.3306 (+0.54z)| lr 5.05e-04 | 4157.43 ms | 32.5% bf16 MFU | 126169 tok/s step 5610/19560 | loss 3.571723 (+0.53z)| norm 0.3555 (+0.89z)| lr 5.05e-04 | 4153.20 ms | 32.5% bf16 MFU | 126173 tok/s step 5611/19560 | loss 3.555377 (+0.09z)| norm 0.3103 (+0.24z)| lr 5.05e-04 | 4152.03 ms | 32.5% bf16 MFU | 126178 tok/s step 5612/19560 | loss 3.624353 (+1.94z)| norm 0.3120 (+0.26z)| lr 5.05e-04 | 4150.65 ms | 32.5% bf16 MFU | 126185 tok/s step 5613/19560 | loss 3.636417 (+2.21z)| norm 0.3020 (+0.11z)| lr 5.05e-04 | 4154.45 ms | 32.5% bf16 MFU | 126185 tok/s step 5614/19560 | loss 3.569149 (+0.43z)| norm 0.2695 (-0.35z)| lr 5.05e-04 | 4151.67 ms | 32.5% bf16 MFU | 126190 tok/s step 5615/19560 | loss 3.570505 (+0.46z)| norm 0.2717 (-0.32z)| lr 5.05e-04 | 4158.87 ms | 32.5% bf16 MFU | 126184 tok/s step 5616/19560 | loss 3.581176 (+0.74z)| norm 0.2700 (-0.34z)| lr 5.05e-04 | 4145.37 ms | 32.6% bf16 MFU | 126199 tok/s step 5617/19560 | loss 3.490141 (-1.65z)| norm 0.2743 (-0.28z)| lr 5.05e-04 | 4146.00 ms | 32.6% bf16 MFU | 126212 tok/s step 5618/19560 | loss 3.607187 (+1.40z)| norm 0.2683 (-0.36z)| lr 5.05e-04 | 4150.30 ms | 32.5% bf16 MFU | 126217 tok/s step 5619/19560 | loss 3.529046 (-0.64z)| norm 0.2941 (+0.00z)| lr 5.05e-04 | 4162.13 ms | 32.4% bf16 MFU | 126205 tok/s step 5620/19560 | loss 3.558302 (+0.12z)| norm 0.2830 (-0.15z)| lr 5.05e-04 | 4145.90 ms | 32.6% bf16 MFU | 126217 tok/s step 5621/19560 | loss 3.538203 (-0.41z)| norm 0.2792 (-0.20z)| lr 5.05e-04 | 4153.62 ms | 32.5% bf16 MFU | 126218 tok/s step 5622/19560 | loss 3.524522 (-0.76z)| norm 0.2374 (-0.79z)| lr 5.05e-04 | 4156.69 ms | 32.5% bf16 MFU | 126213 tok/s step 5623/19560 | loss 3.568064 (+0.37z)| norm 0.2802 (-0.19z)| lr 5.05e-04 | 4152.52 ms | 32.5% bf16 MFU | 126216 tok/s step 5624/19560 | loss 3.564008 (+0.26z)| norm 0.2522 (-0.58z)| lr 5.05e-04 | 4164.68 ms | 32.4% bf16 MFU | 126199 tok/s step 5625/19560 | loss 3.527392 (-0.68z)| norm 0.2867 (-0.09z)| lr 5.05e-04 | 4149.66 ms | 32.5% bf16 MFU | 126207 tok/s step 5626/19560 | loss 3.545687 (-0.20z)| norm 0.2693 (-0.34z)| lr 5.05e-04 | 4175.03 ms | 32.3% bf16 MFU | 126175 tok/s step 5627/19560 | loss 3.554741 (+0.04z)| norm 0.2998 (+0.09z)| lr 5.05e-04 | 4154.97 ms | 32.5% bf16 MFU | 126175 tok/s step 5628/19560 | loss 3.550313 (-0.07z)| norm 0.2764 (-0.24z)| lr 5.05e-04 | 4149.90 ms | 32.5% bf16 MFU | 126184 tok/s step 5629/19560 | loss 3.554060 (+0.03z)| norm 0.2689 (-0.34z)| lr 5.04e-04 | 4146.52 ms | 32.6% bf16 MFU | 126196 tok/s step 5630/19560 | loss 3.649846 (+2.47z)| norm 0.3014 (+0.12z)| lr 5.04e-04 | 4161.21 ms | 32.4% bf16 MFU | 126186 tok/s step 5631/19560 | loss 3.548587 (-0.13z)| norm 0.2381 (-0.77z)| lr 5.04e-04 | 4168.71 ms | 32.4% bf16 MFU | 126165 tok/s step 5632/19560 | loss 3.526983 (-0.68z)| norm 0.2787 (-0.20z)| lr 5.04e-04 | 4156.40 ms | 32.5% bf16 MFU | 126164 tok/s step 5633/19560 | loss 3.526911 (-0.67z)| norm 0.2944 (+0.03z)| lr 5.04e-04 | 4154.00 ms | 32.5% bf16 MFU | 126167 tok/s step 5634/19560 | loss 3.588456 (+0.91z)| norm 0.2810 (-0.16z)| lr 5.04e-04 | 4161.70 ms | 32.4% bf16 MFU | 126157 tok/s step 5635/19560 | loss 3.549422 (-0.10z)| norm 0.2818 (-0.15z)| lr 5.04e-04 | 4153.34 ms | 32.5% bf16 MFU | 126161 tok/s step 5636/19560 | loss 3.538767 (-0.37z)| norm 0.2731 (-0.27z)| lr 5.04e-04 | 4173.51 ms | 32.4% bf16 MFU | 126134 tok/s step 5637/19560 | loss 3.490638 (-1.59z)| norm 0.2676 (-0.35z)| lr 5.04e-04 | 4210.47 ms | 32.1% bf16 MFU | 126053 tok/s step 5638/19560 | loss 3.527971 (-0.62z)| norm 0.2948 (+0.03z)| lr 5.04e-04 | 4183.80 ms | 32.3% bf16 MFU | 126016 tok/s step 5639/19560 | loss 3.541929 (-0.26z)| norm 0.2788 (-0.19z)| lr 5.04e-04 | 4161.52 ms | 32.4% bf16 MFU | 126015 tok/s step 5640/19560 | loss 3.564076 (+0.31z)| norm 0.2770 (-0.21z)| lr 5.04e-04 | 4147.94 ms | 32.6% bf16 MFU | 126034 tok/s step 5641/19560 | loss 3.514583 (-0.97z)| norm 0.3122 (+0.28z)| lr 5.04e-04 | 4155.72 ms | 32.5% bf16 MFU | 126040 tok/s step 5642/19560 | loss 3.552942 (+0.05z)| norm 0.2703 (-0.31z)| lr 5.04e-04 | 4172.81 ms | 32.4% bf16 MFU | 126020 tok/s step 5643/19560 | loss 3.482085 (-1.82z)| norm 0.2780 (-0.20z)| lr 5.04e-04 | 4158.45 ms | 32.5% bf16 MFU | 126023 tok/s step 5644/19560 | loss 3.504057 (-1.22z)| norm 0.2987 (+0.10z)| lr 5.04e-04 | 4161.03 ms | 32.4% bf16 MFU | 126022 tok/s step 5645/19560 | loss 3.514360 (-0.97z)| norm 0.2973 (+0.08z)| lr 5.04e-04 | 4152.33 ms | 32.5% bf16 MFU | 126034 tok/s step 5646/19560 | loss 3.473532 (-2.01z)| norm 0.2873 (-0.07z)| lr 5.04e-04 | 4170.32 ms | 32.4% bf16 MFU | 126018 tok/s step 5647/19560 | loss 3.526703 (-0.61z)| norm 0.2778 (-0.20z)| lr 5.04e-04 | 4164.25 ms | 32.4% bf16 MFU | 126013 tok/s step 5648/19560 | loss 3.529136 (-0.53z)| norm 0.2747 (-0.24z)| lr 5.04e-04 | 4169.55 ms | 32.4% bf16 MFU | 125999 tok/s step 5649/19560 | loss 3.486881 (-1.64z)| norm 0.2654 (-0.36z)| lr 5.04e-04 | 4156.35 ms | 32.5% bf16 MFU | 126006 tok/s step 5650/19560 | loss 3.464336 (-2.17z)| norm 0.2883 (-0.04z)| lr 5.04e-04 | 4162.88 ms | 32.4% bf16 MFU | 126003 tok/s step 5651/19560 | loss 3.555927 (+0.21z)| norm 0.2753 (-0.22z)| lr 5.04e-04 | 4162.68 ms | 32.4% bf16 MFU | 126000 tok/s step 5652/19560 | loss 3.482493 (-1.68z)| norm 0.2567 (-0.48z)| lr 5.04e-04 | 4160.72 ms | 32.5% bf16 MFU | 126001 tok/s step 5653/19560 | loss 3.486504 (-1.55z)| norm 0.2752 (-0.22z)| lr 5.04e-04 | 4156.52 ms | 32.5% bf16 MFU | 126008 tok/s step 5654/19560 | loss 3.490989 (-1.42z)| norm 0.2778 (-0.18z)| lr 5.04e-04 | 4154.42 ms | 32.5% bf16 MFU | 126017 tok/s step 5655/19560 | loss 3.507491 (-0.99z)| norm 0.2966 (+0.09z)| lr 5.04e-04 | 4170.79 ms | 32.4% bf16 MFU | 126002 tok/s step 5656/19560 | loss 3.541539 (-0.11z)| norm 0.2771 (-0.19z)| lr 5.03e-04 | 4165.35 ms | 32.4% bf16 MFU | 125995 tok/s step 5657/19560 | loss 3.569157 (+0.60z)| norm 0.2574 (-0.46z)| lr 5.03e-04 | 4161.67 ms | 32.4% bf16 MFU | 125994 tok/s step 5658/19560 | loss 3.474517 (-1.82z)| norm 0.2814 (-0.12z)| lr 5.03e-04 | 4158.11 ms | 32.5% bf16 MFU | 125999 tok/s step 5659/19560 | loss 3.626521 (+2.02z)| norm 0.2568 (-0.47z)| lr 5.03e-04 | 4156.26 ms | 32.5% bf16 MFU | 126006 tok/s step 5660/19560 | loss 3.497039 (-1.22z)| norm 0.2538 (-0.50z)| lr 5.03e-04 | 4162.90 ms | 32.4% bf16 MFU | 126003 tok/s step 5661/19560 | loss 3.490046 (-1.38z)| norm 0.2763 (-0.19z)| lr 5.03e-04 | 4182.55 ms | 32.3% bf16 MFU | 125970 tok/s step 5662/19560 | loss 3.538005 (-0.19z)| norm 0.2844 (-0.07z)| lr 5.03e-04 | 4162.54 ms | 32.4% bf16 MFU | 125970 tok/s step 5663/19560 | loss 3.539107 (-0.14z)| norm 0.2841 (-0.07z)| lr 5.03e-04 | 4166.52 ms | 32.4% bf16 MFU | 125963 tok/s step 5664/19560 | loss 3.465419 (-2.01z)| norm 0.2764 (-0.18z)| lr 5.03e-04 | 4157.59 ms | 32.5% bf16 MFU | 125970 tok/s step 5665/19560 | loss 3.512922 (-0.79z)| norm 0.3722 (+1.16z)| lr 5.03e-04 | 4162.84 ms | 32.4% bf16 MFU | 125969 tok/s step 5666/19560 | loss 3.421286 (-3.02z)| norm 0.3262 (+0.51z)| lr 5.03e-04 | 4161.81 ms | 32.4% bf16 MFU | 125969 tok/s step 5667/19560 | loss 3.519814 (-0.56z)| norm 0.2925 (+0.04z)| lr 5.03e-04 | 4153.22 ms | 32.5% bf16 MFU | 125982 tok/s step 5668/19560 | loss 3.579230 (+0.91z)| norm 0.2968 (+0.10z)| lr 5.03e-04 | 4177.14 ms | 32.3% bf16 MFU | 125959 tok/s step 5669/19560 | loss 3.560232 (+0.45z)| norm 0.2818 (-0.12z)| lr 5.03e-04 | 4164.07 ms | 32.4% bf16 MFU | 125956 tok/s step 5670/19560 | loss 3.540072 (-0.05z)| norm 0.2727 (-0.24z)| lr 5.03e-04 | 4159.72 ms | 32.5% bf16 MFU | 125960 tok/s step 5671/19560 | loss 3.573074 (+0.78z)| norm 0.3387 (+0.68z)| lr 5.03e-04 | 4150.03 ms | 32.5% bf16 MFU | 125979 tok/s step 5672/19560 | loss 3.488330 (-1.33z)| norm 0.3191 (+0.40z)| lr 5.03e-04 | 4191.06 ms | 32.2% bf16 MFU | 125935 tok/s step 5673/19560 | loss 3.691140 (+3.54z)| norm 0.2656 (-0.34z)| lr 5.03e-04 | 4167.14 ms | 32.4% bf16 MFU | 125929 tok/s step 5674/19560 | loss 3.545489 (+0.08z)| norm 0.3449 (+0.76z)| lr 5.03e-04 | 4154.00 ms | 32.5% bf16 MFU | 125943 tok/s step 5675/19560 | loss 3.539448 (-0.07z)| norm 0.3233 (+0.46z)| lr 5.03e-04 | 4164.68 ms | 32.4% bf16 MFU | 125940 tok/s step 5676/19560 | loss 3.506422 (-0.85z)| norm 0.2806 (-0.14z)| lr 5.03e-04 | 4155.20 ms | 32.5% bf16 MFU | 125952 tok/s step 5677/19560 | loss 3.515363 (-0.63z)| norm 0.2980 (+0.10z)| lr 5.03e-04 | 4152.78 ms | 32.5% bf16 MFU | 125967 tok/s step 5678/19560 | loss 3.520806 (-0.50z)| norm 0.3060 (+0.22z)| lr 5.03e-04 | 4160.52 ms | 32.5% bf16 MFU | 125970 tok/s step 5679/19560 | loss 3.540955 (-0.02z)| norm 0.2621 (-0.40z)| lr 5.03e-04 | 4153.98 ms | 32.5% bf16 MFU | 125982 tok/s step 5680/19560 | loss 3.466282 (-1.77z)| norm 0.2614 (-0.41z)| lr 5.03e-04 | 4164.67 ms | 32.4% bf16 MFU | 125977 tok/s step 5681/19560 | loss 3.558374 (+0.41z)| norm 0.2879 (-0.03z)| lr 5.03e-04 | 4172.26 ms | 32.4% bf16 MFU | 125961 tok/s step 5682/19560 | loss 3.501673 (-0.93z)| norm 0.2655 (-0.34z)| lr 5.03e-04 | 4158.98 ms | 32.5% bf16 MFU | 125966 tok/s step 5683/19560 | loss 3.510878 (-0.71z)| norm 0.2927 (+0.03z)| lr 5.02e-04 | 4163.46 ms | 32.4% bf16 MFU | 125964 tok/s step 5684/19560 | loss 3.558587 (+0.41z)| norm 0.2705 (-0.28z)| lr 5.02e-04 | 4158.66 ms | 32.5% bf16 MFU | 125970 tok/s step 5685/19560 | loss 3.517771 (-0.54z)| norm 0.2734 (-0.24z)| lr 5.02e-04 | 4170.25 ms | 32.4% bf16 MFU | 125957 tok/s step 5686/19560 | loss 3.387293 (-3.44z)| norm 0.2727 (-0.25z)| lr 5.02e-04 | 4169.11 ms | 32.4% bf16 MFU | 125947 tok/s step 5687/19560 | loss 3.593122 (+1.20z)| norm 0.2807 (-0.14z)| lr 5.02e-04 | 4151.94 ms | 32.5% bf16 MFU | 125964 tok/s step 5688/19560 | loss 3.499663 (-0.90z)| norm 0.2670 (-0.32z)| lr 5.02e-04 | 4153.97 ms | 32.5% bf16 MFU | 125976 tok/s step 5689/19560 | loss 3.482560 (-1.27z)| norm 0.2474 (-0.60z)| lr 5.02e-04 | 4187.36 ms | 32.2% bf16 MFU | 125938 tok/s step 5690/19560 | loss 3.541379 (+0.05z)| norm 0.2780 (-0.17z)| lr 5.02e-04 | 4159.01 ms | 32.5% bf16 MFU | 125944 tok/s step 5691/19560 | loss 3.521484 (-0.39z)| norm 0.2855 (-0.06z)| lr 5.02e-04 | 4158.06 ms | 32.5% bf16 MFU | 125951 tok/s step 5692/19560 | loss 3.507890 (-0.69z)| norm 0.2928 (+0.03z)| lr 5.02e-04 | 4168.29 ms | 32.4% bf16 MFU | 125942 tok/s step 5693/19560 | loss 3.498572 (-0.90z)| norm 0.2591 (-0.44z)| lr 5.02e-04 | 4153.96 ms | 32.5% bf16 MFU | 125956 tok/s step 5694/19560 | loss 3.510219 (-0.63z)| norm 0.3188 (+0.39z)| lr 5.02e-04 | 4154.37 ms | 32.5% bf16 MFU | 125968 tok/s step 5695/19560 | loss 3.562556 (+0.53z)| norm 0.2917 (+0.01z)| lr 5.02e-04 | 4151.51 ms | 32.5% bf16 MFU | 125984 tok/s step 5696/19560 | loss 3.494565 (-0.98z)| norm 0.3123 (+0.30z)| lr 5.02e-04 | 4157.20 ms | 32.5% bf16 MFU | 125991 tok/s step 5697/19560 | loss 3.515915 (-0.49z)| norm 0.3448 (+0.75z)| lr 5.02e-04 | 4154.23 ms | 32.5% bf16 MFU | 126002 tok/s step 5698/19560 | loss 3.483642 (-1.21z)| norm 0.2846 (-0.09z)| lr 5.02e-04 | 4167.61 ms | 32.4% bf16 MFU | 125992 tok/s step 5699/19560 | loss 3.514724 (-0.50z)| norm 0.2978 (+0.09z)| lr 5.02e-04 | 4155.70 ms | 32.5% bf16 MFU | 126000 tok/s step 5700/19560 | loss 3.621811 (+1.91z)| norm 0.2991 (+0.10z)| lr 5.02e-04 | 4151.80 ms | 32.5% bf16 MFU | 126014 tok/s step 5701/19560 | loss 3.529988 (-0.15z)| norm 0.2813 (-0.15z)| lr 5.02e-04 | 4161.90 ms | 32.4% bf16 MFU | 126012 tok/s step 5702/19560 | loss 3.524668 (-0.28z)| norm 0.2888 (-0.05z)| lr 5.02e-04 | 4165.66 ms | 32.4% bf16 MFU | 126004 tok/s step 5703/19560 | loss 3.543334 (+0.14z)| norm 0.2664 (-0.36z)| lr 5.02e-04 | 4154.39 ms | 32.5% bf16 MFU | 126014 tok/s step 5704/19560 | loss 3.477641 (-1.33z)| norm 0.2815 (-0.15z)| lr 5.02e-04 | 4171.60 ms | 32.4% bf16 MFU | 125998 tok/s step 5705/19560 | loss 3.543876 (+0.16z)| norm 0.2820 (-0.15z)| lr 5.02e-04 | 4161.11 ms | 32.4% bf16 MFU | 125998 tok/s step 5706/19560 | loss 3.486099 (-1.13z)| norm 0.2765 (-0.22z)| lr 5.02e-04 | 4151.49 ms | 32.5% bf16 MFU | 126012 tok/s step 5707/19560 | loss 3.499481 (-0.83z)| norm 0.2829 (-0.13z)| lr 5.02e-04 | 4166.52 ms | 32.4% bf16 MFU | 126003 tok/s step 5708/19560 | loss 3.432879 (-2.27z)| norm 0.2596 (-0.46z)| lr 5.02e-04 | 4167.17 ms | 32.4% bf16 MFU | 125994 tok/s step 5709/19560 | loss 3.507236 (-0.63z)| norm 0.2835 (-0.13z)| lr 5.02e-04 | 4151.21 ms | 32.5% bf16 MFU | 126009 tok/s step 5710/19560 | loss 3.498695 (-0.80z)| norm 0.2938 (+0.02z)| lr 5.01e-04 | 4164.79 ms | 32.4% bf16 MFU | 126003 tok/s step 5711/19560 | loss 3.478141 (-1.24z)| norm 0.2880 (-0.06z)| lr 5.01e-04 | 4163.87 ms | 32.4% bf16 MFU | 125998 tok/s step 5712/19560 | loss 3.495177 (-0.85z)| norm 0.3171 (+0.34z)| lr 5.01e-04 | 4159.29 ms | 32.5% bf16 MFU | 126001 tok/s step 5713/19560 | loss 3.457603 (-1.65z)| norm 0.2853 (-0.11z)| lr 5.01e-04 | 4174.46 ms | 32.3% bf16 MFU | 125981 tok/s step 5714/19560 | loss 3.477764 (-1.19z)| norm 0.3043 (+0.16z)| lr 5.01e-04 | 4176.85 ms | 32.3% bf16 MFU | 125958 tok/s step 5715/19560 | loss 3.570474 (+0.83z)| norm 0.3216 (+0.40z)| lr 5.01e-04 | 4164.50 ms | 32.4% bf16 MFU | 125955 tok/s step 5716/19560 | loss 3.497331 (-0.75z)| norm 0.3437 (+0.70z)| lr 5.01e-04 | 4162.94 ms | 32.4% bf16 MFU | 125954 tok/s step 5717/19560 | loss 3.535724 (+0.08z)| norm 0.3179 (+0.33z)| lr 5.01e-04 | 4161.51 ms | 32.4% bf16 MFU | 125955 tok/s step 5718/19560 | loss 3.539256 (+0.14z)| norm 0.2437 (-0.71z)| lr 5.01e-04 | 4155.53 ms | 32.5% bf16 MFU | 125966 tok/s step 5719/19560 | loss 3.482278 (-1.09z)| norm 0.3017 (+0.10z)| lr 5.01e-04 | 4173.58 ms | 32.4% bf16 MFU | 125949 tok/s step 5720/19560 | loss 3.504858 (-0.59z)| norm 0.2724 (-0.31z)| lr 5.01e-04 | 4156.87 ms | 32.5% bf16 MFU | 125958 tok/s step 5721/19560 | loss 3.461317 (-1.52z)| norm 0.3440 (+0.69z)| lr 5.01e-04 | 4168.19 ms | 32.4% bf16 MFU | 125949 tok/s step 5722/19560 | loss 3.529664 (-0.04z)| norm 0.2980 (+0.04z)| lr 5.01e-04 | 4159.81 ms | 32.5% bf16 MFU | 125953 tok/s step 5723/19560 | loss 3.575229 (+0.95z)| norm 0.2866 (-0.12z)| lr 5.01e-04 | 4168.97 ms | 32.4% bf16 MFU | 125944 tok/s step 5724/19560 | loss 3.534220 (+0.06z)| norm 0.3379 (+0.59z)| lr 5.01e-04 | 4163.16 ms | 32.4% bf16 MFU | 125943 tok/s step 5725/19560 | loss 3.548157 (+0.36z)| norm 0.3286 (+0.46z)| lr 5.01e-04 | 4228.64 ms | 31.9% bf16 MFU | 125845 tok/s step 5726/19560 | loss 3.560451 (+0.63z)| norm 0.2959 (+0.00z)| lr 5.01e-04 | 4158.26 ms | 32.5% bf16 MFU | 125857 tok/s step 5727/19560 | loss 3.470043 (-1.31z)| norm 0.2637 (-0.44z)| lr 5.01e-04 | 4168.95 ms | 32.4% bf16 MFU | 125852 tok/s step 5728/19560 | loss 3.510576 (-0.42z)| norm 0.2865 (-0.13z)| lr 5.01e-04 | 4154.01 ms | 32.5% bf16 MFU | 125870 tok/s step 5729/19560 | loss 3.542757 (+0.29z)| norm 0.2775 (-0.25z)| lr 5.01e-04 | 4160.74 ms | 32.5% bf16 MFU | 125877 tok/s step 5730/19560 | loss 3.553443 (+0.52z)| norm 0.2847 (-0.15z)| lr 5.01e-04 | 4170.45 ms | 32.4% bf16 MFU | 125869 tok/s step 5731/19560 | loss 3.494412 (-0.77z)| norm 0.2538 (-0.58z)| lr 5.01e-04 | 4165.36 ms | 32.4% bf16 MFU | 125869 tok/s step 5732/19560 | loss 3.522061 (-0.16z)| norm 0.2770 (-0.26z)| lr 5.01e-04 | 4157.95 ms | 32.5% bf16 MFU | 125880 tok/s step 5733/19560 | loss 3.526395 (-0.06z)| norm 0.2692 (-0.36z)| lr 5.01e-04 | 4163.48 ms | 32.4% bf16 MFU | 125883 tok/s step 5734/19560 | loss 3.477762 (-1.11z)| norm 0.2864 (-0.09z)| lr 5.01e-04 | 4159.29 ms | 32.5% bf16 MFU | 125891 tok/s step 5735/19560 | loss 3.525145 (-0.05z)| norm 0.2684 (-0.75z)| lr 5.01e-04 | 4164.22 ms | 32.4% bf16 MFU | 125892 tok/s step 5736/19560 | loss 3.471547 (-1.24z)| norm 0.2702 (-0.69z)| lr 5.01e-04 | 4162.14 ms | 32.4% bf16 MFU | 125895 tok/s step 5737/19560 | loss 3.480487 (-1.02z)| norm 0.2782 (-0.35z)| lr 5.00e-04 | 4174.33 ms | 32.3% bf16 MFU | 125880 tok/s step 5738/19560 | loss 3.505149 (-0.46z)| norm 0.2509 (-1.50z)| lr 5.00e-04 | 4153.35 ms | 32.5% bf16 MFU | 125898 tok/s step 5739/19560 | loss 3.537538 (+0.27z)| norm 0.2766 (-0.38z)| lr 5.00e-04 | 4181.05 ms | 32.3% bf16 MFU | 125873 tok/s step 5740/19560 | loss 3.642848 (+2.61z)| norm 0.2896 (+0.19z)| lr 5.00e-04 | 4157.40 ms | 32.5% bf16 MFU | 125885 tok/s step 5741/19560 | loss 3.536462 (+0.26z)| norm 0.2612 (-1.03z)| lr 5.00e-04 | 4160.14 ms | 32.5% bf16 MFU | 125892 tok/s step 5742/19560 | loss 3.577383 (+1.19z)| norm 0.2691 (-0.69z)| lr 5.00e-04 | 4160.73 ms | 32.5% bf16 MFU | 125898 tok/s step 5743/19560 | loss 3.444076 (-1.81z)| norm 0.2938 (+0.38z)| lr 5.00e-04 | 4161.36 ms | 32.4% bf16 MFU | 125902 tok/s step 5744/19560 | loss 3.522678 (-0.02z)| norm 0.2573 (-1.20z)| lr 5.00e-04 | 4158.61 ms | 32.5% bf16 MFU | 125911 tok/s step 5745/19560 | loss 3.490566 (-0.75z)| norm 0.2922 (+0.31z)| lr 5.00e-04 | 4158.50 ms | 32.5% bf16 MFU | 125919 tok/s step 5746/19560 | loss 3.503096 (-0.45z)| norm 0.2825 (-0.12z)| lr 5.00e-04 | 4153.64 ms | 32.5% bf16 MFU | 125934 tok/s step 5747/19560 | loss 3.497210 (-0.58z)| norm 0.2789 (-0.27z)| lr 5.00e-04 | 4150.56 ms | 32.5% bf16 MFU | 125954 tok/s step 5748/19560 | loss 3.523208 (+0.02z)| norm 0.2840 (-0.05z)| lr 5.00e-04 | 4168.63 ms | 32.4% bf16 MFU | 125944 tok/s step 5749/19560 | loss 3.496677 (-0.58z)| norm 0.2927 (+0.32z)| lr 5.00e-04 | 4166.83 ms | 32.4% bf16 MFU | 125938 tok/s step 5750/19560 | loss 3.466326 (-1.26z)| norm 0.2881 (+0.11z)| lr 5.00e-04 | 4148.26 ms | 32.5% bf16 MFU | 125961 tok/s val loss 3.517663 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2805/10042 = 0.279327 step 5751/19560 | loss 3.487721 (-0.76z)| norm 0.3110 (+1.10z)| lr 5.00e-04 | 4213.46 ms | 32.0% bf16 MFU | 125884 tok/s step 5752/19560 | loss 3.444635 (-1.72z)| norm 0.2923 (+0.27z)| lr 5.00e-04 | 4165.73 ms | 32.4% bf16 MFU | 125883 tok/s step 5753/19560 | loss 3.503162 (-0.38z)| norm 0.2446 (-1.80z)| lr 5.00e-04 | 4154.68 ms | 32.5% bf16 MFU | 125898 tok/s step 5754/19560 | loss 3.514962 (-0.10z)| norm 0.2779 (-0.35z)| lr 5.00e-04 | 4155.30 ms | 32.5% bf16 MFU | 125912 tok/s step 5755/19560 | loss 3.537308 (+0.41z)| norm 0.2589 (-1.16z)| lr 5.00e-04 | 4157.85 ms | 32.5% bf16 MFU | 125921 tok/s step 5756/19560 | loss 3.437606 (-1.83z)| norm 0.2673 (-0.79z)| lr 5.00e-04 | 4157.20 ms | 32.5% bf16 MFU | 125931 tok/s step 5757/19560 | loss 3.473987 (-0.99z)| norm 0.2927 (+0.31z)| lr 5.00e-04 | 4158.02 ms | 32.5% bf16 MFU | 125939 tok/s step 5758/19560 | loss 3.493170 (-0.55z)| norm 0.2712 (-0.62z)| lr 5.00e-04 | 4150.95 ms | 32.5% bf16 MFU | 125957 tok/s step 5759/19560 | loss 3.545382 (+0.67z)| norm 0.2814 (-0.19z)| lr 5.00e-04 | 4168.90 ms | 32.4% bf16 MFU | 125948 tok/s step 5760/19560 | loss 3.494986 (-0.50z)| norm 0.2764 (-0.42z)| lr 5.00e-04 | 4156.23 ms | 32.5% bf16 MFU | 125957 tok/s step 5761/19560 | loss 3.500615 (-0.36z)| norm 0.2682 (-0.77z)| lr 5.00e-04 | 4162.99 ms | 32.4% bf16 MFU | 125957 tok/s step 5762/19560 | loss 3.528213 (+0.30z)| norm 0.2567 (-1.26z)| lr 5.00e-04 | 4155.81 ms | 32.5% bf16 MFU | 125967 tok/s step 5763/19560 | loss 3.460431 (-1.28z)| norm 0.2731 (-0.54z)| lr 5.00e-04 | 4190.35 ms | 32.2% bf16 MFU | 125924 tok/s step 5764/19560 | loss 3.539880 (+0.59z)| norm 0.2641 (-0.93z)| lr 4.99e-04 | 4169.99 ms | 32.4% bf16 MFU | 125914 tok/s step 5765/19560 | loss 3.484027 (-0.73z)| norm 0.2907 (+0.23z)| lr 4.99e-04 | 4156.25 ms | 32.5% bf16 MFU | 125926 tok/s step 5766/19560 | loss 3.443546 (-1.65z)| norm 0.2961 (+0.47z)| lr 4.99e-04 | 4163.05 ms | 32.4% bf16 MFU | 125927 tok/s step 5767/19560 | loss 3.488939 (-0.58z)| norm 0.2624 (-1.00z)| lr 4.99e-04 | 4236.88 ms | 31.9% bf16 MFU | 125817 tok/s step 5768/19560 | loss 3.541562 (+0.65z)| norm 0.2832 (-0.09z)| lr 4.99e-04 | 4160.27 ms | 32.5% bf16 MFU | 125828 tok/s step 5769/19560 | loss 3.523934 (+0.24z)| norm 0.2890 (+0.17z)| lr 4.99e-04 | 4159.27 ms | 32.5% bf16 MFU | 125839 tok/s step 5770/19560 | loss 3.511668 (-0.04z)| norm 0.2766 (-0.38z)| lr 4.99e-04 | 4151.64 ms | 32.5% bf16 MFU | 125861 tok/s step 5771/19560 | loss 3.549807 (+0.84z)| norm 0.2829 (-0.10z)| lr 4.99e-04 | 4161.36 ms | 32.4% bf16 MFU | 125868 tok/s step 5772/19560 | loss 3.558690 (+1.04z)| norm 0.3599 (+3.15z)| lr 4.99e-04 | 4158.74 ms | 32.5% bf16 MFU | 125878 tok/s step 5773/19560 | loss 3.589900 (+1.73z)| norm 0.3114 (+1.08z)| lr 4.99e-04 | 4154.15 ms | 32.5% bf16 MFU | 125894 tok/s step 5774/19560 | loss 3.479609 (-0.82z)| norm 0.2810 (-0.20z)| lr 4.99e-04 | 4188.55 ms | 32.2% bf16 MFU | 125858 tok/s step 5775/19560 | loss 3.564546 (+1.13z)| norm 0.2910 (+0.21z)| lr 4.99e-04 | 4153.52 ms | 32.5% bf16 MFU | 125877 tok/s step 5776/19560 | loss 3.534888 (+0.45z)| norm 0.3088 (+0.95z)| lr 4.99e-04 | 4169.40 ms | 32.4% bf16 MFU | 125870 tok/s step 5777/19560 | loss 3.525154 (+0.22z)| norm 0.2833 (-0.13z)| lr 4.99e-04 | 4148.58 ms | 32.5% bf16 MFU | 125895 tok/s step 5778/19560 | loss 3.474308 (-0.95z)| norm 0.2621 (-1.01z)| lr 4.99e-04 | 4153.90 ms | 32.5% bf16 MFU | 125911 tok/s step 5779/19560 | loss 3.514476 (-0.02z)| norm 0.2617 (-1.02z)| lr 4.99e-04 | 4161.20 ms | 32.4% bf16 MFU | 125916 tok/s step 5780/19560 | loss 3.535473 (+0.46z)| norm 0.2787 (-0.32z)| lr 4.99e-04 | 4160.45 ms | 32.5% bf16 MFU | 125921 tok/s step 5781/19560 | loss 3.574471 (+1.34z)| norm 0.2972 (+0.46z)| lr 4.99e-04 | 4158.55 ms | 32.5% bf16 MFU | 125928 tok/s step 5782/19560 | loss 3.507001 (-0.22z)| norm 0.2785 (-0.33z)| lr 4.99e-04 | 4168.51 ms | 32.4% bf16 MFU | 125921 tok/s step 5783/19560 | loss 3.513696 (-0.07z)| norm 0.2564 (-1.24z)| lr 4.99e-04 | 4156.72 ms | 32.5% bf16 MFU | 125931 tok/s step 5784/19560 | loss 3.458392 (-1.32z)| norm 0.2864 (+0.01z)| lr 4.99e-04 | 4168.32 ms | 32.4% bf16 MFU | 125924 tok/s step 5785/19560 | loss 3.546947 (+0.72z)| norm 0.2792 (-0.30z)| lr 4.99e-04 | 4156.77 ms | 32.5% bf16 MFU | 125934 tok/s step 5786/19560 | loss 3.484553 (-0.72z)| norm 0.2893 (+0.12z)| lr 4.99e-04 | 4236.51 ms | 31.9% bf16 MFU | 125825 tok/s step 5787/19560 | loss 3.538682 (+0.55z)| norm 0.2851 (-0.06z)| lr 4.99e-04 | 4156.07 ms | 32.5% bf16 MFU | 125841 tok/s step 5788/19560 | loss 3.567404 (+1.21z)| norm 0.2779 (-0.38z)| lr 4.99e-04 | 4155.59 ms | 32.5% bf16 MFU | 125857 tok/s step 5789/19560 | loss 3.542662 (+0.62z)| norm 0.2777 (-0.39z)| lr 4.99e-04 | 4158.43 ms | 32.5% bf16 MFU | 125868 tok/s step 5790/19560 | loss 3.493765 (-0.52z)| norm 0.2883 (+0.06z)| lr 4.99e-04 | 4160.53 ms | 32.5% bf16 MFU | 125876 tok/s step 5791/19560 | loss 3.418067 (-2.24z)| norm 0.2806 (-0.26z)| lr 4.98e-04 | 4157.96 ms | 32.5% bf16 MFU | 125886 tok/s step 5792/19560 | loss 3.509927 (-0.12z)| norm 0.2781 (-0.37z)| lr 4.98e-04 | 4167.15 ms | 32.4% bf16 MFU | 125883 tok/s step 5793/19560 | loss 3.438563 (-1.74z)| norm 0.2817 (-0.20z)| lr 4.98e-04 | 4161.50 ms | 32.4% bf16 MFU | 125888 tok/s step 5794/19560 | loss 3.599247 (+1.91z)| norm 0.3099 (+1.09z)| lr 4.98e-04 | 4158.54 ms | 32.5% bf16 MFU | 125897 tok/s step 5795/19560 | loss 3.550727 (+0.79z)| norm 0.3622 (+3.31z)| lr 4.98e-04 | 4187.52 ms | 32.2% bf16 MFU | 125863 tok/s step 5796/19560 | loss 3.476770 (-0.89z)| norm 0.2987 (+0.53z)| lr 4.98e-04 | 4154.65 ms | 32.5% bf16 MFU | 125879 tok/s step 5797/19560 | loss 3.483527 (-0.73z)| norm 0.2697 (-0.73z)| lr 4.98e-04 | 4166.99 ms | 32.4% bf16 MFU | 125876 tok/s step 5798/19560 | loss 3.498995 (-0.36z)| norm 0.2998 (+0.57z)| lr 4.98e-04 | 4146.72 ms | 32.6% bf16 MFU | 125904 tok/s step 5799/19560 | loss 3.592966 (+1.80z)| norm 0.2850 (-0.05z)| lr 4.98e-04 | 4171.93 ms | 32.4% bf16 MFU | 125892 tok/s step 5800/19560 | loss 3.517390 (+0.06z)| norm 0.2734 (-0.56z)| lr 4.98e-04 | 4159.65 ms | 32.5% bf16 MFU | 125900 tok/s step 5801/19560 | loss 3.517477 (+0.10z)| norm 0.2837 (-0.10z)| lr 4.98e-04 | 4155.77 ms | 32.5% bf16 MFU | 125913 tok/s step 5802/19560 | loss 3.480028 (-0.82z)| norm 0.2994 (+0.64z)| lr 4.98e-04 | 4172.79 ms | 32.4% bf16 MFU | 125899 tok/s step 5803/19560 | loss 3.457410 (-1.35z)| norm 0.3195 (+1.57z)| lr 4.98e-04 | 4163.94 ms | 32.4% bf16 MFU | 125900 tok/s step 5804/19560 | loss 3.507966 (-0.11z)| norm 0.2606 (-1.14z)| lr 4.98e-04 | 4153.49 ms | 32.5% bf16 MFU | 125916 tok/s step 5805/19560 | loss 3.510369 (-0.05z)| norm 0.2883 (+0.14z)| lr 4.98e-04 | 4168.96 ms | 32.4% bf16 MFU | 125909 tok/s step 5806/19560 | loss 3.527719 (+0.37z)| norm 0.2812 (-0.18z)| lr 4.98e-04 | 4199.78 ms | 32.1% bf16 MFU | 125855 tok/s step 5807/19560 | loss 3.513482 (+0.03z)| norm 0.2629 (-1.03z)| lr 4.98e-04 | 4158.12 ms | 32.5% bf16 MFU | 125867 tok/s step 5808/19560 | loss 3.558085 (+1.11z)| norm 0.3176 (+1.47z)| lr 4.98e-04 | 4157.43 ms | 32.5% bf16 MFU | 125879 tok/s step 5809/19560 | loss 3.447499 (-1.58z)| norm 0.3045 (+0.86z)| lr 4.98e-04 | 4161.44 ms | 32.4% bf16 MFU | 125884 tok/s step 5810/19560 | loss 3.544365 (+0.78z)| norm 0.3323 (+2.09z)| lr 4.98e-04 | 4171.38 ms | 32.4% bf16 MFU | 125874 tok/s step 5811/19560 | loss 3.498787 (-0.33z)| norm 0.2792 (-0.32z)| lr 4.98e-04 | 4164.47 ms | 32.4% bf16 MFU | 125875 tok/s step 5812/19560 | loss 3.474618 (-0.91z)| norm 0.2597 (-1.19z)| lr 4.98e-04 | 4160.56 ms | 32.5% bf16 MFU | 125882 tok/s step 5813/19560 | loss 3.521418 (+0.24z)| norm 0.2856 (-0.03z)| lr 4.98e-04 | 4164.82 ms | 32.4% bf16 MFU | 125882 tok/s step 5814/19560 | loss 3.523650 (+0.28z)| norm 0.3198 (+1.49z)| lr 4.98e-04 | 4160.22 ms | 32.5% bf16 MFU | 125889 tok/s step 5815/19560 | loss 3.536234 (+0.62z)| norm 0.2751 (-0.51z)| lr 4.98e-04 | 4154.77 ms | 32.5% bf16 MFU | 125904 tok/s step 5816/19560 | loss 3.571142 (+1.49z)| norm 0.2683 (-0.82z)| lr 4.98e-04 | 4166.76 ms | 32.4% bf16 MFU | 125901 tok/s step 5817/19560 | loss 3.558886 (+1.16z)| norm 0.2898 (+0.13z)| lr 4.97e-04 | 4163.74 ms | 32.4% bf16 MFU | 125901 tok/s step 5818/19560 | loss 3.483462 (-0.75z)| norm 0.2629 (-1.08z)| lr 4.97e-04 | 4162.85 ms | 32.4% bf16 MFU | 125904 tok/s step 5819/19560 | loss 3.566775 (+1.35z)| norm 0.2797 (-0.32z)| lr 4.97e-04 | 4161.63 ms | 32.4% bf16 MFU | 125907 tok/s step 5820/19560 | loss 3.501375 (-0.30z)| norm 0.2543 (-1.44z)| lr 4.97e-04 | 4151.34 ms | 32.5% bf16 MFU | 125927 tok/s step 5821/19560 | loss 3.469007 (-1.11z)| norm 0.2707 (-0.71z)| lr 4.97e-04 | 4149.36 ms | 32.5% bf16 MFU | 125948 tok/s step 5822/19560 | loss 3.549798 (+0.91z)| norm 0.2971 (+0.49z)| lr 4.97e-04 | 4177.19 ms | 32.3% bf16 MFU | 125926 tok/s step 5823/19560 | loss 3.406823 (-2.59z)| norm 0.2646 (-0.97z)| lr 4.97e-04 | 4153.93 ms | 32.5% bf16 MFU | 125941 tok/s step 5824/19560 | loss 3.514280 (+0.05z)| norm 0.2565 (-1.32z)| lr 4.97e-04 | 4153.17 ms | 32.5% bf16 MFU | 125956 tok/s step 5825/19560 | loss 3.447810 (-1.56z)| norm 0.2565 (-1.31z)| lr 4.97e-04 | 4154.78 ms | 32.5% bf16 MFU | 125967 tok/s step 5826/19560 | loss 3.542769 (+0.74z)| norm 0.2818 (-0.14z)| lr 4.97e-04 | 4155.77 ms | 32.5% bf16 MFU | 125977 tok/s step 5827/19560 | loss 3.510386 (-0.05z)| norm 0.2728 (-0.55z)| lr 4.97e-04 | 4180.50 ms | 32.3% bf16 MFU | 125949 tok/s step 5828/19560 | loss 3.581551 (+1.72z)| norm 0.2780 (-0.31z)| lr 4.97e-04 | 4162.57 ms | 32.4% bf16 MFU | 125949 tok/s step 5829/19560 | loss 3.526394 (+0.36z)| norm 0.2674 (-0.79z)| lr 4.97e-04 | 4167.88 ms | 32.4% bf16 MFU | 125941 tok/s step 5830/19560 | loss 3.569084 (+1.40z)| norm 0.2756 (-0.40z)| lr 4.97e-04 | 4165.61 ms | 32.4% bf16 MFU | 125937 tok/s step 5831/19560 | loss 3.485943 (-0.64z)| norm 0.2841 (-0.02z)| lr 4.97e-04 | 4175.05 ms | 32.3% bf16 MFU | 125919 tok/s step 5832/19560 | loss 3.523004 (+0.27z)| norm 0.2694 (-0.69z)| lr 4.97e-04 | 4175.53 ms | 32.3% bf16 MFU | 125901 tok/s step 5833/19560 | loss 3.491443 (-0.50z)| norm 0.2494 (-1.59z)| lr 4.97e-04 | 4153.92 ms | 32.5% bf16 MFU | 125917 tok/s step 5834/19560 | loss 3.468537 (-1.06z)| norm 0.2549 (-1.32z)| lr 4.97e-04 | 4154.78 ms | 32.5% bf16 MFU | 125930 tok/s step 5835/19560 | loss 3.583756 (+1.74z)| norm 0.2730 (-0.50z)| lr 4.97e-04 | 4164.18 ms | 32.4% bf16 MFU | 125929 tok/s step 5836/19560 | loss 3.561588 (+1.19z)| norm 0.2699 (-0.64z)| lr 4.97e-04 | 4157.46 ms | 32.5% bf16 MFU | 125938 tok/s step 5837/19560 | loss 3.550341 (+0.90z)| norm 0.2403 (-1.95z)| lr 4.97e-04 | 4161.54 ms | 32.4% bf16 MFU | 125940 tok/s step 5838/19560 | loss 3.497356 (-0.40z)| norm 0.2444 (-1.73z)| lr 4.97e-04 | 4157.23 ms | 32.5% bf16 MFU | 125949 tok/s step 5839/19560 | loss 3.495909 (-0.44z)| norm 0.2731 (-0.45z)| lr 4.97e-04 | 4153.82 ms | 32.5% bf16 MFU | 125963 tok/s step 5840/19560 | loss 3.478732 (-0.86z)| norm 0.2628 (-0.89z)| lr 4.97e-04 | 4153.93 ms | 32.5% bf16 MFU | 125975 tok/s step 5841/19560 | loss 3.514685 (+0.02z)| norm 0.2774 (-0.24z)| lr 4.97e-04 | 4151.70 ms | 32.5% bf16 MFU | 125991 tok/s step 5842/19560 | loss 3.509089 (-0.13z)| norm 0.2830 (+0.02z)| lr 4.97e-04 | 4162.63 ms | 32.4% bf16 MFU | 125989 tok/s step 5843/19560 | loss 3.461097 (-1.30z)| norm 0.2778 (-0.20z)| lr 4.97e-04 | 4157.23 ms | 32.5% bf16 MFU | 125995 tok/s step 5844/19560 | loss 3.504341 (-0.23z)| norm 0.2836 (+0.09z)| lr 4.96e-04 | 4155.34 ms | 32.5% bf16 MFU | 126004 tok/s step 5845/19560 | loss 3.557735 (+1.09z)| norm 0.2719 (-0.45z)| lr 4.96e-04 | 4162.36 ms | 32.4% bf16 MFU | 126002 tok/s step 5846/19560 | loss 3.536441 (+0.56z)| norm 0.2770 (-0.22z)| lr 4.96e-04 | 4154.40 ms | 32.5% bf16 MFU | 126011 tok/s step 5847/19560 | loss 3.511387 (-0.06z)| norm 0.2608 (-0.98z)| lr 4.96e-04 | 4159.03 ms | 32.5% bf16 MFU | 126014 tok/s step 5848/19560 | loss 3.536146 (+0.55z)| norm 0.2765 (-0.23z)| lr 4.96e-04 | 4167.75 ms | 32.4% bf16 MFU | 126003 tok/s step 5849/19560 | loss 3.511704 (-0.07z)| norm 0.3030 (+1.08z)| lr 4.96e-04 | 4170.41 ms | 32.4% bf16 MFU | 125989 tok/s step 5850/19560 | loss 3.472341 (-1.04z)| norm 0.2625 (-0.90z)| lr 4.96e-04 | 4163.83 ms | 32.4% bf16 MFU | 125985 tok/s step 5851/19560 | loss 3.543094 (+0.73z)| norm 0.2814 (+0.04z)| lr 4.96e-04 | 4159.23 ms | 32.5% bf16 MFU | 125988 tok/s step 5852/19560 | loss 3.497044 (-0.41z)| norm 0.2927 (+0.63z)| lr 4.96e-04 | 4177.62 ms | 32.3% bf16 MFU | 125964 tok/s step 5853/19560 | loss 3.493140 (-0.50z)| norm 0.2727 (-0.38z)| lr 4.96e-04 | 4175.21 ms | 32.3% bf16 MFU | 125944 tok/s step 5854/19560 | loss 3.518765 (+0.15z)| norm 0.2469 (-1.69z)| lr 4.96e-04 | 4159.24 ms | 32.5% bf16 MFU | 125950 tok/s step 5855/19560 | loss 3.498862 (-0.36z)| norm 0.2714 (-0.42z)| lr 4.96e-04 | 4159.41 ms | 32.5% bf16 MFU | 125955 tok/s step 5856/19560 | loss 3.592652 (+1.97z)| norm 0.2584 (-1.08z)| lr 4.96e-04 | 4168.36 ms | 32.4% bf16 MFU | 125946 tok/s step 5857/19560 | loss 3.564079 (+1.25z)| norm 0.3230 (+2.19z)| lr 4.96e-04 | 4156.44 ms | 32.5% bf16 MFU | 125956 tok/s step 5858/19560 | loss 3.564011 (+1.24z)| norm 0.2962 (+0.83z)| lr 4.96e-04 | 4158.60 ms | 32.5% bf16 MFU | 125961 tok/s step 5859/19560 | loss 3.558359 (+1.08z)| norm 0.2962 (+0.81z)| lr 4.96e-04 | 4146.08 ms | 32.6% bf16 MFU | 125986 tok/s step 5860/19560 | loss 3.492916 (-0.52z)| norm 0.2532 (-1.35z)| lr 4.96e-04 | 4164.99 ms | 32.4% bf16 MFU | 125981 tok/s step 5861/19560 | loss 3.500170 (-0.34z)| norm 0.2810 (+0.05z)| lr 4.96e-04 | 4158.91 ms | 32.5% bf16 MFU | 125985 tok/s step 5862/19560 | loss 3.472051 (-1.03z)| norm 0.2804 (+0.02z)| lr 4.96e-04 | 4168.24 ms | 32.4% bf16 MFU | 125975 tok/s step 5863/19560 | loss 3.489850 (-0.59z)| norm 0.2594 (-1.03z)| lr 4.96e-04 | 4157.30 ms | 32.5% bf16 MFU | 125982 tok/s step 5864/19560 | loss 3.457511 (-1.37z)| norm 0.2766 (-0.17z)| lr 4.96e-04 | 4156.76 ms | 32.5% bf16 MFU | 125989 tok/s step 5865/19560 | loss 3.472897 (-0.99z)| norm 0.2514 (-1.41z)| lr 4.96e-04 | 4172.62 ms | 32.4% bf16 MFU | 125972 tok/s step 5866/19560 | loss 3.489886 (-0.57z)| norm 0.2836 (+0.18z)| lr 4.96e-04 | 4167.08 ms | 32.4% bf16 MFU | 125964 tok/s step 5867/19560 | loss 3.471600 (-1.01z)| norm 0.2465 (-1.66z)| lr 4.96e-04 | 4159.19 ms | 32.5% bf16 MFU | 125969 tok/s step 5868/19560 | loss 3.532689 (+0.53z)| norm 0.2760 (-0.18z)| lr 4.96e-04 | 4160.01 ms | 32.5% bf16 MFU | 125972 tok/s step 5869/19560 | loss 3.506131 (-0.14z)| norm 0.3008 (+1.04z)| lr 4.96e-04 | 4173.03 ms | 32.4% bf16 MFU | 125955 tok/s step 5870/19560 | loss 3.518728 (+0.19z)| norm 0.2872 (+0.35z)| lr 4.95e-04 | 4145.16 ms | 32.6% bf16 MFU | 125982 tok/s step 5871/19560 | loss 3.517726 (+0.15z)| norm 0.2868 (+0.34z)| lr 4.95e-04 | 4157.27 ms | 32.5% bf16 MFU | 125988 tok/s step 5872/19560 | loss 3.543232 (+0.81z)| norm 0.2745 (-0.28z)| lr 4.95e-04 | 4157.16 ms | 32.5% bf16 MFU | 125995 tok/s step 5873/19560 | loss 3.557914 (+1.17z)| norm 0.2557 (-1.21z)| lr 4.95e-04 | 4155.22 ms | 32.5% bf16 MFU | 126004 tok/s step 5874/19560 | loss 3.559840 (+1.20z)| norm 0.3371 (+2.75z)| lr 4.95e-04 | 4161.38 ms | 32.4% bf16 MFU | 126003 tok/s step 5875/19560 | loss 3.496535 (-0.42z)| norm 0.3125 (+1.53z)| lr 4.95e-04 | 4176.04 ms | 32.3% bf16 MFU | 125980 tok/s step 5876/19560 | loss 3.498663 (-0.36z)| norm 0.2592 (-1.01z)| lr 4.95e-04 | 4170.18 ms | 32.4% bf16 MFU | 125967 tok/s step 5877/19560 | loss 3.499324 (-0.35z)| norm 0.2882 (+0.37z)| lr 4.95e-04 | 4164.04 ms | 32.4% bf16 MFU | 125964 tok/s step 5878/19560 | loss 3.512886 (-0.01z)| norm 0.2553 (-1.18z)| lr 4.95e-04 | 4152.68 ms | 32.5% bf16 MFU | 125979 tok/s step 5879/19560 | loss 3.496163 (-0.44z)| norm 0.2688 (-0.52z)| lr 4.95e-04 | 4178.14 ms | 32.3% bf16 MFU | 125954 tok/s step 5880/19560 | loss 3.489222 (-0.64z)| norm 0.2711 (-0.41z)| lr 4.95e-04 | 4159.61 ms | 32.5% bf16 MFU | 125958 tok/s step 5881/19560 | loss 3.544814 (+0.81z)| norm 0.2661 (-0.66z)| lr 4.95e-04 | 4161.35 ms | 32.4% bf16 MFU | 125960 tok/s step 5882/19560 | loss 3.512848 (-0.03z)| norm 0.2820 (+0.10z)| lr 4.95e-04 | 4170.40 ms | 32.4% bf16 MFU | 125948 tok/s step 5883/19560 | loss 3.520531 (+0.18z)| norm 0.2878 (+0.38z)| lr 4.95e-04 | 4157.61 ms | 32.5% bf16 MFU | 125956 tok/s step 5884/19560 | loss 3.474174 (-1.05z)| norm 0.3114 (+1.50z)| lr 4.95e-04 | 4165.90 ms | 32.4% bf16 MFU | 125950 tok/s step 5885/19560 | loss 3.484833 (-0.78z)| norm 0.2783 (-0.10z)| lr 4.95e-04 | 4166.31 ms | 32.4% bf16 MFU | 125945 tok/s step 5886/19560 | loss 3.481238 (-0.87z)| norm 0.2994 (+0.91z)| lr 4.95e-04 | 4156.60 ms | 32.5% bf16 MFU | 125954 tok/s step 5887/19560 | loss 3.495130 (-0.49z)| norm 0.2823 (+0.09z)| lr 4.95e-04 | 4166.29 ms | 32.4% bf16 MFU | 125949 tok/s step 5888/19560 | loss 3.486204 (-0.72z)| norm 0.2978 (+0.82z)| lr 4.95e-04 | 4166.82 ms | 32.4% bf16 MFU | 125942 tok/s step 5889/19560 | loss 3.447755 (-1.71z)| norm 0.2945 (+0.65z)| lr 4.95e-04 | 4158.82 ms | 32.5% bf16 MFU | 125949 tok/s step 5890/19560 | loss 3.480310 (-0.85z)| norm 0.2549 (-1.24z)| lr 4.95e-04 | 4180.12 ms | 32.3% bf16 MFU | 125922 tok/s step 5891/19560 | loss 3.576480 (+1.63z)| norm 0.2756 (-0.25z)| lr 4.95e-04 | 4156.32 ms | 32.5% bf16 MFU | 125933 tok/s step 5892/19560 | loss 3.523611 (+0.26z)| norm 0.2633 (-0.84z)| lr 4.95e-04 | 4164.71 ms | 32.4% bf16 MFU | 125931 tok/s step 5893/19560 | loss 3.493111 (-0.54z)| norm 0.2743 (-0.31z)| lr 4.95e-04 | 4190.50 ms | 32.2% bf16 MFU | 125890 tok/s step 5894/19560 | loss 3.469887 (-1.16z)| norm 0.2950 (+0.69z)| lr 4.95e-04 | 4158.62 ms | 32.5% bf16 MFU | 125899 tok/s step 5895/19560 | loss 3.531955 (+0.47z)| norm 0.2674 (-0.64z)| lr 4.95e-04 | 4168.41 ms | 32.4% bf16 MFU | 125893 tok/s step 5896/19560 | loss 3.531843 (+0.47z)| norm 0.2751 (-0.27z)| lr 4.95e-04 | 4157.40 ms | 32.5% bf16 MFU | 125904 tok/s step 5897/19560 | loss 3.517316 (+0.09z)| norm 0.3317 (+2.38z)| lr 4.94e-04 | 4161.28 ms | 32.4% bf16 MFU | 125908 tok/s step 5898/19560 | loss 3.574324 (+1.56z)| norm 0.3551 (+3.30z)| lr 4.94e-04 | 4162.22 ms | 32.4% bf16 MFU | 125911 tok/s step 5899/19560 | loss 3.533520 (+0.50z)| norm 0.3278 (+2.03z)| lr 4.94e-04 | 4168.78 ms | 32.4% bf16 MFU | 125904 tok/s step 5900/19560 | loss 3.556341 (+1.10z)| norm 0.2752 (-0.29z)| lr 4.94e-04 | 4156.49 ms | 32.5% bf16 MFU | 125916 tok/s step 5901/19560 | loss 3.522983 (+0.24z)| norm 0.2900 (+0.41z)| lr 4.94e-04 | 4159.09 ms | 32.5% bf16 MFU | 125923 tok/s step 5902/19560 | loss 3.495473 (-0.49z)| norm 0.2705 (-0.50z)| lr 4.94e-04 | 4159.07 ms | 32.5% bf16 MFU | 125929 tok/s step 5903/19560 | loss 3.481169 (-0.86z)| norm 0.2758 (-0.24z)| lr 4.94e-04 | 4150.25 ms | 32.5% bf16 MFU | 125949 tok/s step 5904/19560 | loss 3.534695 (+0.57z)| norm 0.2638 (-0.79z)| lr 4.94e-04 | 4164.09 ms | 32.4% bf16 MFU | 125947 tok/s step 5905/19560 | loss 3.512489 (-0.02z)| norm 0.2348 (-2.10z)| lr 4.94e-04 | 4164.61 ms | 32.4% bf16 MFU | 125944 tok/s step 5906/19560 | loss 3.538610 (+0.67z)| norm 0.2739 (-0.30z)| lr 4.94e-04 | 4157.62 ms | 32.5% bf16 MFU | 125952 tok/s step 5907/19560 | loss 3.535230 (+0.57z)| norm 0.2588 (-0.99z)| lr 4.94e-04 | 4165.09 ms | 32.4% bf16 MFU | 125949 tok/s step 5908/19560 | loss 3.532520 (+0.50z)| norm 0.2777 (-0.12z)| lr 4.94e-04 | 4150.17 ms | 32.5% bf16 MFU | 125968 tok/s step 5909/19560 | loss 3.534123 (+0.56z)| norm 0.2907 (+0.48z)| lr 4.94e-04 | 4159.95 ms | 32.5% bf16 MFU | 125971 tok/s step 5910/19560 | loss 3.539292 (+0.69z)| norm 0.2869 (+0.30z)| lr 4.94e-04 | 4167.00 ms | 32.4% bf16 MFU | 125963 tok/s step 5911/19560 | loss 3.531489 (+0.47z)| norm 0.2650 (-0.72z)| lr 4.94e-04 | 4165.35 ms | 32.4% bf16 MFU | 125959 tok/s step 5912/19560 | loss 3.471809 (-1.15z)| norm 0.2818 (+0.07z)| lr 4.94e-04 | 4161.19 ms | 32.4% bf16 MFU | 125960 tok/s step 5913/19560 | loss 3.550883 (+1.00z)| norm 0.3074 (+1.24z)| lr 4.94e-04 | 4153.07 ms | 32.5% bf16 MFU | 125974 tok/s step 5914/19560 | loss 3.493688 (-0.56z)| norm 0.2749 (-0.26z)| lr 4.94e-04 | 4159.70 ms | 32.5% bf16 MFU | 125978 tok/s step 5915/19560 | loss 3.587822 (+1.96z)| norm 0.2547 (-1.17z)| lr 4.94e-04 | 4157.61 ms | 32.5% bf16 MFU | 125984 tok/s step 5916/19560 | loss 3.512783 (-0.04z)| norm 0.3144 (+1.54z)| lr 4.94e-04 | 4159.04 ms | 32.5% bf16 MFU | 125988 tok/s step 5917/19560 | loss 3.524383 (+0.28z)| norm 0.3215 (+1.82z)| lr 4.94e-04 | 4164.73 ms | 32.4% bf16 MFU | 125983 tok/s step 5918/19560 | loss 3.471127 (-1.15z)| norm 0.2751 (-0.25z)| lr 4.94e-04 | 4156.64 ms | 32.5% bf16 MFU | 125990 tok/s step 5919/19560 | loss 3.481974 (-0.89z)| norm 0.2884 (+0.34z)| lr 4.94e-04 | 4153.41 ms | 32.5% bf16 MFU | 126002 tok/s step 5920/19560 | loss 3.564268 (+1.35z)| norm 0.3031 (+0.99z)| lr 4.94e-04 | 4373.98 ms | 30.9% bf16 MFU | 125695 tok/s step 5921/19560 | loss 3.477078 (-1.05z)| norm 0.2759 (-0.23z)| lr 4.94e-04 | 10831.72 ms | 12.5% bf16 MFU | 121831 tok/s step 5922/19560 | loss 3.493988 (-0.57z)| norm 0.2885 (+0.35z)| lr 4.94e-04 | 4131.58 ms | 32.7% bf16 MFU | 122084 tok/s step 5923/19560 | loss 3.501848 (-0.34z)| norm 0.2577 (-1.05z)| lr 4.93e-04 | 4136.46 ms | 32.6% bf16 MFU | 122317 tok/s step 5924/19560 | loss 3.571768 (+1.61z)| norm 0.2826 (+0.13z)| lr 4.93e-04 | 4143.07 ms | 32.6% bf16 MFU | 122529 tok/s step 5925/19560 | loss 3.503952 (-0.31z)| norm 0.2916 (+0.55z)| lr 4.93e-04 | 4140.09 ms | 32.6% bf16 MFU | 122734 tok/s step 5926/19560 | loss 3.501365 (-0.38z)| norm 0.2921 (+0.58z)| lr 4.93e-04 | 4572.11 ms | 29.5% bf16 MFU | 122331 tok/s step 5927/19560 | loss 3.484354 (-0.85z)| norm 0.2754 (-0.21z)| lr 4.93e-04 | 4143.01 ms | 32.6% bf16 MFU | 122542 tok/s step 5928/19560 | loss 3.483953 (-0.85z)| norm 0.2632 (-0.79z)| lr 4.93e-04 | 4144.56 ms | 32.6% bf16 MFU | 122740 tok/s step 5929/19560 | loss 3.534772 (+0.60z)| norm 0.2649 (-0.70z)| lr 4.93e-04 | 4153.05 ms | 32.5% bf16 MFU | 122915 tok/s step 5930/19560 | loss 3.645020 (+3.55z)| norm 0.2889 (+0.44z)| lr 4.93e-04 | 4153.29 ms | 32.5% bf16 MFU | 123081 tok/s step 5931/19560 | loss 3.492754 (-0.62z)| norm 0.2642 (-0.72z)| lr 4.93e-04 | 4158.07 ms | 32.5% bf16 MFU | 123231 tok/s step 5932/19560 | loss 3.546692 (+0.85z)| norm 0.2749 (-0.21z)| lr 4.93e-04 | 4147.24 ms | 32.6% bf16 MFU | 123391 tok/s step 5933/19560 | loss 3.462222 (-1.45z)| norm 0.2788 (-0.02z)| lr 4.93e-04 | 4155.22 ms | 32.5% bf16 MFU | 123530 tok/s step 5934/19560 | loss 3.502153 (-0.35z)| norm 0.2792 (+0.00z)| lr 4.93e-04 | 4153.62 ms | 32.5% bf16 MFU | 123665 tok/s step 5935/19560 | loss 3.488133 (-0.73z)| norm 0.2726 (-0.32z)| lr 4.93e-04 | 4149.21 ms | 32.5% bf16 MFU | 123799 tok/s step 5936/19560 | loss 3.478421 (-0.98z)| norm 0.2823 (+0.17z)| lr 4.93e-04 | 4162.09 ms | 32.4% bf16 MFU | 123908 tok/s step 5937/19560 | loss 3.576720 (+1.68z)| norm 0.2954 (+0.81z)| lr 4.93e-04 | 4143.42 ms | 32.6% bf16 MFU | 124039 tok/s step 5938/19560 | loss 3.508626 (-0.17z)| norm 0.2755 (-0.15z)| lr 4.93e-04 | 4153.17 ms | 32.5% bf16 MFU | 124149 tok/s step 5939/19560 | loss 3.525274 (+0.28z)| norm 0.2876 (+0.46z)| lr 4.93e-04 | 4153.37 ms | 32.5% bf16 MFU | 124253 tok/s step 5940/19560 | loss 3.496663 (-0.51z)| norm 0.2983 (+0.99z)| lr 4.93e-04 | 4159.47 ms | 32.5% bf16 MFU | 124343 tok/s step 5941/19560 | loss 3.511979 (-0.09z)| norm 0.2823 (+0.18z)| lr 4.93e-04 | 4152.30 ms | 32.5% bf16 MFU | 124439 tok/s step 5942/19560 | loss 3.472404 (-1.16z)| norm 0.2864 (+0.40z)| lr 4.93e-04 | 4173.48 ms | 32.4% bf16 MFU | 124498 tok/s step 5943/19560 | loss 3.532112 (+0.47z)| norm 0.2701 (-0.43z)| lr 4.93e-04 | 4195.06 ms | 32.2% bf16 MFU | 124522 tok/s step 5944/19560 | loss 3.538347 (+0.65z)| norm 0.2812 (+0.14z)| lr 4.93e-04 | 4152.15 ms | 32.5% bf16 MFU | 124609 tok/s step 5945/19560 | loss 3.505147 (-0.25z)| norm 0.3077 (+1.48z)| lr 4.93e-04 | 4166.47 ms | 32.4% bf16 MFU | 124671 tok/s step 5946/19560 | loss 3.569344 (+1.50z)| norm 0.3117 (+1.65z)| lr 4.93e-04 | 4145.69 ms | 32.6% bf16 MFU | 124760 tok/s step 5947/19560 | loss 3.486231 (-0.78z)| norm 0.2931 (+0.70z)| lr 4.93e-04 | 4169.06 ms | 32.4% bf16 MFU | 124810 tok/s step 5948/19560 | loss 3.512993 (-0.04z)| norm 0.2720 (-0.37z)| lr 4.93e-04 | 4149.40 ms | 32.5% bf16 MFU | 124887 tok/s step 5949/19560 | loss 3.518837 (+0.12z)| norm 0.2806 (+0.06z)| lr 4.92e-04 | 4162.52 ms | 32.4% bf16 MFU | 124941 tok/s step 5950/19560 | loss 3.525990 (+0.32z)| norm 0.2947 (+0.78z)| lr 4.92e-04 | 4150.26 ms | 32.5% bf16 MFU | 125010 tok/s step 5951/19560 | loss 3.544942 (+0.85z)| norm 0.2491 (-1.53z)| lr 4.92e-04 | 4163.72 ms | 32.4% bf16 MFU | 125055 tok/s step 5952/19560 | loss 3.506873 (-0.25z)| norm 0.2892 (+0.49z)| lr 4.92e-04 | 4160.07 ms | 32.5% bf16 MFU | 125104 tok/s step 5953/19560 | loss 3.469219 (-1.36z)| norm 0.2833 (+0.18z)| lr 4.92e-04 | 4180.32 ms | 32.3% bf16 MFU | 125120 tok/s step 5954/19560 | loss 3.474983 (-1.17z)| norm 0.2794 (-0.02z)| lr 4.92e-04 | 4157.83 ms | 32.5% bf16 MFU | 125169 tok/s step 5955/19560 | loss 3.461310 (-1.54z)| norm 0.2653 (-0.73z)| lr 4.92e-04 | 4177.12 ms | 32.3% bf16 MFU | 125186 tok/s step 5956/19560 | loss 3.488440 (-0.75z)| norm 0.2923 (+0.64z)| lr 4.92e-04 | 4174.00 ms | 32.3% bf16 MFU | 125207 tok/s step 5957/19560 | loss 3.475790 (-1.10z)| norm 0.2800 (+0.01z)| lr 4.92e-04 | 4149.81 ms | 32.5% bf16 MFU | 125264 tok/s step 5958/19560 | loss 3.518419 (+0.15z)| norm 0.2714 (-0.43z)| lr 4.92e-04 | 4155.42 ms | 32.5% bf16 MFU | 125309 tok/s step 5959/19560 | loss 3.521347 (+0.23z)| norm 0.3342 (+2.67z)| lr 4.92e-04 | 4169.07 ms | 32.4% bf16 MFU | 125331 tok/s step 5960/19560 | loss 3.526607 (+0.38z)| norm 0.3162 (+1.74z)| lr 4.92e-04 | 4157.23 ms | 32.5% bf16 MFU | 125371 tok/s step 5961/19560 | loss 3.490958 (-0.67z)| norm 0.2946 (+0.67z)| lr 4.92e-04 | 4153.75 ms | 32.5% bf16 MFU | 125413 tok/s step 5962/19560 | loss 3.592253 (+2.26z)| norm 0.2856 (+0.22z)| lr 4.92e-04 | 4159.74 ms | 32.5% bf16 MFU | 125444 tok/s step 5963/19560 | loss 3.565063 (+1.48z)| norm 0.3061 (+1.22z)| lr 4.92e-04 | 4163.92 ms | 32.4% bf16 MFU | 125468 tok/s step 5964/19560 | loss 3.538194 (+0.71z)| norm 0.2826 (+0.05z)| lr 4.92e-04 | 4182.46 ms | 32.3% bf16 MFU | 125462 tok/s step 5965/19560 | loss 3.485026 (-0.84z)| norm 0.2780 (-0.20z)| lr 4.92e-04 | 4162.45 ms | 32.4% bf16 MFU | 125487 tok/s step 5966/19560 | loss 3.545560 (+0.93z)| norm 0.2934 (+0.57z)| lr 4.92e-04 | 4152.66 ms | 32.5% bf16 MFU | 125525 tok/s step 5967/19560 | loss 3.476191 (-1.11z)| norm 0.2861 (+0.19z)| lr 4.92e-04 | 4164.63 ms | 32.4% bf16 MFU | 125543 tok/s step 5968/19560 | loss 3.476779 (-1.09z)| norm 0.2659 (-0.84z)| lr 4.92e-04 | 4161.30 ms | 32.4% bf16 MFU | 125566 tok/s step 5969/19560 | loss 3.491524 (-0.65z)| norm 0.2847 (+0.12z)| lr 4.92e-04 | 4154.39 ms | 32.5% bf16 MFU | 125598 tok/s step 5970/19560 | loss 3.429946 (-2.38z)| norm 0.2814 (-0.05z)| lr 4.92e-04 | 4157.80 ms | 32.5% bf16 MFU | 125623 tok/s step 5971/19560 | loss 3.508569 (-0.14z)| norm 0.2826 (+0.01z)| lr 4.92e-04 | 4157.26 ms | 32.5% bf16 MFU | 125647 tok/s step 5972/19560 | loss 3.452254 (-1.73z)| norm 0.2853 (+0.14z)| lr 4.92e-04 | 4153.21 ms | 32.5% bf16 MFU | 125677 tok/s step 5973/19560 | loss 3.606437 (+2.59z)| norm 0.2982 (+0.79z)| lr 4.92e-04 | 4161.61 ms | 32.4% bf16 MFU | 125692 tok/s step 5974/19560 | loss 3.493179 (-0.56z)| norm 0.2979 (+0.77z)| lr 4.92e-04 | 4151.06 ms | 32.5% bf16 MFU | 125722 tok/s step 5975/19560 | loss 3.555461 (+1.17z)| norm 0.2596 (-1.18z)| lr 4.91e-04 | 4152.99 ms | 32.5% bf16 MFU | 125748 tok/s step 5976/19560 | loss 3.517218 (+0.11z)| norm 0.2908 (+0.40z)| lr 4.91e-04 | 4165.61 ms | 32.4% bf16 MFU | 125754 tok/s step 5977/19560 | loss 3.512801 (-0.01z)| norm 0.2865 (+0.19z)| lr 4.91e-04 | 4155.42 ms | 32.5% bf16 MFU | 125775 tok/s step 5978/19560 | loss 3.500803 (-0.36z)| norm 0.2713 (-0.59z)| lr 4.91e-04 | 4159.65 ms | 32.5% bf16 MFU | 125788 tok/s step 5979/19560 | loss 3.575261 (+1.70z)| norm 0.3021 (+0.97z)| lr 4.91e-04 | 4163.15 ms | 32.4% bf16 MFU | 125796 tok/s step 5980/19560 | loss 3.505280 (-0.24z)| norm 0.2742 (-0.44z)| lr 4.91e-04 | 4150.66 ms | 32.5% bf16 MFU | 125821 tok/s step 5981/19560 | loss 3.460226 (-1.47z)| norm 0.2613 (-1.09z)| lr 4.91e-04 | 4160.84 ms | 32.4% bf16 MFU | 125831 tok/s step 5982/19560 | loss 3.522551 (+0.25z)| norm 0.2808 (-0.12z)| lr 4.91e-04 | 4175.51 ms | 32.3% bf16 MFU | 125817 tok/s step 5983/19560 | loss 3.542512 (+0.78z)| norm 0.2690 (-0.72z)| lr 4.91e-04 | 4152.96 ms | 32.5% bf16 MFU | 125839 tok/s step 5984/19560 | loss 3.470719 (-1.18z)| norm 0.2838 (+0.03z)| lr 4.91e-04 | 4171.11 ms | 32.4% bf16 MFU | 125831 tok/s step 5985/19560 | loss 3.498256 (-0.40z)| norm 0.2562 (-1.39z)| lr 4.91e-04 | 4152.19 ms | 32.5% bf16 MFU | 125853 tok/s step 5986/19560 | loss 3.541900 (+0.83z)| norm 0.2837 (+0.06z)| lr 4.91e-04 | 4152.61 ms | 32.5% bf16 MFU | 125873 tok/s step 5987/19560 | loss 3.562382 (+1.41z)| norm 0.3081 (+1.32z)| lr 4.91e-04 | 4159.55 ms | 32.5% bf16 MFU | 125882 tok/s step 5988/19560 | loss 3.535354 (+0.64z)| norm 0.2819 (-0.06z)| lr 4.91e-04 | 4158.06 ms | 32.5% bf16 MFU | 125892 tok/s step 5989/19560 | loss 3.509879 (-0.08z)| norm 0.2744 (-0.45z)| lr 4.91e-04 | 4157.75 ms | 32.5% bf16 MFU | 125903 tok/s step 5990/19560 | loss 3.479139 (-0.95z)| norm 0.2778 (-0.26z)| lr 4.91e-04 | 4157.82 ms | 32.5% bf16 MFU | 125912 tok/s step 5991/19560 | loss 3.518427 (+0.15z)| norm 0.2612 (-1.14z)| lr 4.91e-04 | 4157.17 ms | 32.5% bf16 MFU | 125923 tok/s step 5992/19560 | loss 3.490197 (-0.66z)| norm 0.2697 (-0.69z)| lr 4.91e-04 | 4161.42 ms | 32.4% bf16 MFU | 125926 tok/s step 5993/19560 | loss 3.467881 (-1.29z)| norm 0.2888 (+0.30z)| lr 4.91e-04 | 4162.92 ms | 32.4% bf16 MFU | 125927 tok/s step 5994/19560 | loss 3.546932 (+0.94z)| norm 0.2740 (-0.48z)| lr 4.91e-04 | 4159.23 ms | 32.5% bf16 MFU | 125933 tok/s step 5995/19560 | loss 3.543806 (+0.84z)| norm 0.2713 (-0.64z)| lr 4.91e-04 | 4162.07 ms | 32.4% bf16 MFU | 125935 tok/s step 5996/19560 | loss 3.509470 (-0.13z)| norm 0.2886 (+0.28z)| lr 4.91e-04 | 4165.26 ms | 32.4% bf16 MFU | 125932 tok/s step 5997/19560 | loss 3.463877 (-1.41z)| norm 0.2804 (-0.15z)| lr 4.91e-04 | 4153.95 ms | 32.5% bf16 MFU | 125946 tok/s step 5998/19560 | loss 3.575525 (+1.71z)| norm 0.2879 (+0.25z)| lr 4.91e-04 | 4163.37 ms | 32.4% bf16 MFU | 125945 tok/s step 5999/19560 | loss 3.521036 (+0.19z)| norm 0.2997 (+0.88z)| lr 4.91e-04 | 4163.32 ms | 32.4% bf16 MFU | 125944 tok/s step 6000/19560 | loss 3.457912 (-1.54z)| norm 0.3252 (+2.19z)| lr 4.91e-04 | 4158.09 ms | 32.5% bf16 MFU | 125951 tok/s val loss 3.503702 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2796/10042 = 0.278431 step 6001/19560 | loss 3.458653 (-1.50z)| norm 0.2931 (+0.49z)| lr 4.90e-04 | 4160.37 ms | 32.5% bf16 MFU | 125955 tok/s step 6002/19560 | loss 3.486090 (-0.73z)| norm 0.3045 (+1.13z)| lr 4.90e-04 | 4147.90 ms | 32.6% bf16 MFU | 125977 tok/s step 6003/19560 | loss 3.552058 (+1.09z)| norm 0.2886 (+0.28z)| lr 4.90e-04 | 4169.15 ms | 32.4% bf16 MFU | 125966 tok/s step 6004/19560 | loss 3.514489 (+0.05z)| norm 0.3108 (+1.48z)| lr 4.90e-04 | 4160.54 ms | 32.5% bf16 MFU | 125968 tok/s step 6005/19560 | loss 3.510016 (-0.08z)| norm 0.3511 (+3.50z)| lr 4.90e-04 | 4158.41 ms | 32.5% bf16 MFU | 125974 tok/s step 6006/19560 | loss 3.540352 (+0.75z)| norm 0.2897 (+0.27z)| lr 4.90e-04 | 4214.05 ms | 32.0% bf16 MFU | 125896 tok/s step 6007/19560 | loss 3.502729 (-0.29z)| norm 0.3128 (+1.46z)| lr 4.90e-04 | 4163.96 ms | 32.4% bf16 MFU | 125897 tok/s step 6008/19560 | loss 3.492195 (-0.58z)| norm 0.2860 (+0.04z)| lr 4.90e-04 | 4155.95 ms | 32.5% bf16 MFU | 125909 tok/s step 6009/19560 | loss 3.473790 (-1.07z)| norm 0.2785 (-0.36z)| lr 4.90e-04 | 4205.84 ms | 32.1% bf16 MFU | 125847 tok/s step 6010/19560 | loss 3.510628 (-0.05z)| norm 0.2608 (-1.28z)| lr 4.90e-04 | 4155.87 ms | 32.5% bf16 MFU | 125862 tok/s step 6011/19560 | loss 3.490703 (-0.60z)| norm 0.2825 (-0.13z)| lr 4.90e-04 | 4179.13 ms | 32.3% bf16 MFU | 125842 tok/s step 6012/19560 | loss 3.556800 (+1.21z)| norm 0.4242 (+6.14z)| lr 4.90e-04 | 4186.90 ms | 32.2% bf16 MFU | 125811 tok/s step 6013/19560 | loss 3.536141 (+0.63z)| norm 0.3341 (+2.08z)| lr 4.90e-04 | 4213.09 ms | 32.0% bf16 MFU | 125742 tok/s step 6014/19560 | loss 3.512347 (-0.04z)| norm 0.3196 (+1.44z)| lr 4.90e-04 | 4180.92 ms | 32.3% bf16 MFU | 125725 tok/s step 6015/19560 | loss 3.482173 (-0.87z)| norm 0.2854 (-0.05z)| lr 4.90e-04 | 4153.00 ms | 32.5% bf16 MFU | 125751 tok/s step 6016/19560 | loss 3.543940 (+0.83z)| norm 0.2976 (+0.48z)| lr 4.90e-04 | 4157.15 ms | 32.5% bf16 MFU | 125769 tok/s step 6017/19560 | loss 3.494821 (-0.55z)| norm 0.3013 (+0.64z)| lr 4.90e-04 | 4160.82 ms | 32.4% bf16 MFU | 125781 tok/s step 6018/19560 | loss 3.489565 (-0.70z)| norm 0.2455 (-1.77z)| lr 4.90e-04 | 4248.97 ms | 31.8% bf16 MFU | 125662 tok/s step 6019/19560 | loss 3.446219 (-1.88z)| norm 0.3130 (+1.12z)| lr 4.90e-04 | 4164.60 ms | 32.4% bf16 MFU | 125673 tok/s step 6020/19560 | loss 3.492561 (-0.57z)| norm 0.2903 (+0.14z)| lr 4.90e-04 | 4159.13 ms | 32.5% bf16 MFU | 125692 tok/s step 6021/19560 | loss 3.490852 (-0.62z)| norm 0.2581 (-1.24z)| lr 4.90e-04 | 4153.12 ms | 32.5% bf16 MFU | 125720 tok/s step 6022/19560 | loss 3.498493 (-0.42z)| norm 0.3017 (+0.63z)| lr 4.90e-04 | 4155.52 ms | 32.5% bf16 MFU | 125742 tok/s step 6023/19560 | loss 3.534337 (+0.59z)| norm 0.2812 (-0.25z)| lr 4.90e-04 | 4153.17 ms | 32.5% bf16 MFU | 125767 tok/s step 6024/19560 | loss 3.493644 (-0.54z)| norm 0.2762 (-0.47z)| lr 4.90e-04 | 4161.40 ms | 32.4% bf16 MFU | 125778 tok/s step 6025/19560 | loss 3.505044 (-0.22z)| norm 0.2581 (-1.23z)| lr 4.90e-04 | 4154.87 ms | 32.5% bf16 MFU | 125798 tok/s step 6026/19560 | loss 3.452160 (-1.68z)| norm 0.2679 (-0.80z)| lr 4.90e-04 | 4147.96 ms | 32.6% bf16 MFU | 125828 tok/s step 6027/19560 | loss 3.588821 (+2.11z)| norm 0.2660 (-0.87z)| lr 4.89e-04 | 4154.50 ms | 32.5% bf16 MFU | 125847 tok/s step 6028/19560 | loss 3.525867 (+0.38z)| norm 0.5052 (+7.43z)| lr 4.89e-04 | 4164.46 ms | 32.4% bf16 MFU | 125849 tok/s step 6029/19560 | loss 3.529039 (+0.46z)| norm 0.3355 (+1.62z)| lr 4.89e-04 | 4165.10 ms | 32.4% bf16 MFU | 125851 tok/s step 6030/19560 | loss 3.531698 (+0.53z)| norm 0.3163 (+0.96z)| lr 4.89e-04 | 4167.22 ms | 32.4% bf16 MFU | 125849 tok/s step 6031/19560 | loss 3.498958 (-0.38z)| norm 0.2738 (-0.47z)| lr 4.89e-04 | 4153.85 ms | 32.5% bf16 MFU | 125867 tok/s step 6032/19560 | loss 3.461383 (-1.40z)| norm 0.3149 (+0.90z)| lr 4.89e-04 | 4142.95 ms | 32.6% bf16 MFU | 125901 tok/s step 6033/19560 | loss 3.466892 (-1.23z)| norm 0.2865 (-0.07z)| lr 4.89e-04 | 4158.27 ms | 32.5% bf16 MFU | 125910 tok/s step 6034/19560 | loss 3.521935 (+0.28z)| norm 0.3070 (+0.61z)| lr 4.89e-04 | 4161.51 ms | 32.4% bf16 MFU | 125914 tok/s step 6035/19560 | loss 3.481050 (-0.83z)| norm 0.2707 (-0.62z)| lr 4.89e-04 | 4147.33 ms | 32.6% bf16 MFU | 125939 tok/s step 6036/19560 | loss 3.458255 (-1.43z)| norm 0.2966 (+0.26z)| lr 4.89e-04 | 4162.19 ms | 32.4% bf16 MFU | 125940 tok/s step 6037/19560 | loss 3.512650 (+0.06z)| norm 0.3339 (+1.50z)| lr 4.89e-04 | 4158.68 ms | 32.5% bf16 MFU | 125947 tok/s step 6038/19560 | loss 3.482626 (-0.75z)| norm 0.2980 (+0.28z)| lr 4.89e-04 | 4160.12 ms | 32.5% bf16 MFU | 125951 tok/s step 6039/19560 | loss 3.535808 (+0.70z)| norm 0.3175 (+0.93z)| lr 4.89e-04 | 4164.24 ms | 32.4% bf16 MFU | 125949 tok/s step 6040/19560 | loss 3.600200 (+2.39z)| norm 0.2839 (-0.21z)| lr 4.89e-04 | 4162.65 ms | 32.4% bf16 MFU | 125949 tok/s step 6041/19560 | loss 3.473599 (-0.99z)| norm 0.2990 (+0.31z)| lr 4.89e-04 | 4160.74 ms | 32.5% bf16 MFU | 125952 tok/s step 6042/19560 | loss 3.547541 (+0.98z)| norm 0.2943 (+0.14z)| lr 4.89e-04 | 4157.70 ms | 32.5% bf16 MFU | 125959 tok/s step 6043/19560 | loss 3.471193 (-1.05z)| norm 0.2982 (+0.26z)| lr 4.89e-04 | 4154.91 ms | 32.5% bf16 MFU | 125970 tok/s step 6044/19560 | loss 3.437043 (-1.93z)| norm 0.2924 (+0.07z)| lr 4.89e-04 | 4157.83 ms | 32.5% bf16 MFU | 125977 tok/s step 6045/19560 | loss 3.442324 (-1.75z)| norm 0.3235 (+1.14z)| lr 4.89e-04 | 4154.55 ms | 32.5% bf16 MFU | 125988 tok/s step 6046/19560 | loss 3.527336 (+0.48z)| norm 0.2702 (-0.68z)| lr 4.89e-04 | 4154.00 ms | 32.5% bf16 MFU | 125999 tok/s step 6047/19560 | loss 3.452475 (-1.48z)| norm 0.2590 (-1.05z)| lr 4.89e-04 | 4156.11 ms | 32.5% bf16 MFU | 126006 tok/s step 6048/19560 | loss 3.488338 (-0.53z)| norm 0.2924 (+0.08z)| lr 4.89e-04 | 4147.52 ms | 32.6% bf16 MFU | 126027 tok/s step 6049/19560 | loss 3.590316 (+2.11z)| norm 0.2844 (-0.19z)| lr 4.89e-04 | 4151.79 ms | 32.5% bf16 MFU | 126039 tok/s step 6050/19560 | loss 3.530390 (+0.54z)| norm 0.3128 (+0.77z)| lr 4.89e-04 | 4155.49 ms | 32.5% bf16 MFU | 126046 tok/s step 6051/19560 | loss 3.496995 (-0.33z)| norm 0.2798 (-0.36z)| lr 4.89e-04 | 4157.02 ms | 32.5% bf16 MFU | 126049 tok/s step 6052/19560 | loss 3.577127 (+1.76z)| norm 0.2463 (-1.48z)| lr 4.89e-04 | 4161.37 ms | 32.4% bf16 MFU | 126046 tok/s step 6053/19560 | loss 3.574378 (+1.65z)| norm 0.2943 (+0.14z)| lr 4.88e-04 | 4158.43 ms | 32.5% bf16 MFU | 126048 tok/s step 6054/19560 | loss 3.453338 (-1.44z)| norm 0.2725 (-0.59z)| lr 4.88e-04 | 4159.95 ms | 32.5% bf16 MFU | 126047 tok/s step 6055/19560 | loss 3.495555 (-0.36z)| norm 0.2736 (-0.55z)| lr 4.88e-04 | 4157.49 ms | 32.5% bf16 MFU | 126050 tok/s step 6056/19560 | loss 3.442565 (-1.69z)| norm 0.2892 (-0.03z)| lr 4.88e-04 | 4166.19 ms | 32.4% bf16 MFU | 126040 tok/s step 6057/19560 | loss 3.504324 (-0.12z)| norm 0.2631 (-0.91z)| lr 4.88e-04 | 4156.99 ms | 32.5% bf16 MFU | 126044 tok/s step 6058/19560 | loss 3.459511 (-1.28z)| norm 0.2818 (-0.28z)| lr 4.88e-04 | 4162.84 ms | 32.4% bf16 MFU | 126039 tok/s step 6059/19560 | loss 3.619642 (+2.84z)| norm 0.2956 (+0.18z)| lr 4.88e-04 | 4158.09 ms | 32.5% bf16 MFU | 126041 tok/s step 6060/19560 | loss 3.482666 (-0.66z)| norm 0.2997 (+0.31z)| lr 4.88e-04 | 4159.75 ms | 32.5% bf16 MFU | 126041 tok/s step 6061/19560 | loss 3.527348 (+0.48z)| norm 0.2621 (-0.96z)| lr 4.88e-04 | 4158.93 ms | 32.5% bf16 MFU | 126042 tok/s step 6062/19560 | loss 3.539751 (+0.79z)| norm 0.2700 (-0.69z)| lr 4.88e-04 | 4165.27 ms | 32.4% bf16 MFU | 126034 tok/s step 6063/19560 | loss 3.510517 (+0.03z)| norm 0.2727 (-0.59z)| lr 4.88e-04 | 4148.96 ms | 32.5% bf16 MFU | 126050 tok/s step 6064/19560 | loss 3.507797 (-0.04z)| norm 0.2948 (+0.15z)| lr 4.88e-04 | 4149.93 ms | 32.5% bf16 MFU | 126065 tok/s step 6065/19560 | loss 3.547107 (+0.98z)| norm 0.3047 (+0.48z)| lr 4.88e-04 | 4156.10 ms | 32.5% bf16 MFU | 126069 tok/s step 6066/19560 | loss 3.495123 (-0.36z)| norm 0.2652 (-0.85z)| lr 4.88e-04 | 4150.17 ms | 32.5% bf16 MFU | 126082 tok/s step 6067/19560 | loss 3.543870 (+0.90z)| norm 0.2881 (-0.08z)| lr 4.88e-04 | 4155.49 ms | 32.5% bf16 MFU | 126086 tok/s step 6068/19560 | loss 3.496195 (-0.34z)| norm 0.2723 (-0.60z)| lr 4.88e-04 | 4149.46 ms | 32.5% bf16 MFU | 126100 tok/s step 6069/19560 | loss 3.498401 (-0.28z)| norm 0.2410 (-1.63z)| lr 4.88e-04 | 4148.23 ms | 32.5% bf16 MFU | 126114 tok/s step 6070/19560 | loss 3.523936 (+0.37z)| norm 0.2824 (-0.25z)| lr 4.88e-04 | 4155.14 ms | 32.5% bf16 MFU | 126117 tok/s step 6071/19560 | loss 3.506702 (-0.07z)| norm 0.2622 (-0.92z)| lr 4.88e-04 | 4158.55 ms | 32.5% bf16 MFU | 126115 tok/s step 6072/19560 | loss 3.474401 (-0.90z)| norm 0.2748 (-0.49z)| lr 4.88e-04 | 4154.53 ms | 32.5% bf16 MFU | 126119 tok/s step 6073/19560 | loss 3.563334 (+1.39z)| norm 0.2701 (-0.64z)| lr 4.88e-04 | 4153.80 ms | 32.5% bf16 MFU | 126124 tok/s step 6074/19560 | loss 3.494342 (-0.38z)| norm 0.2495 (-1.31z)| lr 4.88e-04 | 4150.72 ms | 32.5% bf16 MFU | 126134 tok/s step 6075/19560 | loss 3.458905 (-1.29z)| norm 0.2733 (-0.51z)| lr 4.88e-04 | 4154.67 ms | 32.5% bf16 MFU | 126136 tok/s step 6076/19560 | loss 3.411635 (-2.43z)| norm 0.2775 (-0.37z)| lr 4.88e-04 | 4158.06 ms | 32.5% bf16 MFU | 126134 tok/s step 6077/19560 | loss 3.455019 (-1.31z)| norm 0.2566 (-1.06z)| lr 4.88e-04 | 4154.88 ms | 32.5% bf16 MFU | 126137 tok/s step 6078/19560 | loss 3.526087 (+0.47z)| norm 0.2807 (-0.26z)| lr 4.87e-04 | 4159.38 ms | 32.5% bf16 MFU | 126132 tok/s step 6079/19560 | loss 3.524806 (+0.45z)| norm 0.2937 (+0.16z)| lr 4.87e-04 | 4170.31 ms | 32.4% bf16 MFU | 126112 tok/s step 6080/19560 | loss 3.502367 (-0.12z)| norm 0.2982 (+0.31z)| lr 4.87e-04 | 4159.16 ms | 32.5% bf16 MFU | 126109 tok/s step 6081/19560 | loss 3.421572 (-2.11z)| norm 0.3462 (+1.86z)| lr 4.87e-04 | 4159.98 ms | 32.5% bf16 MFU | 126105 tok/s step 6082/19560 | loss 3.542491 (+0.88z)| norm 0.2977 (+0.27z)| lr 4.87e-04 | 4156.42 ms | 32.5% bf16 MFU | 126107 tok/s step 6083/19560 | loss 3.500673 (-0.17z)| norm 0.3493 (+1.91z)| lr 4.87e-04 | 4159.19 ms | 32.5% bf16 MFU | 126104 tok/s step 6084/19560 | loss 3.466284 (-1.02z)| norm 0.3602 (+2.20z)| lr 4.87e-04 | 4146.84 ms | 32.6% bf16 MFU | 126121 tok/s step 6085/19560 | loss 3.508309 (+0.02z)| norm 0.3075 (+0.53z)| lr 4.87e-04 | 4155.97 ms | 32.5% bf16 MFU | 126122 tok/s step 6086/19560 | loss 3.479122 (-0.70z)| norm 0.3090 (+0.56z)| lr 4.87e-04 | 4148.34 ms | 32.5% bf16 MFU | 126135 tok/s step 6087/19560 | loss 3.483470 (-0.59z)| norm 0.2966 (+0.18z)| lr 4.87e-04 | 4151.69 ms | 32.5% bf16 MFU | 126143 tok/s step 6088/19560 | loss 3.522452 (+0.39z)| norm 0.2936 (+0.09z)| lr 4.87e-04 | 4157.54 ms | 32.5% bf16 MFU | 126141 tok/s step 6089/19560 | loss 3.540661 (+0.83z)| norm 0.2605 (-0.96z)| lr 4.87e-04 | 4155.32 ms | 32.5% bf16 MFU | 126142 tok/s step 6090/19560 | loss 3.476312 (-0.76z)| norm 0.2738 (-0.53z)| lr 4.87e-04 | 4147.70 ms | 32.6% bf16 MFU | 126156 tok/s step 6091/19560 | loss 3.549875 (+1.10z)| norm 0.2404 (-1.56z)| lr 4.87e-04 | 4157.06 ms | 32.5% bf16 MFU | 126154 tok/s step 6092/19560 | loss 3.477778 (-0.71z)| norm 0.2538 (-1.13z)| lr 4.87e-04 | 4162.05 ms | 32.4% bf16 MFU | 126144 tok/s step 6093/19560 | loss 3.508625 (+0.07z)| norm 0.2646 (-0.78z)| lr 4.87e-04 | 5042.85 ms | 26.8% bf16 MFU | 125036 tok/s step 6094/19560 | loss 3.556332 (+1.27z)| norm 0.2637 (-0.80z)| lr 4.87e-04 | 4682.32 ms | 28.8% bf16 MFU | 124382 tok/s step 6095/19560 | loss 3.472604 (-0.85z)| norm 0.2941 (+0.15z)| lr 4.87e-04 | 4149.65 ms | 32.5% bf16 MFU | 124481 tok/s step 6096/19560 | loss 3.495580 (-0.27z)| norm 0.2895 (-0.00z)| lr 4.87e-04 | 4153.53 ms | 32.5% bf16 MFU | 124568 tok/s step 6097/19560 | loss 3.419535 (-2.14z)| norm 0.2514 (-1.18z)| lr 4.87e-04 | 4163.12 ms | 32.4% bf16 MFU | 124636 tok/s step 6098/19560 | loss 3.519126 (+0.32z)| norm 0.2660 (-0.72z)| lr 4.87e-04 | 4155.72 ms | 32.5% bf16 MFU | 124712 tok/s step 6099/19560 | loss 3.517045 (+0.27z)| norm 0.3160 (+0.82z)| lr 4.87e-04 | 4150.86 ms | 32.5% bf16 MFU | 124792 tok/s step 6100/19560 | loss 3.509084 (+0.06z)| norm 0.2718 (-0.54z)| lr 4.87e-04 | 4166.17 ms | 32.4% bf16 MFU | 124845 tok/s step 6101/19560 | loss 3.452812 (-1.37z)| norm 0.2806 (-0.27z)| lr 4.87e-04 | 4156.21 ms | 32.5% bf16 MFU | 124910 tok/s step 6102/19560 | loss 3.575739 (+1.77z)| norm 0.2986 (+0.29z)| lr 4.87e-04 | 4155.50 ms | 32.5% bf16 MFU | 124973 tok/s step 6103/19560 | loss 3.487286 (-0.48z)| norm 0.2598 (-0.91z)| lr 4.87e-04 | 4147.23 ms | 32.6% bf16 MFU | 125045 tok/s step 6104/19560 | loss 3.462521 (-1.10z)| norm 0.3019 (+0.39z)| lr 4.86e-04 | 4165.71 ms | 32.4% bf16 MFU | 125086 tok/s step 6105/19560 | loss 3.508488 (+0.08z)| norm 0.2787 (-0.33z)| lr 4.86e-04 | 4151.24 ms | 32.5% bf16 MFU | 125146 tok/s step 6106/19560 | loss 3.496968 (-0.21z)| norm 0.2835 (-0.18z)| lr 4.86e-04 | 4169.09 ms | 32.4% bf16 MFU | 125177 tok/s step 6107/19560 | loss 3.480749 (-0.62z)| norm 0.4100 (+3.53z)| lr 4.86e-04 | 4147.46 ms | 32.6% bf16 MFU | 125238 tok/s step 6108/19560 | loss 3.482890 (-0.56z)| norm 0.3049 (+0.42z)| lr 4.86e-04 | 4159.08 ms | 32.5% bf16 MFU | 125279 tok/s step 6109/19560 | loss 3.540675 (+0.92z)| norm 0.3005 (+0.29z)| lr 4.86e-04 | 4151.01 ms | 32.5% bf16 MFU | 125331 tok/s step 6110/19560 | loss 3.451466 (-1.36z)| norm 0.2819 (-0.26z)| lr 4.86e-04 | 4166.38 ms | 32.4% bf16 MFU | 125356 tok/s step 6111/19560 | loss 3.482809 (-0.55z)| norm 0.2569 (-1.00z)| lr 4.86e-04 | 4151.77 ms | 32.5% bf16 MFU | 125402 tok/s step 6112/19560 | loss 3.492290 (-0.31z)| norm 0.2978 (+0.21z)| lr 4.86e-04 | 4152.88 ms | 32.5% bf16 MFU | 125445 tok/s step 6113/19560 | loss 3.463068 (-1.05z)| norm 0.2603 (-0.90z)| lr 4.86e-04 | 4154.83 ms | 32.5% bf16 MFU | 125482 tok/s step 6114/19560 | loss 3.472964 (-0.78z)| norm 0.2667 (-0.71z)| lr 4.86e-04 | 4147.81 ms | 32.6% bf16 MFU | 125528 tok/s step 6115/19560 | loss 3.489730 (-0.34z)| norm 0.2698 (-0.61z)| lr 4.86e-04 | 4160.05 ms | 32.5% bf16 MFU | 125553 tok/s step 6116/19560 | loss 3.511133 (+0.22z)| norm 0.2624 (-0.82z)| lr 4.86e-04 | 4160.36 ms | 32.5% bf16 MFU | 125576 tok/s step 6117/19560 | loss 3.500513 (-0.05z)| norm 0.2620 (-0.82z)| lr 4.86e-04 | 4144.51 ms | 32.6% bf16 MFU | 125622 tok/s step 6118/19560 | loss 3.515911 (+0.34z)| norm 0.3042 (+0.41z)| lr 4.86e-04 | 4146.26 ms | 32.6% bf16 MFU | 125664 tok/s step 6119/19560 | loss 3.510260 (+0.20z)| norm 0.2932 (+0.08z)| lr 4.86e-04 | 4161.17 ms | 32.4% bf16 MFU | 125680 tok/s step 6120/19560 | loss 3.490913 (-0.31z)| norm 0.3044 (+0.40z)| lr 4.86e-04 | 4152.80 ms | 32.5% bf16 MFU | 125709 tok/s step 6121/19560 | loss 3.510032 (+0.18z)| norm 0.2784 (-0.36z)| lr 4.86e-04 | 4151.94 ms | 32.5% bf16 MFU | 125737 tok/s step 6122/19560 | loss 3.490002 (-0.33z)| norm 0.2641 (-0.78z)| lr 4.86e-04 | 4149.10 ms | 32.5% bf16 MFU | 125768 tok/s step 6123/19560 | loss 3.498293 (-0.11z)| norm 0.2944 (+0.10z)| lr 4.86e-04 | 4152.31 ms | 32.5% bf16 MFU | 125793 tok/s step 6124/19560 | loss 3.481034 (-0.56z)| norm 0.2756 (-0.45z)| lr 4.86e-04 | 4158.07 ms | 32.5% bf16 MFU | 125808 tok/s step 6125/19560 | loss 3.522248 (+0.52z)| norm 0.2879 (-0.09z)| lr 4.86e-04 | 4149.86 ms | 32.5% bf16 MFU | 125834 tok/s step 6126/19560 | loss 3.512056 (+0.27z)| norm 0.3089 (+0.53z)| lr 4.86e-04 | 4152.95 ms | 32.5% bf16 MFU | 125855 tok/s step 6127/19560 | loss 3.452763 (-1.30z)| norm 0.2635 (-0.80z)| lr 4.86e-04 | 4166.89 ms | 32.4% bf16 MFU | 125853 tok/s step 6128/19560 | loss 3.497210 (-0.12z)| norm 0.2957 (+0.15z)| lr 4.86e-04 | 4164.56 ms | 32.4% bf16 MFU | 125855 tok/s step 6129/19560 | loss 3.494426 (-0.21z)| norm 0.2724 (-0.53z)| lr 4.86e-04 | 4150.97 ms | 32.5% bf16 MFU | 125878 tok/s step 6130/19560 | loss 3.459771 (-1.13z)| norm 0.2614 (-0.84z)| lr 4.85e-04 | 4150.37 ms | 32.5% bf16 MFU | 125900 tok/s step 6131/19560 | loss 3.567642 (+1.76z)| norm 0.2706 (-0.56z)| lr 4.85e-04 | 4154.80 ms | 32.5% bf16 MFU | 125914 tok/s step 6132/19560 | loss 3.461594 (-1.07z)| norm 0.2765 (-0.38z)| lr 4.85e-04 | 4152.93 ms | 32.5% bf16 MFU | 125931 tok/s step 6133/19560 | loss 3.479166 (-0.59z)| norm 0.2702 (-0.56z)| lr 4.85e-04 | 4163.35 ms | 32.4% bf16 MFU | 125931 tok/s step 6134/19560 | loss 3.477878 (-0.61z)| norm 0.2760 (-0.38z)| lr 4.85e-04 | 4149.81 ms | 32.5% bf16 MFU | 125951 tok/s step 6135/19560 | loss 3.470708 (-0.80z)| norm 0.2674 (-0.62z)| lr 4.85e-04 | 4166.24 ms | 32.4% bf16 MFU | 125946 tok/s step 6136/19560 | loss 3.516426 (+0.42z)| norm 0.2646 (-0.70z)| lr 4.85e-04 | 4165.12 ms | 32.4% bf16 MFU | 125942 tok/s step 6137/19560 | loss 3.531415 (+0.80z)| norm 0.2533 (-1.03z)| lr 4.85e-04 | 4152.86 ms | 32.5% bf16 MFU | 125958 tok/s step 6138/19560 | loss 3.462323 (-1.02z)| norm 0.2737 (-0.43z)| lr 4.85e-04 | 4153.74 ms | 32.5% bf16 MFU | 125971 tok/s step 6139/19560 | loss 3.495289 (-0.15z)| norm 0.2928 (+0.13z)| lr 4.85e-04 | 4152.48 ms | 32.5% bf16 MFU | 125985 tok/s step 6140/19560 | loss 3.565567 (+1.71z)| norm 0.2659 (-0.67z)| lr 4.85e-04 | 4151.59 ms | 32.5% bf16 MFU | 126000 tok/s step 6141/19560 | loss 3.464586 (-0.95z)| norm 0.2944 (+0.24z)| lr 4.85e-04 | 4152.63 ms | 32.5% bf16 MFU | 126013 tok/s step 6142/19560 | loss 3.492412 (-0.21z)| norm 0.3096 (+0.73z)| lr 4.85e-04 | 4150.82 ms | 32.5% bf16 MFU | 126028 tok/s step 6143/19560 | loss 3.471377 (-0.76z)| norm 0.2915 (+0.15z)| lr 4.85e-04 | 4158.19 ms | 32.5% bf16 MFU | 126031 tok/s step 6144/19560 | loss 3.481336 (-0.49z)| norm 0.2799 (-0.21z)| lr 4.85e-04 | 4143.36 ms | 32.6% bf16 MFU | 126056 tok/s step 6145/19560 | loss 3.546618 (+1.23z)| norm 0.2908 (+0.14z)| lr 4.85e-04 | 4156.36 ms | 32.5% bf16 MFU | 126060 tok/s step 6146/19560 | loss 3.457365 (-1.11z)| norm 0.2705 (-0.52z)| lr 4.85e-04 | 4158.84 ms | 32.5% bf16 MFU | 126061 tok/s step 6147/19560 | loss 3.524938 (+0.65z)| norm 0.2916 (+0.16z)| lr 4.85e-04 | 4165.77 ms | 32.4% bf16 MFU | 126050 tok/s step 6148/19560 | loss 3.504211 (+0.10z)| norm 0.2755 (-0.35z)| lr 4.85e-04 | 4155.70 ms | 32.5% bf16 MFU | 126056 tok/s step 6149/19560 | loss 3.517736 (+0.45z)| norm 0.2709 (-0.50z)| lr 4.85e-04 | 4167.92 ms | 32.4% bf16 MFU | 126043 tok/s step 6150/19560 | loss 3.459116 (-1.09z)| norm 0.2858 (-0.02z)| lr 4.85e-04 | 4147.27 ms | 32.6% bf16 MFU | 126061 tok/s step 6151/19560 | loss 3.482823 (-0.45z)| norm 0.3227 (+1.16z)| lr 4.85e-04 | 4161.19 ms | 32.4% bf16 MFU | 126058 tok/s step 6152/19560 | loss 3.407821 (-2.36z)| norm 0.3037 (+0.54z)| lr 4.85e-04 | 4173.54 ms | 32.4% bf16 MFU | 126036 tok/s step 6153/19560 | loss 3.499350 (+0.00z)| norm 0.2819 (-0.17z)| lr 4.85e-04 | 4151.25 ms | 32.5% bf16 MFU | 126049 tok/s step 6154/19560 | loss 3.510022 (+0.27z)| norm 0.3260 (+1.23z)| lr 4.85e-04 | 4147.67 ms | 32.6% bf16 MFU | 126067 tok/s step 6155/19560 | loss 3.534567 (+0.93z)| norm 0.3177 (+0.95z)| lr 4.84e-04 | 4150.13 ms | 32.5% bf16 MFU | 126080 tok/s step 6156/19560 | loss 3.538804 (+1.04z)| norm 0.2969 (+0.43z)| lr 4.84e-04 | 4153.55 ms | 32.5% bf16 MFU | 126088 tok/s step 6157/19560 | loss 3.559170 (+1.55z)| norm 0.3099 (+0.98z)| lr 4.84e-04 | 4159.24 ms | 32.5% bf16 MFU | 126086 tok/s step 6158/19560 | loss 3.498268 (-0.03z)| norm 0.3212 (+1.44z)| lr 4.84e-04 | 4154.88 ms | 32.5% bf16 MFU | 126091 tok/s step 6159/19560 | loss 3.461670 (-0.98z)| norm 0.3023 (+0.65z)| lr 4.84e-04 | 4157.59 ms | 32.5% bf16 MFU | 126091 tok/s step 6160/19560 | loss 3.614387 (+2.89z)| norm 0.3039 (+0.72z)| lr 4.84e-04 | 4159.44 ms | 32.5% bf16 MFU | 126089 tok/s step 6161/19560 | loss 3.529341 (+0.72z)| norm 0.2842 (-0.09z)| lr 4.84e-04 | 4167.60 ms | 32.4% bf16 MFU | 126075 tok/s step 6162/19560 | loss 3.492752 (-0.20z)| norm 0.2954 (+0.38z)| lr 4.84e-04 | 4150.27 ms | 32.5% bf16 MFU | 126087 tok/s step 6163/19560 | loss 3.549512 (+1.22z)| norm 0.2931 (+0.27z)| lr 4.84e-04 | 4148.48 ms | 32.5% bf16 MFU | 126102 tok/s step 6164/19560 | loss 3.509271 (+0.20z)| norm 0.2744 (-0.49z)| lr 4.84e-04 | 4151.70 ms | 32.5% bf16 MFU | 126111 tok/s step 6165/19560 | loss 3.466610 (-0.87z)| norm 0.2628 (-0.96z)| lr 4.84e-04 | 4156.46 ms | 32.5% bf16 MFU | 126112 tok/s step 6166/19560 | loss 3.470702 (-0.77z)| norm 0.2617 (-0.99z)| lr 4.84e-04 | 4163.93 ms | 32.4% bf16 MFU | 126102 tok/s step 6167/19560 | loss 3.492115 (-0.22z)| norm 0.2581 (-1.12z)| lr 4.84e-04 | 4144.67 ms | 32.6% bf16 MFU | 126122 tok/s step 6168/19560 | loss 3.517405 (+0.45z)| norm 0.2648 (-0.83z)| lr 4.84e-04 | 4161.46 ms | 32.4% bf16 MFU | 126115 tok/s step 6169/19560 | loss 3.498426 (-0.05z)| norm 0.3316 (+1.92z)| lr 4.84e-04 | 4149.02 ms | 32.5% bf16 MFU | 126128 tok/s step 6170/19560 | loss 3.474801 (-0.65z)| norm 0.3287 (+1.76z)| lr 4.84e-04 | 4157.35 ms | 32.5% bf16 MFU | 126127 tok/s step 6171/19560 | loss 3.473384 (-0.69z)| norm 0.2793 (-0.24z)| lr 4.84e-04 | 4151.69 ms | 32.5% bf16 MFU | 126135 tok/s step 6172/19560 | loss 3.454866 (-1.18z)| norm 0.2985 (+0.54z)| lr 4.84e-04 | 4160.48 ms | 32.5% bf16 MFU | 126129 tok/s step 6173/19560 | loss 3.440372 (-1.56z)| norm 0.2836 (-0.05z)| lr 4.84e-04 | 4154.90 ms | 32.5% bf16 MFU | 126132 tok/s step 6174/19560 | loss 3.538112 (+1.00z)| norm 0.2705 (-0.59z)| lr 4.84e-04 | 4161.79 ms | 32.4% bf16 MFU | 126124 tok/s step 6175/19560 | loss 3.565605 (+1.69z)| norm 0.2608 (-0.99z)| lr 4.84e-04 | 4154.84 ms | 32.5% bf16 MFU | 126127 tok/s step 6176/19560 | loss 3.485204 (-0.41z)| norm 0.2659 (-0.77z)| lr 4.84e-04 | 4146.20 ms | 32.6% bf16 MFU | 126143 tok/s step 6177/19560 | loss 3.525153 (+0.66z)| norm 0.2740 (-0.43z)| lr 4.84e-04 | 4154.22 ms | 32.5% bf16 MFU | 126146 tok/s step 6178/19560 | loss 3.512135 (+0.32z)| norm 0.2672 (-0.70z)| lr 4.84e-04 | 4155.67 ms | 32.5% bf16 MFU | 126147 tok/s step 6179/19560 | loss 3.495977 (-0.11z)| norm 0.2634 (-0.85z)| lr 4.84e-04 | 4155.11 ms | 32.5% bf16 MFU | 126149 tok/s step 6180/19560 | loss 3.465648 (-0.91z)| norm 0.2649 (-0.80z)| lr 4.83e-04 | 4155.24 ms | 32.5% bf16 MFU | 126150 tok/s step 6181/19560 | loss 3.525942 (+0.74z)| norm 0.2733 (-0.45z)| lr 4.83e-04 | 4149.49 ms | 32.5% bf16 MFU | 126160 tok/s step 6182/19560 | loss 3.507131 (+0.22z)| norm 0.2549 (-1.19z)| lr 4.83e-04 | 4159.88 ms | 32.5% bf16 MFU | 126154 tok/s step 6183/19560 | loss 3.523461 (+0.66z)| norm 0.2796 (-0.18z)| lr 4.83e-04 | 4150.98 ms | 32.5% bf16 MFU | 126161 tok/s step 6184/19560 | loss 3.462496 (-1.03z)| norm 0.2786 (-0.22z)| lr 4.83e-04 | 4151.26 ms | 32.5% bf16 MFU | 126168 tok/s step 6185/19560 | loss 3.434611 (-1.76z)| norm 0.2662 (-0.73z)| lr 4.83e-04 | 4163.44 ms | 32.4% bf16 MFU | 126156 tok/s step 6186/19560 | loss 3.485048 (-0.39z)| norm 0.2842 (+0.01z)| lr 4.83e-04 | 4163.92 ms | 32.4% bf16 MFU | 126144 tok/s step 6187/19560 | loss 3.521253 (+0.65z)| norm 0.2990 (+0.62z)| lr 4.83e-04 | 4149.27 ms | 32.5% bf16 MFU | 126154 tok/s step 6188/19560 | loss 3.496721 (-0.06z)| norm 0.2612 (-0.92z)| lr 4.83e-04 | 4144.63 ms | 32.6% bf16 MFU | 126172 tok/s step 6189/19560 | loss 3.503139 (+0.13z)| norm 0.2637 (-0.82z)| lr 4.83e-04 | 4149.73 ms | 32.5% bf16 MFU | 126180 tok/s step 6190/19560 | loss 3.585978 (+2.46z)| norm 0.2815 (-0.10z)| lr 4.83e-04 | 4155.50 ms | 32.5% bf16 MFU | 126180 tok/s step 6191/19560 | loss 3.491623 (-0.20z)| norm 0.2530 (-1.25z)| lr 4.83e-04 | 4152.08 ms | 32.5% bf16 MFU | 126184 tok/s step 6192/19560 | loss 3.476584 (-0.62z)| norm 0.2811 (-0.10z)| lr 4.83e-04 | 4156.73 ms | 32.5% bf16 MFU | 126181 tok/s step 6193/19560 | loss 3.571480 (+2.03z)| norm 0.2867 (+0.14z)| lr 4.83e-04 | 4153.61 ms | 32.5% bf16 MFU | 126184 tok/s step 6194/19560 | loss 3.478117 (-0.57z)| norm 0.2955 (+0.49z)| lr 4.83e-04 | 4164.88 ms | 32.4% bf16 MFU | 126169 tok/s step 6195/19560 | loss 3.472117 (-0.72z)| norm 0.2720 (-0.47z)| lr 4.83e-04 | 4152.30 ms | 32.5% bf16 MFU | 126173 tok/s step 6196/19560 | loss 3.512003 (+0.39z)| norm 0.2817 (-0.08z)| lr 4.83e-04 | 4158.21 ms | 32.5% bf16 MFU | 126169 tok/s step 6197/19560 | loss 3.444009 (-1.49z)| norm 0.3045 (+0.84z)| lr 4.83e-04 | 4153.69 ms | 32.5% bf16 MFU | 126172 tok/s step 6198/19560 | loss 3.452167 (-1.24z)| norm 0.2920 (+0.32z)| lr 4.83e-04 | 4161.80 ms | 32.4% bf16 MFU | 126162 tok/s step 6199/19560 | loss 3.485035 (-0.33z)| norm 0.2387 (-1.85z)| lr 4.83e-04 | 4152.12 ms | 32.5% bf16 MFU | 126167 tok/s step 6200/19560 | loss 3.518321 (+0.58z)| norm 0.2823 (-0.07z)| lr 4.83e-04 | 4147.03 ms | 32.6% bf16 MFU | 126180 tok/s step 6201/19560 | loss 3.557230 (+1.66z)| norm 0.2967 (+0.51z)| lr 4.83e-04 | 4161.20 ms | 32.4% bf16 MFU | 126171 tok/s step 6202/19560 | loss 3.487607 (-0.26z)| norm 0.2588 (-1.05z)| lr 4.83e-04 | 4157.33 ms | 32.5% bf16 MFU | 126168 tok/s step 6203/19560 | loss 3.448878 (-1.33z)| norm 0.2748 (-0.39z)| lr 4.83e-04 | 4147.39 ms | 32.6% bf16 MFU | 126180 tok/s step 6204/19560 | loss 3.475057 (-0.64z)| norm 0.2892 (+0.20z)| lr 4.83e-04 | 4165.91 ms | 32.4% bf16 MFU | 126164 tok/s step 6205/19560 | loss 3.571036 (+2.02z)| norm 0.2593 (-1.04z)| lr 4.83e-04 | 4163.14 ms | 32.4% bf16 MFU | 126152 tok/s step 6206/19560 | loss 3.494743 (-0.10z)| norm 0.2830 (-0.06z)| lr 4.82e-04 | 4156.91 ms | 32.5% bf16 MFU | 126151 tok/s step 6207/19560 | loss 3.467409 (-0.85z)| norm 0.2722 (-0.50z)| lr 4.82e-04 | 4142.72 ms | 32.6% bf16 MFU | 126171 tok/s step 6208/19560 | loss 3.520851 (+0.64z)| norm 0.2885 (+0.18z)| lr 4.82e-04 | 4169.92 ms | 32.4% bf16 MFU | 126149 tok/s step 6209/19560 | loss 3.549686 (+1.43z)| norm 0.3212 (+1.56z)| lr 4.82e-04 | 4189.36 ms | 32.2% bf16 MFU | 126099 tok/s step 6210/19560 | loss 3.505390 (+0.19z)| norm 0.2809 (-0.13z)| lr 4.82e-04 | 4176.45 ms | 32.3% bf16 MFU | 126071 tok/s step 6211/19560 | loss 3.530023 (+0.88z)| norm 0.2686 (-0.63z)| lr 4.82e-04 | 4178.54 ms | 32.3% bf16 MFU | 126041 tok/s step 6212/19560 | loss 3.441323 (-1.61z)| norm 0.2653 (-0.78z)| lr 4.82e-04 | 4158.45 ms | 32.5% bf16 MFU | 126043 tok/s step 6213/19560 | loss 3.508783 (+0.28z)| norm 0.2609 (-0.96z)| lr 4.82e-04 | 4150.38 ms | 32.5% bf16 MFU | 126057 tok/s step 6214/19560 | loss 3.547789 (+1.35z)| norm 0.3215 (+1.76z)| lr 4.82e-04 | 4160.11 ms | 32.5% bf16 MFU | 126055 tok/s step 6215/19560 | loss 3.542279 (+1.18z)| norm 0.2490 (-1.47z)| lr 4.82e-04 | 4158.02 ms | 32.5% bf16 MFU | 126057 tok/s step 6216/19560 | loss 3.598157 (+2.65z)| norm 0.2876 (+0.26z)| lr 4.82e-04 | 4164.86 ms | 32.4% bf16 MFU | 126048 tok/s step 6217/19560 | loss 3.545489 (+1.22z)| norm 0.2745 (-0.33z)| lr 4.82e-04 | 4149.92 ms | 32.5% bf16 MFU | 126063 tok/s step 6218/19560 | loss 3.546423 (+1.22z)| norm 0.3091 (+1.20z)| lr 4.82e-04 | 4162.90 ms | 32.4% bf16 MFU | 126057 tok/s step 6219/19560 | loss 3.526204 (+0.69z)| norm 0.2977 (+0.68z)| lr 4.82e-04 | 4160.14 ms | 32.5% bf16 MFU | 126055 tok/s step 6220/19560 | loss 3.509162 (+0.22z)| norm 0.2941 (+0.51z)| lr 4.82e-04 | 4159.92 ms | 32.5% bf16 MFU | 126054 tok/s step 6221/19560 | loss 3.549782 (+1.30z)| norm 0.2761 (-0.32z)| lr 4.82e-04 | 4151.60 ms | 32.5% bf16 MFU | 126066 tok/s step 6222/19560 | loss 3.514110 (+0.36z)| norm 0.2636 (-0.89z)| lr 4.82e-04 | 4159.71 ms | 32.5% bf16 MFU | 126064 tok/s step 6223/19560 | loss 3.527307 (+0.70z)| norm 0.3103 (+1.23z)| lr 4.82e-04 | 4157.55 ms | 32.5% bf16 MFU | 126066 tok/s step 6224/19560 | loss 3.572755 (+1.89z)| norm 0.3505 (+2.92z)| lr 4.82e-04 | 4162.83 ms | 32.4% bf16 MFU | 126060 tok/s step 6225/19560 | loss 3.517930 (+0.41z)| norm 0.2837 (-0.01z)| lr 4.82e-04 | 4158.07 ms | 32.5% bf16 MFU | 126062 tok/s step 6226/19560 | loss 3.491650 (-0.30z)| norm 0.2958 (+0.51z)| lr 4.82e-04 | 4153.99 ms | 32.5% bf16 MFU | 126069 tok/s step 6227/19560 | loss 3.498690 (-0.10z)| norm 0.3229 (+1.70z)| lr 4.82e-04 | 4158.53 ms | 32.5% bf16 MFU | 126070 tok/s step 6228/19560 | loss 3.488898 (-0.36z)| norm 0.3168 (+1.41z)| lr 4.82e-04 | 4166.39 ms | 32.4% bf16 MFU | 126058 tok/s step 6229/19560 | loss 3.503821 (+0.03z)| norm 0.2825 (-0.09z)| lr 4.82e-04 | 4161.64 ms | 32.4% bf16 MFU | 126054 tok/s step 6230/19560 | loss 3.489444 (-0.35z)| norm 0.2708 (-0.59z)| lr 4.82e-04 | 4153.85 ms | 32.5% bf16 MFU | 126062 tok/s step 6231/19560 | loss 3.473456 (-0.79z)| norm 0.2785 (-0.26z)| lr 4.81e-04 | 4158.12 ms | 32.5% bf16 MFU | 126064 tok/s step 6232/19560 | loss 3.481472 (-0.57z)| norm 0.2525 (-1.38z)| lr 4.81e-04 | 4158.99 ms | 32.5% bf16 MFU | 126064 tok/s step 6233/19560 | loss 3.515152 (+0.37z)| norm 0.3004 (+0.70z)| lr 4.81e-04 | 4163.55 ms | 32.4% bf16 MFU | 126057 tok/s step 6234/19560 | loss 3.539292 (+1.03z)| norm 0.2792 (-0.22z)| lr 4.81e-04 | 4153.57 ms | 32.5% bf16 MFU | 126065 tok/s step 6235/19560 | loss 3.524096 (+0.60z)| norm 0.2788 (-0.22z)| lr 4.81e-04 | 4165.39 ms | 32.4% bf16 MFU | 126055 tok/s step 6236/19560 | loss 3.447855 (-1.50z)| norm 0.2521 (-1.52z)| lr 4.81e-04 | 4158.19 ms | 32.5% bf16 MFU | 126057 tok/s step 6237/19560 | loss 3.533144 (+0.85z)| norm 0.2607 (-1.08z)| lr 4.81e-04 | 4149.31 ms | 32.5% bf16 MFU | 126072 tok/s step 6238/19560 | loss 3.547972 (+1.24z)| norm 0.3214 (+1.88z)| lr 4.81e-04 | 4160.17 ms | 32.5% bf16 MFU | 126069 tok/s step 6239/19560 | loss 3.515545 (+0.34z)| norm 0.2898 (+0.33z)| lr 4.81e-04 | 4158.63 ms | 32.5% bf16 MFU | 126069 tok/s step 6240/19560 | loss 3.506138 (+0.07z)| norm 0.2715 (-0.56z)| lr 4.81e-04 | 4151.03 ms | 32.5% bf16 MFU | 126081 tok/s step 6241/19560 | loss 3.542580 (+1.07z)| norm 0.2988 (+0.77z)| lr 4.81e-04 | 4207.40 ms | 32.1% bf16 MFU | 126008 tok/s step 6242/19560 | loss 3.480034 (-0.67z)| norm 0.2718 (-0.56z)| lr 4.81e-04 | 4162.20 ms | 32.4% bf16 MFU | 126005 tok/s step 6243/19560 | loss 3.582648 (+2.12z)| norm 0.2521 (-1.52z)| lr 4.81e-04 | 4155.66 ms | 32.5% bf16 MFU | 126013 tok/s step 6244/19560 | loss 3.541768 (+1.00z)| norm 0.2770 (-0.31z)| lr 4.81e-04 | 4152.15 ms | 32.5% bf16 MFU | 126026 tok/s step 6245/19560 | loss 3.502108 (-0.08z)| norm 0.2782 (-0.25z)| lr 4.81e-04 | 4152.87 ms | 32.5% bf16 MFU | 126037 tok/s step 6246/19560 | loss 3.555478 (+1.35z)| norm 0.2790 (-0.20z)| lr 4.81e-04 | 4146.94 ms | 32.6% bf16 MFU | 126057 tok/s step 6247/19560 | loss 3.548161 (+1.14z)| norm 0.2796 (-0.17z)| lr 4.81e-04 | 4165.35 ms | 32.4% bf16 MFU | 126047 tok/s step 6248/19560 | loss 3.559310 (+1.41z)| norm 0.2574 (-1.25z)| lr 4.81e-04 | 4164.87 ms | 32.4% bf16 MFU | 126039 tok/s step 6249/19560 | loss 3.504782 (-0.04z)| norm 0.2792 (-0.17z)| lr 4.81e-04 | 4159.14 ms | 32.5% bf16 MFU | 126040 tok/s step 6250/19560 | loss 3.475636 (-0.81z)| norm 0.2637 (-0.94z)| lr 4.81e-04 | 4162.46 ms | 32.4% bf16 MFU | 126036 tok/s val loss 3.496155 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2809/10042 = 0.279725 step 6251/19560 | loss 3.533908 (+0.73z)| norm 0.2937 (+0.55z)| lr 4.81e-04 | 4150.58 ms | 32.5% bf16 MFU | 126050 tok/s step 6252/19560 | loss 3.607230 (+2.59z)| norm 0.3005 (+0.87z)| lr 4.81e-04 | 4160.50 ms | 32.5% bf16 MFU | 126048 tok/s step 6253/19560 | loss 3.552993 (+1.17z)| norm 0.2956 (+0.63z)| lr 4.81e-04 | 4161.18 ms | 32.4% bf16 MFU | 126045 tok/s step 6254/19560 | loss 3.470832 (-0.94z)| norm 0.2718 (-0.54z)| lr 4.81e-04 | 4150.82 ms | 32.5% bf16 MFU | 126059 tok/s step 6255/19560 | loss 3.515463 (+0.20z)| norm 0.3464 (+3.03z)| lr 4.81e-04 | 4153.65 ms | 32.5% bf16 MFU | 126067 tok/s step 6256/19560 | loss 3.522798 (+0.38z)| norm 0.3427 (+2.75z)| lr 4.80e-04 | 4162.82 ms | 32.4% bf16 MFU | 126061 tok/s step 6257/19560 | loss 3.517860 (+0.25z)| norm 0.3028 (+0.88z)| lr 4.80e-04 | 4151.85 ms | 32.5% bf16 MFU | 126072 tok/s step 6258/19560 | loss 3.472528 (-0.93z)| norm 0.3101 (+1.20z)| lr 4.80e-04 | 4156.77 ms | 32.5% bf16 MFU | 126075 tok/s step 6259/19560 | loss 3.537880 (+0.78z)| norm 0.2839 (-0.02z)| lr 4.80e-04 | 4150.91 ms | 32.5% bf16 MFU | 126086 tok/s step 6260/19560 | loss 3.518119 (+0.25z)| norm 0.3012 (+0.77z)| lr 4.80e-04 | 4153.33 ms | 32.5% bf16 MFU | 126094 tok/s step 6261/19560 | loss 3.513868 (+0.14z)| norm 0.2914 (+0.31z)| lr 4.80e-04 | 4159.91 ms | 32.5% bf16 MFU | 126090 tok/s step 6262/19560 | loss 3.542591 (+0.88z)| norm 0.2461 (-1.76z)| lr 4.80e-04 | 4151.77 ms | 32.5% bf16 MFU | 126100 tok/s step 6263/19560 | loss 3.487076 (-0.59z)| norm 0.2986 (+0.64z)| lr 4.80e-04 | 4157.39 ms | 32.5% bf16 MFU | 126100 tok/s step 6264/19560 | loss 3.476879 (-0.85z)| norm 0.3009 (+0.73z)| lr 4.80e-04 | 4159.24 ms | 32.5% bf16 MFU | 126098 tok/s step 6265/19560 | loss 3.489451 (-0.51z)| norm 0.2703 (-0.69z)| lr 4.80e-04 | 4155.90 ms | 32.5% bf16 MFU | 126101 tok/s step 6266/19560 | loss 3.512073 (+0.08z)| norm 0.2705 (-0.67z)| lr 4.80e-04 | 4153.32 ms | 32.5% bf16 MFU | 126108 tok/s step 6267/19560 | loss 3.523805 (+0.38z)| norm 0.2513 (-1.54z)| lr 4.80e-04 | 4156.10 ms | 32.5% bf16 MFU | 126110 tok/s step 6268/19560 | loss 3.492741 (-0.43z)| norm 0.2701 (-0.67z)| lr 4.80e-04 | 4156.76 ms | 32.5% bf16 MFU | 126111 tok/s step 6269/19560 | loss 3.512614 (+0.09z)| norm 0.2733 (-0.52z)| lr 4.80e-04 | 4154.86 ms | 32.5% bf16 MFU | 126114 tok/s step 6270/19560 | loss 3.523794 (+0.39z)| norm 0.2812 (-0.15z)| lr 4.80e-04 | 4152.51 ms | 32.5% bf16 MFU | 126122 tok/s step 6271/19560 | loss 3.540309 (+0.82z)| norm 0.2551 (-1.33z)| lr 4.80e-04 | 4153.28 ms | 32.5% bf16 MFU | 126127 tok/s step 6272/19560 | loss 3.493004 (-0.46z)| norm 0.2671 (-0.77z)| lr 4.80e-04 | 4150.67 ms | 32.5% bf16 MFU | 126137 tok/s step 6273/19560 | loss 3.489195 (-0.55z)| norm 0.2957 (+0.53z)| lr 4.80e-04 | 4159.04 ms | 32.5% bf16 MFU | 126133 tok/s step 6274/19560 | loss 3.518937 (+0.24z)| norm 0.2956 (+0.52z)| lr 4.80e-04 | 4155.74 ms | 32.5% bf16 MFU | 126134 tok/s step 6275/19560 | loss 3.499131 (-0.29z)| norm 0.3093 (+1.13z)| lr 4.80e-04 | 4154.99 ms | 32.5% bf16 MFU | 126137 tok/s step 6276/19560 | loss 3.466909 (-1.16z)| norm 0.2689 (-0.70z)| lr 4.80e-04 | 4166.96 ms | 32.4% bf16 MFU | 126121 tok/s step 6277/19560 | loss 3.465223 (-1.18z)| norm 0.2846 (+0.01z)| lr 4.80e-04 | 4155.73 ms | 32.5% bf16 MFU | 126123 tok/s step 6278/19560 | loss 3.488864 (-0.56z)| norm 0.2810 (-0.16z)| lr 4.80e-04 | 4160.00 ms | 32.5% bf16 MFU | 126118 tok/s step 6279/19560 | loss 3.478591 (-0.83z)| norm 0.2866 (+0.11z)| lr 4.80e-04 | 4149.83 ms | 32.5% bf16 MFU | 126129 tok/s step 6280/19560 | loss 3.504514 (-0.16z)| norm 0.2657 (-0.84z)| lr 4.80e-04 | 4172.74 ms | 32.4% bf16 MFU | 126105 tok/s step 6281/19560 | loss 3.552293 (+1.16z)| norm 0.3131 (+1.33z)| lr 4.79e-04 | 4162.55 ms | 32.4% bf16 MFU | 126097 tok/s step 6282/19560 | loss 3.471064 (-1.08z)| norm 0.3132 (+1.35z)| lr 4.79e-04 | 4167.44 ms | 32.4% bf16 MFU | 126083 tok/s step 6283/19560 | loss 3.505010 (-0.14z)| norm 0.3125 (+1.32z)| lr 4.79e-04 | 4158.97 ms | 32.5% bf16 MFU | 126082 tok/s step 6284/19560 | loss 3.522583 (+0.35z)| norm 0.2797 (-0.19z)| lr 4.79e-04 | 4155.95 ms | 32.5% bf16 MFU | 126085 tok/s step 6285/19560 | loss 3.506032 (-0.10z)| norm 0.2651 (-0.85z)| lr 4.79e-04 | 4178.64 ms | 32.3% bf16 MFU | 126055 tok/s step 6286/19560 | loss 3.520375 (+0.30z)| norm 0.2650 (-0.84z)| lr 4.79e-04 | 4155.82 ms | 32.5% bf16 MFU | 126060 tok/s step 6287/19560 | loss 3.502260 (-0.22z)| norm 0.2767 (-0.29z)| lr 4.79e-04 | 4186.88 ms | 32.2% bf16 MFU | 126018 tok/s step 6288/19560 | loss 3.517761 (+0.25z)| norm 0.2593 (-1.09z)| lr 4.79e-04 | 4156.98 ms | 32.5% bf16 MFU | 126023 tok/s step 6289/19560 | loss 3.519547 (+0.30z)| norm 0.2581 (-1.13z)| lr 4.79e-04 | 4148.14 ms | 32.5% bf16 MFU | 126041 tok/s step 6290/19560 | loss 3.497758 (-0.33z)| norm 0.2809 (-0.06z)| lr 4.79e-04 | 4160.94 ms | 32.4% bf16 MFU | 126039 tok/s step 6291/19560 | loss 3.617186 (+3.04z)| norm 0.2959 (+0.64z)| lr 4.79e-04 | 4157.08 ms | 32.5% bf16 MFU | 126043 tok/s step 6292/19560 | loss 3.537484 (+0.78z)| norm 0.2804 (-0.09z)| lr 4.79e-04 | 4158.77 ms | 32.5% bf16 MFU | 126045 tok/s step 6293/19560 | loss 3.522895 (+0.36z)| norm 0.2840 (+0.07z)| lr 4.79e-04 | 4154.06 ms | 32.5% bf16 MFU | 126053 tok/s step 6294/19560 | loss 3.455001 (-1.56z)| norm 0.2818 (-0.04z)| lr 4.79e-04 | 4153.42 ms | 32.5% bf16 MFU | 126062 tok/s step 6295/19560 | loss 3.513390 (+0.09z)| norm 0.2464 (-1.69z)| lr 4.79e-04 | 4154.63 ms | 32.5% bf16 MFU | 126068 tok/s step 6296/19560 | loss 3.506927 (-0.10z)| norm 0.2517 (-1.43z)| lr 4.79e-04 | 4149.54 ms | 32.5% bf16 MFU | 126082 tok/s step 6297/19560 | loss 3.507886 (-0.07z)| norm 0.2490 (-1.54z)| lr 4.79e-04 | 4153.16 ms | 32.5% bf16 MFU | 126090 tok/s step 6298/19560 | loss 3.526816 (+0.45z)| norm 0.4470 (+6.47z)| lr 4.79e-04 | 4152.17 ms | 32.5% bf16 MFU | 126099 tok/s step 6299/19560 | loss 3.511902 (+0.02z)| norm 0.3330 (+1.94z)| lr 4.79e-04 | 4162.73 ms | 32.4% bf16 MFU | 126092 tok/s step 6300/19560 | loss 3.532851 (+0.61z)| norm 0.3170 (+1.31z)| lr 4.79e-04 | 4149.19 ms | 32.5% bf16 MFU | 126105 tok/s step 6301/19560 | loss 3.550072 (+1.09z)| norm 0.2973 (+0.54z)| lr 4.79e-04 | 4156.14 ms | 32.5% bf16 MFU | 126107 tok/s step 6302/19560 | loss 3.485422 (-0.78z)| norm 0.3012 (+0.68z)| lr 4.79e-04 | 4152.83 ms | 32.5% bf16 MFU | 126114 tok/s step 6303/19560 | loss 3.521068 (+0.27z)| norm 0.3070 (+0.89z)| lr 4.79e-04 | 4162.64 ms | 32.4% bf16 MFU | 126106 tok/s step 6304/19560 | loss 3.517365 (+0.16z)| norm 0.2908 (+0.26z)| lr 4.79e-04 | 4153.23 ms | 32.5% bf16 MFU | 126112 tok/s step 6305/19560 | loss 3.514925 (+0.09z)| norm 0.2743 (-0.38z)| lr 4.79e-04 | 4154.90 ms | 32.5% bf16 MFU | 126116 tok/s step 6306/19560 | loss 3.477229 (-1.01z)| norm 0.2694 (-0.57z)| lr 4.78e-04 | 4162.55 ms | 32.4% bf16 MFU | 126108 tok/s step 6307/19560 | loss 3.538280 (+0.77z)| norm 0.2791 (-0.20z)| lr 4.78e-04 | 4149.60 ms | 32.5% bf16 MFU | 126120 tok/s step 6308/19560 | loss 3.548955 (+1.07z)| norm 0.2823 (-0.08z)| lr 4.78e-04 | 4157.29 ms | 32.5% bf16 MFU | 126120 tok/s step 6309/19560 | loss 3.484575 (-0.81z)| norm 0.2888 (+0.17z)| lr 4.78e-04 | 4156.83 ms | 32.5% bf16 MFU | 126120 tok/s step 6310/19560 | loss 3.570830 (+1.68z)| norm 0.2726 (-0.47z)| lr 4.78e-04 | 4150.78 ms | 32.5% bf16 MFU | 126129 tok/s step 6311/19560 | loss 3.504210 (-0.25z)| norm 0.2830 (-0.06z)| lr 4.78e-04 | 4152.90 ms | 32.5% bf16 MFU | 126135 tok/s step 6312/19560 | loss 3.469879 (-1.25z)| norm 0.3020 (+0.67z)| lr 4.78e-04 | 4153.64 ms | 32.5% bf16 MFU | 126140 tok/s step 6313/19560 | loss 3.494042 (-0.57z)| norm 0.2880 (+0.12z)| lr 4.78e-04 | 4154.96 ms | 32.5% bf16 MFU | 126142 tok/s step 6314/19560 | loss 3.476724 (-1.08z)| norm 0.2588 (-1.01z)| lr 4.78e-04 | 4155.78 ms | 32.5% bf16 MFU | 126143 tok/s step 6315/19560 | loss 3.505105 (-0.23z)| norm 0.2917 (+0.27z)| lr 4.78e-04 | 4194.87 ms | 32.2% bf16 MFU | 126085 tok/s step 6316/19560 | loss 3.497989 (-0.44z)| norm 0.2626 (-0.86z)| lr 4.78e-04 | 4151.26 ms | 32.5% bf16 MFU | 126095 tok/s step 6317/19560 | loss 3.432466 (-2.32z)| norm 0.2827 (-0.09z)| lr 4.78e-04 | 4152.98 ms | 32.5% bf16 MFU | 126103 tok/s step 6318/19560 | loss 3.496349 (-0.45z)| norm 0.2741 (-0.42z)| lr 4.78e-04 | 4156.38 ms | 32.5% bf16 MFU | 126105 tok/s step 6319/19560 | loss 3.474555 (-1.09z)| norm 0.2544 (-1.19z)| lr 4.78e-04 | 4157.81 ms | 32.5% bf16 MFU | 126104 tok/s step 6320/19560 | loss 3.537205 (+0.74z)| norm 0.2579 (-1.04z)| lr 4.78e-04 | 4158.32 ms | 32.5% bf16 MFU | 126103 tok/s step 6321/19560 | loss 3.473356 (-1.13z)| norm 0.2610 (-0.91z)| lr 4.78e-04 | 4151.77 ms | 32.5% bf16 MFU | 126112 tok/s step 6322/19560 | loss 3.554348 (+1.25z)| norm 0.2698 (-0.56z)| lr 4.78e-04 | 4157.80 ms | 32.5% bf16 MFU | 126111 tok/s step 6323/19560 | loss 3.489164 (-0.68z)| norm 0.2783 (-0.23z)| lr 4.78e-04 | 4149.78 ms | 32.5% bf16 MFU | 126123 tok/s step 6324/19560 | loss 3.510732 (-0.04z)| norm 0.2556 (-1.10z)| lr 4.78e-04 | 4154.77 ms | 32.5% bf16 MFU | 126126 tok/s step 6325/19560 | loss 3.455967 (-1.68z)| norm 0.2745 (-0.36z)| lr 4.78e-04 | 4159.92 ms | 32.5% bf16 MFU | 126121 tok/s step 6326/19560 | loss 3.519786 (+0.21z)| norm 0.2744 (-0.36z)| lr 4.78e-04 | 4156.19 ms | 32.5% bf16 MFU | 126123 tok/s step 6327/19560 | loss 3.486405 (-0.79z)| norm 0.2833 (-0.03z)| lr 4.78e-04 | 4162.37 ms | 32.4% bf16 MFU | 126115 tok/s step 6328/19560 | loss 3.527971 (+0.46z)| norm 0.3106 (+1.02z)| lr 4.78e-04 | 4160.71 ms | 32.5% bf16 MFU | 126109 tok/s step 6329/19560 | loss 3.483902 (-0.86z)| norm 0.2898 (+0.21z)| lr 4.78e-04 | 4143.44 ms | 32.6% bf16 MFU | 126130 tok/s step 6330/19560 | loss 3.458697 (-1.60z)| norm 0.2805 (-0.15z)| lr 4.78e-04 | 4150.78 ms | 32.5% bf16 MFU | 126140 tok/s step 6331/19560 | loss 3.554663 (+1.27z)| norm 0.3224 (+1.45z)| lr 4.77e-04 | 4153.67 ms | 32.5% bf16 MFU | 126144 tok/s step 6332/19560 | loss 3.493226 (-0.60z)| norm 0.3089 (+0.92z)| lr 4.77e-04 | 4162.27 ms | 32.4% bf16 MFU | 126135 tok/s step 6333/19560 | loss 3.498543 (-0.43z)| norm 0.2716 (-0.52z)| lr 4.77e-04 | 4157.84 ms | 32.5% bf16 MFU | 126133 tok/s step 6334/19560 | loss 3.577707 (+1.96z)| norm 0.2599 (-0.96z)| lr 4.77e-04 | 4153.32 ms | 32.5% bf16 MFU | 126138 tok/s step 6335/19560 | loss 3.452359 (-1.83z)| norm 0.2694 (-0.59z)| lr 4.77e-04 | 4152.84 ms | 32.5% bf16 MFU | 126143 tok/s step 6336/19560 | loss 3.481703 (-0.93z)| norm 0.2725 (-0.47z)| lr 4.77e-04 | 4155.78 ms | 32.5% bf16 MFU | 126144 tok/s step 6337/19560 | loss 3.537581 (+0.76z)| norm 0.2487 (-1.37z)| lr 4.77e-04 | 4147.55 ms | 32.6% bf16 MFU | 126157 tok/s step 6338/19560 | loss 3.444432 (-2.01z)| norm 0.2837 (-0.02z)| lr 4.77e-04 | 4160.83 ms | 32.4% bf16 MFU | 126150 tok/s step 6339/19560 | loss 3.516530 (+0.14z)| norm 0.2605 (-0.91z)| lr 4.77e-04 | 4157.66 ms | 32.5% bf16 MFU | 126147 tok/s step 6340/19560 | loss 3.507777 (-0.14z)| norm 0.2647 (-0.75z)| lr 4.77e-04 | 4155.79 ms | 32.5% bf16 MFU | 126148 tok/s step 6341/19560 | loss 3.556900 (+1.32z)| norm 0.2680 (-0.63z)| lr 4.77e-04 | 4157.86 ms | 32.5% bf16 MFU | 126145 tok/s step 6342/19560 | loss 3.491721 (-0.62z)| norm 0.2783 (-0.21z)| lr 4.77e-04 | 4151.92 ms | 32.5% bf16 MFU | 126152 tok/s step 6343/19560 | loss 3.479860 (-0.96z)| norm 0.2790 (-0.20z)| lr 4.77e-04 | 4152.87 ms | 32.5% bf16 MFU | 126157 tok/s step 6344/19560 | loss 3.511875 (+0.02z)| norm 0.2719 (-0.47z)| lr 4.77e-04 | 4141.68 ms | 32.6% bf16 MFU | 126178 tok/s step 6345/19560 | loss 3.482485 (-0.88z)| norm 0.2852 (+0.05z)| lr 4.77e-04 | 4153.65 ms | 32.5% bf16 MFU | 126180 tok/s step 6346/19560 | loss 3.486529 (-0.74z)| norm 0.2661 (-0.69z)| lr 4.77e-04 | 4167.84 ms | 32.4% bf16 MFU | 126161 tok/s step 6347/19560 | loss 3.509487 (-0.02z)| norm 0.2779 (-0.22z)| lr 4.77e-04 | 4155.59 ms | 32.5% bf16 MFU | 126161 tok/s step 6348/19560 | loss 3.525001 (+0.46z)| norm 0.2761 (-0.29z)| lr 4.77e-04 | 4169.33 ms | 32.4% bf16 MFU | 126141 tok/s step 6349/19560 | loss 3.489475 (-0.63z)| norm 0.2789 (-0.18z)| lr 4.77e-04 | 4152.22 ms | 32.5% bf16 MFU | 126147 tok/s step 6350/19560 | loss 3.510441 (+0.02z)| norm 0.2687 (-0.58z)| lr 4.77e-04 | 4151.39 ms | 32.5% bf16 MFU | 126154 tok/s step 6351/19560 | loss 3.526142 (+0.51z)| norm 0.2808 (-0.10z)| lr 4.77e-04 | 4161.47 ms | 32.4% bf16 MFU | 126146 tok/s step 6352/19560 | loss 3.571014 (+1.91z)| norm 0.3024 (+0.79z)| lr 4.77e-04 | 4145.61 ms | 32.6% bf16 MFU | 126162 tok/s step 6353/19560 | loss 3.474102 (-1.10z)| norm 0.2923 (+0.38z)| lr 4.77e-04 | 4152.54 ms | 32.5% bf16 MFU | 126167 tok/s step 6354/19560 | loss 3.501037 (-0.26z)| norm 0.2877 (+0.20z)| lr 4.77e-04 | 4150.37 ms | 32.5% bf16 MFU | 126174 tok/s step 6355/19560 | loss 3.623134 (+3.35z)| norm 0.2791 (-0.14z)| lr 4.76e-04 | 4157.04 ms | 32.5% bf16 MFU | 126172 tok/s step 6356/19560 | loss 3.479290 (-0.92z)| norm 0.3106 (+1.15z)| lr 4.76e-04 | 4155.66 ms | 32.5% bf16 MFU | 126171 tok/s step 6357/19560 | loss 3.500040 (-0.31z)| norm 0.2871 (+0.19z)| lr 4.76e-04 | 4146.72 ms | 32.6% bf16 MFU | 126184 tok/s step 6358/19560 | loss 3.462098 (-1.42z)| norm 0.2610 (-0.88z)| lr 4.76e-04 | 4152.42 ms | 32.5% bf16 MFU | 126188 tok/s step 6359/19560 | loss 3.527069 (+0.49z)| norm 0.2846 (+0.09z)| lr 4.76e-04 | 4151.80 ms | 32.5% bf16 MFU | 126193 tok/s step 6360/19560 | loss 3.585603 (+2.16z)| norm 0.2757 (-0.29z)| lr 4.76e-04 | 4147.60 ms | 32.6% bf16 MFU | 126204 tok/s step 6361/19560 | loss 3.513369 (+0.06z)| norm 0.2691 (-0.55z)| lr 4.76e-04 | 4157.75 ms | 32.5% bf16 MFU | 126198 tok/s step 6362/19560 | loss 3.530969 (+0.57z)| norm 0.2951 (+0.51z)| lr 4.76e-04 | 4158.68 ms | 32.5% bf16 MFU | 126192 tok/s step 6363/19560 | loss 3.416067 (-2.67z)| norm 0.3006 (+0.73z)| lr 4.76e-04 | 4163.69 ms | 32.4% bf16 MFU | 126178 tok/s step 6364/19560 | loss 3.564792 (+1.52z)| norm 0.2910 (+0.33z)| lr 4.76e-04 | 4158.22 ms | 32.5% bf16 MFU | 126174 tok/s step 6365/19560 | loss 3.487370 (-0.67z)| norm 0.2997 (+0.68z)| lr 4.76e-04 | 4157.60 ms | 32.5% bf16 MFU | 126170 tok/s step 6366/19560 | loss 3.464063 (-1.31z)| norm 0.2896 (+0.27z)| lr 4.76e-04 | 4163.66 ms | 32.4% bf16 MFU | 126158 tok/s step 6367/19560 | loss 3.538703 (+0.80z)| norm 0.2872 (+0.17z)| lr 4.76e-04 | 4162.51 ms | 32.4% bf16 MFU | 126147 tok/s step 6368/19560 | loss 3.605555 (+2.60z)| norm 0.3035 (+0.84z)| lr 4.76e-04 | 4155.65 ms | 32.5% bf16 MFU | 126148 tok/s step 6369/19560 | loss 3.505954 (-0.14z)| norm 0.2740 (-0.38z)| lr 4.76e-04 | 4157.41 ms | 32.5% bf16 MFU | 126146 tok/s step 6370/19560 | loss 3.477324 (-0.93z)| norm 0.2734 (-0.41z)| lr 4.76e-04 | 4153.02 ms | 32.5% bf16 MFU | 126151 tok/s step 6371/19560 | loss 3.535539 (+0.70z)| norm 0.2706 (-0.53z)| lr 4.76e-04 | 4157.11 ms | 32.5% bf16 MFU | 126149 tok/s step 6372/19560 | loss 3.496296 (-0.39z)| norm 0.2992 (+0.66z)| lr 4.76e-04 | 4147.55 ms | 32.6% bf16 MFU | 126162 tok/s step 6373/19560 | loss 3.590258 (+2.18z)| norm 0.3054 (+0.91z)| lr 4.76e-04 | 4153.40 ms | 32.5% bf16 MFU | 126166 tok/s step 6374/19560 | loss 3.438871 (-1.94z)| norm 0.2657 (-0.74z)| lr 4.76e-04 | 4156.37 ms | 32.5% bf16 MFU | 126165 tok/s step 6375/19560 | loss 3.549978 (+1.09z)| norm 0.2589 (-1.02z)| lr 4.76e-04 | 4154.84 ms | 32.5% bf16 MFU | 126166 tok/s step 6376/19560 | loss 3.469839 (-1.08z)| norm 0.2535 (-1.24z)| lr 4.76e-04 | 4146.94 ms | 32.6% bf16 MFU | 126179 tok/s step 6377/19560 | loss 3.450360 (-1.58z)| norm 0.2525 (-1.27z)| lr 4.76e-04 | 4153.96 ms | 32.5% bf16 MFU | 126181 tok/s step 6378/19560 | loss 3.536785 (+0.74z)| norm 0.2543 (-1.18z)| lr 4.76e-04 | 4157.44 ms | 32.5% bf16 MFU | 126177 tok/s step 6379/19560 | loss 3.503560 (-0.15z)| norm 0.2603 (-0.92z)| lr 4.76e-04 | 4164.39 ms | 32.4% bf16 MFU | 126163 tok/s step 6380/19560 | loss 3.575778 (+1.84z)| norm 0.2651 (-0.71z)| lr 4.75e-04 | 4157.72 ms | 32.5% bf16 MFU | 126160 tok/s step 6381/19560 | loss 3.483228 (-0.70z)| norm 0.2966 (+0.58z)| lr 4.75e-04 | 4158.78 ms | 32.5% bf16 MFU | 126155 tok/s step 6382/19560 | loss 3.520730 (+0.33z)| norm 0.2634 (-0.78z)| lr 4.75e-04 | 4155.06 ms | 32.5% bf16 MFU | 126157 tok/s step 6383/19560 | loss 3.526574 (+0.49z)| norm 0.2678 (-0.59z)| lr 4.75e-04 | 4161.71 ms | 32.4% bf16 MFU | 126148 tok/s step 6384/19560 | loss 3.576770 (+1.85z)| norm 0.2735 (-0.34z)| lr 4.75e-04 | 4157.14 ms | 32.5% bf16 MFU | 126146 tok/s step 6385/19560 | loss 3.512482 (+0.09z)| norm 0.2704 (-0.46z)| lr 4.75e-04 | 4160.65 ms | 32.5% bf16 MFU | 126139 tok/s step 6386/19560 | loss 3.451737 (-1.56z)| norm 0.2652 (-0.68z)| lr 4.75e-04 | 4150.57 ms | 32.5% bf16 MFU | 126148 tok/s step 6387/19560 | loss 3.523570 (+0.40z)| norm 0.3225 (+1.78z)| lr 4.75e-04 | 4158.10 ms | 32.5% bf16 MFU | 126145 tok/s step 6388/19560 | loss 3.510703 (+0.05z)| norm 0.3105 (+1.26z)| lr 4.75e-04 | 4149.89 ms | 32.5% bf16 MFU | 126155 tok/s step 6389/19560 | loss 3.454581 (-1.46z)| norm 0.2688 (-0.52z)| lr 4.75e-04 | 4156.29 ms | 32.5% bf16 MFU | 126154 tok/s step 6390/19560 | loss 3.531297 (+0.62z)| norm 0.2818 (+0.03z)| lr 4.75e-04 | 4163.15 ms | 32.4% bf16 MFU | 126143 tok/s step 6391/19560 | loss 3.499258 (-0.25z)| norm 0.2797 (-0.06z)| lr 4.75e-04 | 4153.00 ms | 32.5% bf16 MFU | 126148 tok/s step 6392/19560 | loss 3.540331 (+0.85z)| norm 0.2728 (-0.35z)| lr 4.75e-04 | 4153.92 ms | 32.5% bf16 MFU | 126152 tok/s step 6393/19560 | loss 3.501717 (-0.20z)| norm 0.2546 (-1.13z)| lr 4.75e-04 | 4154.52 ms | 32.5% bf16 MFU | 126154 tok/s step 6394/19560 | loss 3.487664 (-0.57z)| norm 0.2682 (-0.54z)| lr 4.75e-04 | 4148.53 ms | 32.5% bf16 MFU | 126165 tok/s step 6395/19560 | loss 3.515809 (+0.19z)| norm 0.2908 (+0.42z)| lr 4.75e-04 | 4154.22 ms | 32.5% bf16 MFU | 126167 tok/s step 6396/19560 | loss 3.481402 (-0.74z)| norm 0.2804 (-0.03z)| lr 4.75e-04 | 4160.61 ms | 32.5% bf16 MFU | 126160 tok/s step 6397/19560 | loss 3.505648 (-0.08z)| norm 0.2594 (-0.94z)| lr 4.75e-04 | 4149.53 ms | 32.5% bf16 MFU | 126169 tok/s step 6398/19560 | loss 3.448426 (-1.60z)| norm 0.2827 (+0.07z)| lr 4.75e-04 | 4155.13 ms | 32.5% bf16 MFU | 126169 tok/s step 6399/19560 | loss 3.471808 (-0.96z)| norm 0.2760 (-0.22z)| lr 4.75e-04 | 4184.23 ms | 32.3% bf16 MFU | 126126 tok/s step 6400/19560 | loss 3.462465 (-1.20z)| norm 0.2834 (+0.09z)| lr 4.75e-04 | 4227.61 ms | 31.9% bf16 MFU | 126020 tok/s step 6401/19560 | loss 3.482156 (-0.67z)| norm 0.2583 (-0.99z)| lr 4.75e-04 | 4166.52 ms | 32.4% bf16 MFU | 126011 tok/s step 6402/19560 | loss 3.478726 (-0.75z)| norm 0.2633 (-0.76z)| lr 4.75e-04 | 4167.98 ms | 32.4% bf16 MFU | 126000 tok/s step 6403/19560 | loss 3.510909 (+0.10z)| norm 0.2661 (-0.62z)| lr 4.75e-04 | 4164.51 ms | 32.4% bf16 MFU | 125995 tok/s step 6404/19560 | loss 3.424626 (-2.15z)| norm 0.3194 (+1.67z)| lr 4.75e-04 | 4161.13 ms | 32.4% bf16 MFU | 125995 tok/s step 6405/19560 | loss 3.457680 (-1.28z)| norm 0.2817 (+0.04z)| lr 4.74e-04 | 4167.37 ms | 32.4% bf16 MFU | 125986 tok/s step 6406/19560 | loss 3.463815 (-1.11z)| norm 0.2818 (+0.04z)| lr 4.74e-04 | 4165.81 ms | 32.4% bf16 MFU | 125979 tok/s step 6407/19560 | loss 3.470347 (-0.94z)| norm 0.2959 (+0.65z)| lr 4.74e-04 | 4160.55 ms | 32.5% bf16 MFU | 125981 tok/s step 6408/19560 | loss 3.546469 (+1.03z)| norm 0.2822 (+0.05z)| lr 4.74e-04 | 4157.47 ms | 32.5% bf16 MFU | 125987 tok/s step 6409/19560 | loss 3.437368 (-1.76z)| norm 0.2825 (+0.08z)| lr 4.74e-04 | 4201.74 ms | 32.1% bf16 MFU | 125927 tok/s step 6410/19560 | loss 3.516416 (+0.27z)| norm 0.3125 (+1.38z)| lr 4.74e-04 | 4162.35 ms | 32.4% bf16 MFU | 125928 tok/s step 6411/19560 | loss 3.441492 (-1.64z)| norm 0.2747 (-0.25z)| lr 4.74e-04 | 4163.43 ms | 32.4% bf16 MFU | 125928 tok/s step 6412/19560 | loss 3.492187 (-0.34z)| norm 0.2725 (-0.35z)| lr 4.74e-04 | 4222.41 ms | 32.0% bf16 MFU | 125840 tok/s step 6413/19560 | loss 3.496695 (-0.22z)| norm 0.2767 (-0.17z)| lr 4.74e-04 | 4194.24 ms | 32.2% bf16 MFU | 125798 tok/s step 6414/19560 | loss 3.509628 (+0.11z)| norm 0.2786 (-0.09z)| lr 4.74e-04 | 4158.76 ms | 32.5% bf16 MFU | 125812 tok/s step 6415/19560 | loss 3.488376 (-0.43z)| norm 0.2860 (+0.23z)| lr 4.74e-04 | 4213.20 ms | 32.0% bf16 MFU | 125743 tok/s step 6416/19560 | loss 3.512442 (+0.19z)| norm 0.2909 (+0.44z)| lr 4.74e-04 | 4168.55 ms | 32.4% bf16 MFU | 125745 tok/s step 6417/19560 | loss 3.434464 (-1.77z)| norm 0.2567 (-1.06z)| lr 4.74e-04 | 4161.85 ms | 32.4% bf16 MFU | 125756 tok/s step 6418/19560 | loss 3.457036 (-1.18z)| norm 0.2466 (-1.48z)| lr 4.74e-04 | 4161.07 ms | 32.4% bf16 MFU | 125768 tok/s step 6419/19560 | loss 3.485976 (-0.44z)| norm 0.2864 (+0.26z)| lr 4.74e-04 | 4165.30 ms | 32.4% bf16 MFU | 125773 tok/s step 6420/19560 | loss 3.463289 (-1.02z)| norm 0.2488 (-1.37z)| lr 4.74e-04 | 4153.58 ms | 32.5% bf16 MFU | 125796 tok/s step 6421/19560 | loss 3.451631 (-1.30z)| norm 0.2660 (-0.62z)| lr 4.74e-04 | 4157.90 ms | 32.5% bf16 MFU | 125811 tok/s step 6422/19560 | loss 3.508316 (+0.15z)| norm 0.2456 (-1.47z)| lr 4.74e-04 | 4170.80 ms | 32.4% bf16 MFU | 125806 tok/s step 6423/19560 | loss 3.532466 (+0.78z)| norm 0.2578 (-0.96z)| lr 4.74e-04 | 4203.23 ms | 32.1% bf16 MFU | 125752 tok/s step 6424/19560 | loss 3.504230 (+0.04z)| norm 0.2840 (+0.17z)| lr 4.74e-04 | 4148.11 ms | 32.5% bf16 MFU | 125784 tok/s step 6425/19560 | loss 3.488255 (-0.36z)| norm 0.3004 (+0.86z)| lr 4.74e-04 | 4174.12 ms | 32.3% bf16 MFU | 125775 tok/s step 6426/19560 | loss 3.498183 (-0.10z)| norm 0.2838 (+0.26z)| lr 4.74e-04 | 4158.76 ms | 32.5% bf16 MFU | 125790 tok/s step 6427/19560 | loss 3.499330 (-0.07z)| norm 0.2829 (+0.23z)| lr 4.74e-04 | 4158.80 ms | 32.5% bf16 MFU | 125804 tok/s step 6428/19560 | loss 3.443257 (-1.50z)| norm 0.2953 (+0.99z)| lr 4.74e-04 | 4162.64 ms | 32.4% bf16 MFU | 125811 tok/s step 6429/19560 | loss 3.509219 (+0.21z)| norm 0.2780 (-0.04z)| lr 4.73e-04 | 4167.53 ms | 32.4% bf16 MFU | 125811 tok/s step 6430/19560 | loss 3.471726 (-0.75z)| norm 0.2622 (-0.98z)| lr 4.73e-04 | 4159.13 ms | 32.5% bf16 MFU | 125823 tok/s step 6431/19560 | loss 3.472659 (-0.72z)| norm 0.3171 (+2.33z)| lr 4.73e-04 | 4157.17 ms | 32.5% bf16 MFU | 125838 tok/s step 6432/19560 | loss 3.495910 (-0.11z)| norm 0.3209 (+2.49z)| lr 4.73e-04 | 4151.20 ms | 32.5% bf16 MFU | 125861 tok/s step 6433/19560 | loss 3.462280 (-0.97z)| norm 0.2844 (+0.33z)| lr 4.73e-04 | 4181.99 ms | 32.3% bf16 MFU | 125836 tok/s step 6434/19560 | loss 3.402893 (-2.43z)| norm 0.2789 (+0.01z)| lr 4.73e-04 | 4161.65 ms | 32.4% bf16 MFU | 125843 tok/s step 6435/19560 | loss 3.489940 (-0.23z)| norm 0.3015 (+1.32z)| lr 4.73e-04 | 4155.50 ms | 32.5% bf16 MFU | 125859 tok/s step 6436/19560 | loss 3.453780 (-1.13z)| norm 0.2784 (-0.03z)| lr 4.73e-04 | 4159.66 ms | 32.5% bf16 MFU | 125868 tok/s step 6437/19560 | loss 3.461012 (-0.94z)| norm 0.2642 (-0.85z)| lr 4.73e-04 | 4165.64 ms | 32.4% bf16 MFU | 125868 tok/s step 6438/19560 | loss 3.489017 (-0.21z)| norm 0.2785 (-0.02z)| lr 4.73e-04 | 4162.66 ms | 32.4% bf16 MFU | 125872 tok/s step 6439/19560 | loss 3.515510 (+0.46z)| norm 0.2745 (-0.25z)| lr 4.73e-04 | 4154.56 ms | 32.5% bf16 MFU | 125888 tok/s step 6440/19560 | loss 3.464075 (-0.85z)| norm 0.2862 (+0.45z)| lr 4.73e-04 | 4159.13 ms | 32.5% bf16 MFU | 125897 tok/s step 6441/19560 | loss 3.490511 (-0.18z)| norm 0.2685 (-0.58z)| lr 4.73e-04 | 4169.11 ms | 32.4% bf16 MFU | 125890 tok/s step 6442/19560 | loss 3.462130 (-0.90z)| norm 0.2619 (-0.98z)| lr 4.73e-04 | 4157.51 ms | 32.5% bf16 MFU | 125901 tok/s step 6443/19560 | loss 3.477708 (-0.49z)| norm 0.2618 (-0.97z)| lr 4.73e-04 | 4161.76 ms | 32.4% bf16 MFU | 125904 tok/s step 6444/19560 | loss 3.444937 (-1.31z)| norm 0.2808 (+0.14z)| lr 4.73e-04 | 4149.03 ms | 32.5% bf16 MFU | 125927 tok/s step 6445/19560 | loss 3.508400 (+0.28z)| norm 0.2562 (-1.29z)| lr 4.73e-04 | 4159.00 ms | 32.5% bf16 MFU | 125934 tok/s step 6446/19560 | loss 3.463463 (-0.86z)| norm 0.2848 (+0.39z)| lr 4.73e-04 | 4163.88 ms | 32.4% bf16 MFU | 125933 tok/s step 6447/19560 | loss 3.478024 (-0.49z)| norm 0.2751 (-0.19z)| lr 4.73e-04 | 4164.99 ms | 32.4% bf16 MFU | 125930 tok/s step 6448/19560 | loss 3.437729 (-1.49z)| norm 0.2974 (+1.11z)| lr 4.73e-04 | 4159.36 ms | 32.5% bf16 MFU | 125936 tok/s step 6449/19560 | loss 3.468651 (-0.70z)| norm 0.2828 (+0.23z)| lr 4.73e-04 | 4168.44 ms | 32.4% bf16 MFU | 125928 tok/s step 6450/19560 | loss 3.524320 (+0.72z)| norm 0.2812 (+0.13z)| lr 4.73e-04 | 4158.91 ms | 32.5% bf16 MFU | 125935 tok/s step 6451/19560 | loss 3.444250 (-1.30z)| norm 0.3047 (+1.50z)| lr 4.73e-04 | 4171.37 ms | 32.4% bf16 MFU | 125923 tok/s step 6452/19560 | loss 3.508034 (+0.31z)| norm 0.2953 (+0.94z)| lr 4.73e-04 | 4172.48 ms | 32.4% bf16 MFU | 125909 tok/s step 6453/19560 | loss 3.466097 (-0.75z)| norm 0.2946 (+0.88z)| lr 4.73e-04 | 4159.46 ms | 32.5% bf16 MFU | 125916 tok/s step 6454/19560 | loss 3.435706 (-1.50z)| norm 0.2971 (+1.01z)| lr 4.72e-04 | 4156.53 ms | 32.5% bf16 MFU | 125927 tok/s step 6455/19560 | loss 3.462682 (-0.81z)| norm 0.3258 (+2.61z)| lr 4.72e-04 | 4170.77 ms | 32.4% bf16 MFU | 125916 tok/s step 6456/19560 | loss 3.424982 (-1.72z)| norm 0.2690 (-0.63z)| lr 4.72e-04 | 4164.65 ms | 32.4% bf16 MFU | 125915 tok/s step 6457/19560 | loss 3.476616 (-0.43z)| norm 0.2942 (+0.83z)| lr 4.72e-04 | 4153.79 ms | 32.5% bf16 MFU | 125930 tok/s step 6458/19560 | loss 3.497532 (+0.08z)| norm 0.2857 (+0.33z)| lr 4.72e-04 | 4160.55 ms | 32.5% bf16 MFU | 125934 tok/s step 6459/19560 | loss 3.489083 (-0.12z)| norm 0.2647 (-0.87z)| lr 4.72e-04 | 4166.51 ms | 32.4% bf16 MFU | 125929 tok/s step 6460/19560 | loss 3.464035 (-0.74z)| norm 0.3013 (+1.30z)| lr 4.72e-04 | 4816.70 ms | 28.0% bf16 MFU | 125075 tok/s step 6461/19560 | loss 3.459138 (-0.86z)| norm 0.2823 (+0.16z)| lr 4.72e-04 | 4190.14 ms | 32.2% bf16 MFU | 125078 tok/s step 6462/19560 | loss 3.495530 (+0.07z)| norm 0.2966 (+1.00z)| lr 4.72e-04 | 4159.03 ms | 32.5% bf16 MFU | 125127 tok/s step 6463/19560 | loss 3.491708 (-0.03z)| norm 0.3023 (+1.32z)| lr 4.72e-04 | 4150.88 ms | 32.5% bf16 MFU | 125186 tok/s step 6464/19560 | loss 3.594090 (+2.51z)| norm 0.3145 (+1.99z)| lr 4.72e-04 | 4152.21 ms | 32.5% bf16 MFU | 125240 tok/s step 6465/19560 | loss 3.441985 (-1.28z)| norm 0.2593 (-1.24z)| lr 4.72e-04 | 4154.79 ms | 32.5% bf16 MFU | 125287 tok/s step 6466/19560 | loss 3.478029 (-0.38z)| norm 0.2961 (+0.91z)| lr 4.72e-04 | 6083.17 ms | 22.2% bf16 MFU | 123332 tok/s step 6467/19560 | loss 3.494982 (+0.05z)| norm 0.2849 (+0.25z)| lr 4.72e-04 | 4162.63 ms | 32.4% bf16 MFU | 123463 tok/s step 6468/19560 | loss 3.476426 (-0.41z)| norm 0.2892 (+0.48z)| lr 4.72e-04 | 4157.23 ms | 32.5% bf16 MFU | 123596 tok/s step 6469/19560 | loss 3.465294 (-0.68z)| norm 0.2872 (+0.36z)| lr 4.72e-04 | 4147.62 ms | 32.6% bf16 MFU | 123736 tok/s step 6470/19560 | loss 3.459610 (-0.82z)| norm 0.2939 (+0.75z)| lr 4.72e-04 | 4149.33 ms | 32.5% bf16 MFU | 123867 tok/s step 6471/19560 | loss 3.539279 (+1.18z)| norm 0.3072 (+1.50z)| lr 4.72e-04 | 4153.55 ms | 32.5% bf16 MFU | 123985 tok/s step 6472/19560 | loss 3.447880 (-1.10z)| norm 0.2896 (+0.47z)| lr 4.72e-04 | 4149.63 ms | 32.5% bf16 MFU | 124103 tok/s step 6473/19560 | loss 3.424442 (-1.66z)| norm 0.3090 (+1.58z)| lr 4.72e-04 | 4164.83 ms | 32.4% bf16 MFU | 124192 tok/s step 6474/19560 | loss 3.472335 (-0.47z)| norm 0.2940 (+0.70z)| lr 4.72e-04 | 4160.21 ms | 32.5% bf16 MFU | 124284 tok/s step 6475/19560 | loss 3.513454 (+0.55z)| norm 0.2647 (-0.99z)| lr 4.72e-04 | 4162.26 ms | 32.4% bf16 MFU | 124368 tok/s step 6476/19560 | loss 3.449345 (-1.02z)| norm 0.2848 (+0.16z)| lr 4.72e-04 | 4152.26 ms | 32.5% bf16 MFU | 124463 tok/s step 6477/19560 | loss 3.456899 (-0.83z)| norm 0.2871 (+0.30z)| lr 4.72e-04 | 4155.92 ms | 32.5% bf16 MFU | 124547 tok/s step 6478/19560 | loss 3.502398 (+0.29z)| norm 0.2469 (-1.99z)| lr 4.71e-04 | 4145.75 ms | 32.6% bf16 MFU | 124643 tok/s step 6479/19560 | loss 3.436245 (-1.32z)| norm 0.2793 (-0.14z)| lr 4.71e-04 | 4172.51 ms | 32.4% bf16 MFU | 124694 tok/s step 6480/19560 | loss 3.394962 (-2.29z)| norm 0.2782 (-0.19z)| lr 4.71e-04 | 4174.10 ms | 32.3% bf16 MFU | 124739 tok/s step 6481/19560 | loss 3.534399 (+1.11z)| norm 0.2874 (+0.33z)| lr 4.71e-04 | 4177.45 ms | 32.3% bf16 MFU | 124777 tok/s step 6482/19560 | loss 3.496639 (+0.19z)| norm 0.3001 (+1.05z)| lr 4.71e-04 | 4148.75 ms | 32.5% bf16 MFU | 124857 tok/s step 6483/19560 | loss 3.480372 (-0.19z)| norm 0.2944 (+0.72z)| lr 4.71e-04 | 4161.82 ms | 32.4% bf16 MFU | 124913 tok/s step 6484/19560 | loss 3.486254 (-0.04z)| norm 0.2983 (+0.95z)| lr 4.71e-04 | 4158.15 ms | 32.5% bf16 MFU | 124972 tok/s step 6485/19560 | loss 3.458214 (-0.74z)| norm 0.2771 (-0.26z)| lr 4.71e-04 | 4152.77 ms | 32.5% bf16 MFU | 125036 tok/s step 6486/19560 | loss 3.497173 (+0.24z)| norm 0.3017 (+1.13z)| lr 4.71e-04 | 4179.51 ms | 32.3% bf16 MFU | 125056 tok/s step 6487/19560 | loss 3.530253 (+1.08z)| norm 0.3240 (+2.34z)| lr 4.71e-04 | 4178.60 ms | 32.3% bf16 MFU | 125077 tok/s step 6488/19560 | loss 3.464726 (-0.57z)| norm 0.2851 (+0.16z)| lr 4.71e-04 | 4162.73 ms | 32.4% bf16 MFU | 125120 tok/s step 6489/19560 | loss 3.494452 (+0.20z)| norm 0.2785 (-0.22z)| lr 4.71e-04 | 4152.03 ms | 32.5% bf16 MFU | 125178 tok/s step 6490/19560 | loss 3.574728 (+2.26z)| norm 0.3212 (+2.14z)| lr 4.71e-04 | 4163.28 ms | 32.4% bf16 MFU | 125216 tok/s step 6491/19560 | loss 3.512682 (+0.65z)| norm 0.2922 (+0.54z)| lr 4.71e-04 | 4165.98 ms | 32.4% bf16 MFU | 125247 tok/s step 6492/19560 | loss 3.440760 (-1.21z)| norm 0.2772 (-0.29z)| lr 4.71e-04 | 4162.55 ms | 32.4% bf16 MFU | 125283 tok/s step 6493/19560 | loss 3.490236 (+0.09z)| norm 0.2830 (+0.04z)| lr 4.71e-04 | 4167.03 ms | 32.4% bf16 MFU | 125309 tok/s step 6494/19560 | loss 3.474226 (-0.33z)| norm 0.2762 (-0.33z)| lr 4.71e-04 | 4168.12 ms | 32.4% bf16 MFU | 125333 tok/s step 6495/19560 | loss 3.486506 (+0.00z)| norm 0.2714 (-0.59z)| lr 4.71e-04 | 4179.40 ms | 32.3% bf16 MFU | 125339 tok/s step 6496/19560 | loss 3.594009 (+2.87z)| norm 0.2863 (+0.25z)| lr 4.71e-04 | 4162.17 ms | 32.4% bf16 MFU | 125370 tok/s step 6497/19560 | loss 3.405111 (-2.11z)| norm 0.2751 (-0.38z)| lr 4.71e-04 | 4163.70 ms | 32.4% bf16 MFU | 125398 tok/s step 6498/19560 | loss 3.434827 (-1.31z)| norm 0.2848 (+0.16z)| lr 4.71e-04 | 4154.99 ms | 32.5% bf16 MFU | 125437 tok/s step 6499/19560 | loss 3.461060 (-0.62z)| norm 0.2629 (-1.06z)| lr 4.71e-04 | 4156.30 ms | 32.5% bf16 MFU | 125472 tok/s step 6500/19560 | loss 3.536043 (+1.33z)| norm 0.2660 (-0.88z)| lr 4.71e-04 | 4168.93 ms | 32.4% bf16 MFU | 125487 tok/s val loss 3.486369 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2787/10042 = 0.277534 step 6501/19560 | loss 3.468102 (-0.43z)| norm 0.2866 (+0.28z)| lr 4.71e-04 | 4172.47 ms | 32.4% bf16 MFU | 125495 tok/s step 6502/19560 | loss 3.514781 (+0.81z)| norm 0.2905 (+0.49z)| lr 4.71e-04 | 4168.46 ms | 32.4% bf16 MFU | 125509 tok/s step 6503/19560 | loss 3.457070 (-0.73z)| norm 0.2732 (-0.49z)| lr 4.70e-04 | 4157.20 ms | 32.5% bf16 MFU | 125539 tok/s step 6504/19560 | loss 3.460732 (-0.63z)| norm 0.2721 (-0.57z)| lr 4.70e-04 | 4164.06 ms | 32.4% bf16 MFU | 125558 tok/s step 6505/19560 | loss 3.496294 (+0.33z)| norm 0.2635 (-1.07z)| lr 4.70e-04 | 4162.18 ms | 32.4% bf16 MFU | 125578 tok/s step 6506/19560 | loss 3.468129 (-0.43z)| norm 0.2923 (+0.58z)| lr 4.70e-04 | 4165.11 ms | 32.4% bf16 MFU | 125593 tok/s step 6507/19560 | loss 3.501253 (+0.48z)| norm 0.2792 (-0.20z)| lr 4.70e-04 | 4152.55 ms | 32.5% bf16 MFU | 125626 tok/s step 6508/19560 | loss 3.424190 (-1.62z)| norm 0.2588 (-1.38z)| lr 4.70e-04 | 4160.35 ms | 32.5% bf16 MFU | 125646 tok/s step 6509/19560 | loss 3.462085 (-0.56z)| norm 0.2607 (-1.25z)| lr 4.70e-04 | 4163.40 ms | 32.4% bf16 MFU | 125660 tok/s step 6510/19560 | loss 3.490256 (+0.23z)| norm 0.2791 (-0.19z)| lr 4.70e-04 | 4172.54 ms | 32.4% bf16 MFU | 125660 tok/s step 6511/19560 | loss 3.454765 (-0.75z)| norm 0.2760 (-0.38z)| lr 4.70e-04 | 4150.92 ms | 32.5% bf16 MFU | 125692 tok/s step 6512/19560 | loss 3.438692 (-1.20z)| norm 0.2630 (-1.13z)| lr 4.70e-04 | 4156.62 ms | 32.5% bf16 MFU | 125714 tok/s step 6513/19560 | loss 3.524318 (+1.26z)| norm 0.2828 (+0.02z)| lr 4.70e-04 | 4171.49 ms | 32.4% bf16 MFU | 125712 tok/s step 6514/19560 | loss 3.462431 (-0.52z)| norm 0.2791 (-0.20z)| lr 4.70e-04 | 4168.03 ms | 32.4% bf16 MFU | 125716 tok/s step 6515/19560 | loss 3.423990 (-1.59z)| norm 0.2734 (-0.53z)| lr 4.70e-04 | 4173.48 ms | 32.4% bf16 MFU | 125712 tok/s step 6516/19560 | loss 3.431304 (-1.36z)| norm 0.3127 (+1.82z)| lr 4.70e-04 | 4162.57 ms | 32.4% bf16 MFU | 125724 tok/s step 6517/19560 | loss 3.541259 (+1.73z)| norm 0.2983 (+0.95z)| lr 4.70e-04 | 4173.21 ms | 32.4% bf16 MFU | 125719 tok/s step 6518/19560 | loss 3.402249 (-2.14z)| norm 0.2621 (-1.20z)| lr 4.70e-04 | 4166.58 ms | 32.4% bf16 MFU | 125725 tok/s step 6519/19560 | loss 3.500305 (+0.60z)| norm 0.2722 (-0.59z)| lr 4.70e-04 | 4176.81 ms | 32.3% bf16 MFU | 125715 tok/s step 6520/19560 | loss 3.447703 (-0.86z)| norm 0.2825 (+0.02z)| lr 4.70e-04 | 4162.67 ms | 32.4% bf16 MFU | 125726 tok/s step 6521/19560 | loss 3.523840 (+1.27z)| norm 0.2817 (-0.04z)| lr 4.70e-04 | 4152.24 ms | 32.5% bf16 MFU | 125753 tok/s step 6522/19560 | loss 3.446818 (-0.87z)| norm 0.2920 (+0.56z)| lr 4.70e-04 | 4169.23 ms | 32.4% bf16 MFU | 125753 tok/s step 6523/19560 | loss 3.386073 (-2.49z)| norm 0.2727 (-0.59z)| lr 4.70e-04 | 4159.51 ms | 32.5% bf16 MFU | 125768 tok/s step 6524/19560 | loss 3.477232 (+0.01z)| norm 0.2887 (+0.37z)| lr 4.70e-04 | 4160.14 ms | 32.5% bf16 MFU | 125781 tok/s step 6525/19560 | loss 3.510644 (+0.92z)| norm 0.2866 (+0.23z)| lr 4.70e-04 | 4167.39 ms | 32.4% bf16 MFU | 125782 tok/s step 6526/19560 | loss 3.535434 (+1.57z)| norm 0.3208 (+2.24z)| lr 4.70e-04 | 4159.55 ms | 32.5% bf16 MFU | 125795 tok/s step 6527/19560 | loss 3.533069 (+1.48z)| norm 0.2966 (+0.79z)| lr 4.69e-04 | 4160.03 ms | 32.5% bf16 MFU | 125807 tok/s step 6528/19560 | loss 3.472574 (-0.15z)| norm 0.2755 (-0.45z)| lr 4.69e-04 | 4171.54 ms | 32.4% bf16 MFU | 125801 tok/s step 6529/19560 | loss 3.466431 (-0.31z)| norm 0.2766 (-0.40z)| lr 4.69e-04 | 4163.80 ms | 32.4% bf16 MFU | 125807 tok/s step 6530/19560 | loss 3.456884 (-0.57z)| norm 0.2750 (-0.51z)| lr 4.69e-04 | 4253.98 ms | 31.7% bf16 MFU | 125679 tok/s step 6531/19560 | loss 3.443309 (-0.92z)| norm 0.2946 (+0.66z)| lr 4.69e-04 | 4190.86 ms | 32.2% bf16 MFU | 125650 tok/s step 6532/19560 | loss 3.493099 (+0.41z)| norm 0.5200 (+8.87z)| lr 4.69e-04 | 4165.05 ms | 32.4% bf16 MFU | 125661 tok/s step 6533/19560 | loss 3.469398 (-0.23z)| norm 0.3263 (+1.53z)| lr 4.69e-04 | 4165.30 ms | 32.4% bf16 MFU | 125672 tok/s step 6534/19560 | loss 3.547941 (+1.85z)| norm 0.3358 (+1.84z)| lr 4.69e-04 | 4165.67 ms | 32.4% bf16 MFU | 125681 tok/s step 6535/19560 | loss 3.467360 (-0.30z)| norm 0.3187 (+1.19z)| lr 4.69e-04 | 4177.38 ms | 32.3% bf16 MFU | 125672 tok/s step 6536/19560 | loss 3.499557 (+0.58z)| norm 0.2957 (+0.35z)| lr 4.69e-04 | 4176.22 ms | 32.3% bf16 MFU | 125666 tok/s step 6537/19560 | loss 3.497493 (+0.51z)| norm 0.3099 (+0.86z)| lr 4.69e-04 | 4168.37 ms | 32.4% bf16 MFU | 125671 tok/s step 6538/19560 | loss 3.483755 (+0.14z)| norm 0.2955 (+0.34z)| lr 4.69e-04 | 4172.00 ms | 32.4% bf16 MFU | 125671 tok/s step 6539/19560 | loss 3.467874 (-0.30z)| norm 0.3146 (+1.02z)| lr 4.69e-04 | 4163.70 ms | 32.4% bf16 MFU | 125684 tok/s step 6540/19560 | loss 3.478085 (-0.01z)| norm 0.2693 (-0.63z)| lr 4.69e-04 | 4159.74 ms | 32.5% bf16 MFU | 125701 tok/s step 6541/19560 | loss 3.465551 (-0.35z)| norm 0.2840 (-0.10z)| lr 4.69e-04 | 4165.42 ms | 32.4% bf16 MFU | 125710 tok/s step 6542/19560 | loss 3.481402 (+0.09z)| norm 0.3102 (+0.85z)| lr 4.69e-04 | 4228.36 ms | 31.9% bf16 MFU | 125624 tok/s step 6543/19560 | loss 3.441550 (-0.99z)| norm 0.2782 (-0.32z)| lr 4.69e-04 | 4155.81 ms | 32.5% bf16 MFU | 125650 tok/s step 6544/19560 | loss 3.494284 (+0.46z)| norm 0.2961 (+0.33z)| lr 4.69e-04 | 4163.33 ms | 32.4% bf16 MFU | 125664 tok/s step 6545/19560 | loss 3.452652 (-0.69z)| norm 0.2830 (-0.15z)| lr 4.69e-04 | 4159.70 ms | 32.5% bf16 MFU | 125683 tok/s step 6546/19560 | loss 3.448663 (-0.80z)| norm 0.3295 (+1.53z)| lr 4.69e-04 | 4168.23 ms | 32.4% bf16 MFU | 125688 tok/s step 6547/19560 | loss 3.492007 (+0.39z)| norm 0.2811 (-0.24z)| lr 4.69e-04 | 4165.55 ms | 32.4% bf16 MFU | 125697 tok/s step 6548/19560 | loss 3.523978 (+1.25z)| norm 0.2641 (-0.87z)| lr 4.69e-04 | 4171.75 ms | 32.4% bf16 MFU | 125696 tok/s step 6549/19560 | loss 3.440850 (-1.02z)| norm 0.2949 (+0.25z)| lr 4.69e-04 | 4173.77 ms | 32.3% bf16 MFU | 125692 tok/s step 6550/19560 | loss 3.467651 (-0.28z)| norm 0.2763 (-0.45z)| lr 4.69e-04 | 4159.28 ms | 32.5% bf16 MFU | 125710 tok/s step 6551/19560 | loss 3.441834 (-0.97z)| norm 0.2697 (-0.70z)| lr 4.68e-04 | 4181.65 ms | 32.3% bf16 MFU | 125693 tok/s step 6552/19560 | loss 3.534351 (+1.56z)| norm 0.2805 (-0.29z)| lr 4.68e-04 | 4163.39 ms | 32.4% bf16 MFU | 125705 tok/s step 6553/19560 | loss 3.564187 (+2.31z)| norm 0.2783 (-0.37z)| lr 4.68e-04 | 4159.06 ms | 32.5% bf16 MFU | 125723 tok/s step 6554/19560 | loss 3.506318 (+0.76z)| norm 0.2692 (-0.70z)| lr 4.68e-04 | 4161.98 ms | 32.4% bf16 MFU | 125735 tok/s step 6555/19560 | loss 3.536060 (+1.53z)| norm 0.2756 (-0.47z)| lr 4.68e-04 | 4170.76 ms | 32.4% bf16 MFU | 125734 tok/s step 6556/19560 | loss 3.451311 (-0.72z)| norm 0.2591 (-1.06z)| lr 4.68e-04 | 4170.41 ms | 32.4% bf16 MFU | 125733 tok/s step 6557/19560 | loss 3.443096 (-0.92z)| norm 0.2819 (-0.22z)| lr 4.68e-04 | 4159.19 ms | 32.5% bf16 MFU | 125749 tok/s step 6558/19560 | loss 3.493625 (+0.41z)| norm 0.2605 (-1.01z)| lr 4.68e-04 | 4171.92 ms | 32.4% bf16 MFU | 125745 tok/s step 6559/19560 | loss 3.526839 (+1.27z)| norm 0.2711 (-0.61z)| lr 4.68e-04 | 4160.65 ms | 32.5% bf16 MFU | 125758 tok/s step 6560/19560 | loss 3.447307 (-0.81z)| norm 0.2837 (-0.13z)| lr 4.68e-04 | 4159.39 ms | 32.5% bf16 MFU | 125773 tok/s step 6561/19560 | loss 3.443266 (-0.91z)| norm 0.2660 (-0.78z)| lr 4.68e-04 | 4163.93 ms | 32.4% bf16 MFU | 125780 tok/s step 6562/19560 | loss 3.388438 (-2.33z)| norm 0.2750 (-0.44z)| lr 4.68e-04 | 4156.96 ms | 32.5% bf16 MFU | 125797 tok/s step 6563/19560 | loss 3.453672 (-0.62z)| norm 0.2940 (+0.27z)| lr 4.68e-04 | 4168.98 ms | 32.4% bf16 MFU | 125795 tok/s step 6564/19560 | loss 3.482603 (+0.13z)| norm 0.3271 (+1.47z)| lr 4.68e-04 | 4158.16 ms | 32.5% bf16 MFU | 125810 tok/s step 6565/19560 | loss 3.460570 (-0.45z)| norm 0.2853 (-0.08z)| lr 4.68e-04 | 4174.54 ms | 32.3% bf16 MFU | 125799 tok/s step 6566/19560 | loss 3.455719 (-0.57z)| norm 0.3045 (+0.62z)| lr 4.68e-04 | 4162.22 ms | 32.4% bf16 MFU | 125807 tok/s step 6567/19560 | loss 3.487287 (+0.26z)| norm 0.3101 (+0.82z)| lr 4.68e-04 | 4171.93 ms | 32.4% bf16 MFU | 125800 tok/s step 6568/19560 | loss 3.452609 (-0.64z)| norm 0.2958 (+0.29z)| lr 4.68e-04 | 4161.40 ms | 32.4% bf16 MFU | 125810 tok/s step 6569/19560 | loss 3.502325 (+0.65z)| norm 0.2752 (-0.48z)| lr 4.68e-04 | 4155.88 ms | 32.5% bf16 MFU | 125827 tok/s step 6570/19560 | loss 3.500551 (+0.60z)| norm 0.3148 (+0.98z)| lr 4.68e-04 | 4751.17 ms | 28.4% bf16 MFU | 125053 tok/s step 6571/19560 | loss 3.458442 (-0.49z)| norm 0.2783 (-0.38z)| lr 4.68e-04 | 4176.58 ms | 32.3% bf16 MFU | 125077 tok/s step 6572/19560 | loss 3.491538 (+0.36z)| norm 0.2822 (-0.24z)| lr 4.68e-04 | 4170.74 ms | 32.4% bf16 MFU | 125108 tok/s step 6573/19560 | loss 3.463846 (-0.35z)| norm 0.3008 (+0.44z)| lr 4.68e-04 | 4154.98 ms | 32.5% bf16 MFU | 125162 tok/s step 6574/19560 | loss 3.410639 (-1.71z)| norm 0.2750 (-0.52z)| lr 4.68e-04 | 4150.64 ms | 32.5% bf16 MFU | 125220 tok/s step 6575/19560 | loss 3.458845 (-0.46z)| norm 0.2651 (-0.88z)| lr 4.67e-04 | 4160.36 ms | 32.5% bf16 MFU | 125260 tok/s step 6576/19560 | loss 3.422121 (-1.40z)| norm 0.2855 (-0.12z)| lr 4.67e-04 | 4158.13 ms | 32.5% bf16 MFU | 125301 tok/s step 6577/19560 | loss 3.429459 (-1.20z)| norm 0.2594 (-1.08z)| lr 4.67e-04 | 4171.39 ms | 32.4% bf16 MFU | 125320 tok/s step 6578/19560 | loss 3.448343 (-0.71z)| norm 0.2615 (-0.99z)| lr 4.67e-04 | 4160.09 ms | 32.5% bf16 MFU | 125356 tok/s step 6579/19560 | loss 3.463191 (-0.33z)| norm 0.2812 (-0.26z)| lr 4.67e-04 | 4166.90 ms | 32.4% bf16 MFU | 125379 tok/s step 6580/19560 | loss 3.534892 (+1.50z)| norm 0.2791 (-0.33z)| lr 4.67e-04 | 4175.27 ms | 32.3% bf16 MFU | 125389 tok/s step 6581/19560 | loss 3.437478 (-0.98z)| norm 0.2621 (-0.95z)| lr 4.67e-04 | 4168.47 ms | 32.4% bf16 MFU | 125408 tok/s step 6582/19560 | loss 3.579234 (+2.55z)| norm 0.2623 (-0.93z)| lr 4.67e-04 | 4170.37 ms | 32.4% bf16 MFU | 125423 tok/s step 6583/19560 | loss 3.446322 (-0.76z)| norm 0.2637 (-0.86z)| lr 4.67e-04 | 4172.69 ms | 32.4% bf16 MFU | 125435 tok/s step 6584/19560 | loss 3.439704 (-0.93z)| norm 0.2544 (-1.20z)| lr 4.67e-04 | 4169.23 ms | 32.4% bf16 MFU | 125450 tok/s step 6585/19560 | loss 3.494472 (+0.43z)| norm 0.2571 (-1.08z)| lr 4.67e-04 | 4159.13 ms | 32.5% bf16 MFU | 125481 tok/s step 6586/19560 | loss 3.437434 (-0.98z)| norm 0.2630 (-0.86z)| lr 4.67e-04 | 4164.69 ms | 32.4% bf16 MFU | 125501 tok/s step 6587/19560 | loss 3.476763 (+0.00z)| norm 0.2688 (-0.65z)| lr 4.67e-04 | 4167.24 ms | 32.4% bf16 MFU | 125517 tok/s step 6588/19560 | loss 3.473469 (-0.08z)| norm 0.2984 (+0.44z)| lr 4.67e-04 | 4159.21 ms | 32.5% bf16 MFU | 125544 tok/s step 6589/19560 | loss 3.390707 (-2.09z)| norm 0.2701 (-0.59z)| lr 4.67e-04 | 4167.14 ms | 32.4% bf16 MFU | 125557 tok/s step 6590/19560 | loss 3.467800 (-0.20z)| norm 0.2535 (-1.19z)| lr 4.67e-04 | 4218.01 ms | 32.0% bf16 MFU | 125494 tok/s step 6591/19560 | loss 3.508808 (+0.80z)| norm 0.2916 (+0.21z)| lr 4.67e-04 | 4285.60 ms | 31.5% bf16 MFU | 125336 tok/s step 6592/19560 | loss 3.482218 (+0.18z)| norm 0.2866 (+0.03z)| lr 4.67e-04 | 4365.11 ms | 30.9% bf16 MFU | 125075 tok/s step 6593/19560 | loss 3.487797 (+0.31z)| norm 0.2385 (-1.71z)| lr 4.67e-04 | 4262.48 ms | 31.7% bf16 MFU | 124971 tok/s step 6594/19560 | loss 3.492147 (+0.42z)| norm 0.2979 (+0.45z)| lr 4.67e-04 | 4213.13 ms | 32.0% bf16 MFU | 124945 tok/s step 6595/19560 | loss 3.455913 (-0.49z)| norm 0.2773 (-0.30z)| lr 4.67e-04 | 4230.25 ms | 31.9% bf16 MFU | 124894 tok/s step 6596/19560 | loss 3.474537 (-0.02z)| norm 0.2796 (-0.21z)| lr 4.67e-04 | 4263.35 ms | 31.7% bf16 MFU | 124798 tok/s step 6597/19560 | loss 3.603573 (+3.10z)| norm 0.2797 (-0.21z)| lr 4.67e-04 | 4248.57 ms | 31.8% bf16 MFU | 124729 tok/s step 6598/19560 | loss 3.500072 (+0.57z)| norm 0.3027 (+0.63z)| lr 4.67e-04 | 4157.57 ms | 32.5% bf16 MFU | 124797 tok/s step 6599/19560 | loss 3.464145 (-0.30z)| norm 0.2843 (-0.04z)| lr 4.66e-04 | 4198.68 ms | 32.2% bf16 MFU | 124801 tok/s step 6600/19560 | loss 3.461961 (-0.35z)| norm 0.2699 (-0.55z)| lr 4.66e-04 | 4173.66 ms | 32.3% bf16 MFU | 124842 tok/s step 6601/19560 | loss 3.479719 (+0.07z)| norm 0.2881 (+0.11z)| lr 4.66e-04 | 4179.05 ms | 32.3% bf16 MFU | 124873 tok/s step 6602/19560 | loss 3.467211 (-0.23z)| norm 0.3136 (+1.03z)| lr 4.66e-04 | 4172.44 ms | 32.4% bf16 MFU | 124912 tok/s step 6603/19560 | loss 3.534246 (+1.41z)| norm 0.2694 (-0.57z)| lr 4.66e-04 | 4171.60 ms | 32.4% bf16 MFU | 124950 tok/s step 6604/19560 | loss 3.507833 (+0.75z)| norm 0.2631 (-0.79z)| lr 4.66e-04 | 4171.35 ms | 32.4% bf16 MFU | 124987 tok/s step 6605/19560 | loss 3.595679 (+2.80z)| norm 0.2844 (-0.02z)| lr 4.66e-04 | 4166.18 ms | 32.4% bf16 MFU | 125030 tok/s step 6606/19560 | loss 3.441444 (-0.87z)| norm 0.2721 (-0.47z)| lr 4.66e-04 | 4188.71 ms | 32.2% bf16 MFU | 125037 tok/s step 6607/19560 | loss 3.500797 (+0.53z)| norm 0.2787 (-0.24z)| lr 4.66e-04 | 4210.97 ms | 32.1% bf16 MFU | 125010 tok/s step 6608/19560 | loss 3.531130 (+1.25z)| norm 0.2648 (-0.74z)| lr 4.66e-04 | 4182.56 ms | 32.3% bf16 MFU | 125027 tok/s step 6609/19560 | loss 3.512606 (+0.81z)| norm 0.2734 (-0.42z)| lr 4.66e-04 | 4169.81 ms | 32.4% bf16 MFU | 125063 tok/s step 6610/19560 | loss 3.480803 (+0.04z)| norm 0.2922 (+0.26z)| lr 4.66e-04 | 4267.53 ms | 31.6% bf16 MFU | 124952 tok/s step 6611/19560 | loss 3.444999 (-0.82z)| norm 0.2960 (+0.41z)| lr 4.66e-04 | 4154.68 ms | 32.5% bf16 MFU | 125014 tok/s step 6612/19560 | loss 3.413836 (-1.55z)| norm 0.2970 (+0.44z)| lr 4.66e-04 | 4153.75 ms | 32.5% bf16 MFU | 125075 tok/s step 6613/19560 | loss 3.513921 (+0.84z)| norm 0.2693 (-0.56z)| lr 4.66e-04 | 4186.99 ms | 32.2% bf16 MFU | 125082 tok/s step 6614/19560 | loss 3.516517 (+0.90z)| norm 0.2837 (-0.03z)| lr 4.66e-04 | 4195.69 ms | 32.2% bf16 MFU | 125076 tok/s step 6615/19560 | loss 3.480568 (+0.05z)| norm 0.2796 (-0.17z)| lr 4.66e-04 | 4162.00 ms | 32.4% bf16 MFU | 125120 tok/s step 6616/19560 | loss 3.509740 (+0.74z)| norm 0.2791 (-0.19z)| lr 4.66e-04 | 4156.27 ms | 32.5% bf16 MFU | 125171 tok/s step 6617/19560 | loss 3.500140 (+0.51z)| norm 0.3074 (+0.84z)| lr 4.66e-04 | 4323.41 ms | 31.2% bf16 MFU | 124976 tok/s step 6618/19560 | loss 3.458338 (-0.48z)| norm 0.2662 (-0.66z)| lr 4.66e-04 | 4188.01 ms | 32.2% bf16 MFU | 124987 tok/s step 6619/19560 | loss 3.537148 (+1.43z)| norm 0.3073 (+0.85z)| lr 4.66e-04 | 4370.77 ms | 30.9% bf16 MFU | 124735 tok/s step 6620/19560 | loss 3.484045 (+0.13z)| norm 0.2540 (-1.10z)| lr 4.66e-04 | 4171.23 ms | 32.4% bf16 MFU | 124783 tok/s step 6621/19560 | loss 3.485836 (+0.18z)| norm 0.2741 (-0.36z)| lr 4.66e-04 | 4188.33 ms | 32.2% bf16 MFU | 124803 tok/s step 6622/19560 | loss 3.475976 (-0.06z)| norm 0.2847 (+0.03z)| lr 4.66e-04 | 4168.44 ms | 32.4% bf16 MFU | 124851 tok/s step 6623/19560 | loss 3.437057 (-1.00z)| norm 0.2852 (+0.04z)| lr 4.65e-04 | 4186.77 ms | 32.2% bf16 MFU | 124870 tok/s step 6624/19560 | loss 3.507845 (+0.76z)| norm 0.2778 (-0.23z)| lr 4.65e-04 | 4165.15 ms | 32.4% bf16 MFU | 124920 tok/s step 6625/19560 | loss 3.446590 (-0.79z)| norm 0.2696 (-0.53z)| lr 4.65e-04 | 4214.14 ms | 32.0% bf16 MFU | 124895 tok/s step 6626/19560 | loss 3.513069 (+0.88z)| norm 0.2674 (-0.60z)| lr 4.65e-04 | 4154.88 ms | 32.5% bf16 MFU | 124959 tok/s step 6627/19560 | loss 3.548175 (+1.73z)| norm 0.2730 (-0.40z)| lr 4.65e-04 | 4173.00 ms | 32.4% bf16 MFU | 124993 tok/s step 6628/19560 | loss 3.498554 (+0.50z)| norm 0.2562 (-1.01z)| lr 4.65e-04 | 4180.74 ms | 32.3% bf16 MFU | 125014 tok/s step 6629/19560 | loss 3.513607 (+0.87z)| norm 0.2725 (-0.41z)| lr 4.65e-04 | 4174.62 ms | 32.3% bf16 MFU | 125043 tok/s step 6630/19560 | loss 3.524121 (+1.13z)| norm 0.2556 (-1.01z)| lr 4.65e-04 | 4183.24 ms | 32.3% bf16 MFU | 125057 tok/s step 6631/19560 | loss 3.466839 (-0.31z)| norm 0.2819 (-0.06z)| lr 4.65e-04 | 4163.49 ms | 32.4% bf16 MFU | 125101 tok/s step 6632/19560 | loss 3.591285 (+2.71z)| norm 0.2854 (+0.06z)| lr 4.65e-04 | 4161.82 ms | 32.4% bf16 MFU | 125144 tok/s step 6633/19560 | loss 3.565902 (+2.04z)| norm 0.3120 (+1.01z)| lr 4.65e-04 | 4176.32 ms | 32.3% bf16 MFU | 125164 tok/s step 6634/19560 | loss 3.526520 (+1.08z)| norm 0.2842 (+0.01z)| lr 4.65e-04 | 4163.11 ms | 32.4% bf16 MFU | 125203 tok/s step 6635/19560 | loss 3.523641 (+1.01z)| norm 0.2814 (-0.09z)| lr 4.65e-04 | 4170.46 ms | 32.4% bf16 MFU | 125228 tok/s step 6636/19560 | loss 3.574509 (+2.17z)| norm 0.3646 (+2.81z)| lr 4.65e-04 | 4161.82 ms | 32.4% bf16 MFU | 125266 tok/s step 6637/19560 | loss 3.488340 (+0.13z)| norm 0.2952 (+0.36z)| lr 4.65e-04 | 4193.08 ms | 32.2% bf16 MFU | 125254 tok/s step 6638/19560 | loss 3.456791 (-0.61z)| norm 0.2970 (+0.42z)| lr 4.65e-04 | 4198.88 ms | 32.2% bf16 MFU | 125235 tok/s step 6639/19560 | loss 3.424781 (-1.35z)| norm 0.2790 (-0.22z)| lr 4.65e-04 | 5236.35 ms | 25.8% bf16 MFU | 123979 tok/s step 6640/19560 | loss 3.545798 (+1.46z)| norm 0.2871 (+0.06z)| lr 4.65e-04 | 4174.76 ms | 32.3% bf16 MFU | 124059 tok/s step 6641/19560 | loss 3.516562 (+0.78z)| norm 0.3192 (+1.18z)| lr 4.65e-04 | 4159.49 ms | 32.5% bf16 MFU | 124159 tok/s step 6642/19560 | loss 3.519239 (+0.83z)| norm 0.2799 (-0.20z)| lr 4.65e-04 | 5513.75 ms | 24.5% bf16 MFU | 122705 tok/s step 6643/19560 | loss 3.500709 (+0.39z)| norm 0.2868 (+0.04z)| lr 4.65e-04 | 4149.55 ms | 32.5% bf16 MFU | 122887 tok/s step 6644/19560 | loss 3.535056 (+1.18z)| norm 0.2641 (-0.75z)| lr 4.65e-04 | 4148.38 ms | 32.5% bf16 MFU | 123062 tok/s step 6645/19560 | loss 3.507239 (+0.53z)| norm 0.2909 (+0.19z)| lr 4.65e-04 | 4175.82 ms | 32.3% bf16 MFU | 123187 tok/s step 6646/19560 | loss 3.441275 (-1.05z)| norm 0.2701 (-0.54z)| lr 4.65e-04 | 4167.24 ms | 32.4% bf16 MFU | 123318 tok/s step 6647/19560 | loss 3.628075 (+3.25z)| norm 0.2824 (-0.11z)| lr 4.64e-04 | 4160.21 ms | 32.5% bf16 MFU | 123453 tok/s step 6648/19560 | loss 3.520776 (+0.78z)| norm 0.2784 (-0.25z)| lr 4.64e-04 | 4195.94 ms | 32.2% bf16 MFU | 123528 tok/s step 6649/19560 | loss 3.518802 (+0.74z)| norm 0.2747 (-0.38z)| lr 4.64e-04 | 4172.40 ms | 32.4% bf16 MFU | 123635 tok/s step 6650/19560 | loss 3.492593 (+0.13z)| norm 0.3117 (+0.92z)| lr 4.64e-04 | 4161.07 ms | 32.4% bf16 MFU | 123753 tok/s step 6651/19560 | loss 3.508745 (+0.49z)| norm 0.2554 (-1.06z)| lr 4.64e-04 | 4168.15 ms | 32.4% bf16 MFU | 123854 tok/s step 6652/19560 | loss 3.516689 (+0.67z)| norm 0.2936 (+0.28z)| lr 4.64e-04 | 4170.95 ms | 32.4% bf16 MFU | 123947 tok/s step 6653/19560 | loss 3.478147 (-0.23z)| norm 0.2884 (+0.10z)| lr 4.64e-04 | 4171.33 ms | 32.4% bf16 MFU | 124034 tok/s step 6654/19560 | loss 3.525073 (+0.88z)| norm 0.2773 (-0.28z)| lr 4.64e-04 | 4176.25 ms | 32.3% bf16 MFU | 124109 tok/s step 6655/19560 | loss 3.483628 (-0.09z)| norm 0.3144 (+1.02z)| lr 4.64e-04 | 4171.21 ms | 32.4% bf16 MFU | 124188 tok/s step 6656/19560 | loss 3.460897 (-0.63z)| norm 0.3290 (+1.51z)| lr 4.64e-04 | 4171.06 ms | 32.4% bf16 MFU | 124264 tok/s step 6657/19560 | loss 3.453944 (-0.79z)| norm 0.2723 (-0.47z)| lr 4.64e-04 | 4180.36 ms | 32.3% bf16 MFU | 124321 tok/s step 6658/19560 | loss 3.513901 (+0.62z)| norm 0.2950 (+0.32z)| lr 4.64e-04 | 4168.24 ms | 32.4% bf16 MFU | 124394 tok/s step 6659/19560 | loss 3.475950 (-0.29z)| norm 0.2712 (-0.51z)| lr 4.64e-04 | 4192.68 ms | 32.2% bf16 MFU | 124427 tok/s step 6660/19560 | loss 3.532728 (+1.05z)| norm 0.2693 (-0.73z)| lr 4.64e-04 | 4168.69 ms | 32.4% bf16 MFU | 124494 tok/s step 6661/19560 | loss 3.500088 (+0.27z)| norm 0.3017 (+0.93z)| lr 4.64e-04 | 4181.82 ms | 32.3% bf16 MFU | 124538 tok/s step 6662/19560 | loss 3.539243 (+1.20z)| norm 0.2533 (-1.55z)| lr 4.64e-04 | 4170.94 ms | 32.4% bf16 MFU | 124596 tok/s step 6663/19560 | loss 3.474749 (-0.33z)| norm 0.2833 (+0.04z)| lr 4.64e-04 | 4179.93 ms | 32.3% bf16 MFU | 124638 tok/s step 6664/19560 | loss 3.492440 (+0.09z)| norm 0.2732 (-0.49z)| lr 4.64e-04 | 4190.82 ms | 32.2% bf16 MFU | 124661 tok/s step 6665/19560 | loss 3.578426 (+2.08z)| norm 0.2727 (-0.51z)| lr 4.64e-04 | 4162.05 ms | 32.4% bf16 MFU | 124726 tok/s step 6666/19560 | loss 3.481777 (-0.17z)| norm 0.3255 (+2.26z)| lr 4.64e-04 | 4195.08 ms | 32.2% bf16 MFU | 124739 tok/s step 6667/19560 | loss 3.557817 (+1.57z)| norm 0.3115 (+1.53z)| lr 4.64e-04 | 4168.32 ms | 32.4% bf16 MFU | 124791 tok/s step 6668/19560 | loss 3.503161 (+0.30z)| norm 0.2955 (+0.68z)| lr 4.64e-04 | 4165.61 ms | 32.4% bf16 MFU | 124844 tok/s step 6669/19560 | loss 3.534184 (+1.00z)| norm 0.2706 (-0.62z)| lr 4.64e-04 | 4163.80 ms | 32.4% bf16 MFU | 124898 tok/s step 6670/19560 | loss 3.468687 (-0.50z)| norm 0.3021 (+1.04z)| lr 4.64e-04 | 4163.14 ms | 32.4% bf16 MFU | 124950 tok/s step 6671/19560 | loss 3.474508 (-0.37z)| norm 0.2918 (+0.49z)| lr 4.63e-04 | 4176.08 ms | 32.3% bf16 MFU | 124980 tok/s step 6672/19560 | loss 3.470821 (-0.45z)| norm 0.3135 (+1.61z)| lr 4.63e-04 | 4179.38 ms | 32.3% bf16 MFU | 125003 tok/s step 6673/19560 | loss 3.437920 (-1.21z)| norm 0.2861 (+0.18z)| lr 4.63e-04 | 4175.38 ms | 32.3% bf16 MFU | 125031 tok/s step 6674/19560 | loss 3.436687 (-1.23z)| norm 0.2717 (-0.56z)| lr 4.63e-04 | 4168.45 ms | 32.4% bf16 MFU | 125068 tok/s step 6675/19560 | loss 3.501951 (+0.27z)| norm 0.2743 (-0.42z)| lr 4.63e-04 | 4166.78 ms | 32.4% bf16 MFU | 125106 tok/s step 6676/19560 | loss 3.469632 (-0.47z)| norm 0.2864 (+0.22z)| lr 4.63e-04 | 4175.68 ms | 32.3% bf16 MFU | 125129 tok/s step 6677/19560 | loss 3.526938 (+0.84z)| norm 0.2426 (-2.08z)| lr 4.63e-04 | 4162.25 ms | 32.4% bf16 MFU | 125171 tok/s step 6678/19560 | loss 3.522131 (+0.71z)| norm 0.2991 (+0.90z)| lr 4.63e-04 | 4161.03 ms | 32.4% bf16 MFU | 125212 tok/s step 6679/19560 | loss 3.463411 (-0.64z)| norm 0.3032 (+1.10z)| lr 4.63e-04 | 4189.91 ms | 32.2% bf16 MFU | 125208 tok/s step 6680/19560 | loss 3.451087 (-0.91z)| norm 0.2707 (-0.61z)| lr 4.63e-04 | 4172.00 ms | 32.4% bf16 MFU | 125231 tok/s step 6681/19560 | loss 3.430762 (-1.36z)| norm 0.2923 (+0.52z)| lr 4.63e-04 | 4193.52 ms | 32.2% bf16 MFU | 125221 tok/s step 6682/19560 | loss 3.457398 (-0.74z)| norm 0.2804 (-0.11z)| lr 4.63e-04 | 4172.72 ms | 32.4% bf16 MFU | 125242 tok/s step 6683/19560 | loss 3.508380 (+0.45z)| norm 0.2657 (-0.88z)| lr 4.63e-04 | 4164.58 ms | 32.4% bf16 MFU | 125274 tok/s step 6684/19560 | loss 3.484176 (-0.12z)| norm 0.2944 (+0.62z)| lr 4.63e-04 | 4169.42 ms | 32.4% bf16 MFU | 125298 tok/s step 6685/19560 | loss 3.484423 (-0.12z)| norm 0.2681 (-0.76z)| lr 4.63e-04 | 4178.13 ms | 32.3% bf16 MFU | 125307 tok/s step 6686/19560 | loss 3.540052 (+1.17z)| norm 0.2961 (+0.69z)| lr 4.63e-04 | 4153.38 ms | 32.5% bf16 MFU | 125353 tok/s step 6687/19560 | loss 3.541644 (+1.20z)| norm 0.2588 (-1.25z)| lr 4.63e-04 | 4171.36 ms | 32.4% bf16 MFU | 125370 tok/s step 6688/19560 | loss 3.542181 (+1.19z)| norm 0.2709 (-0.61z)| lr 4.63e-04 | 4168.52 ms | 32.4% bf16 MFU | 125390 tok/s step 6689/19560 | loss 3.507260 (+0.37z)| norm 0.2826 (-0.01z)| lr 4.63e-04 | 4162.50 ms | 32.4% bf16 MFU | 125419 tok/s step 6690/19560 | loss 3.502343 (+0.24z)| norm 0.2524 (-1.57z)| lr 4.63e-04 | 4173.43 ms | 32.4% bf16 MFU | 125429 tok/s step 6691/19560 | loss 3.466156 (-0.62z)| norm 0.2740 (-0.44z)| lr 4.63e-04 | 4164.31 ms | 32.4% bf16 MFU | 125452 tok/s step 6692/19560 | loss 3.510744 (+0.44z)| norm 0.2660 (-0.85z)| lr 4.63e-04 | 4184.64 ms | 32.3% bf16 MFU | 125444 tok/s step 6693/19560 | loss 3.506746 (+0.33z)| norm 0.3039 (+1.15z)| lr 4.63e-04 | 4215.60 ms | 32.0% bf16 MFU | 125390 tok/s step 6694/19560 | loss 3.469420 (-0.56z)| norm 0.3800 (+4.69z)| lr 4.63e-04 | 4159.62 ms | 32.5% bf16 MFU | 125423 tok/s step 6695/19560 | loss 3.476429 (-0.39z)| norm 0.3284 (+2.17z)| lr 4.62e-04 | 4163.85 ms | 32.4% bf16 MFU | 125448 tok/s step 6696/19560 | loss 3.500616 (+0.18z)| norm 0.3010 (+0.86z)| lr 4.62e-04 | 4166.03 ms | 32.4% bf16 MFU | 125468 tok/s step 6697/19560 | loss 3.528889 (+0.85z)| norm 0.2996 (+0.79z)| lr 4.62e-04 | 4158.46 ms | 32.5% bf16 MFU | 125498 tok/s step 6698/19560 | loss 3.634876 (+3.22z)| norm 0.2833 (+0.02z)| lr 4.62e-04 | 4187.58 ms | 32.2% bf16 MFU | 125483 tok/s step 6699/19560 | loss 3.494259 (-0.01z)| norm 0.2966 (+0.65z)| lr 4.62e-04 | 4184.73 ms | 32.3% bf16 MFU | 125473 tok/s step 6700/19560 | loss 3.476736 (-0.41z)| norm 0.2928 (+0.47z)| lr 4.62e-04 | 4180.77 ms | 32.3% bf16 MFU | 125470 tok/s step 6701/19560 | loss 3.547566 (+1.20z)| norm 0.2827 (-0.01z)| lr 4.62e-04 | 4156.29 ms | 32.5% bf16 MFU | 125504 tok/s step 6702/19560 | loss 3.507814 (+0.27z)| norm 0.2750 (-0.38z)| lr 4.62e-04 | 4157.90 ms | 32.5% bf16 MFU | 125533 tok/s step 6703/19560 | loss 3.483044 (-0.31z)| norm 0.2834 (+0.01z)| lr 4.62e-04 | 4181.94 ms | 32.3% bf16 MFU | 125525 tok/s step 6704/19560 | loss 3.438538 (-1.35z)| norm 0.2559 (-1.29z)| lr 4.62e-04 | 4194.99 ms | 32.2% bf16 MFU | 125498 tok/s step 6705/19560 | loss 3.475420 (-0.50z)| norm 0.2771 (-0.28z)| lr 4.62e-04 | 4165.20 ms | 32.4% bf16 MFU | 125516 tok/s step 6706/19560 | loss 3.532114 (+0.82z)| norm 0.2425 (-1.91z)| lr 4.62e-04 | 4234.14 ms | 31.9% bf16 MFU | 125432 tok/s step 6707/19560 | loss 3.457102 (-0.95z)| norm 0.2618 (-0.99z)| lr 4.62e-04 | 4177.32 ms | 32.3% bf16 MFU | 125436 tok/s step 6708/19560 | loss 3.452204 (-1.05z)| norm 0.2574 (-1.18z)| lr 4.62e-04 | 4184.13 ms | 32.3% bf16 MFU | 125429 tok/s step 6709/19560 | loss 3.485402 (-0.28z)| norm 0.2659 (-0.78z)| lr 4.62e-04 | 4206.31 ms | 32.1% bf16 MFU | 125390 tok/s step 6710/19560 | loss 3.492975 (-0.08z)| norm 0.2431 (-1.83z)| lr 4.62e-04 | 4154.81 ms | 32.5% bf16 MFU | 125430 tok/s step 6711/19560 | loss 3.488577 (-0.20z)| norm 0.2572 (-1.17z)| lr 4.62e-04 | 4184.81 ms | 32.3% bf16 MFU | 125422 tok/s step 6712/19560 | loss 3.462594 (-0.84z)| norm 0.2634 (-0.89z)| lr 4.62e-04 | 4157.03 ms | 32.5% bf16 MFU | 125457 tok/s step 6713/19560 | loss 3.473060 (-0.58z)| norm 0.2536 (-1.34z)| lr 4.62e-04 | 4165.81 ms | 32.4% bf16 MFU | 125477 tok/s step 6714/19560 | loss 3.468865 (-0.69z)| norm 0.2682 (-0.66z)| lr 4.62e-04 | 4179.54 ms | 32.3% bf16 MFU | 125475 tok/s step 6715/19560 | loss 3.515795 (+0.46z)| norm 0.2793 (-0.15z)| lr 4.62e-04 | 4157.86 ms | 32.5% bf16 MFU | 125506 tok/s step 6716/19560 | loss 3.507874 (+0.25z)| norm 0.2816 (-0.04z)| lr 4.62e-04 | 4182.84 ms | 32.3% bf16 MFU | 125498 tok/s step 6717/19560 | loss 3.442266 (-1.40z)| norm 0.2810 (-0.07z)| lr 4.62e-04 | 4164.24 ms | 32.4% bf16 MFU | 125518 tok/s step 6718/19560 | loss 3.449445 (-1.21z)| norm 0.2546 (-1.31z)| lr 4.62e-04 | 4169.62 ms | 32.4% bf16 MFU | 125530 tok/s step 6719/19560 | loss 3.476819 (-0.52z)| norm 0.2722 (-0.48z)| lr 4.61e-04 | 4182.64 ms | 32.3% bf16 MFU | 125521 tok/s step 6720/19560 | loss 3.421869 (-1.85z)| norm 0.2968 (+0.68z)| lr 4.61e-04 | 4160.26 ms | 32.5% bf16 MFU | 125546 tok/s step 6721/19560 | loss 3.499724 (+0.06z)| norm 0.3109 (+1.32z)| lr 4.61e-04 | 4228.04 ms | 31.9% bf16 MFU | 125468 tok/s step 6722/19560 | loss 3.485378 (-0.29z)| norm 0.2830 (+0.01z)| lr 4.61e-04 | 4177.98 ms | 32.3% bf16 MFU | 125469 tok/s step 6723/19560 | loss 3.461620 (-0.88z)| norm 0.2764 (-0.31z)| lr 4.61e-04 | 4152.09 ms | 32.5% bf16 MFU | 125510 tok/s step 6724/19560 | loss 3.520193 (+0.56z)| norm 0.2998 (+0.80z)| lr 4.61e-04 | 4163.46 ms | 32.4% bf16 MFU | 125530 tok/s step 6725/19560 | loss 3.463537 (-0.83z)| norm 0.2805 (-0.12z)| lr 4.61e-04 | 4178.72 ms | 32.3% bf16 MFU | 125527 tok/s step 6726/19560 | loss 3.523032 (+0.67z)| norm 0.2864 (+0.17z)| lr 4.61e-04 | 4171.63 ms | 32.4% bf16 MFU | 125535 tok/s step 6727/19560 | loss 3.506459 (+0.24z)| norm 0.2637 (-0.90z)| lr 4.61e-04 | 4167.46 ms | 32.4% bf16 MFU | 125548 tok/s step 6728/19560 | loss 3.446621 (-1.27z)| norm 0.2778 (-0.23z)| lr 4.61e-04 | 4170.29 ms | 32.4% bf16 MFU | 125557 tok/s step 6729/19560 | loss 3.486181 (-0.27z)| norm 0.2869 (+0.20z)| lr 4.61e-04 | 4163.63 ms | 32.4% bf16 MFU | 125575 tok/s step 6730/19560 | loss 3.493696 (-0.09z)| norm 0.2672 (-0.73z)| lr 4.61e-04 | 4166.96 ms | 32.4% bf16 MFU | 125587 tok/s step 6731/19560 | loss 3.523782 (+0.68z)| norm 0.2968 (+0.68z)| lr 4.61e-04 | 4155.62 ms | 32.5% bf16 MFU | 125616 tok/s step 6732/19560 | loss 3.492029 (-0.12z)| norm 0.2733 (-0.45z)| lr 4.61e-04 | 4156.62 ms | 32.5% bf16 MFU | 125642 tok/s step 6733/19560 | loss 3.521375 (+0.65z)| norm 0.2476 (-1.65z)| lr 4.61e-04 | 4169.07 ms | 32.4% bf16 MFU | 125648 tok/s step 6734/19560 | loss 3.540588 (+1.13z)| norm 0.2593 (-1.08z)| lr 4.61e-04 | 4194.72 ms | 32.2% bf16 MFU | 125615 tok/s step 6735/19560 | loss 3.512936 (+0.41z)| norm 0.2726 (-0.45z)| lr 4.61e-04 | 4173.33 ms | 32.4% bf16 MFU | 125615 tok/s step 6736/19560 | loss 3.492671 (-0.11z)| norm 0.2715 (-0.51z)| lr 4.61e-04 | 4186.14 ms | 32.3% bf16 MFU | 125597 tok/s step 6737/19560 | loss 3.478220 (-0.48z)| norm 0.2745 (-0.37z)| lr 4.61e-04 | 4161.94 ms | 32.4% bf16 MFU | 125616 tok/s step 6738/19560 | loss 3.486682 (-0.26z)| norm 0.2932 (+0.51z)| lr 4.61e-04 | 4156.86 ms | 32.5% bf16 MFU | 125641 tok/s step 6739/19560 | loss 3.492867 (-0.11z)| norm 0.2599 (-1.04z)| lr 4.61e-04 | 4151.10 ms | 32.5% bf16 MFU | 125674 tok/s step 6740/19560 | loss 3.470007 (-0.73z)| norm 0.2530 (-1.34z)| lr 4.61e-04 | 4176.27 ms | 32.3% bf16 MFU | 125667 tok/s step 6741/19560 | loss 3.595374 (+2.54z)| norm 0.2980 (+0.75z)| lr 4.61e-04 | 4177.21 ms | 32.3% bf16 MFU | 125660 tok/s step 6742/19560 | loss 3.511359 (+0.35z)| norm 0.2930 (+0.51z)| lr 4.61e-04 | 4181.91 ms | 32.3% bf16 MFU | 125645 tok/s step 6743/19560 | loss 3.466585 (-0.82z)| norm 0.2924 (+0.48z)| lr 4.60e-04 | 4173.75 ms | 32.3% bf16 MFU | 125644 tok/s step 6744/19560 | loss 3.474164 (-0.61z)| norm 0.2599 (-1.02z)| lr 4.60e-04 | 4167.25 ms | 32.4% bf16 MFU | 125652 tok/s step 6745/19560 | loss 3.490479 (-0.18z)| norm 0.2785 (-0.15z)| lr 4.60e-04 | 4166.19 ms | 32.4% bf16 MFU | 125662 tok/s step 6746/19560 | loss 3.482267 (-0.40z)| norm 0.3228 (+1.88z)| lr 4.60e-04 | 4162.97 ms | 32.4% bf16 MFU | 125676 tok/s step 6747/19560 | loss 3.518302 (+0.54z)| norm 0.3144 (+1.48z)| lr 4.60e-04 | 4155.72 ms | 32.5% bf16 MFU | 125700 tok/s step 6748/19560 | loss 3.471123 (-0.69z)| norm 0.2777 (-0.22z)| lr 4.60e-04 | 4165.45 ms | 32.4% bf16 MFU | 125708 tok/s step 6749/19560 | loss 3.471300 (-0.68z)| norm 0.2990 (+0.76z)| lr 4.60e-04 | 4165.74 ms | 32.4% bf16 MFU | 125716 tok/s step 6750/19560 | loss 3.502909 (+0.14z)| norm 0.2827 (+0.01z)| lr 4.60e-04 | 4159.88 ms | 32.5% bf16 MFU | 125731 tok/s val loss 3.476414 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2810/10042 = 0.279825 step 6751/19560 | loss 3.536828 (+1.01z)| norm 0.3262 (+1.97z)| lr 4.60e-04 | 5137.13 ms | 26.3% bf16 MFU | 124548 tok/s step 6752/19560 | loss 3.473346 (-0.65z)| norm 0.2726 (-0.47z)| lr 4.60e-04 | 4352.58 ms | 31.0% bf16 MFU | 124343 tok/s step 6753/19560 | loss 3.487263 (-0.30z)| norm 0.3264 (+1.93z)| lr 4.60e-04 | 4289.15 ms | 31.5% bf16 MFU | 124238 tok/s step 6754/19560 | loss 3.475888 (-0.59z)| norm 0.2913 (+0.35z)| lr 4.60e-04 | 4359.25 ms | 31.0% bf16 MFU | 124039 tok/s step 6755/19560 | loss 3.486413 (-0.30z)| norm 0.3052 (+0.96z)| lr 4.60e-04 | 4382.34 ms | 30.8% bf16 MFU | 123819 tok/s step 6756/19560 | loss 3.470323 (-0.72z)| norm 0.2855 (+0.07z)| lr 4.60e-04 | 4219.23 ms | 32.0% bf16 MFU | 123841 tok/s step 6757/19560 | loss 3.455812 (-1.09z)| norm 0.2689 (-0.68z)| lr 4.60e-04 | 4254.08 ms | 31.7% bf16 MFU | 123812 tok/s step 6758/19560 | loss 3.536056 (+1.03z)| norm 0.2930 (+0.40z)| lr 4.60e-04 | 4224.95 ms | 32.0% bf16 MFU | 123826 tok/s step 6759/19560 | loss 3.486331 (-0.29z)| norm 0.2618 (-1.01z)| lr 4.60e-04 | 4238.67 ms | 31.9% bf16 MFU | 123819 tok/s step 6760/19560 | loss 3.472719 (-0.64z)| norm 0.2908 (+0.30z)| lr 4.60e-04 | 4230.53 ms | 31.9% bf16 MFU | 123824 tok/s step 6761/19560 | loss 3.444729 (-1.38z)| norm 0.3250 (+1.83z)| lr 4.60e-04 | 4267.23 ms | 31.6% bf16 MFU | 123776 tok/s step 6762/19560 | loss 3.495978 (+0.02z)| norm 0.2619 (-0.99z)| lr 4.60e-04 | 4219.61 ms | 32.0% bf16 MFU | 123800 tok/s step 6763/19560 | loss 3.443670 (-1.38z)| norm 0.3096 (+1.12z)| lr 4.60e-04 | 4239.09 ms | 31.9% bf16 MFU | 123794 tok/s step 6764/19560 | loss 3.405159 (-2.38z)| norm 0.2749 (-0.41z)| lr 4.60e-04 | 4189.06 ms | 32.2% bf16 MFU | 123862 tok/s step 6765/19560 | loss 3.486076 (-0.19z)| norm 0.2689 (-0.68z)| lr 4.60e-04 | 4204.82 ms | 32.1% bf16 MFU | 123903 tok/s step 6766/19560 | loss 3.486375 (-0.19z)| norm 0.2884 (+0.24z)| lr 4.59e-04 | 4187.86 ms | 32.2% bf16 MFU | 123968 tok/s step 6767/19560 | loss 3.525153 (+0.85z)| norm 0.2712 (-0.56z)| lr 4.59e-04 | 4200.86 ms | 32.1% bf16 MFU | 124010 tok/s step 6768/19560 | loss 3.515350 (+0.59z)| norm 0.2948 (+0.54z)| lr 4.59e-04 | 4245.93 ms | 31.8% bf16 MFU | 123983 tok/s step 6769/19560 | loss 3.493265 (-0.01z)| norm 0.2829 (-0.00z)| lr 4.59e-04 | 4167.62 ms | 32.4% bf16 MFU | 124074 tok/s step 6770/19560 | loss 3.493073 (-0.01z)| norm 0.2734 (-0.45z)| lr 4.59e-04 | 4186.39 ms | 32.3% bf16 MFU | 124132 tok/s step 6771/19560 | loss 3.491665 (-0.05z)| norm 0.2667 (-0.76z)| lr 4.59e-04 | 4161.58 ms | 32.4% bf16 MFU | 124225 tok/s step 6772/19560 | loss 3.463575 (-0.82z)| norm 0.2911 (+0.38z)| lr 4.59e-04 | 4182.92 ms | 32.3% bf16 MFU | 124281 tok/s step 6773/19560 | loss 3.463655 (-0.80z)| norm 0.2811 (-0.08z)| lr 4.59e-04 | 4162.90 ms | 32.4% bf16 MFU | 124364 tok/s step 6774/19560 | loss 3.481487 (-0.32z)| norm 0.2794 (-0.17z)| lr 4.59e-04 | 4172.23 ms | 32.4% bf16 MFU | 124429 tok/s step 6775/19560 | loss 3.480031 (-0.35z)| norm 0.2997 (+0.78z)| lr 4.59e-04 | 4182.22 ms | 32.3% bf16 MFU | 124475 tok/s step 6776/19560 | loss 3.521282 (+0.88z)| norm 0.2721 (-0.52z)| lr 4.59e-04 | 4163.44 ms | 32.4% bf16 MFU | 124548 tok/s step 6777/19560 | loss 3.463476 (-0.83z)| norm 0.2751 (-0.38z)| lr 4.59e-04 | 4159.53 ms | 32.5% bf16 MFU | 124623 tok/s step 6778/19560 | loss 3.456674 (-1.02z)| norm 0.2598 (-1.08z)| lr 4.59e-04 | 4174.88 ms | 32.3% bf16 MFU | 124671 tok/s step 6779/19560 | loss 3.498336 (+0.22z)| norm 0.2475 (-1.65z)| lr 4.59e-04 | 4180.11 ms | 32.3% bf16 MFU | 124708 tok/s step 6780/19560 | loss 3.519492 (+0.84z)| norm 0.2848 (+0.10z)| lr 4.59e-04 | 4177.76 ms | 32.3% bf16 MFU | 124748 tok/s step 6781/19560 | loss 3.485908 (-0.15z)| norm 0.2905 (+0.37z)| lr 4.59e-04 | 4184.03 ms | 32.3% bf16 MFU | 124776 tok/s step 6782/19560 | loss 3.517596 (+0.79z)| norm 0.2810 (-0.07z)| lr 4.59e-04 | 4183.21 ms | 32.3% bf16 MFU | 124803 tok/s step 6783/19560 | loss 3.511961 (+0.61z)| norm 0.2988 (+0.77z)| lr 4.59e-04 | 4157.13 ms | 32.5% bf16 MFU | 124869 tok/s step 6784/19560 | loss 3.508078 (+0.49z)| norm 0.2881 (+0.29z)| lr 4.59e-04 | 4176.83 ms | 32.3% bf16 MFU | 124902 tok/s step 6785/19560 | loss 3.500704 (+0.26z)| norm 0.2516 (-1.46z)| lr 4.59e-04 | 4191.24 ms | 32.2% bf16 MFU | 124911 tok/s step 6786/19560 | loss 3.457847 (-1.00z)| norm 0.2748 (-0.34z)| lr 4.59e-04 | 4163.07 ms | 32.4% bf16 MFU | 124963 tok/s step 6787/19560 | loss 3.471237 (-0.60z)| norm 0.2721 (-0.47z)| lr 4.59e-04 | 4161.94 ms | 32.4% bf16 MFU | 125013 tok/s step 6788/19560 | loss 3.556019 (+1.90z)| norm 0.2768 (-0.25z)| lr 4.59e-04 | 4163.68 ms | 32.4% bf16 MFU | 125058 tok/s step 6789/19560 | loss 3.484709 (-0.20z)| norm 0.2431 (-1.83z)| lr 4.59e-04 | 4165.90 ms | 32.4% bf16 MFU | 125098 tok/s step 6790/19560 | loss 3.505427 (+0.42z)| norm 0.2776 (-0.20z)| lr 4.58e-04 | 4173.69 ms | 32.3% bf16 MFU | 125124 tok/s step 6791/19560 | loss 3.524146 (+0.96z)| norm 0.2554 (-1.24z)| lr 4.58e-04 | 4162.10 ms | 32.4% bf16 MFU | 125166 tok/s step 6792/19560 | loss 3.499901 (+0.24z)| norm 0.2801 (-0.06z)| lr 4.58e-04 | 4197.89 ms | 32.2% bf16 MFU | 125152 tok/s step 6793/19560 | loss 3.480590 (-0.31z)| norm 0.2939 (+0.58z)| lr 4.58e-04 | 4160.10 ms | 32.5% bf16 MFU | 125196 tok/s step 6794/19560 | loss 3.449829 (-1.23z)| norm 0.2625 (-0.90z)| lr 4.58e-04 | 4164.26 ms | 32.4% bf16 MFU | 125232 tok/s step 6795/19560 | loss 3.410300 (-2.38z)| norm 0.2628 (-0.87z)| lr 4.58e-04 | 4166.53 ms | 32.4% bf16 MFU | 125262 tok/s step 6796/19560 | loss 3.488042 (-0.04z)| norm 0.3001 (+0.94z)| lr 4.58e-04 | 4165.11 ms | 32.4% bf16 MFU | 125292 tok/s step 6797/19560 | loss 3.482482 (-0.20z)| norm 0.2915 (+0.51z)| lr 4.58e-04 | 4160.51 ms | 32.5% bf16 MFU | 125329 tok/s step 6798/19560 | loss 3.459081 (-0.90z)| norm 0.2505 (-1.46z)| lr 4.58e-04 | 4182.75 ms | 32.3% bf16 MFU | 125329 tok/s step 6799/19560 | loss 3.486293 (-0.08z)| norm 0.3044 (+1.14z)| lr 4.58e-04 | 4155.73 ms | 32.5% bf16 MFU | 125371 tok/s step 6800/19560 | loss 3.642625 (+4.26z)| norm 0.2783 (-0.10z)| lr 4.58e-04 | 4165.08 ms | 32.4% bf16 MFU | 125396 tok/s step 6801/19560 | loss 3.438826 (-1.44z)| norm 0.2616 (-0.90z)| lr 4.58e-04 | 4244.62 ms | 31.8% bf16 MFU | 125302 tok/s step 6802/19560 | loss 3.507203 (+0.46z)| norm 0.3208 (+1.92z)| lr 4.58e-04 | 4157.71 ms | 32.5% bf16 MFU | 125342 tok/s step 6803/19560 | loss 3.621547 (+3.49z)| norm 0.2639 (-0.79z)| lr 4.58e-04 | 4159.51 ms | 32.5% bf16 MFU | 125377 tok/s step 6804/19560 | loss 3.443397 (-1.29z)| norm 0.2709 (-0.45z)| lr 4.58e-04 | 4157.22 ms | 32.5% bf16 MFU | 125414 tok/s step 6805/19560 | loss 3.508295 (+0.45z)| norm 0.2848 (+0.20z)| lr 4.58e-04 | 4159.31 ms | 32.5% bf16 MFU | 125446 tok/s step 6806/19560 | loss 3.542638 (+1.36z)| norm 0.3077 (+1.30z)| lr 4.58e-04 | 4163.64 ms | 32.4% bf16 MFU | 125470 tok/s step 6807/19560 | loss 3.504467 (+0.33z)| norm 0.2869 (+0.30z)| lr 4.58e-04 | 4154.01 ms | 32.5% bf16 MFU | 125507 tok/s step 6808/19560 | loss 3.466790 (-0.68z)| norm 0.2453 (-1.68z)| lr 4.58e-04 | 4157.05 ms | 32.5% bf16 MFU | 125538 tok/s step 6809/19560 | loss 3.487768 (-0.13z)| norm 0.2769 (-0.17z)| lr 4.58e-04 | 4155.51 ms | 32.5% bf16 MFU | 125569 tok/s step 6810/19560 | loss 3.547395 (+1.46z)| norm 0.2714 (-0.42z)| lr 4.58e-04 | 4166.27 ms | 32.4% bf16 MFU | 125583 tok/s step 6811/19560 | loss 3.493080 (-0.00z)| norm 0.2693 (-0.53z)| lr 4.58e-04 | 4155.24 ms | 32.5% bf16 MFU | 125612 tok/s step 6812/19560 | loss 3.505572 (+0.33z)| norm 0.2678 (-0.59z)| lr 4.58e-04 | 4171.13 ms | 32.4% bf16 MFU | 125616 tok/s step 6813/19560 | loss 3.463380 (-0.80z)| norm 0.2913 (+0.53z)| lr 4.57e-04 | 4160.88 ms | 32.4% bf16 MFU | 125636 tok/s step 6814/19560 | loss 3.564773 (+1.91z)| norm 0.2645 (-0.75z)| lr 4.57e-04 | 4167.37 ms | 32.4% bf16 MFU | 125644 tok/s step 6815/19560 | loss 3.486238 (-0.18z)| norm 0.2851 (+0.23z)| lr 4.57e-04 | 4166.20 ms | 32.4% bf16 MFU | 125654 tok/s step 6816/19560 | loss 3.498697 (+0.17z)| norm 0.2972 (+0.81z)| lr 4.57e-04 | 4159.18 ms | 32.5% bf16 MFU | 125674 tok/s step 6817/19560 | loss 3.532005 (+1.06z)| norm 0.2826 (+0.10z)| lr 4.57e-04 | 4161.54 ms | 32.4% bf16 MFU | 125690 tok/s step 6818/19560 | loss 3.505749 (+0.35z)| norm 0.2343 (-2.19z)| lr 4.57e-04 | 4163.66 ms | 32.4% bf16 MFU | 125701 tok/s step 6819/19560 | loss 3.513533 (+0.55z)| norm 0.2879 (+0.36z)| lr 4.57e-04 | 4159.69 ms | 32.5% bf16 MFU | 125718 tok/s step 6820/19560 | loss 3.512340 (+0.52z)| norm 0.2734 (-0.34z)| lr 4.57e-04 | 4149.25 ms | 32.5% bf16 MFU | 125750 tok/s step 6821/19560 | loss 3.445244 (-1.27z)| norm 0.2634 (-0.80z)| lr 4.57e-04 | 4160.15 ms | 32.5% bf16 MFU | 125764 tok/s step 6822/19560 | loss 3.552415 (+1.57z)| norm 0.2904 (+0.58z)| lr 4.57e-04 | 4150.18 ms | 32.5% bf16 MFU | 125792 tok/s step 6823/19560 | loss 3.533587 (+1.05z)| norm 0.2844 (+0.29z)| lr 4.57e-04 | 4163.10 ms | 32.4% bf16 MFU | 125800 tok/s step 6824/19560 | loss 3.577402 (+2.16z)| norm 0.2939 (+0.81z)| lr 4.57e-04 | 4157.32 ms | 32.5% bf16 MFU | 125815 tok/s step 6825/19560 | loss 3.555769 (+1.58z)| norm 0.2914 (+0.67z)| lr 4.57e-04 | 4161.17 ms | 32.4% bf16 MFU | 125824 tok/s step 6826/19560 | loss 3.488046 (-0.15z)| norm 0.2920 (+0.70z)| lr 4.57e-04 | 4156.49 ms | 32.5% bf16 MFU | 125840 tok/s step 6827/19560 | loss 3.471390 (-0.60z)| norm 0.2879 (+0.49z)| lr 4.57e-04 | 4163.05 ms | 32.4% bf16 MFU | 125845 tok/s step 6828/19560 | loss 3.419810 (-1.96z)| norm 0.2959 (+0.92z)| lr 4.57e-04 | 4165.89 ms | 32.4% bf16 MFU | 125845 tok/s step 6829/19560 | loss 3.516053 (+0.63z)| norm 0.2792 (+0.01z)| lr 4.57e-04 | 4155.95 ms | 32.5% bf16 MFU | 125861 tok/s step 6830/19560 | loss 3.474687 (-0.48z)| norm 0.3084 (+1.57z)| lr 4.57e-04 | 4156.68 ms | 32.5% bf16 MFU | 125874 tok/s step 6831/19560 | loss 3.524180 (+0.85z)| norm 0.2908 (+0.62z)| lr 4.57e-04 | 4155.55 ms | 32.5% bf16 MFU | 125889 tok/s step 6832/19560 | loss 3.513958 (+0.56z)| norm 0.2790 (-0.03z)| lr 4.57e-04 | 4198.60 ms | 32.2% bf16 MFU | 125838 tok/s step 6833/19560 | loss 3.473213 (-0.54z)| norm 0.2808 (+0.07z)| lr 4.57e-04 | 4155.13 ms | 32.5% bf16 MFU | 125855 tok/s step 6834/19560 | loss 3.478760 (-0.38z)| norm 0.2709 (-0.49z)| lr 4.57e-04 | 4146.69 ms | 32.6% bf16 MFU | 125884 tok/s step 6835/19560 | loss 3.478796 (-0.39z)| norm 0.2775 (-0.13z)| lr 4.57e-04 | 4157.11 ms | 32.5% bf16 MFU | 125896 tok/s step 6836/19560 | loss 3.512486 (+0.52z)| norm 0.2726 (-0.41z)| lr 4.57e-04 | 4153.80 ms | 32.5% bf16 MFU | 125912 tok/s step 6837/19560 | loss 3.511083 (+0.48z)| norm 0.2673 (-0.70z)| lr 4.56e-04 | 4155.84 ms | 32.5% bf16 MFU | 125924 tok/s step 6838/19560 | loss 3.531236 (+1.02z)| norm 0.2823 (+0.12z)| lr 4.56e-04 | 4147.55 ms | 32.6% bf16 MFU | 125948 tok/s step 6839/19560 | loss 3.496333 (+0.06z)| norm 0.2492 (-1.74z)| lr 4.56e-04 | 4168.41 ms | 32.4% bf16 MFU | 125940 tok/s step 6840/19560 | loss 3.504851 (+0.29z)| norm 0.2740 (-0.36z)| lr 4.56e-04 | 4152.23 ms | 32.5% bf16 MFU | 125956 tok/s step 6841/19560 | loss 3.543754 (+1.33z)| norm 0.2757 (-0.27z)| lr 4.56e-04 | 4159.87 ms | 32.5% bf16 MFU | 125960 tok/s step 6842/19560 | loss 3.504373 (+0.25z)| norm 0.2442 (-2.02z)| lr 4.56e-04 | 4164.29 ms | 32.4% bf16 MFU | 125957 tok/s step 6843/19560 | loss 3.540240 (+1.22z)| norm 0.2575 (-1.25z)| lr 4.56e-04 | 4165.77 ms | 32.4% bf16 MFU | 125952 tok/s step 6844/19560 | loss 3.606601 (+2.90z)| norm 0.2654 (-0.81z)| lr 4.56e-04 | 4161.39 ms | 32.4% bf16 MFU | 125954 tok/s step 6845/19560 | loss 3.529124 (+0.85z)| norm 0.2609 (-1.05z)| lr 4.56e-04 | 4153.79 ms | 32.5% bf16 MFU | 125967 tok/s step 6846/19560 | loss 3.471009 (-0.69z)| norm 0.2690 (-0.61z)| lr 4.56e-04 | 4161.68 ms | 32.4% bf16 MFU | 125968 tok/s step 6847/19560 | loss 3.574261 (+2.00z)| norm 0.2676 (-0.68z)| lr 4.56e-04 | 4164.53 ms | 32.4% bf16 MFU | 125964 tok/s step 6848/19560 | loss 3.492920 (-0.14z)| norm 0.2757 (-0.22z)| lr 4.56e-04 | 4161.08 ms | 32.4% bf16 MFU | 125966 tok/s step 6849/19560 | loss 3.436060 (-1.62z)| norm 0.2585 (-1.16z)| lr 4.56e-04 | 4155.38 ms | 32.5% bf16 MFU | 125976 tok/s step 6850/19560 | loss 3.482822 (-0.39z)| norm 0.2669 (-0.69z)| lr 4.56e-04 | 4173.26 ms | 32.4% bf16 MFU | 125959 tok/s step 6851/19560 | loss 3.503980 (+0.16z)| norm 0.2580 (-1.17z)| lr 4.56e-04 | 4152.26 ms | 32.5% bf16 MFU | 125974 tok/s step 6852/19560 | loss 3.460887 (-0.96z)| norm 0.2712 (-0.43z)| lr 4.56e-04 | 4157.46 ms | 32.5% bf16 MFU | 125981 tok/s step 6853/19560 | loss 3.454886 (-1.12z)| norm 0.2504 (-1.56z)| lr 4.56e-04 | 4157.80 ms | 32.5% bf16 MFU | 125987 tok/s step 6854/19560 | loss 3.479224 (-0.47z)| norm 0.2618 (-0.91z)| lr 4.56e-04 | 4149.02 ms | 32.5% bf16 MFU | 126005 tok/s step 6855/19560 | loss 3.473313 (-0.62z)| norm 0.2816 (+0.17z)| lr 4.56e-04 | 4151.98 ms | 32.5% bf16 MFU | 126019 tok/s step 6856/19560 | loss 3.545537 (+1.25z)| norm 0.2850 (+0.35z)| lr 4.56e-04 | 4165.52 ms | 32.4% bf16 MFU | 126011 tok/s step 6857/19560 | loss 3.486970 (-0.28z)| norm 0.3010 (+1.23z)| lr 4.56e-04 | 4158.55 ms | 32.5% bf16 MFU | 126014 tok/s step 6858/19560 | loss 3.484592 (-0.34z)| norm 0.2843 (+0.30z)| lr 4.56e-04 | 4155.07 ms | 32.5% bf16 MFU | 126023 tok/s step 6859/19560 | loss 3.492702 (-0.12z)| norm 0.3112 (+1.76z)| lr 4.56e-04 | 4170.24 ms | 32.4% bf16 MFU | 126008 tok/s step 6860/19560 | loss 3.510386 (+0.34z)| norm 0.2778 (-0.06z)| lr 4.55e-04 | 4156.84 ms | 32.5% bf16 MFU | 126013 tok/s step 6861/19560 | loss 3.476272 (-0.55z)| norm 0.3317 (+2.79z)| lr 4.55e-04 | 4150.29 ms | 32.5% bf16 MFU | 126029 tok/s step 6862/19560 | loss 3.507632 (+0.28z)| norm 0.3174 (+1.97z)| lr 4.55e-04 | 4167.21 ms | 32.4% bf16 MFU | 126018 tok/s step 6863/19560 | loss 3.495295 (-0.04z)| norm 0.2928 (+0.66z)| lr 4.55e-04 | 4164.30 ms | 32.4% bf16 MFU | 126012 tok/s step 6864/19560 | loss 3.459851 (-0.97z)| norm 0.3013 (+1.09z)| lr 4.55e-04 | 4155.25 ms | 32.5% bf16 MFU | 126021 tok/s step 6865/19560 | loss 3.567054 (+1.81z)| norm 0.2786 (-0.10z)| lr 4.55e-04 | 4152.91 ms | 32.5% bf16 MFU | 126032 tok/s step 6866/19560 | loss 3.501134 (+0.10z)| norm 0.3196 (+2.02z)| lr 4.55e-04 | 4158.97 ms | 32.5% bf16 MFU | 126033 tok/s step 6867/19560 | loss 3.484790 (-0.33z)| norm 0.3204 (+2.01z)| lr 4.55e-04 | 4158.35 ms | 32.5% bf16 MFU | 126036 tok/s step 6868/19560 | loss 3.514550 (+0.44z)| norm 0.7156 (+10.06z)| lr 4.55e-04 | 4175.16 ms | 32.3% bf16 MFU | 126013 tok/s step 6869/19560 | loss 3.514783 (+0.47z)| norm 0.3706 (+1.96z)| lr 4.55e-04 | 4178.50 ms | 32.3% bf16 MFU | 125986 tok/s step 6870/19560 | loss 3.503050 (+0.16z)| norm 0.3315 (+1.05z)| lr 4.55e-04 | 4167.39 ms | 32.4% bf16 MFU | 125977 tok/s step 6871/19560 | loss 3.493305 (-0.10z)| norm 0.3149 (+0.66z)| lr 4.55e-04 | 4157.06 ms | 32.5% bf16 MFU | 125984 tok/s step 6872/19560 | loss 3.442802 (-1.44z)| norm 0.2963 (+0.23z)| lr 4.55e-04 | 4161.93 ms | 32.4% bf16 MFU | 125983 tok/s step 6873/19560 | loss 3.471568 (-0.67z)| norm 0.3125 (+0.60z)| lr 4.55e-04 | 4161.92 ms | 32.4% bf16 MFU | 125983 tok/s step 6874/19560 | loss 3.483025 (-0.36z)| norm 0.2869 (+0.02z)| lr 4.55e-04 | 4153.40 ms | 32.5% bf16 MFU | 125995 tok/s step 6875/19560 | loss 3.489920 (-0.18z)| norm 0.2906 (+0.11z)| lr 4.55e-04 | 4158.96 ms | 32.5% bf16 MFU | 125998 tok/s step 6876/19560 | loss 3.473962 (-0.60z)| norm 0.2718 (-0.32z)| lr 4.55e-04 | 4152.56 ms | 32.5% bf16 MFU | 126011 tok/s step 6877/19560 | loss 3.480137 (-0.44z)| norm 0.2908 (+0.11z)| lr 4.55e-04 | 4164.37 ms | 32.4% bf16 MFU | 126006 tok/s step 6878/19560 | loss 3.497240 (+0.02z)| norm 0.2998 (+0.32z)| lr 4.55e-04 | 4163.09 ms | 32.4% bf16 MFU | 126002 tok/s step 6879/19560 | loss 3.513378 (+0.45z)| norm 0.2697 (-0.37z)| lr 4.55e-04 | 4154.73 ms | 32.5% bf16 MFU | 126012 tok/s step 6880/19560 | loss 3.525210 (+0.76z)| norm 0.2965 (+0.25z)| lr 4.55e-04 | 4152.54 ms | 32.5% bf16 MFU | 126024 tok/s step 6881/19560 | loss 3.501958 (+0.13z)| norm 0.2605 (-0.57z)| lr 4.55e-04 | 4151.91 ms | 32.5% bf16 MFU | 126037 tok/s step 6882/19560 | loss 3.507478 (+0.27z)| norm 0.2930 (+0.18z)| lr 4.55e-04 | 4166.06 ms | 32.4% bf16 MFU | 126027 tok/s step 6883/19560 | loss 3.521976 (+0.65z)| norm 0.2722 (-0.30z)| lr 4.55e-04 | 4162.25 ms | 32.4% bf16 MFU | 126024 tok/s step 6884/19560 | loss 3.530737 (+0.87z)| norm 0.2596 (-0.58z)| lr 4.54e-04 | 4154.77 ms | 32.5% bf16 MFU | 126032 tok/s step 6885/19560 | loss 3.465311 (-0.87z)| norm 0.2805 (-0.10z)| lr 4.54e-04 | 4157.26 ms | 32.5% bf16 MFU | 126036 tok/s step 6886/19560 | loss 3.456611 (-1.09z)| norm 0.2695 (-0.35z)| lr 4.54e-04 | 4168.74 ms | 32.4% bf16 MFU | 126023 tok/s step 6887/19560 | loss 3.473590 (-0.63z)| norm 0.2805 (-0.10z)| lr 4.54e-04 | 4150.25 ms | 32.5% bf16 MFU | 126038 tok/s step 6888/19560 | loss 3.474738 (-0.60z)| norm 0.2924 (+0.17z)| lr 4.54e-04 | 4159.89 ms | 32.5% bf16 MFU | 126038 tok/s step 6889/19560 | loss 3.480974 (-0.45z)| norm 0.2630 (-0.49z)| lr 4.54e-04 | 4155.38 ms | 32.5% bf16 MFU | 126044 tok/s step 6890/19560 | loss 3.484750 (-0.34z)| norm 0.2669 (-0.41z)| lr 4.54e-04 | 4160.50 ms | 32.5% bf16 MFU | 126043 tok/s step 6891/19560 | loss 3.439968 (-1.54z)| norm 0.2801 (-0.09z)| lr 4.54e-04 | 4153.65 ms | 32.5% bf16 MFU | 126052 tok/s step 6892/19560 | loss 3.511746 (+0.37z)| norm 0.3019 (+0.41z)| lr 4.54e-04 | 4152.48 ms | 32.5% bf16 MFU | 126062 tok/s step 6893/19560 | loss 3.485019 (-0.36z)| norm 0.2488 (-0.82z)| lr 4.54e-04 | 4161.43 ms | 32.4% bf16 MFU | 126059 tok/s step 6894/19560 | loss 3.458848 (-1.07z)| norm 0.2578 (-0.60z)| lr 4.54e-04 | 4146.45 ms | 32.6% bf16 MFU | 126078 tok/s step 6895/19560 | loss 3.484615 (-0.36z)| norm 0.3261 (+0.96z)| lr 4.54e-04 | 4166.71 ms | 32.4% bf16 MFU | 126065 tok/s step 6896/19560 | loss 3.516516 (+0.51z)| norm 0.3082 (+0.55z)| lr 4.54e-04 | 4158.39 ms | 32.5% bf16 MFU | 126066 tok/s step 6897/19560 | loss 3.537445 (+1.07z)| norm 0.2626 (-0.50z)| lr 4.54e-04 | 4154.05 ms | 32.5% bf16 MFU | 126073 tok/s step 6898/19560 | loss 3.487422 (-0.29z)| norm 0.3384 (+1.22z)| lr 4.54e-04 | 4151.15 ms | 32.5% bf16 MFU | 126085 tok/s step 6899/19560 | loss 3.507266 (+0.25z)| norm 0.3107 (+0.58z)| lr 4.54e-04 | 4160.16 ms | 32.5% bf16 MFU | 126082 tok/s step 6900/19560 | loss 3.521115 (+0.61z)| norm 0.3140 (+0.65z)| lr 4.54e-04 | 4162.69 ms | 32.4% bf16 MFU | 126075 tok/s step 6901/19560 | loss 3.517425 (+0.50z)| norm 0.2919 (+0.15z)| lr 4.54e-04 | 4180.95 ms | 32.3% bf16 MFU | 126041 tok/s step 6902/19560 | loss 3.499346 (+0.00z)| norm 0.2826 (-0.07z)| lr 4.54e-04 | 4156.58 ms | 32.5% bf16 MFU | 126046 tok/s step 6903/19560 | loss 3.491755 (-0.21z)| norm 0.2857 (+0.01z)| lr 4.54e-04 | 4155.30 ms | 32.5% bf16 MFU | 126052 tok/s step 6904/19560 | loss 3.535468 (+0.99z)| norm 0.2643 (-0.48z)| lr 4.54e-04 | 4159.06 ms | 32.5% bf16 MFU | 126053 tok/s step 6905/19560 | loss 3.510699 (+0.30z)| norm 0.2610 (-0.55z)| lr 4.54e-04 | 4147.34 ms | 32.6% bf16 MFU | 126071 tok/s step 6906/19560 | loss 3.544579 (+1.21z)| norm 0.2434 (-0.95z)| lr 4.54e-04 | 4156.59 ms | 32.5% bf16 MFU | 126074 tok/s step 6907/19560 | loss 3.496100 (-0.12z)| norm 0.2506 (-0.78z)| lr 4.53e-04 | 4160.38 ms | 32.5% bf16 MFU | 126071 tok/s step 6908/19560 | loss 3.446891 (-1.44z)| norm 0.2588 (-0.59z)| lr 4.53e-04 | 4150.77 ms | 32.5% bf16 MFU | 126083 tok/s step 6909/19560 | loss 3.521567 (+0.58z)| norm 0.2615 (-0.53z)| lr 4.53e-04 | 4152.16 ms | 32.5% bf16 MFU | 126092 tok/s step 6910/19560 | loss 3.452853 (-1.27z)| norm 0.2391 (-1.02z)| lr 4.53e-04 | 4152.70 ms | 32.5% bf16 MFU | 126100 tok/s step 6911/19560 | loss 3.482827 (-0.45z)| norm 0.2401 (-0.99z)| lr 4.53e-04 | 4174.36 ms | 32.3% bf16 MFU | 126075 tok/s step 6912/19560 | loss 3.417576 (-2.16z)| norm 0.2617 (-0.49z)| lr 4.53e-04 | 4159.21 ms | 32.5% bf16 MFU | 126074 tok/s step 6913/19560 | loss 3.495184 (-0.09z)| norm 0.2477 (-0.81z)| lr 4.53e-04 | 4162.12 ms | 32.4% bf16 MFU | 126069 tok/s step 6914/19560 | loss 3.467149 (-0.84z)| norm 0.2837 (-0.00z)| lr 4.53e-04 | 4155.31 ms | 32.5% bf16 MFU | 126074 tok/s step 6915/19560 | loss 3.566900 (+1.78z)| norm 0.2867 (+0.06z)| lr 4.53e-04 | 4163.59 ms | 32.4% bf16 MFU | 126066 tok/s step 6916/19560 | loss 3.452277 (-1.23z)| norm 0.2680 (-0.35z)| lr 4.53e-04 | 4158.86 ms | 32.5% bf16 MFU | 126066 tok/s step 6917/19560 | loss 3.501757 (+0.08z)| norm 0.2918 (+0.17z)| lr 4.53e-04 | 4160.87 ms | 32.4% bf16 MFU | 126063 tok/s step 6918/19560 | loss 3.457868 (-1.07z)| norm 0.2498 (-0.77z)| lr 4.53e-04 | 4162.35 ms | 32.4% bf16 MFU | 126058 tok/s step 6919/19560 | loss 3.467778 (-0.80z)| norm 0.2834 (-0.02z)| lr 4.53e-04 | 4157.17 ms | 32.5% bf16 MFU | 126061 tok/s step 6920/19560 | loss 3.499219 (+0.03z)| norm 0.2542 (-0.67z)| lr 4.53e-04 | 4155.59 ms | 32.5% bf16 MFU | 126066 tok/s step 6921/19560 | loss 3.467979 (-0.79z)| norm 0.2706 (-0.30z)| lr 4.53e-04 | 4154.17 ms | 32.5% bf16 MFU | 126073 tok/s step 6922/19560 | loss 3.458096 (-1.05z)| norm 0.2879 (+0.09z)| lr 4.53e-04 | 4157.73 ms | 32.5% bf16 MFU | 126075 tok/s step 6923/19560 | loss 3.645428 (+3.70z)| norm 1.1254 (+9.66z)| lr 4.53e-04 | 4170.16 ms | 32.4% bf16 MFU | 126057 tok/s step 6924/19560 | loss 3.471766 (-0.71z)| norm 0.3447 (+0.62z)| lr 4.53e-04 | 4159.47 ms | 32.5% bf16 MFU | 126057 tok/s step 6925/19560 | loss 3.481440 (-0.46z)| norm 0.2921 (+0.01z)| lr 4.53e-04 | 4159.57 ms | 32.5% bf16 MFU | 126056 tok/s step 6926/19560 | loss 3.488463 (-0.29z)| norm 0.3274 (+0.41z)| lr 4.53e-04 | 4165.90 ms | 32.4% bf16 MFU | 126046 tok/s step 6927/19560 | loss 3.493089 (-0.17z)| norm 0.2568 (-0.40z)| lr 4.53e-04 | 4162.91 ms | 32.4% bf16 MFU | 126041 tok/s step 6928/19560 | loss 3.551503 (+1.40z)| norm 0.2955 (+0.05z)| lr 4.53e-04 | 4161.93 ms | 32.4% bf16 MFU | 126037 tok/s step 6929/19560 | loss 3.471220 (-0.76z)| norm 0.2767 (-0.17z)| lr 4.53e-04 | 4163.27 ms | 32.4% bf16 MFU | 126032 tok/s step 6930/19560 | loss 3.488674 (-0.29z)| norm 0.2574 (-0.39z)| lr 4.52e-04 | 4163.80 ms | 32.4% bf16 MFU | 126026 tok/s step 6931/19560 | loss 3.495963 (-0.07z)| norm 0.2777 (-0.16z)| lr 4.52e-04 | 4168.59 ms | 32.4% bf16 MFU | 126013 tok/s step 6932/19560 | loss 3.624245 (+3.37z)| norm 0.2529 (-0.44z)| lr 4.52e-04 | 4160.58 ms | 32.5% bf16 MFU | 126013 tok/s step 6933/19560 | loss 3.486290 (-0.36z)| norm 0.2607 (-0.35z)| lr 4.52e-04 | 4162.81 ms | 32.4% bf16 MFU | 126010 tok/s step 6934/19560 | loss 3.467967 (-0.84z)| norm 0.2590 (-0.36z)| lr 4.52e-04 | 4161.98 ms | 32.4% bf16 MFU | 126008 tok/s step 6935/19560 | loss 3.490140 (-0.24z)| norm 0.2590 (-0.36z)| lr 4.52e-04 | 4160.02 ms | 32.5% bf16 MFU | 126009 tok/s step 6936/19560 | loss 3.567586 (+1.82z)| norm 0.2966 (+0.07z)| lr 4.52e-04 | 4164.91 ms | 32.4% bf16 MFU | 126003 tok/s step 6937/19560 | loss 3.453742 (-1.22z)| norm 0.2951 (+0.05z)| lr 4.52e-04 | 4158.70 ms | 32.5% bf16 MFU | 126006 tok/s step 6938/19560 | loss 3.524916 (+0.69z)| norm 0.3007 (+0.11z)| lr 4.52e-04 | 4155.31 ms | 32.5% bf16 MFU | 126014 tok/s step 6939/19560 | loss 3.486865 (-0.33z)| norm 0.3022 (+0.13z)| lr 4.52e-04 | 4155.59 ms | 32.5% bf16 MFU | 126022 tok/s step 6940/19560 | loss 3.467168 (-0.85z)| norm 0.2924 (+0.01z)| lr 4.52e-04 | 4161.37 ms | 32.4% bf16 MFU | 126020 tok/s step 6941/19560 | loss 3.511707 (+0.33z)| norm 0.2867 (-0.05z)| lr 4.52e-04 | 4164.79 ms | 32.4% bf16 MFU | 126014 tok/s step 6942/19560 | loss 3.500428 (+0.04z)| norm 0.2774 (-0.16z)| lr 4.52e-04 | 4160.26 ms | 32.5% bf16 MFU | 126014 tok/s step 6943/19560 | loss 3.465325 (-0.90z)| norm 0.2935 (+0.02z)| lr 4.52e-04 | 4153.75 ms | 32.5% bf16 MFU | 126024 tok/s step 6944/19560 | loss 3.480768 (-0.48z)| norm 0.2666 (-0.29z)| lr 4.52e-04 | 4172.46 ms | 32.4% bf16 MFU | 126006 tok/s step 6945/19560 | loss 3.612767 (+2.98z)| norm 0.2595 (-0.37z)| lr 4.52e-04 | 4157.73 ms | 32.5% bf16 MFU | 126011 tok/s step 6946/19560 | loss 3.501367 (+0.06z)| norm 0.2731 (-0.21z)| lr 4.52e-04 | 4153.72 ms | 32.5% bf16 MFU | 126021 tok/s step 6947/19560 | loss 3.490982 (-0.21z)| norm 0.2577 (-0.39z)| lr 4.52e-04 | 4159.34 ms | 32.5% bf16 MFU | 126023 tok/s step 6948/19560 | loss 3.468585 (-0.78z)| norm 0.2578 (-0.39z)| lr 4.52e-04 | 4159.50 ms | 32.5% bf16 MFU | 126024 tok/s step 6949/19560 | loss 3.505441 (+0.17z)| norm 0.2867 (-0.05z)| lr 4.52e-04 | 4149.77 ms | 32.5% bf16 MFU | 126040 tok/s step 6950/19560 | loss 3.417512 (-2.10z)| norm 0.2522 (-0.45z)| lr 4.52e-04 | 4147.45 ms | 32.6% bf16 MFU | 126058 tok/s step 6951/19560 | loss 3.477604 (-0.52z)| norm 0.3127 (+0.25z)| lr 4.52e-04 | 4168.14 ms | 32.4% bf16 MFU | 126045 tok/s step 6952/19560 | loss 3.483321 (-0.36z)| norm 0.2693 (-0.25z)| lr 4.52e-04 | 4162.20 ms | 32.4% bf16 MFU | 126041 tok/s step 6953/19560 | loss 3.504383 (+0.21z)| norm 0.2796 (-0.13z)| lr 4.51e-04 | 4156.93 ms | 32.5% bf16 MFU | 126045 tok/s step 6954/19560 | loss 3.473060 (-0.62z)| norm 0.2697 (-0.24z)| lr 4.51e-04 | 4161.98 ms | 32.4% bf16 MFU | 126041 tok/s step 6955/19560 | loss 3.546526 (+1.33z)| norm 0.2749 (-0.18z)| lr 4.51e-04 | 4165.35 ms | 32.4% bf16 MFU | 126032 tok/s step 6956/19560 | loss 3.487094 (-0.28z)| norm 0.2894 (-0.01z)| lr 4.51e-04 | 4161.24 ms | 32.4% bf16 MFU | 126030 tok/s step 6957/19560 | loss 3.479797 (-0.47z)| norm 0.2851 (-0.06z)| lr 4.51e-04 | 4171.29 ms | 32.4% bf16 MFU | 126013 tok/s step 6958/19560 | loss 3.480189 (-0.46z)| norm 0.2960 (+0.06z)| lr 4.51e-04 | 4160.02 ms | 32.5% bf16 MFU | 126014 tok/s step 6959/19560 | loss 3.460456 (-0.98z)| norm 0.2606 (-0.34z)| lr 4.51e-04 | 4153.16 ms | 32.5% bf16 MFU | 126025 tok/s step 6960/19560 | loss 3.546356 (+1.33z)| norm 0.2675 (-0.26z)| lr 4.51e-04 | 4151.87 ms | 32.5% bf16 MFU | 126038 tok/s step 6961/19560 | loss 3.480435 (-0.45z)| norm 0.2742 (-0.18z)| lr 4.51e-04 | 4300.83 ms | 31.4% bf16 MFU | 125831 tok/s step 6962/19560 | loss 3.521733 (+0.66z)| norm 0.2838 (-0.08z)| lr 4.51e-04 | 4159.52 ms | 32.5% bf16 MFU | 125842 tok/s step 6963/19560 | loss 3.527654 (+0.81z)| norm 0.2891 (-0.01z)| lr 4.51e-04 | 4155.95 ms | 32.5% bf16 MFU | 125858 tok/s step 6964/19560 | loss 3.518296 (+0.55z)| norm 0.2858 (-0.05z)| lr 4.51e-04 | 4158.12 ms | 32.5% bf16 MFU | 125869 tok/s step 6965/19560 | loss 3.484409 (-0.35z)| norm 0.2566 (-0.39z)| lr 4.51e-04 | 4159.99 ms | 32.5% bf16 MFU | 125877 tok/s step 6966/19560 | loss 3.576433 (+2.08z)| norm 0.3029 (+0.14z)| lr 4.51e-04 | 4153.65 ms | 32.5% bf16 MFU | 125895 tok/s step 6967/19560 | loss 3.472959 (-0.66z)| norm 0.3054 (+0.17z)| lr 4.51e-04 | 4154.40 ms | 32.5% bf16 MFU | 125910 tok/s step 6968/19560 | loss 3.480822 (-0.44z)| norm 0.2711 (-0.23z)| lr 4.51e-04 | 4161.10 ms | 32.4% bf16 MFU | 125914 tok/s step 6969/19560 | loss 3.533524 (+0.96z)| norm 0.2734 (-0.20z)| lr 4.51e-04 | 4151.53 ms | 32.5% bf16 MFU | 125933 tok/s step 6970/19560 | loss 3.434440 (-1.64z)| norm 0.2825 (-0.10z)| lr 4.51e-04 | 4158.21 ms | 32.5% bf16 MFU | 125940 tok/s step 6971/19560 | loss 3.490009 (-0.17z)| norm 0.2938 (+0.03z)| lr 4.51e-04 | 4346.13 ms | 31.1% bf16 MFU | 125675 tok/s step 6972/19560 | loss 3.456532 (-1.05z)| norm 0.2663 (-0.29z)| lr 4.51e-04 | 4405.64 ms | 30.6% bf16 MFU | 125342 tok/s step 6973/19560 | loss 3.500328 (+0.14z)| norm 0.2605 (-0.36z)| lr 4.51e-04 | 4385.03 ms | 30.8% bf16 MFU | 125053 tok/s step 6974/19560 | loss 3.535851 (+1.09z)| norm 0.2791 (-0.15z)| lr 4.51e-04 | 4233.35 ms | 31.9% bf16 MFU | 124992 tok/s step 6975/19560 | loss 3.512517 (+0.48z)| norm 0.2891 (-0.03z)| lr 4.51e-04 | 4206.58 ms | 32.1% bf16 MFU | 124975 tok/s step 6976/19560 | loss 3.465711 (-0.81z)| norm 0.2685 (-0.27z)| lr 4.51e-04 | 4170.85 ms | 32.4% bf16 MFU | 125011 tok/s step 6977/19560 | loss 3.502567 (+0.20z)| norm 0.2680 (-0.27z)| lr 4.50e-04 | 4203.73 ms | 32.1% bf16 MFU | 124996 tok/s step 6978/19560 | loss 3.461805 (-0.93z)| norm 0.2855 (-0.07z)| lr 4.50e-04 | 4174.50 ms | 32.3% bf16 MFU | 125026 tok/s step 6979/19560 | loss 3.454008 (-1.13z)| norm 0.3113 (+0.22z)| lr 4.50e-04 | 4173.38 ms | 32.4% bf16 MFU | 125056 tok/s step 6980/19560 | loss 3.513847 (+0.51z)| norm 0.2585 (-0.39z)| lr 4.50e-04 | 4191.49 ms | 32.2% bf16 MFU | 125058 tok/s step 6981/19560 | loss 3.456794 (-1.07z)| norm 0.3093 (+0.19z)| lr 4.50e-04 | 4164.87 ms | 32.4% bf16 MFU | 125099 tok/s step 6982/19560 | loss 3.524211 (+0.79z)| norm 0.3359 (+0.49z)| lr 4.50e-04 | 4170.69 ms | 32.4% bf16 MFU | 125129 tok/s step 6983/19560 | loss 3.485841 (-0.27z)| norm 0.3083 (+0.17z)| lr 4.50e-04 | 4186.07 ms | 32.3% bf16 MFU | 125135 tok/s step 6984/19560 | loss 3.512448 (+0.47z)| norm 0.3263 (+0.38z)| lr 4.50e-04 | 4223.24 ms | 32.0% bf16 MFU | 125086 tok/s step 6985/19560 | loss 3.529891 (+0.95z)| norm 0.2927 (-0.01z)| lr 4.50e-04 | 4158.29 ms | 32.5% bf16 MFU | 125135 tok/s step 6986/19560 | loss 3.536346 (+1.11z)| norm 0.3183 (+0.28z)| lr 4.50e-04 | 4159.79 ms | 32.5% bf16 MFU | 125181 tok/s step 6987/19560 | loss 3.487735 (-0.23z)| norm 0.3122 (+0.21z)| lr 4.50e-04 | 4176.80 ms | 32.3% bf16 MFU | 125198 tok/s step 6988/19560 | loss 3.541273 (+1.24z)| norm 0.2806 (-0.16z)| lr 4.50e-04 | 4201.76 ms | 32.1% bf16 MFU | 125177 tok/s step 6989/19560 | loss 3.498238 (+0.05z)| norm 0.3007 (+0.08z)| lr 4.50e-04 | 4169.99 ms | 32.4% bf16 MFU | 125204 tok/s step 6990/19560 | loss 3.459906 (-1.00z)| norm 0.2778 (-0.18z)| lr 4.50e-04 | 4164.97 ms | 32.4% bf16 MFU | 125238 tok/s step 6991/19560 | loss 3.450383 (-1.24z)| norm 0.2730 (-0.24z)| lr 4.50e-04 | 4170.59 ms | 32.4% bf16 MFU | 125262 tok/s step 6992/19560 | loss 3.477554 (-0.51z)| norm 0.2976 (+0.05z)| lr 4.50e-04 | 4165.44 ms | 32.4% bf16 MFU | 125292 tok/s step 6993/19560 | loss 3.491431 (-0.11z)| norm 0.2887 (-0.05z)| lr 4.50e-04 | 4171.34 ms | 32.4% bf16 MFU | 125312 tok/s step 6994/19560 | loss 3.519781 (+0.67z)| norm 0.2879 (-0.06z)| lr 4.50e-04 | 4162.62 ms | 32.4% bf16 MFU | 125344 tok/s step 6995/19560 | loss 3.440425 (-1.51z)| norm 0.2770 (-0.18z)| lr 4.50e-04 | 4207.06 ms | 32.1% bf16 MFU | 125308 tok/s step 6996/19560 | loss 3.437923 (-1.55z)| norm 0.2922 (+0.03z)| lr 4.50e-04 | 4161.08 ms | 32.4% bf16 MFU | 125342 tok/s step 6997/19560 | loss 3.416629 (-2.07z)| norm 0.2835 (-0.07z)| lr 4.50e-04 | 4162.23 ms | 32.4% bf16 MFU | 125373 tok/s step 6998/19560 | loss 3.485678 (-0.22z)| norm 0.2743 (-0.18z)| lr 4.50e-04 | 4174.45 ms | 32.3% bf16 MFU | 125384 tok/s step 6999/19560 | loss 3.439456 (-1.43z)| norm 0.2681 (-0.26z)| lr 4.50e-04 | 4181.46 ms | 32.3% bf16 MFU | 125384 tok/s step 7000/19560 | loss 3.508019 (+0.38z)| norm 0.2776 (-0.13z)| lr 4.49e-04 | 4176.02 ms | 32.3% bf16 MFU | 125392 tok/s val loss 3.470341 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2802/10042 = 0.279028 step 7001/19560 | loss 3.473511 (-0.54z)| norm 0.2622 (-0.33z)| lr 4.49e-04 | 4161.21 ms | 32.4% bf16 MFU | 125422 tok/s step 7002/19560 | loss 3.514997 (+0.56z)| norm 0.2731 (-0.18z)| lr 4.49e-04 | 4158.70 ms | 32.5% bf16 MFU | 125455 tok/s step 7003/19560 | loss 3.554286 (+1.58z)| norm 0.2842 (-0.04z)| lr 4.49e-04 | 4169.25 ms | 32.4% bf16 MFU | 125470 tok/s step 7004/19560 | loss 3.441978 (-1.38z)| norm 0.2952 (+0.10z)| lr 4.49e-04 | 4164.31 ms | 32.4% bf16 MFU | 125491 tok/s step 7005/19560 | loss 3.446217 (-1.25z)| norm 0.2653 (-0.28z)| lr 4.49e-04 | 4161.81 ms | 32.4% bf16 MFU | 125515 tok/s step 7006/19560 | loss 3.525656 (+0.82z)| norm 0.3006 (+0.17z)| lr 4.49e-04 | 4250.22 ms | 31.8% bf16 MFU | 125407 tok/s step 7007/19560 | loss 3.497736 (+0.09z)| norm 0.3210 (+0.43z)| lr 4.49e-04 | 4152.38 ms | 32.5% bf16 MFU | 125450 tok/s step 7008/19560 | loss 3.567508 (+1.89z)| norm 0.2786 (-0.12z)| lr 4.49e-04 | 4179.87 ms | 32.3% bf16 MFU | 125449 tok/s step 7009/19560 | loss 3.464382 (-0.77z)| norm 0.2969 (+0.12z)| lr 4.49e-04 | 4161.68 ms | 32.4% bf16 MFU | 125476 tok/s step 7010/19560 | loss 3.725788 (+5.25z)| norm 0.3140 (+0.34z)| lr 4.49e-04 | 4154.22 ms | 32.5% bf16 MFU | 125512 tok/s step 7011/19560 | loss 3.489194 (-0.15z)| norm 0.2952 (+0.09z)| lr 4.49e-04 | 4174.46 ms | 32.3% bf16 MFU | 125516 tok/s step 7012/19560 | loss 3.486716 (-0.20z)| norm 0.3125 (+0.31z)| lr 4.49e-04 | 4180.56 ms | 32.3% bf16 MFU | 125511 tok/s step 7013/19560 | loss 3.498213 (+0.06z)| norm 0.2906 (+0.02z)| lr 4.49e-04 | 4221.69 ms | 32.0% bf16 MFU | 125445 tok/s step 7014/19560 | loss 3.517797 (+0.50z)| norm 0.2830 (-0.07z)| lr 4.49e-04 | 4171.69 ms | 32.4% bf16 MFU | 125457 tok/s step 7015/19560 | loss 3.470667 (-0.58z)| norm 0.3264 (+0.48z)| lr 4.49e-04 | 4145.68 ms | 32.6% bf16 MFU | 125507 tok/s step 7016/19560 | loss 3.516790 (+0.47z)| norm 0.2828 (-0.08z)| lr 4.49e-04 | 4182.70 ms | 32.3% bf16 MFU | 125499 tok/s step 7017/19560 | loss 3.573644 (+1.74z)| norm 0.2857 (-0.05z)| lr 4.49e-04 | 4172.48 ms | 32.4% bf16 MFU | 125507 tok/s step 7018/19560 | loss 3.534887 (+0.85z)| norm 0.2930 (+0.05z)| lr 4.49e-04 | 4174.68 ms | 32.3% bf16 MFU | 125511 tok/s step 7019/19560 | loss 3.521459 (+0.53z)| norm 0.3037 (+0.18z)| lr 4.49e-04 | 4162.31 ms | 32.4% bf16 MFU | 125533 tok/s step 7020/19560 | loss 3.493422 (-0.10z)| norm 0.2861 (-0.04z)| lr 4.49e-04 | 4176.03 ms | 32.3% bf16 MFU | 125534 tok/s step 7021/19560 | loss 3.482858 (-0.34z)| norm 0.2872 (-0.03z)| lr 4.49e-04 | 4175.19 ms | 32.3% bf16 MFU | 125536 tok/s step 7022/19560 | loss 3.460735 (-0.85z)| norm 0.3015 (+0.15z)| lr 4.49e-04 | 4177.04 ms | 32.3% bf16 MFU | 125535 tok/s step 7023/19560 | loss 3.431949 (-1.49z)| norm 0.2703 (-0.25z)| lr 4.48e-04 | 4176.53 ms | 32.3% bf16 MFU | 125535 tok/s step 7024/19560 | loss 3.477295 (-0.45z)| norm 0.2813 (-0.11z)| lr 4.48e-04 | 4164.47 ms | 32.4% bf16 MFU | 125553 tok/s step 7025/19560 | loss 3.506829 (+0.22z)| norm 0.2812 (-0.11z)| lr 4.48e-04 | 4174.49 ms | 32.3% bf16 MFU | 125555 tok/s step 7026/19560 | loss 3.557115 (+1.34z)| norm 0.3001 (+0.14z)| lr 4.48e-04 | 4168.00 ms | 32.4% bf16 MFU | 125567 tok/s step 7027/19560 | loss 3.528698 (+0.70z)| norm 0.2802 (-0.12z)| lr 4.48e-04 | 4174.15 ms | 32.3% bf16 MFU | 125568 tok/s step 7028/19560 | loss 3.490946 (-0.15z)| norm 0.2532 (-0.46z)| lr 4.48e-04 | 4172.32 ms | 32.4% bf16 MFU | 125573 tok/s step 7029/19560 | loss 3.472371 (-0.56z)| norm 0.2550 (-0.43z)| lr 4.48e-04 | 4187.95 ms | 32.2% bf16 MFU | 125554 tok/s step 7030/19560 | loss 3.481286 (-0.35z)| norm 0.2751 (-0.17z)| lr 4.48e-04 | 4157.25 ms | 32.5% bf16 MFU | 125582 tok/s step 7031/19560 | loss 3.501498 (+0.10z)| norm 0.2868 (-0.02z)| lr 4.48e-04 | 4162.22 ms | 32.4% bf16 MFU | 125601 tok/s step 7032/19560 | loss 3.527651 (+0.69z)| norm 0.4765 (+2.37z)| lr 4.48e-04 | 4168.27 ms | 32.4% bf16 MFU | 125610 tok/s step 7033/19560 | loss 3.486803 (-0.23z)| norm 0.3041 (+0.18z)| lr 4.48e-04 | 4169.08 ms | 32.4% bf16 MFU | 125617 tok/s step 7034/19560 | loss 3.515408 (+0.43z)| norm 0.3040 (+0.17z)| lr 4.48e-04 | 4156.51 ms | 32.5% bf16 MFU | 125643 tok/s step 7035/19560 | loss 3.468647 (-0.63z)| norm 0.2762 (-0.19z)| lr 4.48e-04 | 4156.98 ms | 32.5% bf16 MFU | 125667 tok/s step 7036/19560 | loss 3.479302 (-0.39z)| norm 0.3286 (+0.47z)| lr 4.48e-04 | 4179.24 ms | 32.3% bf16 MFU | 125656 tok/s step 7037/19560 | loss 3.487597 (-0.20z)| norm 0.2973 (+0.07z)| lr 4.48e-04 | 4161.76 ms | 32.4% bf16 MFU | 125672 tok/s step 7038/19560 | loss 3.444122 (-1.18z)| norm 0.2740 (-0.23z)| lr 4.48e-04 | 4162.65 ms | 32.4% bf16 MFU | 125686 tok/s step 7039/19560 | loss 3.481189 (-0.34z)| norm 0.2836 (-0.11z)| lr 4.48e-04 | 4173.81 ms | 32.3% bf16 MFU | 125683 tok/s step 7040/19560 | loss 3.506010 (+0.21z)| norm 0.2873 (-0.07z)| lr 4.48e-04 | 4160.87 ms | 32.4% bf16 MFU | 125699 tok/s step 7041/19560 | loss 3.672491 (+3.77z)| norm 0.2920 (-0.01z)| lr 4.48e-04 | 4169.57 ms | 32.4% bf16 MFU | 125701 tok/s step 7042/19560 | loss 3.466216 (-0.70z)| norm 0.3189 (+0.33z)| lr 4.48e-04 | 4173.53 ms | 32.4% bf16 MFU | 125697 tok/s step 7043/19560 | loss 3.461942 (-0.78z)| norm 0.3053 (+0.15z)| lr 4.48e-04 | 4165.95 ms | 32.4% bf16 MFU | 125705 tok/s step 7044/19560 | loss 3.487725 (-0.22z)| norm 0.2915 (-0.03z)| lr 4.48e-04 | 4204.84 ms | 32.1% bf16 MFU | 125654 tok/s step 7045/19560 | loss 3.494561 (-0.07z)| norm 0.2881 (-0.07z)| lr 4.48e-04 | 4193.74 ms | 32.2% bf16 MFU | 125622 tok/s step 7046/19560 | loss 3.458716 (-0.85z)| norm 0.3146 (+0.26z)| lr 4.47e-04 | 4179.79 ms | 32.3% bf16 MFU | 125612 tok/s step 7047/19560 | loss 3.526087 (+0.61z)| norm 0.2725 (-0.27z)| lr 4.47e-04 | 4171.00 ms | 32.4% bf16 MFU | 125617 tok/s step 7048/19560 | loss 3.485499 (-0.28z)| norm 0.2990 (+0.06z)| lr 4.47e-04 | 4163.36 ms | 32.4% bf16 MFU | 125632 tok/s step 7049/19560 | loss 3.480127 (-0.40z)| norm 0.2887 (-0.07z)| lr 4.47e-04 | 4240.96 ms | 31.8% bf16 MFU | 125532 tok/s step 7050/19560 | loss 3.480468 (-0.39z)| norm 0.2689 (-0.33z)| lr 4.47e-04 | 4170.50 ms | 32.4% bf16 MFU | 125541 tok/s step 7051/19560 | loss 3.446285 (-1.15z)| norm 0.2716 (-0.63z)| lr 4.47e-04 | 4159.63 ms | 32.5% bf16 MFU | 125566 tok/s step 7052/19560 | loss 3.490354 (-0.15z)| norm 0.2798 (-0.29z)| lr 4.47e-04 | 4179.99 ms | 32.3% bf16 MFU | 125559 tok/s step 7053/19560 | loss 3.431221 (-1.48z)| norm 0.3013 (+0.57z)| lr 4.47e-04 | 4192.20 ms | 32.2% bf16 MFU | 125534 tok/s step 7054/19560 | loss 3.453848 (-0.96z)| norm 0.2881 (+0.05z)| lr 4.47e-04 | 4152.73 ms | 32.5% bf16 MFU | 125570 tok/s step 7055/19560 | loss 3.530513 (+0.76z)| norm 0.2888 (+0.07z)| lr 4.47e-04 | 4156.93 ms | 32.5% bf16 MFU | 125598 tok/s step 7056/19560 | loss 3.448761 (-1.06z)| norm 0.2953 (+0.33z)| lr 4.47e-04 | 4160.79 ms | 32.4% bf16 MFU | 125618 tok/s step 7057/19560 | loss 3.486824 (-0.20z)| norm 0.2730 (-0.58z)| lr 4.47e-04 | 4173.06 ms | 32.4% bf16 MFU | 125619 tok/s step 7058/19560 | loss 3.523591 (+0.62z)| norm 0.2793 (-0.32z)| lr 4.47e-04 | 4150.02 ms | 32.5% bf16 MFU | 125655 tok/s step 7059/19560 | loss 3.487537 (-0.19z)| norm 0.3142 (+1.09z)| lr 4.47e-04 | 4216.39 ms | 32.0% bf16 MFU | 125589 tok/s step 7060/19560 | loss 3.429504 (-1.50z)| norm 0.2771 (-0.44z)| lr 4.47e-04 | 4176.96 ms | 32.3% bf16 MFU | 125586 tok/s step 7061/19560 | loss 3.455862 (-0.89z)| norm 0.2752 (-0.52z)| lr 4.47e-04 | 4153.64 ms | 32.5% bf16 MFU | 125618 tok/s step 7062/19560 | loss 3.512615 (+0.41z)| norm 0.2955 (+0.31z)| lr 4.47e-04 | 4180.26 ms | 32.3% bf16 MFU | 125608 tok/s step 7063/19560 | loss 3.495121 (+0.01z)| norm 0.2620 (-1.09z)| lr 4.47e-04 | 4172.82 ms | 32.4% bf16 MFU | 125610 tok/s step 7064/19560 | loss 3.548100 (+1.24z)| norm 0.2910 (+0.12z)| lr 4.47e-04 | 4184.60 ms | 32.3% bf16 MFU | 125594 tok/s step 7065/19560 | loss 3.472354 (-0.52z)| norm 0.2686 (-0.80z)| lr 4.47e-04 | 4171.19 ms | 32.4% bf16 MFU | 125599 tok/s step 7066/19560 | loss 3.623676 (+2.88z)| norm 0.2585 (-1.20z)| lr 4.47e-04 | 4446.93 ms | 30.4% bf16 MFU | 125214 tok/s step 7067/19560 | loss 3.458519 (-0.82z)| norm 0.2767 (-0.44z)| lr 4.47e-04 | 4164.42 ms | 32.4% bf16 MFU | 125248 tok/s step 7068/19560 | loss 3.463174 (-0.72z)| norm 0.2533 (-1.38z)| lr 4.47e-04 | 4159.66 ms | 32.5% bf16 MFU | 125287 tok/s step 7069/19560 | loss 3.497251 (+0.05z)| norm 0.2857 (-0.06z)| lr 4.46e-04 | 4165.79 ms | 32.4% bf16 MFU | 125316 tok/s step 7070/19560 | loss 3.492436 (-0.06z)| norm 0.2820 (-0.21z)| lr 4.46e-04 | 4172.78 ms | 32.4% bf16 MFU | 125332 tok/s step 7071/19560 | loss 3.539323 (+0.98z)| norm 0.2789 (-0.33z)| lr 4.46e-04 | 4166.88 ms | 32.4% bf16 MFU | 125357 tok/s step 7072/19560 | loss 3.547052 (+1.13z)| norm 0.2744 (-0.52z)| lr 4.46e-04 | 4168.85 ms | 32.4% bf16 MFU | 125377 tok/s step 7073/19560 | loss 3.485139 (-0.23z)| norm 0.2699 (-0.71z)| lr 4.46e-04 | 4158.38 ms | 32.5% bf16 MFU | 125412 tok/s step 7074/19560 | loss 3.508308 (+0.30z)| norm 0.2921 (+0.20z)| lr 4.46e-04 | 4161.09 ms | 32.4% bf16 MFU | 125442 tok/s step 7075/19560 | loss 3.501498 (+0.14z)| norm 0.2802 (-0.30z)| lr 4.46e-04 | 4154.24 ms | 32.5% bf16 MFU | 125480 tok/s step 7076/19560 | loss 3.488042 (-0.17z)| norm 0.2668 (-0.86z)| lr 4.46e-04 | 4153.52 ms | 32.5% bf16 MFU | 125517 tok/s step 7077/19560 | loss 3.577672 (+1.85z)| norm 0.2935 (+0.25z)| lr 4.46e-04 | 4174.83 ms | 32.3% bf16 MFU | 125520 tok/s step 7078/19560 | loss 3.476896 (-0.45z)| norm 0.2759 (-0.50z)| lr 4.46e-04 | 4178.52 ms | 32.3% bf16 MFU | 125518 tok/s step 7079/19560 | loss 3.599216 (+2.28z)| norm 0.2635 (-1.00z)| lr 4.46e-04 | 4152.13 ms | 32.5% bf16 MFU | 125556 tok/s step 7080/19560 | loss 3.440505 (-1.26z)| norm 0.3148 (+1.13z)| lr 4.46e-04 | 4172.37 ms | 32.4% bf16 MFU | 125561 tok/s step 7081/19560 | loss 3.433541 (-1.39z)| norm 0.3018 (+0.58z)| lr 4.46e-04 | 4163.20 ms | 32.4% bf16 MFU | 125579 tok/s step 7082/19560 | loss 3.498078 (+0.03z)| norm 0.2895 (+0.06z)| lr 4.46e-04 | 4167.62 ms | 32.4% bf16 MFU | 125590 tok/s step 7083/19560 | loss 3.509553 (+0.29z)| norm 0.3159 (+1.15z)| lr 4.46e-04 | 4164.43 ms | 32.4% bf16 MFU | 125606 tok/s step 7084/19560 | loss 3.455897 (-0.89z)| norm 0.2625 (-1.06z)| lr 4.46e-04 | 4166.09 ms | 32.4% bf16 MFU | 125618 tok/s step 7085/19560 | loss 3.471912 (-0.54z)| norm 0.2987 (+0.43z)| lr 4.46e-04 | 4162.89 ms | 32.4% bf16 MFU | 125634 tok/s step 7086/19560 | loss 3.536919 (+0.89z)| norm 0.2910 (+0.12z)| lr 4.46e-04 | 4188.37 ms | 32.2% bf16 MFU | 125611 tok/s step 7087/19560 | loss 3.424825 (-1.57z)| norm 0.2926 (+0.17z)| lr 4.46e-04 | 4162.59 ms | 32.4% bf16 MFU | 125628 tok/s step 7088/19560 | loss 3.485788 (-0.22z)| norm 0.2918 (+0.13z)| lr 4.46e-04 | 4177.29 ms | 32.3% bf16 MFU | 125622 tok/s step 7089/19560 | loss 3.432467 (-1.38z)| norm 0.2589 (-1.23z)| lr 4.46e-04 | 4173.70 ms | 32.3% bf16 MFU | 125622 tok/s step 7090/19560 | loss 3.457630 (-0.82z)| norm 0.2612 (-1.13z)| lr 4.46e-04 | 4163.36 ms | 32.4% bf16 MFU | 125637 tok/s step 7091/19560 | loss 3.512072 (+0.38z)| norm 0.2592 (-1.19z)| lr 4.46e-04 | 4159.06 ms | 32.5% bf16 MFU | 125658 tok/s step 7092/19560 | loss 3.483745 (-0.24z)| norm 0.2814 (-0.28z)| lr 4.45e-04 | 4168.43 ms | 32.4% bf16 MFU | 125664 tok/s step 7093/19560 | loss 3.507517 (+0.28z)| norm 0.2750 (-0.55z)| lr 4.45e-04 | 4173.58 ms | 32.4% bf16 MFU | 125662 tok/s step 7094/19560 | loss 3.471696 (-0.49z)| norm 0.2641 (-0.98z)| lr 4.45e-04 | 4165.64 ms | 32.4% bf16 MFU | 125672 tok/s step 7095/19560 | loss 3.458374 (-0.78z)| norm 0.2583 (-1.21z)| lr 4.45e-04 | 4177.72 ms | 32.3% bf16 MFU | 125663 tok/s step 7096/19560 | loss 3.488282 (-0.12z)| norm 0.2696 (-0.74z)| lr 4.45e-04 | 4213.55 ms | 32.0% bf16 MFU | 125602 tok/s step 7097/19560 | loss 3.458575 (-0.77z)| norm 0.2877 (-0.00z)| lr 4.45e-04 | 4156.59 ms | 32.5% bf16 MFU | 125628 tok/s step 7098/19560 | loss 3.566616 (+1.60z)| norm 0.2537 (-1.38z)| lr 4.45e-04 | 4163.40 ms | 32.4% bf16 MFU | 125643 tok/s step 7099/19560 | loss 3.492923 (-0.03z)| norm 0.3139 (+1.07z)| lr 4.45e-04 | 4168.98 ms | 32.4% bf16 MFU | 125649 tok/s step 7100/19560 | loss 3.479097 (-0.34z)| norm 0.2910 (+0.13z)| lr 4.45e-04 | 4172.48 ms | 32.4% bf16 MFU | 125649 tok/s step 7101/19560 | loss 3.484457 (-0.22z)| norm 0.2969 (+0.36z)| lr 4.45e-04 | 4151.21 ms | 32.5% bf16 MFU | 125682 tok/s step 7102/19560 | loss 3.439388 (-1.20z)| norm 0.3336 (+1.83z)| lr 4.45e-04 | 4244.59 ms | 31.8% bf16 MFU | 125573 tok/s step 7103/19560 | loss 3.527360 (+0.74z)| norm 0.2860 (-0.10z)| lr 4.45e-04 | 4182.50 ms | 32.3% bf16 MFU | 125562 tok/s step 7104/19560 | loss 3.481321 (-0.28z)| norm 0.2974 (+0.35z)| lr 4.45e-04 | 4151.35 ms | 32.5% bf16 MFU | 125599 tok/s step 7105/19560 | loss 3.495745 (+0.04z)| norm 0.2750 (-0.56z)| lr 4.45e-04 | 4168.26 ms | 32.4% bf16 MFU | 125608 tok/s step 7106/19560 | loss 3.518747 (+0.54z)| norm 0.2635 (-1.02z)| lr 4.45e-04 | 4171.42 ms | 32.4% bf16 MFU | 125612 tok/s step 7107/19560 | loss 3.535142 (+0.89z)| norm 0.3012 (+0.51z)| lr 4.45e-04 | 4164.93 ms | 32.4% bf16 MFU | 125625 tok/s step 7108/19560 | loss 3.543169 (+1.06z)| norm 0.2774 (-0.46z)| lr 4.45e-04 | 4158.14 ms | 32.5% bf16 MFU | 125648 tok/s step 7109/19560 | loss 3.583272 (+1.90z)| norm 0.3083 (+0.80z)| lr 4.45e-04 | 4148.46 ms | 32.5% bf16 MFU | 125685 tok/s step 7110/19560 | loss 3.477359 (-0.40z)| norm 0.2833 (-0.21z)| lr 4.45e-04 | 4165.81 ms | 32.4% bf16 MFU | 125694 tok/s step 7111/19560 | loss 3.550994 (+1.18z)| norm 0.2696 (-0.76z)| lr 4.45e-04 | 4153.31 ms | 32.5% bf16 MFU | 125721 tok/s step 7112/19560 | loss 3.473181 (-0.49z)| norm 0.2726 (-0.63z)| lr 4.45e-04 | 4163.60 ms | 32.4% bf16 MFU | 125731 tok/s step 7113/19560 | loss 3.511244 (+0.33z)| norm 0.2887 (+0.05z)| lr 4.45e-04 | 4152.87 ms | 32.5% bf16 MFU | 125757 tok/s step 7114/19560 | loss 3.477543 (-0.39z)| norm 0.2637 (-0.98z)| lr 4.44e-04 | 4158.88 ms | 32.5% bf16 MFU | 125772 tok/s step 7115/19560 | loss 3.467628 (-0.60z)| norm 0.2653 (-0.90z)| lr 4.44e-04 | 4164.81 ms | 32.4% bf16 MFU | 125778 tok/s step 7116/19560 | loss 3.479574 (-0.33z)| norm 0.2641 (-0.94z)| lr 4.44e-04 | 4159.72 ms | 32.5% bf16 MFU | 125791 tok/s step 7117/19560 | loss 3.515891 (+0.46z)| norm 0.2629 (-0.98z)| lr 4.44e-04 | 4157.60 ms | 32.5% bf16 MFU | 125806 tok/s step 7118/19560 | loss 3.449781 (-0.98z)| norm 0.2361 (-2.05z)| lr 4.44e-04 | 4180.94 ms | 32.3% bf16 MFU | 125786 tok/s step 7119/19560 | loss 3.472334 (-0.49z)| norm 0.2594 (-1.08z)| lr 4.44e-04 | 4178.53 ms | 32.3% bf16 MFU | 125770 tok/s step 7120/19560 | loss 3.497434 (+0.05z)| norm 0.2659 (-0.80z)| lr 4.44e-04 | 4155.40 ms | 32.5% bf16 MFU | 125790 tok/s step 7121/19560 | loss 3.473595 (-0.47z)| norm 0.2726 (-0.52z)| lr 4.44e-04 | 4156.95 ms | 32.5% bf16 MFU | 125807 tok/s step 7122/19560 | loss 3.464535 (-0.65z)| norm 0.2646 (-0.84z)| lr 4.44e-04 | 4235.58 ms | 31.9% bf16 MFU | 125706 tok/s step 7123/19560 | loss 3.421496 (-1.58z)| norm 0.2815 (-0.16z)| lr 4.44e-04 | 4152.71 ms | 32.5% bf16 MFU | 125733 tok/s step 7124/19560 | loss 3.498400 (+0.08z)| norm 0.2535 (-1.27z)| lr 4.44e-04 | 4162.65 ms | 32.4% bf16 MFU | 125744 tok/s step 7125/19560 | loss 3.482566 (-0.28z)| norm 0.2643 (-0.83z)| lr 4.44e-04 | 4183.43 ms | 32.3% bf16 MFU | 125723 tok/s step 7126/19560 | loss 3.463068 (-0.71z)| norm 0.2777 (-0.29z)| lr 4.44e-04 | 4158.75 ms | 32.5% bf16 MFU | 125740 tok/s step 7127/19560 | loss 3.470838 (-0.54z)| norm 0.2854 (+0.01z)| lr 4.44e-04 | 4171.74 ms | 32.4% bf16 MFU | 125737 tok/s step 7128/19560 | loss 3.507130 (+0.26z)| norm 0.2759 (-0.37z)| lr 4.44e-04 | 4176.19 ms | 32.3% bf16 MFU | 125727 tok/s step 7129/19560 | loss 3.593554 (+2.11z)| norm 0.3173 (+1.28z)| lr 4.44e-04 | 4155.44 ms | 32.5% bf16 MFU | 125749 tok/s step 7130/19560 | loss 3.464692 (-0.68z)| norm 0.3276 (+1.66z)| lr 4.44e-04 | 4161.41 ms | 32.4% bf16 MFU | 125761 tok/s step 7131/19560 | loss 3.468689 (-0.58z)| norm 0.2799 (-0.24z)| lr 4.44e-04 | 4171.43 ms | 32.4% bf16 MFU | 125757 tok/s step 7132/19560 | loss 3.539587 (+0.95z)| norm 0.2829 (-0.11z)| lr 4.44e-04 | 4152.58 ms | 32.5% bf16 MFU | 125782 tok/s step 7133/19560 | loss 3.571381 (+1.62z)| norm 0.3040 (+0.72z)| lr 4.44e-04 | 4157.30 ms | 32.5% bf16 MFU | 125799 tok/s step 7134/19560 | loss 3.530844 (+0.73z)| norm 0.2686 (-0.69z)| lr 4.44e-04 | 4154.60 ms | 32.5% bf16 MFU | 125819 tok/s step 7135/19560 | loss 3.487108 (-0.22z)| norm 0.2686 (-0.68z)| lr 4.44e-04 | 4177.72 ms | 32.3% bf16 MFU | 125803 tok/s step 7136/19560 | loss 3.482369 (-0.31z)| norm 0.2798 (-0.22z)| lr 4.44e-04 | 4162.18 ms | 32.4% bf16 MFU | 125811 tok/s step 7137/19560 | loss 3.540028 (+0.94z)| norm 0.2800 (-0.21z)| lr 4.43e-04 | 4153.83 ms | 32.5% bf16 MFU | 125831 tok/s step 7138/19560 | loss 3.491577 (-0.09z)| norm 0.2687 (-0.66z)| lr 4.43e-04 | 4155.02 ms | 32.5% bf16 MFU | 125849 tok/s step 7139/19560 | loss 3.504121 (+0.22z)| norm 0.2924 (+0.30z)| lr 4.43e-04 | 4168.61 ms | 32.4% bf16 MFU | 125845 tok/s step 7140/19560 | loss 3.576438 (+1.94z)| norm 0.2665 (-0.73z)| lr 4.43e-04 | 4177.85 ms | 32.3% bf16 MFU | 125827 tok/s step 7141/19560 | loss 3.470709 (-0.60z)| norm 0.2749 (-0.38z)| lr 4.43e-04 | 4172.25 ms | 32.4% bf16 MFU | 125819 tok/s step 7142/19560 | loss 3.500462 (+0.12z)| norm 0.2447 (-1.58z)| lr 4.43e-04 | 4170.25 ms | 32.4% bf16 MFU | 125814 tok/s step 7143/19560 | loss 3.496931 (+0.03z)| norm 0.3087 (+1.00z)| lr 4.43e-04 | 4170.86 ms | 32.4% bf16 MFU | 125808 tok/s step 7144/19560 | loss 3.566001 (+1.67z)| norm 0.2780 (-0.24z)| lr 4.43e-04 | 4167.94 ms | 32.4% bf16 MFU | 125807 tok/s step 7145/19560 | loss 3.446388 (-1.17z)| norm 0.2819 (-0.08z)| lr 4.43e-04 | 4182.05 ms | 32.3% bf16 MFU | 125785 tok/s step 7146/19560 | loss 3.464202 (-0.73z)| norm 0.2899 (+0.24z)| lr 4.43e-04 | 4199.10 ms | 32.2% bf16 MFU | 125739 tok/s step 7147/19560 | loss 3.554507 (+1.43z)| norm 0.2510 (-1.31z)| lr 4.43e-04 | 4234.80 ms | 31.9% bf16 MFU | 125642 tok/s step 7148/19560 | loss 3.455418 (-0.93z)| norm 0.2985 (+0.60z)| lr 4.43e-04 | 4366.98 ms | 30.9% bf16 MFU | 125363 tok/s step 7149/19560 | loss 3.457886 (-0.87z)| norm 0.2826 (-0.04z)| lr 4.43e-04 | 4199.54 ms | 32.2% bf16 MFU | 125337 tok/s step 7150/19560 | loss 3.573913 (+1.85z)| norm 0.2807 (-0.11z)| lr 4.43e-04 | 4310.52 ms | 31.3% bf16 MFU | 125152 tok/s step 7151/19560 | loss 3.508629 (+0.30z)| norm 0.2755 (-0.32z)| lr 4.43e-04 | 4171.83 ms | 32.4% bf16 MFU | 125178 tok/s step 7152/19560 | loss 3.437252 (-1.37z)| norm 0.2601 (-0.93z)| lr 4.43e-04 | 4211.10 ms | 32.1% bf16 MFU | 125144 tok/s step 7153/19560 | loss 3.511945 (+0.39z)| norm 0.2658 (-0.69z)| lr 4.43e-04 | 4179.91 ms | 32.3% bf16 MFU | 125158 tok/s step 7154/19560 | loss 3.546954 (+1.22z)| norm 0.2800 (-0.12z)| lr 4.43e-04 | 4173.60 ms | 32.4% bf16 MFU | 125181 tok/s step 7155/19560 | loss 3.529510 (+0.80z)| norm 0.3039 (+0.83z)| lr 4.43e-04 | 4233.98 ms | 31.9% bf16 MFU | 125114 tok/s step 7156/19560 | loss 3.496731 (+0.03z)| norm 0.2907 (+0.29z)| lr 4.43e-04 | 4230.46 ms | 31.9% bf16 MFU | 125055 tok/s step 7157/19560 | loss 3.512165 (+0.38z)| norm 0.2689 (-0.59z)| lr 4.43e-04 | 4213.83 ms | 32.0% bf16 MFU | 125023 tok/s step 7158/19560 | loss 3.532492 (+0.85z)| norm 0.2846 (+0.04z)| lr 4.43e-04 | 4163.23 ms | 32.4% bf16 MFU | 125068 tok/s step 7159/19560 | loss 3.543448 (+1.10z)| norm 0.2501 (-1.33z)| lr 4.43e-04 | 4154.91 ms | 32.5% bf16 MFU | 125124 tok/s step 7160/19560 | loss 3.549397 (+1.23z)| norm 0.2876 (+0.31z)| lr 4.42e-04 | 4170.55 ms | 32.4% bf16 MFU | 125154 tok/s step 7161/19560 | loss 3.447012 (-1.15z)| norm 0.2759 (-0.32z)| lr 4.42e-04 | 4177.17 ms | 32.3% bf16 MFU | 125172 tok/s step 7162/19560 | loss 3.427817 (-1.57z)| norm 0.2904 (+0.49z)| lr 4.42e-04 | 4187.40 ms | 32.2% bf16 MFU | 125173 tok/s step 7163/19560 | loss 3.484899 (-0.25z)| norm 0.3165 (+1.90z)| lr 4.42e-04 | 4172.91 ms | 32.4% bf16 MFU | 125197 tok/s step 7164/19560 | loss 3.477414 (-0.43z)| norm 0.2935 (+0.67z)| lr 4.42e-04 | 4161.80 ms | 32.4% bf16 MFU | 125236 tok/s step 7165/19560 | loss 3.441759 (-1.23z)| norm 0.3034 (+1.22z)| lr 4.42e-04 | 4159.94 ms | 32.5% bf16 MFU | 125276 tok/s step 7166/19560 | loss 3.450499 (-1.04z)| norm 0.3269 (+2.46z)| lr 4.42e-04 | 4169.94 ms | 32.4% bf16 MFU | 125298 tok/s step 7167/19560 | loss 3.500408 (+0.11z)| norm 0.3103 (+1.52z)| lr 4.42e-04 | 4167.31 ms | 32.4% bf16 MFU | 125324 tok/s step 7168/19560 | loss 3.433141 (-1.41z)| norm 0.2723 (-0.54z)| lr 4.42e-04 | 4173.18 ms | 32.4% bf16 MFU | 125339 tok/s step 7169/19560 | loss 3.489439 (-0.11z)| norm 0.3149 (+1.75z)| lr 4.42e-04 | 4181.12 ms | 32.3% bf16 MFU | 125342 tok/s step 7170/19560 | loss 3.410351 (-2.00z)| norm 0.2934 (+0.62z)| lr 4.42e-04 | 4171.13 ms | 32.4% bf16 MFU | 125360 tok/s step 7171/19560 | loss 3.428229 (-1.55z)| norm 0.2702 (-0.64z)| lr 4.42e-04 | 4174.93 ms | 32.3% bf16 MFU | 125371 tok/s step 7172/19560 | loss 3.443455 (-1.17z)| norm 0.2581 (-1.28z)| lr 4.42e-04 | 4165.94 ms | 32.4% bf16 MFU | 125395 tok/s step 7173/19560 | loss 3.419346 (-1.71z)| norm 0.2579 (-1.27z)| lr 4.42e-04 | 4162.28 ms | 32.4% bf16 MFU | 125423 tok/s step 7174/19560 | loss 3.446782 (-1.06z)| norm 0.2738 (-0.40z)| lr 4.42e-04 | 4150.66 ms | 32.5% bf16 MFU | 125468 tok/s step 7175/19560 | loss 3.500713 (+0.21z)| norm 0.2949 (+0.75z)| lr 4.42e-04 | 4185.92 ms | 32.3% bf16 MFU | 125457 tok/s step 7176/19560 | loss 3.504689 (+0.30z)| norm 0.2808 (-0.01z)| lr 4.42e-04 | 4169.91 ms | 32.4% bf16 MFU | 125470 tok/s step 7177/19560 | loss 3.475687 (-0.38z)| norm 0.2930 (+0.65z)| lr 4.42e-04 | 4168.31 ms | 32.4% bf16 MFU | 125486 tok/s step 7178/19560 | loss 3.466060 (-0.60z)| norm 0.2618 (-1.05z)| lr 4.42e-04 | 4170.21 ms | 32.4% bf16 MFU | 125498 tok/s step 7179/19560 | loss 3.418437 (-1.71z)| norm 0.2907 (+0.52z)| lr 4.42e-04 | 4164.34 ms | 32.4% bf16 MFU | 125518 tok/s step 7180/19560 | loss 3.537873 (+1.07z)| norm 0.2986 (+0.94z)| lr 4.42e-04 | 4159.30 ms | 32.5% bf16 MFU | 125544 tok/s step 7181/19560 | loss 3.468290 (-0.56z)| norm 0.2773 (-0.21z)| lr 4.42e-04 | 4172.19 ms | 32.4% bf16 MFU | 125550 tok/s step 7182/19560 | loss 3.512059 (+0.45z)| norm 0.2669 (-0.77z)| lr 4.42e-04 | 4229.30 ms | 31.9% bf16 MFU | 125471 tok/s step 7183/19560 | loss 3.447001 (-1.05z)| norm 0.2591 (-1.18z)| lr 4.41e-04 | 4162.94 ms | 32.4% bf16 MFU | 125495 tok/s step 7184/19560 | loss 3.427847 (-1.49z)| norm 0.2700 (-0.57z)| lr 4.41e-04 | 4180.14 ms | 32.3% bf16 MFU | 125491 tok/s step 7185/19560 | loss 3.470686 (-0.49z)| norm 0.2543 (-1.41z)| lr 4.41e-04 | 4161.12 ms | 32.4% bf16 MFU | 125516 tok/s step 7186/19560 | loss 3.521104 (+0.68z)| norm 0.3027 (+1.19z)| lr 4.41e-04 | 4162.59 ms | 32.4% bf16 MFU | 125538 tok/s step 7187/19560 | loss 3.485120 (-0.15z)| norm 0.2566 (-1.27z)| lr 4.41e-04 | 4161.74 ms | 32.4% bf16 MFU | 125560 tok/s step 7188/19560 | loss 3.404934 (-2.00z)| norm 0.2755 (-0.25z)| lr 4.41e-04 | 4162.52 ms | 32.4% bf16 MFU | 125580 tok/s step 7189/19560 | loss 3.483387 (-0.19z)| norm 0.2536 (-1.42z)| lr 4.41e-04 | 4160.57 ms | 32.5% bf16 MFU | 125602 tok/s step 7190/19560 | loss 3.529316 (+0.86z)| norm 0.2632 (-0.89z)| lr 4.41e-04 | 4166.59 ms | 32.4% bf16 MFU | 125613 tok/s step 7191/19560 | loss 3.458233 (-0.77z)| norm 0.2724 (-0.40z)| lr 4.41e-04 | 4155.42 ms | 32.5% bf16 MFU | 125641 tok/s step 7192/19560 | loss 3.439879 (-1.17z)| norm 0.2391 (-2.13z)| lr 4.41e-04 | 4159.31 ms | 32.5% bf16 MFU | 125661 tok/s step 7193/19560 | loss 3.448777 (-0.96z)| norm 0.2663 (-0.69z)| lr 4.41e-04 | 4173.09 ms | 32.4% bf16 MFU | 125660 tok/s step 7194/19560 | loss 3.398300 (-2.12z)| norm 0.2548 (-1.29z)| lr 4.41e-04 | 4159.48 ms | 32.5% bf16 MFU | 125679 tok/s step 7195/19560 | loss 3.460082 (-0.67z)| norm 0.2507 (-1.49z)| lr 4.41e-04 | 4162.57 ms | 32.4% bf16 MFU | 125693 tok/s step 7196/19560 | loss 3.485497 (-0.08z)| norm 0.2653 (-0.74z)| lr 4.41e-04 | 4163.50 ms | 32.4% bf16 MFU | 125705 tok/s step 7197/19560 | loss 3.461132 (-0.65z)| norm 0.2864 (+0.38z)| lr 4.41e-04 | 4168.49 ms | 32.4% bf16 MFU | 125708 tok/s step 7198/19560 | loss 3.436052 (-1.22z)| norm 0.2939 (+0.77z)| lr 4.41e-04 | 4160.41 ms | 32.5% bf16 MFU | 125724 tok/s step 7199/19560 | loss 3.529759 (+0.97z)| norm 0.2987 (+1.00z)| lr 4.41e-04 | 4161.33 ms | 32.4% bf16 MFU | 125737 tok/s step 7200/19560 | loss 3.466800 (-0.49z)| norm 0.2825 (+0.16z)| lr 4.41e-04 | 4160.67 ms | 32.5% bf16 MFU | 125751 tok/s step 7201/19560 | loss 3.487510 (-0.00z)| norm 0.3008 (+1.10z)| lr 4.41e-04 | 4161.88 ms | 32.4% bf16 MFU | 125762 tok/s step 7202/19560 | loss 3.433924 (-1.24z)| norm 0.3224 (+2.17z)| lr 4.41e-04 | 4167.10 ms | 32.4% bf16 MFU | 125765 tok/s step 7203/19560 | loss 3.469336 (-0.41z)| norm 0.2458 (-1.72z)| lr 4.41e-04 | 4166.75 ms | 32.4% bf16 MFU | 125768 tok/s step 7204/19560 | loss 3.455511 (-0.72z)| norm 0.2613 (-0.93z)| lr 4.41e-04 | 4163.68 ms | 32.4% bf16 MFU | 125775 tok/s step 7205/19560 | loss 3.460064 (-0.61z)| norm 0.2870 (+0.37z)| lr 4.40e-04 | 4231.15 ms | 31.9% bf16 MFU | 125682 tok/s step 7206/19560 | loss 3.409079 (-1.78z)| norm 0.2524 (-1.36z)| lr 4.40e-04 | 4155.33 ms | 32.5% bf16 MFU | 125707 tok/s step 7207/19560 | loss 3.459507 (-0.59z)| norm 0.2836 (+0.20z)| lr 4.40e-04 | 4162.58 ms | 32.4% bf16 MFU | 125719 tok/s step 7208/19560 | loss 3.402073 (-1.95z)| norm 0.2744 (-0.25z)| lr 4.40e-04 | 4175.78 ms | 32.3% bf16 MFU | 125711 tok/s step 7209/19560 | loss 3.587853 (+2.41z)| norm 0.3134 (+1.72z)| lr 4.40e-04 | 4157.19 ms | 32.5% bf16 MFU | 125731 tok/s step 7210/19560 | loss 3.409194 (-1.74z)| norm 0.2832 (+0.20z)| lr 4.40e-04 | 4167.44 ms | 32.4% bf16 MFU | 125735 tok/s step 7211/19560 | loss 3.543881 (+1.37z)| norm 0.3369 (+2.85z)| lr 4.40e-04 | 4159.71 ms | 32.5% bf16 MFU | 125750 tok/s step 7212/19560 | loss 3.440048 (-1.02z)| norm 0.2955 (+0.78z)| lr 4.40e-04 | 4190.84 ms | 32.2% bf16 MFU | 125718 tok/s step 7213/19560 | loss 3.449513 (-0.79z)| norm 0.2803 (+0.03z)| lr 4.40e-04 | 4166.14 ms | 32.4% bf16 MFU | 125724 tok/s step 7214/19560 | loss 3.472041 (-0.27z)| norm 0.2884 (+0.44z)| lr 4.40e-04 | 4165.33 ms | 32.4% bf16 MFU | 125731 tok/s step 7215/19560 | loss 3.514450 (+0.70z)| norm 0.2761 (-0.17z)| lr 4.40e-04 | 4160.36 ms | 32.5% bf16 MFU | 125746 tok/s step 7216/19560 | loss 3.398941 (-1.93z)| norm 0.2615 (-0.89z)| lr 4.40e-04 | 4168.18 ms | 32.4% bf16 MFU | 125748 tok/s step 7217/19560 | loss 3.486983 (+0.07z)| norm 0.2880 (+0.43z)| lr 4.40e-04 | 4176.07 ms | 32.3% bf16 MFU | 125737 tok/s step 7218/19560 | loss 3.471408 (-0.29z)| norm 0.2718 (-0.39z)| lr 4.40e-04 | 4174.57 ms | 32.3% bf16 MFU | 125730 tok/s step 7219/19560 | loss 3.464600 (-0.44z)| norm 0.2773 (-0.12z)| lr 4.40e-04 | 4165.73 ms | 32.4% bf16 MFU | 125736 tok/s step 7220/19560 | loss 3.417842 (-1.49z)| norm 0.2760 (-0.18z)| lr 4.40e-04 | 4168.06 ms | 32.4% bf16 MFU | 125739 tok/s step 7221/19560 | loss 3.508258 (+0.57z)| norm 0.2655 (-0.71z)| lr 4.40e-04 | 4167.06 ms | 32.4% bf16 MFU | 125743 tok/s step 7222/19560 | loss 3.469959 (-0.30z)| norm 0.2843 (+0.23z)| lr 4.40e-04 | 4172.34 ms | 32.4% bf16 MFU | 125739 tok/s step 7223/19560 | loss 3.582631 (+2.20z)| norm 0.3373 (+2.80z)| lr 4.40e-04 | 4167.56 ms | 32.4% bf16 MFU | 125742 tok/s step 7224/19560 | loss 3.565919 (+1.79z)| norm 0.3029 (+1.09z)| lr 4.40e-04 | 4162.77 ms | 32.4% bf16 MFU | 125752 tok/s step 7225/19560 | loss 3.512956 (+0.61z)| norm 0.3205 (+1.91z)| lr 4.40e-04 | 4164.95 ms | 32.4% bf16 MFU | 125759 tok/s step 7226/19560 | loss 3.504254 (+0.43z)| norm 0.2889 (+0.37z)| lr 4.40e-04 | 4162.60 ms | 32.4% bf16 MFU | 125768 tok/s step 7227/19560 | loss 3.510811 (+0.58z)| norm 0.3242 (+2.07z)| lr 4.40e-04 | 4173.26 ms | 32.4% bf16 MFU | 125761 tok/s step 7228/19560 | loss 3.441695 (-0.96z)| norm 0.3108 (+1.41z)| lr 4.39e-04 | 4159.55 ms | 32.5% bf16 MFU | 125775 tok/s step 7229/19560 | loss 3.502282 (+0.39z)| norm 0.2641 (-0.81z)| lr 4.39e-04 | 4341.59 ms | 31.1% bf16 MFU | 125525 tok/s step 7230/19560 | loss 3.553026 (+1.49z)| norm 0.3321 (+2.44z)| lr 4.39e-04 | 4163.71 ms | 32.4% bf16 MFU | 125544 tok/s step 7231/19560 | loss 3.474122 (-0.25z)| norm 0.3047 (+1.12z)| lr 4.39e-04 | 4172.79 ms | 32.4% bf16 MFU | 125549 tok/s step 7232/19560 | loss 3.473671 (-0.26z)| norm 0.2878 (+0.31z)| lr 4.39e-04 | 4161.42 ms | 32.4% bf16 MFU | 125571 tok/s step 7233/19560 | loss 3.515664 (+0.67z)| norm 0.2938 (+0.59z)| lr 4.39e-04 | 4167.57 ms | 32.4% bf16 MFU | 125583 tok/s step 7234/19560 | loss 3.392315 (-2.01z)| norm 0.2719 (-0.45z)| lr 4.39e-04 | 4167.18 ms | 32.4% bf16 MFU | 125594 tok/s step 7235/19560 | loss 3.542272 (+1.26z)| norm 0.2871 (+0.28z)| lr 4.39e-04 | 4165.52 ms | 32.4% bf16 MFU | 125608 tok/s step 7236/19560 | loss 3.655580 (+3.55z)| norm 0.3004 (+0.91z)| lr 4.39e-04 | 4158.17 ms | 32.5% bf16 MFU | 125632 tok/s step 7237/19560 | loss 3.490052 (+0.12z)| norm 0.2754 (-0.28z)| lr 4.39e-04 | 4161.62 ms | 32.4% bf16 MFU | 125649 tok/s step 7238/19560 | loss 3.428230 (-1.18z)| norm 0.2927 (+0.55z)| lr 4.39e-04 | 4153.83 ms | 32.5% bf16 MFU | 125678 tok/s step 7239/19560 | loss 3.456173 (-0.58z)| norm 0.2927 (+0.54z)| lr 4.39e-04 | 4168.76 ms | 32.4% bf16 MFU | 125682 tok/s step 7240/19560 | loss 3.551531 (+1.42z)| norm 0.2726 (-0.43z)| lr 4.39e-04 | 4156.10 ms | 32.5% bf16 MFU | 125705 tok/s step 7241/19560 | loss 3.566495 (+1.71z)| norm 0.2671 (-0.68z)| lr 4.39e-04 | 4152.58 ms | 32.5% bf16 MFU | 125733 tok/s step 7242/19560 | loss 3.404643 (-1.64z)| norm 0.2768 (-0.22z)| lr 4.39e-04 | 4158.42 ms | 32.5% bf16 MFU | 125750 tok/s step 7243/19560 | loss 3.476990 (-0.14z)| norm 0.2656 (-0.76z)| lr 4.39e-04 | 4170.61 ms | 32.4% bf16 MFU | 125748 tok/s step 7244/19560 | loss 3.502040 (+0.37z)| norm 0.2964 (+0.71z)| lr 4.39e-04 | 4168.58 ms | 32.4% bf16 MFU | 125749 tok/s step 7245/19560 | loss 3.531860 (+0.98z)| norm 0.2787 (-0.15z)| lr 4.39e-04 | 4165.41 ms | 32.4% bf16 MFU | 125755 tok/s step 7246/19560 | loss 3.490727 (+0.13z)| norm 0.2711 (-0.54z)| lr 4.39e-04 | 4156.66 ms | 32.5% bf16 MFU | 125774 tok/s step 7247/19560 | loss 3.416280 (-1.39z)| norm 0.2599 (-1.09z)| lr 4.39e-04 | 4154.36 ms | 32.5% bf16 MFU | 125796 tok/s step 7248/19560 | loss 3.426123 (-1.17z)| norm 0.2754 (-0.33z)| lr 4.39e-04 | 4158.15 ms | 32.5% bf16 MFU | 125810 tok/s step 7249/19560 | loss 3.460452 (-0.47z)| norm 0.2794 (-0.14z)| lr 4.39e-04 | 4166.29 ms | 32.4% bf16 MFU | 125812 tok/s step 7250/19560 | loss 3.414587 (-1.39z)| norm 0.2815 (-0.04z)| lr 4.39e-04 | 4157.84 ms | 32.5% bf16 MFU | 125826 tok/s val loss 3.461809 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2817/10042 = 0.280522 step 7251/19560 | loss 3.466894 (-0.34z)| norm 0.2586 (-1.16z)| lr 4.38e-04 | 4156.74 ms | 32.5% bf16 MFU | 125841 tok/s step 7252/19560 | loss 3.497893 (+0.29z)| norm 0.2636 (-0.92z)| lr 4.38e-04 | 4192.29 ms | 32.2% bf16 MFU | 125802 tok/s step 7253/19560 | loss 3.385542 (-1.95z)| norm 0.2866 (+0.21z)| lr 4.38e-04 | 4158.65 ms | 32.5% bf16 MFU | 125815 tok/s step 7254/19560 | loss 3.404992 (-1.53z)| norm 0.2982 (+0.77z)| lr 4.38e-04 | 4153.05 ms | 32.5% bf16 MFU | 125837 tok/s step 7255/19560 | loss 3.425398 (-1.12z)| norm 0.2678 (-0.72z)| lr 4.38e-04 | 4166.94 ms | 32.4% bf16 MFU | 125836 tok/s step 7256/19560 | loss 3.453027 (-0.56z)| norm 0.2720 (-0.51z)| lr 4.38e-04 | 4156.32 ms | 32.5% bf16 MFU | 125851 tok/s step 7257/19560 | loss 3.482162 (+0.03z)| norm 0.2878 (+0.28z)| lr 4.38e-04 | 4163.17 ms | 32.4% bf16 MFU | 125855 tok/s step 7258/19560 | loss 3.438763 (-0.84z)| norm 0.3013 (+0.98z)| lr 4.38e-04 | 4188.10 ms | 32.2% bf16 MFU | 125822 tok/s step 7259/19560 | loss 3.525202 (+0.89z)| norm 0.2923 (+0.52z)| lr 4.38e-04 | 4175.80 ms | 32.3% bf16 MFU | 125809 tok/s step 7260/19560 | loss 3.468539 (-0.24z)| norm 0.3392 (+2.79z)| lr 4.38e-04 | 4160.15 ms | 32.5% bf16 MFU | 125819 tok/s step 7261/19560 | loss 3.440250 (-0.79z)| norm 0.3201 (+1.83z)| lr 4.38e-04 | 4159.95 ms | 32.5% bf16 MFU | 125830 tok/s step 7262/19560 | loss 3.449481 (-0.60z)| norm 0.2678 (-0.72z)| lr 4.38e-04 | 4187.45 ms | 32.2% bf16 MFU | 125799 tok/s step 7263/19560 | loss 3.455329 (-0.47z)| norm 0.3113 (+1.37z)| lr 4.38e-04 | 4164.22 ms | 32.4% bf16 MFU | 125804 tok/s step 7264/19560 | loss 3.476670 (-0.03z)| norm 0.2949 (+0.58z)| lr 4.38e-04 | 4260.40 ms | 31.7% bf16 MFU | 125667 tok/s step 7265/19560 | loss 3.447662 (-0.61z)| norm 0.3117 (+1.36z)| lr 4.38e-04 | 4173.80 ms | 32.3% bf16 MFU | 125664 tok/s step 7266/19560 | loss 3.477752 (+0.00z)| norm 0.3714 (+3.94z)| lr 4.38e-04 | 4151.48 ms | 32.5% bf16 MFU | 125695 tok/s step 7267/19560 | loss 3.439576 (-0.77z)| norm 0.2995 (+0.69z)| lr 4.38e-04 | 4161.99 ms | 32.4% bf16 MFU | 125709 tok/s step 7268/19560 | loss 3.426192 (-1.03z)| norm 0.3196 (+1.57z)| lr 4.38e-04 | 4161.90 ms | 32.4% bf16 MFU | 125722 tok/s step 7269/19560 | loss 3.459370 (-0.34z)| norm 0.3310 (+2.03z)| lr 4.38e-04 | 4164.08 ms | 32.4% bf16 MFU | 125732 tok/s step 7270/19560 | loss 3.516591 (+0.84z)| norm 0.2784 (-0.31z)| lr 4.38e-04 | 4167.92 ms | 32.4% bf16 MFU | 125735 tok/s step 7271/19560 | loss 3.455331 (-0.42z)| norm 0.3235 (+1.69z)| lr 4.38e-04 | 4176.40 ms | 32.3% bf16 MFU | 125725 tok/s step 7272/19560 | loss 3.412000 (-1.30z)| norm 0.3047 (+0.84z)| lr 4.38e-04 | 4182.72 ms | 32.3% bf16 MFU | 125706 tok/s step 7273/19560 | loss 3.476632 (+0.04z)| norm 0.3031 (+0.76z)| lr 4.37e-04 | 4212.47 ms | 32.1% bf16 MFU | 125644 tok/s step 7274/19560 | loss 3.457026 (-0.37z)| norm 0.2967 (+0.48z)| lr 4.37e-04 | 4205.15 ms | 32.1% bf16 MFU | 125595 tok/s step 7275/19560 | loss 3.440097 (-0.71z)| norm 0.3055 (+0.85z)| lr 4.37e-04 | 4159.71 ms | 32.5% bf16 MFU | 125617 tok/s step 7276/19560 | loss 3.512779 (+0.81z)| norm 0.2788 (-0.32z)| lr 4.37e-04 | 4175.73 ms | 32.3% bf16 MFU | 125614 tok/s step 7277/19560 | loss 3.398328 (-1.57z)| norm 0.2588 (-1.20z)| lr 4.37e-04 | 4166.80 ms | 32.4% bf16 MFU | 125625 tok/s step 7278/19560 | loss 3.513022 (+0.84z)| norm 0.2685 (-0.76z)| lr 4.37e-04 | 4165.80 ms | 32.4% bf16 MFU | 125636 tok/s step 7279/19560 | loss 3.489594 (+0.35z)| norm 0.2819 (-0.18z)| lr 4.37e-04 | 4169.26 ms | 32.4% bf16 MFU | 125642 tok/s step 7280/19560 | loss 3.451733 (-0.45z)| norm 0.2554 (-1.34z)| lr 4.37e-04 | 4159.03 ms | 32.5% bf16 MFU | 125663 tok/s step 7281/19560 | loss 3.445488 (-0.58z)| norm 0.2798 (-0.27z)| lr 4.37e-04 | 4156.28 ms | 32.5% bf16 MFU | 125687 tok/s step 7282/19560 | loss 3.427369 (-0.95z)| norm 0.2841 (-0.08z)| lr 4.37e-04 | 4163.38 ms | 32.4% bf16 MFU | 125699 tok/s step 7283/19560 | loss 3.458550 (-0.27z)| norm 0.2656 (-0.88z)| lr 4.37e-04 | 4170.47 ms | 32.4% bf16 MFU | 125700 tok/s step 7284/19560 | loss 3.443370 (-0.59z)| norm 0.2999 (+0.63z)| lr 4.37e-04 | 4163.97 ms | 32.4% bf16 MFU | 125710 tok/s step 7285/19560 | loss 3.420071 (-1.07z)| norm 0.2739 (-0.52z)| lr 4.37e-04 | 4170.09 ms | 32.4% bf16 MFU | 125711 tok/s step 7286/19560 | loss 3.483367 (+0.30z)| norm 0.2945 (+0.38z)| lr 4.37e-04 | 4233.36 ms | 31.9% bf16 MFU | 125618 tok/s step 7287/19560 | loss 3.481247 (+0.26z)| norm 0.2792 (-0.31z)| lr 4.37e-04 | 4171.71 ms | 32.4% bf16 MFU | 125621 tok/s step 7288/19560 | loss 3.461352 (-0.16z)| norm 0.3012 (+0.67z)| lr 4.37e-04 | 4171.34 ms | 32.4% bf16 MFU | 125624 tok/s step 7289/19560 | loss 3.469615 (+0.02z)| norm 0.2885 (+0.10z)| lr 4.37e-04 | 4166.66 ms | 32.4% bf16 MFU | 125635 tok/s step 7290/19560 | loss 3.412273 (-1.24z)| norm 0.2896 (+0.15z)| lr 4.37e-04 | 4155.64 ms | 32.5% bf16 MFU | 125661 tok/s step 7291/19560 | loss 3.470937 (+0.06z)| norm 0.3056 (+0.86z)| lr 4.37e-04 | 4147.70 ms | 32.6% bf16 MFU | 125698 tok/s step 7292/19560 | loss 3.454022 (-0.31z)| norm 0.3115 (+1.12z)| lr 4.37e-04 | 4164.33 ms | 32.4% bf16 MFU | 125708 tok/s step 7293/19560 | loss 3.472534 (+0.09z)| norm 0.2934 (+0.32z)| lr 4.37e-04 | 4161.56 ms | 32.4% bf16 MFU | 125722 tok/s step 7294/19560 | loss 3.451696 (-0.37z)| norm 0.3008 (+0.66z)| lr 4.37e-04 | 4159.76 ms | 32.5% bf16 MFU | 125738 tok/s step 7295/19560 | loss 3.414153 (-1.18z)| norm 0.2880 (+0.10z)| lr 4.37e-04 | 4191.16 ms | 32.2% bf16 MFU | 125706 tok/s step 7296/19560 | loss 3.490551 (+0.49z)| norm 0.2855 (-0.02z)| lr 4.36e-04 | 4191.87 ms | 32.2% bf16 MFU | 125674 tok/s step 7297/19560 | loss 3.449401 (-0.41z)| norm 0.2833 (-0.11z)| lr 4.36e-04 | 4157.60 ms | 32.5% bf16 MFU | 125695 tok/s step 7298/19560 | loss 3.467542 (-0.02z)| norm 0.2736 (-0.55z)| lr 4.36e-04 | 4299.92 ms | 31.4% bf16 MFU | 125507 tok/s step 7299/19560 | loss 3.428377 (-0.89z)| norm 0.2770 (-0.39z)| lr 4.36e-04 | 4148.75 ms | 32.5% bf16 MFU | 125550 tok/s step 7300/19560 | loss 3.439050 (-0.65z)| norm 0.2842 (-0.08z)| lr 4.36e-04 | 4154.79 ms | 32.5% bf16 MFU | 125582 tok/s step 7301/19560 | loss 3.556733 (+1.92z)| norm 0.2658 (-0.93z)| lr 4.36e-04 | 4220.83 ms | 32.0% bf16 MFU | 125514 tok/s step 7302/19560 | loss 3.456058 (-0.29z)| norm 0.2774 (-0.39z)| lr 4.36e-04 | 4157.26 ms | 32.5% bf16 MFU | 125544 tok/s step 7303/19560 | loss 3.450888 (-0.40z)| norm 0.2656 (-0.92z)| lr 4.36e-04 | 4159.72 ms | 32.5% bf16 MFU | 125569 tok/s step 7304/19560 | loss 3.479171 (+0.23z)| norm 0.2791 (-0.30z)| lr 4.36e-04 | 4162.00 ms | 32.4% bf16 MFU | 125589 tok/s step 7305/19560 | loss 3.473241 (+0.10z)| norm 0.2659 (-0.90z)| lr 4.36e-04 | 4165.83 ms | 32.4% bf16 MFU | 125602 tok/s step 7306/19560 | loss 3.439344 (-0.64z)| norm 0.2936 (+0.36z)| lr 4.36e-04 | 4164.94 ms | 32.4% bf16 MFU | 125616 tok/s step 7307/19560 | loss 3.483702 (+0.32z)| norm 0.2921 (+0.29z)| lr 4.36e-04 | 4168.15 ms | 32.4% bf16 MFU | 125624 tok/s step 7308/19560 | loss 3.473254 (+0.10z)| norm 0.2634 (-1.01z)| lr 4.36e-04 | 4392.32 ms | 30.7% bf16 MFU | 125311 tok/s step 7309/19560 | loss 3.481092 (+0.28z)| norm 0.2623 (-1.05z)| lr 4.36e-04 | 4161.55 ms | 32.4% bf16 MFU | 125345 tok/s step 7310/19560 | loss 3.439699 (-0.64z)| norm 0.2584 (-1.22z)| lr 4.36e-04 | 4159.83 ms | 32.5% bf16 MFU | 125380 tok/s step 7311/19560 | loss 3.428451 (-0.88z)| norm 0.2772 (-0.38z)| lr 4.36e-04 | 4382.87 ms | 30.8% bf16 MFU | 125092 tok/s step 7312/19560 | loss 3.487725 (+0.43z)| norm 0.2765 (-0.41z)| lr 4.36e-04 | 4165.49 ms | 32.4% bf16 MFU | 125130 tok/s step 7313/19560 | loss 3.441879 (-0.59z)| norm 0.2649 (-0.95z)| lr 4.36e-04 | 4151.22 ms | 32.5% bf16 MFU | 125189 tok/s step 7314/19560 | loss 3.444418 (-0.52z)| norm 0.2707 (-0.67z)| lr 4.36e-04 | 4168.54 ms | 32.4% bf16 MFU | 125218 tok/s step 7315/19560 | loss 3.404981 (-1.38z)| norm 0.2773 (-0.38z)| lr 4.36e-04 | 4155.18 ms | 32.5% bf16 MFU | 125266 tok/s step 7316/19560 | loss 3.466042 (-0.03z)| norm 0.2769 (-0.40z)| lr 4.36e-04 | 4163.09 ms | 32.4% bf16 MFU | 125299 tok/s step 7317/19560 | loss 3.508935 (+0.92z)| norm 0.2569 (-1.33z)| lr 4.36e-04 | 4168.90 ms | 32.4% bf16 MFU | 125323 tok/s step 7318/19560 | loss 3.439060 (-0.63z)| norm 0.2620 (-1.09z)| lr 4.35e-04 | 4153.72 ms | 32.5% bf16 MFU | 125367 tok/s step 7319/19560 | loss 3.405230 (-1.37z)| norm 0.2748 (-0.50z)| lr 4.35e-04 | 4168.49 ms | 32.4% bf16 MFU | 125388 tok/s step 7320/19560 | loss 3.506610 (+0.88z)| norm 0.2727 (-0.62z)| lr 4.35e-04 | 4184.55 ms | 32.3% bf16 MFU | 125383 tok/s step 7321/19560 | loss 3.412363 (-1.21z)| norm 0.3159 (+1.40z)| lr 4.35e-04 | 4172.87 ms | 32.4% bf16 MFU | 125396 tok/s step 7322/19560 | loss 3.488958 (+0.48z)| norm 0.2572 (-1.36z)| lr 4.35e-04 | 4169.73 ms | 32.4% bf16 MFU | 125413 tok/s step 7323/19560 | loss 3.486139 (+0.41z)| norm 0.2504 (-1.69z)| lr 4.35e-04 | 4153.17 ms | 32.5% bf16 MFU | 125454 tok/s step 7324/19560 | loss 3.463253 (-0.10z)| norm 0.2891 (+0.13z)| lr 4.35e-04 | 4169.11 ms | 32.4% bf16 MFU | 125469 tok/s step 7325/19560 | loss 3.391326 (-1.67z)| norm 0.2727 (-0.64z)| lr 4.35e-04 | 4162.85 ms | 32.4% bf16 MFU | 125493 tok/s step 7326/19560 | loss 3.429012 (-0.84z)| norm 0.2937 (+0.35z)| lr 4.35e-04 | 4163.76 ms | 32.4% bf16 MFU | 125514 tok/s step 7327/19560 | loss 3.424330 (-0.93z)| norm 0.2748 (-0.53z)| lr 4.35e-04 | 4171.82 ms | 32.4% bf16 MFU | 125522 tok/s step 7328/19560 | loss 3.499655 (+0.74z)| norm 0.2848 (-0.06z)| lr 4.35e-04 | 4157.04 ms | 32.5% bf16 MFU | 125552 tok/s step 7329/19560 | loss 3.445599 (-0.45z)| norm 0.2675 (-0.86z)| lr 4.35e-04 | 4174.42 ms | 32.3% bf16 MFU | 125554 tok/s step 7330/19560 | loss 3.399601 (-1.46z)| norm 0.2991 (+0.64z)| lr 4.35e-04 | 4163.48 ms | 32.4% bf16 MFU | 125573 tok/s step 7331/19560 | loss 3.547912 (+1.77z)| norm 0.2873 (+0.07z)| lr 4.35e-04 | 4164.71 ms | 32.4% bf16 MFU | 125589 tok/s step 7332/19560 | loss 3.438283 (-0.61z)| norm 0.2840 (-0.10z)| lr 4.35e-04 | 4174.93 ms | 32.3% bf16 MFU | 125588 tok/s step 7333/19560 | loss 3.452873 (-0.29z)| norm 0.2750 (-0.54z)| lr 4.35e-04 | 4160.99 ms | 32.4% bf16 MFU | 125609 tok/s step 7334/19560 | loss 3.491225 (+0.53z)| norm 0.2567 (-1.43z)| lr 4.35e-04 | 4165.78 ms | 32.4% bf16 MFU | 125621 tok/s step 7335/19560 | loss 3.451294 (-0.34z)| norm 0.2648 (-1.02z)| lr 4.35e-04 | 4158.60 ms | 32.5% bf16 MFU | 125644 tok/s step 7336/19560 | loss 3.455501 (-0.26z)| norm 0.2590 (-1.29z)| lr 4.35e-04 | 4155.27 ms | 32.5% bf16 MFU | 125670 tok/s step 7337/19560 | loss 3.415877 (-1.12z)| norm 0.2483 (-1.77z)| lr 4.35e-04 | 4151.29 ms | 32.5% bf16 MFU | 125702 tok/s step 7338/19560 | loss 3.477134 (+0.24z)| norm 0.2724 (-0.61z)| lr 4.35e-04 | 4167.32 ms | 32.4% bf16 MFU | 125707 tok/s step 7339/19560 | loss 3.392566 (-1.65z)| norm 0.2735 (-0.55z)| lr 4.35e-04 | 4172.73 ms | 32.4% bf16 MFU | 125704 tok/s step 7340/19560 | loss 3.401024 (-1.44z)| norm 0.2691 (-0.75z)| lr 4.35e-04 | 4154.80 ms | 32.5% bf16 MFU | 125728 tok/s step 7341/19560 | loss 3.431803 (-0.74z)| norm 0.2526 (-1.54z)| lr 4.34e-04 | 4158.81 ms | 32.5% bf16 MFU | 125745 tok/s step 7342/19560 | loss 3.448981 (-0.35z)| norm 0.2672 (-0.82z)| lr 4.34e-04 | 4159.88 ms | 32.5% bf16 MFU | 125760 tok/s step 7343/19560 | loss 3.415058 (-1.10z)| norm 0.3013 (+0.82z)| lr 4.34e-04 | 4167.44 ms | 32.4% bf16 MFU | 125762 tok/s step 7344/19560 | loss 3.456653 (-0.17z)| norm 0.2571 (-1.31z)| lr 4.34e-04 | 4167.85 ms | 32.4% bf16 MFU | 125763 tok/s step 7345/19560 | loss 3.485787 (+0.49z)| norm 0.2640 (-0.97z)| lr 4.34e-04 | 4162.11 ms | 32.4% bf16 MFU | 125774 tok/s step 7346/19560 | loss 3.455709 (-0.19z)| norm 0.2603 (-1.14z)| lr 4.34e-04 | 4166.96 ms | 32.4% bf16 MFU | 125776 tok/s step 7347/19560 | loss 3.398405 (-1.46z)| norm 0.2886 (+0.21z)| lr 4.34e-04 | 4163.85 ms | 32.4% bf16 MFU | 125783 tok/s step 7348/19560 | loss 3.551531 (+1.93z)| norm 0.2751 (-0.43z)| lr 4.34e-04 | 4151.26 ms | 32.5% bf16 MFU | 125808 tok/s step 7349/19560 | loss 3.419585 (-0.99z)| norm 0.2684 (-0.76z)| lr 4.34e-04 | 4154.65 ms | 32.5% bf16 MFU | 125828 tok/s step 7350/19560 | loss 3.464469 (+0.01z)| norm 0.2745 (-0.46z)| lr 4.34e-04 | 4180.18 ms | 32.3% bf16 MFU | 125807 tok/s step 7351/19560 | loss 3.427112 (-0.81z)| norm 0.2836 (+0.00z)| lr 4.34e-04 | 4156.13 ms | 32.5% bf16 MFU | 125824 tok/s step 7352/19560 | loss 3.472426 (+0.24z)| norm 0.2725 (-0.54z)| lr 4.34e-04 | 4174.38 ms | 32.3% bf16 MFU | 125813 tok/s step 7353/19560 | loss 3.460021 (-0.04z)| norm 0.2507 (-1.59z)| lr 4.34e-04 | 4201.23 ms | 32.1% bf16 MFU | 125762 tok/s step 7354/19560 | loss 3.484829 (+0.55z)| norm 0.2984 (+0.76z)| lr 4.34e-04 | 4179.70 ms | 32.3% bf16 MFU | 125746 tok/s step 7355/19560 | loss 3.517634 (+1.32z)| norm 0.2933 (+0.53z)| lr 4.34e-04 | 4186.44 ms | 32.3% bf16 MFU | 125720 tok/s step 7356/19560 | loss 3.565318 (+2.36z)| norm 0.3009 (+0.92z)| lr 4.34e-04 | 4157.62 ms | 32.5% bf16 MFU | 125739 tok/s step 7357/19560 | loss 3.443094 (-0.43z)| norm 0.2746 (-0.41z)| lr 4.34e-04 | 4168.10 ms | 32.4% bf16 MFU | 125742 tok/s step 7358/19560 | loss 3.445265 (-0.37z)| norm 0.3142 (+1.62z)| lr 4.34e-04 | 4157.96 ms | 32.5% bf16 MFU | 125759 tok/s step 7359/19560 | loss 3.493988 (+0.76z)| norm 0.3015 (+0.97z)| lr 4.34e-04 | 4171.78 ms | 32.4% bf16 MFU | 125755 tok/s step 7360/19560 | loss 3.468828 (+0.18z)| norm 0.2522 (-1.53z)| lr 4.34e-04 | 4167.28 ms | 32.4% bf16 MFU | 125758 tok/s step 7361/19560 | loss 3.425762 (-0.82z)| norm 0.2700 (-0.61z)| lr 4.34e-04 | 4171.13 ms | 32.4% bf16 MFU | 125755 tok/s step 7362/19560 | loss 3.436085 (-0.59z)| norm 0.3079 (+1.29z)| lr 4.34e-04 | 4157.77 ms | 32.5% bf16 MFU | 125772 tok/s step 7363/19560 | loss 3.462792 (+0.06z)| norm 0.2591 (-1.16z)| lr 4.33e-04 | 4171.83 ms | 32.4% bf16 MFU | 125767 tok/s step 7364/19560 | loss 3.491452 (+0.85z)| norm 0.3158 (+1.68z)| lr 4.33e-04 | 4169.78 ms | 32.4% bf16 MFU | 125765 tok/s step 7365/19560 | loss 3.488844 (+0.79z)| norm 0.2793 (-0.15z)| lr 4.33e-04 | 4157.74 ms | 32.5% bf16 MFU | 125782 tok/s step 7366/19560 | loss 3.532490 (+1.89z)| norm 0.3024 (+1.00z)| lr 4.33e-04 | 4160.44 ms | 32.5% bf16 MFU | 125794 tok/s step 7367/19560 | loss 3.432062 (-0.72z)| norm 0.3021 (+0.98z)| lr 4.33e-04 | 4163.76 ms | 32.4% bf16 MFU | 125800 tok/s step 7368/19560 | loss 3.489403 (+0.80z)| norm 0.2987 (+0.80z)| lr 4.33e-04 | 4166.69 ms | 32.4% bf16 MFU | 125801 tok/s step 7369/19560 | loss 3.499817 (+1.12z)| norm 0.2603 (-1.10z)| lr 4.33e-04 | 4213.33 ms | 32.0% bf16 MFU | 125733 tok/s step 7370/19560 | loss 3.465504 (+0.18z)| norm 0.2763 (-0.31z)| lr 4.33e-04 | 4212.35 ms | 32.1% bf16 MFU | 125670 tok/s step 7371/19560 | loss 3.505680 (+1.27z)| norm 0.2723 (-0.51z)| lr 4.33e-04 | 4193.41 ms | 32.2% bf16 MFU | 125638 tok/s step 7372/19560 | loss 3.465180 (+0.17z)| norm 0.2794 (-0.15z)| lr 4.33e-04 | 4173.63 ms | 32.4% bf16 MFU | 125637 tok/s step 7373/19560 | loss 3.421134 (-1.03z)| norm 0.3060 (+1.15z)| lr 4.33e-04 | 4161.24 ms | 32.4% bf16 MFU | 125654 tok/s step 7374/19560 | loss 3.487161 (+0.81z)| norm 0.2890 (+0.31z)| lr 4.33e-04 | 4216.55 ms | 32.0% bf16 MFU | 125589 tok/s step 7375/19560 | loss 3.553562 (+2.57z)| norm 0.2597 (-1.14z)| lr 4.33e-04 | 4158.00 ms | 32.5% bf16 MFU | 125614 tok/s step 7376/19560 | loss 3.632045 (+4.32z)| norm 0.3087 (+1.26z)| lr 4.33e-04 | 4160.51 ms | 32.5% bf16 MFU | 125634 tok/s step 7377/19560 | loss 3.494364 (+0.84z)| norm 0.3325 (+2.35z)| lr 4.33e-04 | 4217.84 ms | 32.0% bf16 MFU | 125567 tok/s step 7378/19560 | loss 3.524779 (+1.57z)| norm 0.3231 (+1.86z)| lr 4.33e-04 | 4173.15 ms | 32.4% bf16 MFU | 125571 tok/s step 7379/19560 | loss 3.448564 (-0.33z)| norm 0.2901 (+0.29z)| lr 4.33e-04 | 4168.35 ms | 32.4% bf16 MFU | 125581 tok/s step 7380/19560 | loss 3.421868 (-0.99z)| norm 0.3112 (+1.27z)| lr 4.33e-04 | 4166.94 ms | 32.4% bf16 MFU | 125593 tok/s step 7381/19560 | loss 3.487584 (+0.65z)| norm 0.2683 (-0.76z)| lr 4.33e-04 | 4166.82 ms | 32.4% bf16 MFU | 125605 tok/s step 7382/19560 | loss 3.509399 (+1.18z)| norm 0.2903 (+0.29z)| lr 4.33e-04 | 4159.84 ms | 32.5% bf16 MFU | 125626 tok/s step 7383/19560 | loss 3.512283 (+1.24z)| norm 0.2880 (+0.17z)| lr 4.33e-04 | 4155.49 ms | 32.5% bf16 MFU | 125653 tok/s step 7384/19560 | loss 3.505547 (+1.05z)| norm 0.2715 (-0.61z)| lr 4.33e-04 | 4164.37 ms | 32.4% bf16 MFU | 125665 tok/s step 7385/19560 | loss 3.482699 (+0.47z)| norm 0.2813 (-0.15z)| lr 4.32e-04 | 4160.43 ms | 32.5% bf16 MFU | 125683 tok/s step 7386/19560 | loss 3.429713 (-0.86z)| norm 0.2912 (+0.33z)| lr 4.32e-04 | 4161.85 ms | 32.4% bf16 MFU | 125698 tok/s step 7387/19560 | loss 3.530754 (+1.68z)| norm 0.2760 (-0.39z)| lr 4.32e-04 | 4155.75 ms | 32.5% bf16 MFU | 125721 tok/s step 7388/19560 | loss 3.509201 (+1.13z)| norm 0.2973 (+0.66z)| lr 4.32e-04 | 4156.94 ms | 32.5% bf16 MFU | 125741 tok/s step 7389/19560 | loss 3.542239 (+1.91z)| norm 0.2648 (-0.92z)| lr 4.32e-04 | 4165.12 ms | 32.4% bf16 MFU | 125748 tok/s step 7390/19560 | loss 3.584259 (+2.83z)| norm 0.3031 (+0.96z)| lr 4.32e-04 | 4171.01 ms | 32.4% bf16 MFU | 125745 tok/s step 7391/19560 | loss 3.480276 (+0.34z)| norm 0.2923 (+0.44z)| lr 4.32e-04 | 4160.80 ms | 32.4% bf16 MFU | 125758 tok/s step 7392/19560 | loss 3.573146 (+2.48z)| norm 0.2622 (-1.04z)| lr 4.32e-04 | 4154.73 ms | 32.5% bf16 MFU | 125780 tok/s step 7393/19560 | loss 3.433835 (-0.77z)| norm 0.3328 (+2.40z)| lr 4.32e-04 | 4165.67 ms | 32.4% bf16 MFU | 125784 tok/s step 7394/19560 | loss 3.451562 (-0.35z)| norm 0.2964 (+0.71z)| lr 4.32e-04 | 4171.96 ms | 32.4% bf16 MFU | 125778 tok/s step 7395/19560 | loss 3.498189 (+0.72z)| norm 0.2749 (-0.41z)| lr 4.32e-04 | 4163.23 ms | 32.4% bf16 MFU | 125786 tok/s step 7396/19560 | loss 3.507692 (+0.93z)| norm 0.2754 (-0.37z)| lr 4.32e-04 | 4173.67 ms | 32.3% bf16 MFU | 125777 tok/s step 7397/19560 | loss 3.498757 (+0.71z)| norm 0.2880 (+0.33z)| lr 4.32e-04 | 4160.92 ms | 32.4% bf16 MFU | 125789 tok/s step 7398/19560 | loss 3.508046 (+0.93z)| norm 0.2752 (-0.37z)| lr 4.32e-04 | 4155.98 ms | 32.5% bf16 MFU | 125807 tok/s step 7399/19560 | loss 3.434681 (-0.77z)| norm 0.2875 (+0.33z)| lr 4.32e-04 | 4165.83 ms | 32.4% bf16 MFU | 125809 tok/s step 7400/19560 | loss 3.506267 (+0.88z)| norm 0.2772 (-0.24z)| lr 4.32e-04 | 4163.10 ms | 32.4% bf16 MFU | 125816 tok/s step 7401/19560 | loss 3.610662 (+3.16z)| norm 0.2903 (+0.51z)| lr 4.32e-04 | 4183.42 ms | 32.3% bf16 MFU | 125791 tok/s step 7402/19560 | loss 3.420008 (-1.10z)| norm 0.2806 (-0.03z)| lr 4.32e-04 | 4157.08 ms | 32.5% bf16 MFU | 125808 tok/s step 7403/19560 | loss 3.453710 (-0.35z)| norm 0.3110 (+1.68z)| lr 4.32e-04 | 4165.56 ms | 32.4% bf16 MFU | 125810 tok/s step 7404/19560 | loss 3.416877 (-1.16z)| norm 0.2939 (+0.71z)| lr 4.32e-04 | 4145.79 ms | 32.6% bf16 MFU | 125843 tok/s step 7405/19560 | loss 3.439647 (-0.66z)| norm 0.2758 (-0.33z)| lr 4.32e-04 | 4153.59 ms | 32.5% bf16 MFU | 125862 tok/s step 7406/19560 | loss 3.497459 (+0.64z)| norm 0.2747 (-0.39z)| lr 4.32e-04 | 4171.18 ms | 32.4% bf16 MFU | 125854 tok/s step 7407/19560 | loss 3.480421 (+0.26z)| norm 0.2763 (-0.30z)| lr 4.32e-04 | 4161.22 ms | 32.4% bf16 MFU | 125861 tok/s step 7408/19560 | loss 3.419633 (-1.10z)| norm 0.2610 (-1.17z)| lr 4.31e-04 | 4165.72 ms | 32.4% bf16 MFU | 125860 tok/s step 7409/19560 | loss 3.485908 (+0.38z)| norm 0.2593 (-1.25z)| lr 4.31e-04 | 4160.19 ms | 32.5% bf16 MFU | 125869 tok/s step 7410/19560 | loss 3.458581 (-0.24z)| norm 0.2792 (-0.12z)| lr 4.31e-04 | 4159.03 ms | 32.5% bf16 MFU | 125878 tok/s step 7411/19560 | loss 3.513574 (+0.99z)| norm 0.3154 (+1.89z)| lr 4.31e-04 | 4168.60 ms | 32.4% bf16 MFU | 125873 tok/s step 7412/19560 | loss 3.417375 (-1.16z)| norm 0.2715 (-0.57z)| lr 4.31e-04 | 4174.02 ms | 32.3% bf16 MFU | 125860 tok/s step 7413/19560 | loss 3.522115 (+1.16z)| norm 0.2790 (-0.15z)| lr 4.31e-04 | 4180.91 ms | 32.3% bf16 MFU | 125837 tok/s step 7414/19560 | loss 3.486396 (+0.36z)| norm 0.2677 (-0.77z)| lr 4.31e-04 | 4158.18 ms | 32.5% bf16 MFU | 125849 tok/s step 7415/19560 | loss 3.520679 (+1.12z)| norm 0.2623 (-1.06z)| lr 4.31e-04 | 4170.68 ms | 32.4% bf16 MFU | 125842 tok/s step 7416/19560 | loss 3.499960 (+0.65z)| norm 0.2739 (-0.40z)| lr 4.31e-04 | 4163.94 ms | 32.4% bf16 MFU | 125846 tok/s step 7417/19560 | loss 3.477909 (+0.16z)| norm 0.2617 (-1.07z)| lr 4.31e-04 | 4174.96 ms | 32.3% bf16 MFU | 125832 tok/s step 7418/19560 | loss 3.439501 (-0.71z)| norm 0.2729 (-0.44z)| lr 4.31e-04 | 4167.41 ms | 32.4% bf16 MFU | 125831 tok/s step 7419/19560 | loss 3.475299 (+0.09z)| norm 0.2775 (-0.17z)| lr 4.31e-04 | 4168.98 ms | 32.4% bf16 MFU | 125827 tok/s step 7420/19560 | loss 3.545859 (+1.63z)| norm 0.2642 (-0.91z)| lr 4.31e-04 | 4169.81 ms | 32.4% bf16 MFU | 125823 tok/s step 7421/19560 | loss 3.508047 (+0.79z)| norm 0.2678 (-0.69z)| lr 4.31e-04 | 4163.38 ms | 32.4% bf16 MFU | 125828 tok/s step 7422/19560 | loss 3.469758 (-0.06z)| norm 0.2867 (+0.40z)| lr 4.31e-04 | 4169.34 ms | 32.4% bf16 MFU | 125824 tok/s step 7423/19560 | loss 3.426550 (-1.01z)| norm 0.2632 (-0.94z)| lr 4.31e-04 | 4157.91 ms | 32.5% bf16 MFU | 125838 tok/s step 7424/19560 | loss 3.478550 (+0.14z)| norm 0.2777 (-0.10z)| lr 4.31e-04 | 4155.69 ms | 32.5% bf16 MFU | 125854 tok/s step 7425/19560 | loss 3.517937 (+1.00z)| norm 0.2544 (-1.42z)| lr 4.31e-04 | 4165.66 ms | 32.4% bf16 MFU | 125854 tok/s step 7426/19560 | loss 3.470260 (-0.06z)| norm 0.2602 (-1.08z)| lr 4.31e-04 | 4161.24 ms | 32.4% bf16 MFU | 125861 tok/s step 7427/19560 | loss 3.431459 (-0.91z)| norm 0.2438 (-1.96z)| lr 4.31e-04 | 4164.63 ms | 32.4% bf16 MFU | 125862 tok/s step 7428/19560 | loss 3.485474 (+0.27z)| norm 0.2580 (-1.15z)| lr 4.31e-04 | 4167.92 ms | 32.4% bf16 MFU | 125859 tok/s step 7429/19560 | loss 3.543046 (+1.55z)| norm 0.2546 (-1.33z)| lr 4.31e-04 | 4163.43 ms | 32.4% bf16 MFU | 125862 tok/s step 7430/19560 | loss 3.613824 (+2.99z)| norm 0.2663 (-0.67z)| lr 4.30e-04 | 4169.39 ms | 32.4% bf16 MFU | 125857 tok/s step 7431/19560 | loss 3.462585 (-0.25z)| norm 0.2639 (-0.81z)| lr 4.30e-04 | 4162.05 ms | 32.4% bf16 MFU | 125862 tok/s step 7432/19560 | loss 3.459484 (-0.32z)| norm 0.2748 (-0.20z)| lr 4.30e-04 | 4165.76 ms | 32.4% bf16 MFU | 125862 tok/s step 7433/19560 | loss 3.459393 (-0.32z)| norm 0.2497 (-1.57z)| lr 4.30e-04 | 4168.14 ms | 32.4% bf16 MFU | 125858 tok/s step 7434/19560 | loss 3.477057 (+0.06z)| norm 0.2891 (+0.59z)| lr 4.30e-04 | 4153.75 ms | 32.5% bf16 MFU | 125876 tok/s step 7435/19560 | loss 3.468459 (-0.13z)| norm 0.2876 (+0.51z)| lr 4.30e-04 | 4159.73 ms | 32.5% bf16 MFU | 125884 tok/s step 7436/19560 | loss 3.520715 (+0.98z)| norm 0.2823 (+0.21z)| lr 4.30e-04 | 4165.33 ms | 32.4% bf16 MFU | 125884 tok/s step 7437/19560 | loss 3.467317 (-0.16z)| norm 0.2708 (-0.43z)| lr 4.30e-04 | 4167.98 ms | 32.4% bf16 MFU | 125879 tok/s step 7438/19560 | loss 3.448145 (-0.57z)| norm 0.2900 (+0.62z)| lr 4.30e-04 | 4168.20 ms | 32.4% bf16 MFU | 125874 tok/s step 7439/19560 | loss 3.458632 (-0.35z)| norm 0.2892 (+0.57z)| lr 4.30e-04 | 4164.56 ms | 32.4% bf16 MFU | 125875 tok/s step 7440/19560 | loss 3.454092 (-0.44z)| norm 0.3128 (+1.83z)| lr 4.30e-04 | 4172.06 ms | 32.4% bf16 MFU | 125865 tok/s step 7441/19560 | loss 3.492945 (+0.38z)| norm 0.2940 (+0.79z)| lr 4.30e-04 | 4168.93 ms | 32.4% bf16 MFU | 125859 tok/s step 7442/19560 | loss 3.524869 (+1.05z)| norm 0.2865 (+0.38z)| lr 4.30e-04 | 4171.14 ms | 32.4% bf16 MFU | 125851 tok/s step 7443/19560 | loss 3.393401 (-1.76z)| norm 0.3106 (+1.66z)| lr 4.30e-04 | 4167.86 ms | 32.4% bf16 MFU | 125848 tok/s step 7444/19560 | loss 3.386377 (-1.87z)| norm 0.2856 (+0.31z)| lr 4.30e-04 | 4158.93 ms | 32.5% bf16 MFU | 125859 tok/s step 7445/19560 | loss 3.573379 (+2.03z)| norm 0.2895 (+0.51z)| lr 4.30e-04 | 4164.69 ms | 32.4% bf16 MFU | 125860 tok/s step 7446/19560 | loss 3.472552 (-0.07z)| norm 0.3003 (+1.07z)| lr 4.30e-04 | 4158.66 ms | 32.5% bf16 MFU | 125871 tok/s step 7447/19560 | loss 3.470955 (-0.11z)| norm 0.2626 (-0.95z)| lr 4.30e-04 | 4163.15 ms | 32.4% bf16 MFU | 125874 tok/s step 7448/19560 | loss 3.463965 (-0.25z)| norm 0.3067 (+1.40z)| lr 4.30e-04 | 4167.10 ms | 32.4% bf16 MFU | 125871 tok/s step 7449/19560 | loss 3.490807 (+0.30z)| norm 0.2780 (-0.12z)| lr 4.30e-04 | 4167.38 ms | 32.4% bf16 MFU | 125868 tok/s step 7450/19560 | loss 3.461371 (-0.32z)| norm 0.2741 (-0.34z)| lr 4.30e-04 | 4153.65 ms | 32.5% bf16 MFU | 125886 tok/s step 7451/19560 | loss 3.563035 (+1.80z)| norm 0.2807 (+0.00z)| lr 4.30e-04 | 4183.02 ms | 32.3% bf16 MFU | 125858 tok/s step 7452/19560 | loss 3.477880 (+0.02z)| norm 0.2986 (+0.99z)| lr 4.29e-04 | 4158.52 ms | 32.5% bf16 MFU | 125869 tok/s step 7453/19560 | loss 3.434977 (-0.89z)| norm 0.2732 (-0.41z)| lr 4.29e-04 | 4163.89 ms | 32.4% bf16 MFU | 125871 tok/s step 7454/19560 | loss 3.464762 (-0.27z)| norm 0.2614 (-1.04z)| lr 4.29e-04 | 4153.64 ms | 32.5% bf16 MFU | 125889 tok/s step 7455/19560 | loss 3.435100 (-0.90z)| norm 0.3297 (+2.60z)| lr 4.29e-04 | 4158.62 ms | 32.5% bf16 MFU | 125898 tok/s step 7456/19560 | loss 3.456024 (-0.45z)| norm 0.3092 (+1.49z)| lr 4.29e-04 | 4153.12 ms | 32.5% bf16 MFU | 125915 tok/s step 7457/19560 | loss 3.517086 (+0.83z)| norm 0.3025 (+1.12z)| lr 4.29e-04 | 4163.80 ms | 32.4% bf16 MFU | 125915 tok/s step 7458/19560 | loss 3.446375 (-0.68z)| norm 0.3064 (+1.32z)| lr 4.29e-04 | 4177.03 ms | 32.3% bf16 MFU | 125895 tok/s step 7459/19560 | loss 3.473101 (-0.10z)| norm 0.2756 (-0.30z)| lr 4.29e-04 | 4164.59 ms | 32.4% bf16 MFU | 125895 tok/s step 7460/19560 | loss 3.399337 (-1.67z)| norm 0.2812 (-0.00z)| lr 4.29e-04 | 4159.53 ms | 32.5% bf16 MFU | 125903 tok/s step 7461/19560 | loss 3.454885 (-0.48z)| norm 0.2892 (+0.41z)| lr 4.29e-04 | 4155.51 ms | 32.5% bf16 MFU | 125916 tok/s step 7462/19560 | loss 3.504953 (+0.59z)| norm 0.2789 (-0.14z)| lr 4.29e-04 | 4160.95 ms | 32.4% bf16 MFU | 125920 tok/s step 7463/19560 | loss 3.464906 (-0.27z)| norm 0.2838 (+0.11z)| lr 4.29e-04 | 4177.55 ms | 32.3% bf16 MFU | 125899 tok/s step 7464/19560 | loss 3.488349 (+0.22z)| norm 0.2908 (+0.47z)| lr 4.29e-04 | 4165.35 ms | 32.4% bf16 MFU | 125898 tok/s step 7465/19560 | loss 3.483587 (+0.11z)| norm 0.2737 (-0.46z)| lr 4.29e-04 | 4169.69 ms | 32.4% bf16 MFU | 125890 tok/s step 7466/19560 | loss 3.462289 (-0.35z)| norm 0.2848 (+0.14z)| lr 4.29e-04 | 4158.89 ms | 32.5% bf16 MFU | 125899 tok/s step 7467/19560 | loss 3.448570 (-0.66z)| norm 0.2978 (+0.83z)| lr 4.29e-04 | 4169.88 ms | 32.4% bf16 MFU | 125890 tok/s step 7468/19560 | loss 3.530808 (+1.12z)| norm 0.3066 (+1.28z)| lr 4.29e-04 | 4151.81 ms | 32.5% bf16 MFU | 125910 tok/s step 7469/19560 | loss 3.480589 (+0.01z)| norm 0.3174 (+1.83z)| lr 4.29e-04 | 4182.95 ms | 32.3% bf16 MFU | 125881 tok/s step 7470/19560 | loss 3.573214 (+2.00z)| norm 0.2784 (-0.26z)| lr 4.29e-04 | 4161.77 ms | 32.4% bf16 MFU | 125886 tok/s step 7471/19560 | loss 3.466246 (-0.33z)| norm 0.2927 (+0.51z)| lr 4.29e-04 | 4165.68 ms | 32.4% bf16 MFU | 125885 tok/s step 7472/19560 | loss 3.475824 (-0.13z)| norm 0.2795 (-0.21z)| lr 4.29e-04 | 4166.21 ms | 32.4% bf16 MFU | 125882 tok/s step 7473/19560 | loss 3.506474 (+0.54z)| norm 0.2790 (-0.25z)| lr 4.29e-04 | 4171.51 ms | 32.4% bf16 MFU | 125873 tok/s step 7474/19560 | loss 3.456187 (-0.56z)| norm 0.2740 (-0.53z)| lr 4.28e-04 | 4160.19 ms | 32.5% bf16 MFU | 125880 tok/s step 7475/19560 | loss 3.462192 (-0.45z)| norm 0.2571 (-1.43z)| lr 4.28e-04 | 4161.00 ms | 32.4% bf16 MFU | 125886 tok/s step 7476/19560 | loss 3.542886 (+1.35z)| norm 0.2841 (+0.03z)| lr 4.28e-04 | 4166.74 ms | 32.4% bf16 MFU | 125883 tok/s step 7477/19560 | loss 3.485158 (+0.05z)| norm 0.3131 (+1.58z)| lr 4.28e-04 | 4170.88 ms | 32.4% bf16 MFU | 125874 tok/s step 7478/19560 | loss 3.493348 (+0.23z)| norm 0.2671 (-0.90z)| lr 4.28e-04 | 4163.42 ms | 32.4% bf16 MFU | 125877 tok/s step 7479/19560 | loss 3.490814 (+0.17z)| norm 0.2875 (+0.20z)| lr 4.28e-04 | 4162.13 ms | 32.4% bf16 MFU | 125881 tok/s step 7480/19560 | loss 3.490649 (+0.16z)| norm 0.2649 (-1.01z)| lr 4.28e-04 | 4163.50 ms | 32.4% bf16 MFU | 125883 tok/s step 7481/19560 | loss 3.483239 (-0.01z)| norm 0.2667 (-0.93z)| lr 4.28e-04 | 4161.78 ms | 32.4% bf16 MFU | 125888 tok/s step 7482/19560 | loss 3.422071 (-1.37z)| norm 0.2619 (-1.18z)| lr 4.28e-04 | 4180.36 ms | 32.3% bf16 MFU | 125865 tok/s step 7483/19560 | loss 3.430256 (-1.17z)| norm 0.2621 (-1.15z)| lr 4.28e-04 | 4158.19 ms | 32.5% bf16 MFU | 125876 tok/s step 7484/19560 | loss 3.493546 (+0.26z)| norm 0.2694 (-0.74z)| lr 4.28e-04 | 4170.06 ms | 32.4% bf16 MFU | 125868 tok/s step 7485/19560 | loss 3.472417 (-0.23z)| norm 0.2702 (-0.69z)| lr 4.28e-04 | 4162.21 ms | 32.4% bf16 MFU | 125873 tok/s step 7486/19560 | loss 3.533047 (+1.13z)| norm 0.2730 (-0.53z)| lr 4.28e-04 | 4169.65 ms | 32.4% bf16 MFU | 125866 tok/s step 7487/19560 | loss 3.558048 (+1.67z)| norm 0.2690 (-0.74z)| lr 4.28e-04 | 4164.54 ms | 32.4% bf16 MFU | 125868 tok/s step 7488/19560 | loss 3.500781 (+0.38z)| norm 0.2809 (-0.10z)| lr 4.28e-04 | 4162.12 ms | 32.4% bf16 MFU | 125873 tok/s step 7489/19560 | loss 3.450228 (-0.76z)| norm 0.2828 (-0.00z)| lr 4.28e-04 | 4154.64 ms | 32.5% bf16 MFU | 125889 tok/s step 7490/19560 | loss 3.521816 (+0.84z)| norm 0.2853 (+0.15z)| lr 4.28e-04 | 4157.97 ms | 32.5% bf16 MFU | 125899 tok/s step 7491/19560 | loss 3.465455 (-0.43z)| norm 0.2935 (+0.59z)| lr 4.28e-04 | 4153.50 ms | 32.5% bf16 MFU | 125915 tok/s step 7492/19560 | loss 3.487988 (+0.08z)| norm 0.2978 (+0.85z)| lr 4.28e-04 | 4167.46 ms | 32.4% bf16 MFU | 125910 tok/s step 7493/19560 | loss 3.435227 (-1.10z)| norm 0.2643 (-1.03z)| lr 4.28e-04 | 4165.36 ms | 32.4% bf16 MFU | 125908 tok/s step 7494/19560 | loss 3.371592 (-2.45z)| norm 0.2676 (-0.83z)| lr 4.28e-04 | 4164.53 ms | 32.4% bf16 MFU | 125907 tok/s step 7495/19560 | loss 3.503576 (+0.44z)| norm 0.2800 (-0.13z)| lr 4.28e-04 | 4163.63 ms | 32.4% bf16 MFU | 125908 tok/s step 7496/19560 | loss 3.413194 (-1.53z)| norm 0.2931 (+0.62z)| lr 4.27e-04 | 4175.36 ms | 32.3% bf16 MFU | 125891 tok/s step 7497/19560 | loss 3.460277 (-0.49z)| norm 0.2801 (-0.13z)| lr 4.27e-04 | 4183.49 ms | 32.3% bf16 MFU | 125862 tok/s step 7498/19560 | loss 3.487854 (+0.11z)| norm 0.2732 (-0.52z)| lr 4.27e-04 | 4193.47 ms | 32.2% bf16 MFU | 125820 tok/s step 7499/19560 | loss 3.488637 (+0.13z)| norm 0.2879 (+0.32z)| lr 4.27e-04 | 4172.61 ms | 32.4% bf16 MFU | 125812 tok/s step 7500/19560 | loss 3.458989 (-0.52z)| norm 0.2656 (-0.96z)| lr 4.27e-04 | 4172.35 ms | 32.4% bf16 MFU | 125804 tok/s val loss 3.455716 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2855/10042 = 0.284306 step 7501/19560 | loss 3.665709 (+3.77z)| norm 0.2795 (-0.15z)| lr 4.27e-04 | 4165.25 ms | 32.4% bf16 MFU | 125808 tok/s step 7502/19560 | loss 3.559073 (+1.53z)| norm 0.3201 (+2.13z)| lr 4.27e-04 | 4160.24 ms | 32.5% bf16 MFU | 125818 tok/s step 7503/19560 | loss 3.439543 (-0.92z)| norm 0.2690 (-0.76z)| lr 4.27e-04 | 4151.28 ms | 32.5% bf16 MFU | 125842 tok/s step 7504/19560 | loss 3.556583 (+1.56z)| norm 0.8491 (+10.63z)| lr 4.27e-04 | 4168.40 ms | 32.4% bf16 MFU | 125839 tok/s step 7505/19560 | loss 3.475759 (-0.16z)| norm 0.3358 (+0.93z)| lr 4.27e-04 | 4165.47 ms | 32.4% bf16 MFU | 125840 tok/s step 7506/19560 | loss 3.532620 (+1.05z)| norm 0.3353 (+0.92z)| lr 4.27e-04 | 4167.36 ms | 32.4% bf16 MFU | 125839 tok/s step 7507/19560 | loss 3.488667 (+0.10z)| norm 0.3344 (+0.89z)| lr 4.27e-04 | 4159.21 ms | 32.5% bf16 MFU | 125849 tok/s step 7508/19560 | loss 3.497630 (+0.28z)| norm 0.2634 (-0.44z)| lr 4.27e-04 | 4171.99 ms | 32.4% bf16 MFU | 125840 tok/s step 7509/19560 | loss 3.519614 (+0.75z)| norm 0.3161 (+0.54z)| lr 4.27e-04 | 4179.00 ms | 32.3% bf16 MFU | 125821 tok/s step 7510/19560 | loss 3.432625 (-1.10z)| norm 0.2864 (-0.01z)| lr 4.27e-04 | 4177.73 ms | 32.3% bf16 MFU | 125805 tok/s step 7511/19560 | loss 3.510911 (+0.57z)| norm 0.2732 (-0.26z)| lr 4.27e-04 | 4166.25 ms | 32.4% bf16 MFU | 125807 tok/s step 7512/19560 | loss 3.424428 (-1.25z)| norm 0.2685 (-0.35z)| lr 4.27e-04 | 4160.15 ms | 32.5% bf16 MFU | 125818 tok/s step 7513/19560 | loss 3.458118 (-0.53z)| norm 0.2668 (-0.38z)| lr 4.27e-04 | 4246.74 ms | 31.8% bf16 MFU | 125700 tok/s step 7514/19560 | loss 3.556814 (+1.53z)| norm 0.2645 (-0.41z)| lr 4.27e-04 | 4181.69 ms | 32.3% bf16 MFU | 125684 tok/s step 7515/19560 | loss 3.504536 (+0.44z)| norm 0.2618 (-0.46z)| lr 4.27e-04 | 4165.60 ms | 32.4% bf16 MFU | 125693 tok/s step 7516/19560 | loss 3.497370 (+0.29z)| norm 0.2671 (-0.36z)| lr 4.27e-04 | 4181.12 ms | 32.3% bf16 MFU | 125678 tok/s step 7517/19560 | loss 3.508095 (+0.52z)| norm 0.2579 (-0.53z)| lr 4.27e-04 | 4177.43 ms | 32.3% bf16 MFU | 125669 tok/s step 7518/19560 | loss 3.603429 (+2.53z)| norm 0.2612 (-0.46z)| lr 4.26e-04 | 4160.60 ms | 32.5% bf16 MFU | 125686 tok/s step 7519/19560 | loss 3.395728 (-1.83z)| norm 0.2602 (-0.48z)| lr 4.26e-04 | 4162.26 ms | 32.4% bf16 MFU | 125700 tok/s step 7520/19560 | loss 3.443234 (-0.82z)| norm 0.2557 (-0.56z)| lr 4.26e-04 | 4155.92 ms | 32.5% bf16 MFU | 125723 tok/s step 7521/19560 | loss 3.500065 (+0.37z)| norm 0.2675 (-0.33z)| lr 4.26e-04 | 4168.43 ms | 32.4% bf16 MFU | 125725 tok/s step 7522/19560 | loss 3.461742 (-0.44z)| norm 0.2992 (+0.26z)| lr 4.26e-04 | 4165.14 ms | 32.4% bf16 MFU | 125733 tok/s step 7523/19560 | loss 3.434676 (-1.01z)| norm 0.2800 (-0.10z)| lr 4.26e-04 | 4151.59 ms | 32.5% bf16 MFU | 125761 tok/s step 7524/19560 | loss 3.524476 (+0.89z)| norm 0.2812 (-0.07z)| lr 4.26e-04 | 4159.01 ms | 32.5% bf16 MFU | 125776 tok/s step 7525/19560 | loss 3.475480 (-0.14z)| norm 0.2931 (+0.15z)| lr 4.26e-04 | 4171.26 ms | 32.4% bf16 MFU | 125771 tok/s step 7526/19560 | loss 3.486477 (+0.09z)| norm 0.3034 (+0.34z)| lr 4.26e-04 | 4163.67 ms | 32.4% bf16 MFU | 125779 tok/s step 7527/19560 | loss 3.581568 (+2.05z)| norm 0.2469 (-0.72z)| lr 4.26e-04 | 4166.68 ms | 32.4% bf16 MFU | 125781 tok/s step 7528/19560 | loss 3.543709 (+1.25z)| norm 0.2929 (+0.14z)| lr 4.26e-04 | 4174.75 ms | 32.3% bf16 MFU | 125771 tok/s step 7529/19560 | loss 3.487117 (+0.10z)| norm 0.2975 (+0.23z)| lr 4.26e-04 | 4168.43 ms | 32.4% bf16 MFU | 125772 tok/s step 7530/19560 | loss 3.477466 (-0.12z)| norm 0.2843 (-0.02z)| lr 4.26e-04 | 4188.07 ms | 32.2% bf16 MFU | 125742 tok/s step 7531/19560 | loss 3.483212 (+0.00z)| norm 0.2772 (-0.15z)| lr 4.26e-04 | 4161.47 ms | 32.4% bf16 MFU | 125755 tok/s step 7532/19560 | loss 3.523939 (+0.86z)| norm 0.2720 (-0.24z)| lr 4.26e-04 | 4160.67 ms | 32.5% bf16 MFU | 125767 tok/s step 7533/19560 | loss 3.478420 (-0.13z)| norm 0.3077 (+0.42z)| lr 4.26e-04 | 4166.54 ms | 32.4% bf16 MFU | 125771 tok/s step 7534/19560 | loss 3.424963 (-1.27z)| norm 0.2647 (-0.38z)| lr 4.26e-04 | 4159.86 ms | 32.5% bf16 MFU | 125784 tok/s step 7535/19560 | loss 3.489151 (+0.12z)| norm 0.2646 (-0.38z)| lr 4.26e-04 | 4163.80 ms | 32.4% bf16 MFU | 125790 tok/s step 7536/19560 | loss 3.450381 (-0.73z)| norm 0.2905 (+0.10z)| lr 4.26e-04 | 4155.95 ms | 32.5% bf16 MFU | 125809 tok/s step 7537/19560 | loss 3.484157 (+0.00z)| norm 0.2827 (-0.05z)| lr 4.26e-04 | 4156.74 ms | 32.5% bf16 MFU | 125825 tok/s step 7538/19560 | loss 3.490927 (+0.14z)| norm 0.2810 (-0.08z)| lr 4.26e-04 | 4152.41 ms | 32.5% bf16 MFU | 125846 tok/s step 7539/19560 | loss 3.480522 (-0.08z)| norm 0.3002 (+0.28z)| lr 4.26e-04 | 4170.07 ms | 32.4% bf16 MFU | 125840 tok/s step 7540/19560 | loss 3.603703 (+2.53z)| norm 0.2646 (-0.39z)| lr 4.25e-04 | 4168.42 ms | 32.4% bf16 MFU | 125837 tok/s step 7541/19560 | loss 3.446741 (-0.82z)| norm 0.2845 (-0.01z)| lr 4.25e-04 | 4174.72 ms | 32.3% bf16 MFU | 125825 tok/s step 7542/19560 | loss 3.569464 (+1.77z)| norm 0.7802 (+7.13z)| lr 4.25e-04 | 4165.66 ms | 32.4% bf16 MFU | 125826 tok/s step 7543/19560 | loss 3.474717 (-0.22z)| norm 0.3315 (+0.60z)| lr 4.25e-04 | 4269.55 ms | 31.6% bf16 MFU | 125675 tok/s step 7544/19560 | loss 3.417831 (-1.40z)| norm 0.3143 (+0.35z)| lr 4.25e-04 | 4169.27 ms | 32.4% bf16 MFU | 125679 tok/s step 7545/19560 | loss 3.479116 (-0.11z)| norm 0.3165 (+0.38z)| lr 4.25e-04 | 4237.69 ms | 31.9% bf16 MFU | 125581 tok/s step 7546/19560 | loss 3.462970 (-0.46z)| norm 0.3176 (+0.39z)| lr 4.25e-04 | 4251.81 ms | 31.8% bf16 MFU | 125467 tok/s step 7547/19560 | loss 3.473976 (-0.23z)| norm 0.3009 (+0.14z)| lr 4.25e-04 | 4274.25 ms | 31.6% bf16 MFU | 125327 tok/s step 7548/19560 | loss 3.482090 (-0.04z)| norm 0.2946 (+0.05z)| lr 4.25e-04 | 4223.20 ms | 32.0% bf16 MFU | 125268 tok/s step 7549/19560 | loss 3.490399 (+0.13z)| norm 0.2761 (-0.22z)| lr 4.25e-04 | 4190.16 ms | 32.2% bf16 MFU | 125261 tok/s step 7550/19560 | loss 3.496876 (+0.27z)| norm 0.3215 (+0.43z)| lr 4.25e-04 | 4270.80 ms | 31.6% bf16 MFU | 125136 tok/s step 7551/19560 | loss 3.547613 (+1.32z)| norm 0.2603 (-0.45z)| lr 4.25e-04 | 4207.44 ms | 32.1% bf16 MFU | 125109 tok/s step 7552/19560 | loss 3.470926 (-0.30z)| norm 0.2710 (-0.30z)| lr 4.25e-04 | 4175.87 ms | 32.3% bf16 MFU | 125132 tok/s step 7553/19560 | loss 3.467108 (-0.37z)| norm 0.2583 (-0.48z)| lr 4.25e-04 | 4223.72 ms | 32.0% bf16 MFU | 125081 tok/s step 7554/19560 | loss 3.460936 (-0.50z)| norm 0.2746 (-0.25z)| lr 4.25e-04 | 4199.84 ms | 32.1% bf16 MFU | 125069 tok/s step 7555/19560 | loss 3.492733 (+0.16z)| norm 0.2816 (-0.15z)| lr 4.25e-04 | 4287.44 ms | 31.5% bf16 MFU | 124930 tok/s step 7556/19560 | loss 3.458236 (-0.57z)| norm 0.2594 (-0.47z)| lr 4.25e-04 | 4163.08 ms | 32.4% bf16 MFU | 124980 tok/s step 7557/19560 | loss 3.446027 (-0.82z)| norm 0.2668 (-0.37z)| lr 4.25e-04 | 4183.00 ms | 32.3% bf16 MFU | 124998 tok/s step 7558/19560 | loss 3.414293 (-1.49z)| norm 0.2495 (-0.62z)| lr 4.25e-04 | 4167.98 ms | 32.4% bf16 MFU | 125038 tok/s step 7559/19560 | loss 3.484113 (+0.03z)| norm 0.2827 (-0.14z)| lr 4.25e-04 | 4164.63 ms | 32.4% bf16 MFU | 125080 tok/s step 7560/19560 | loss 3.492008 (+0.20z)| norm 0.2660 (-0.38z)| lr 4.25e-04 | 4167.30 ms | 32.4% bf16 MFU | 125117 tok/s step 7561/19560 | loss 3.463486 (-0.43z)| norm 0.2712 (-0.31z)| lr 4.25e-04 | 4177.21 ms | 32.3% bf16 MFU | 125137 tok/s step 7562/19560 | loss 3.450147 (-0.72z)| norm 0.2924 (+0.00z)| lr 4.24e-04 | 4190.71 ms | 32.2% bf16 MFU | 125135 tok/s step 7563/19560 | loss 3.495842 (+0.28z)| norm 0.2831 (-0.13z)| lr 4.24e-04 | 4161.09 ms | 32.4% bf16 MFU | 125178 tok/s step 7564/19560 | loss 3.482233 (-0.01z)| norm 0.2860 (-0.09z)| lr 4.24e-04 | 4176.31 ms | 32.3% bf16 MFU | 125196 tok/s step 7565/19560 | loss 3.474256 (-0.19z)| norm 0.2739 (-0.27z)| lr 4.24e-04 | 6221.43 ms | 21.7% bf16 MFU | 123150 tok/s step 7566/19560 | loss 3.476339 (-0.15z)| norm 0.2696 (-0.33z)| lr 4.24e-04 | 4181.30 ms | 32.3% bf16 MFU | 123262 tok/s step 7567/19560 | loss 3.483197 (-0.00z)| norm 0.2821 (-0.15z)| lr 4.24e-04 | 4178.57 ms | 32.3% bf16 MFU | 123372 tok/s step 7568/19560 | loss 3.475022 (-0.19z)| norm 0.2933 (+0.02z)| lr 4.24e-04 | 4157.31 ms | 32.5% bf16 MFU | 123509 tok/s step 7569/19560 | loss 3.504171 (+0.45z)| norm 0.2668 (-0.36z)| lr 4.24e-04 | 4165.16 ms | 32.4% bf16 MFU | 123628 tok/s step 7570/19560 | loss 3.471932 (-0.25z)| norm 0.2797 (-0.17z)| lr 4.24e-04 | 4171.12 ms | 32.4% bf16 MFU | 123731 tok/s step 7571/19560 | loss 3.446338 (-0.83z)| norm 0.2683 (-0.34z)| lr 4.24e-04 | 4160.62 ms | 32.5% bf16 MFU | 123845 tok/s step 7572/19560 | loss 3.517637 (+0.75z)| norm 0.2973 (+0.09z)| lr 4.24e-04 | 4156.60 ms | 32.5% bf16 MFU | 123959 tok/s step 7573/19560 | loss 3.500740 (+0.39z)| norm 0.2646 (-0.39z)| lr 4.24e-04 | 4169.52 ms | 32.4% bf16 MFU | 124049 tok/s step 7574/19560 | loss 3.452109 (-0.73z)| norm 0.2732 (-0.26z)| lr 4.24e-04 | 4163.21 ms | 32.4% bf16 MFU | 124143 tok/s step 7575/19560 | loss 3.489780 (+0.13z)| norm 0.2863 (-0.07z)| lr 4.24e-04 | 4172.16 ms | 32.4% bf16 MFU | 124219 tok/s step 7576/19560 | loss 3.435755 (-1.10z)| norm 0.2632 (-0.40z)| lr 4.24e-04 | 4164.21 ms | 32.4% bf16 MFU | 124303 tok/s step 7577/19560 | loss 3.435175 (-1.10z)| norm 0.2792 (-0.17z)| lr 4.24e-04 | 4164.03 ms | 32.4% bf16 MFU | 124383 tok/s step 7578/19560 | loss 3.455033 (-0.64z)| norm 0.2902 (-0.01z)| lr 4.24e-04 | 4182.24 ms | 32.3% bf16 MFU | 124432 tok/s step 7579/19560 | loss 3.490455 (+0.18z)| norm 0.2834 (-0.11z)| lr 4.24e-04 | 4182.77 ms | 32.3% bf16 MFU | 124478 tok/s step 7580/19560 | loss 3.453749 (-0.66z)| norm 0.2508 (-0.58z)| lr 4.24e-04 | 4170.19 ms | 32.4% bf16 MFU | 124540 tok/s step 7581/19560 | loss 3.534232 (+1.18z)| norm 0.2618 (-0.42z)| lr 4.24e-04 | 4189.50 ms | 32.2% bf16 MFU | 124570 tok/s step 7582/19560 | loss 3.467888 (-0.36z)| norm 0.2571 (-0.49z)| lr 4.24e-04 | 4159.09 ms | 32.5% bf16 MFU | 124645 tok/s step 7583/19560 | loss 3.512678 (+0.67z)| norm 0.2817 (-0.12z)| lr 4.24e-04 | 4191.21 ms | 32.2% bf16 MFU | 124667 tok/s step 7584/19560 | loss 3.441859 (-0.97z)| norm 0.2575 (-0.47z)| lr 4.23e-04 | 4230.25 ms | 31.9% bf16 MFU | 124631 tok/s step 7585/19560 | loss 3.464828 (-0.43z)| norm 0.2531 (-0.53z)| lr 4.23e-04 | 4168.66 ms | 32.4% bf16 MFU | 124688 tok/s step 7586/19560 | loss 3.445374 (-0.88z)| norm 0.2550 (-0.49z)| lr 4.23e-04 | 4179.30 ms | 32.3% bf16 MFU | 124726 tok/s step 7587/19560 | loss 3.453834 (-0.68z)| norm 0.2807 (-0.12z)| lr 4.23e-04 | 4162.99 ms | 32.4% bf16 MFU | 124786 tok/s step 7588/19560 | loss 3.472201 (-0.27z)| norm 0.2966 (+0.11z)| lr 4.23e-04 | 4166.31 ms | 32.4% bf16 MFU | 124839 tok/s step 7589/19560 | loss 3.448061 (-0.84z)| norm 0.2873 (-0.03z)| lr 4.23e-04 | 5191.27 ms | 26.0% bf16 MFU | 123647 tok/s step 7590/19560 | loss 3.472168 (-0.27z)| norm 0.2848 (-0.06z)| lr 4.23e-04 | 4163.38 ms | 32.4% bf16 MFU | 123761 tok/s step 7591/19560 | loss 3.458709 (-0.58z)| norm 0.2715 (-0.25z)| lr 4.23e-04 | 4169.00 ms | 32.4% bf16 MFU | 123861 tok/s step 7592/19560 | loss 3.502625 (+0.45z)| norm 0.2651 (-0.34z)| lr 4.23e-04 | 4161.84 ms | 32.4% bf16 MFU | 123966 tok/s step 7593/19560 | loss 3.463890 (-0.46z)| norm 0.2734 (-0.22z)| lr 4.23e-04 | 4158.13 ms | 32.5% bf16 MFU | 124072 tok/s step 7594/19560 | loss 3.442132 (-0.96z)| norm 0.2667 (-0.32z)| lr 4.23e-04 | 4180.21 ms | 32.3% bf16 MFU | 124140 tok/s step 7595/19560 | loss 3.419636 (-1.47z)| norm 0.2770 (-0.17z)| lr 4.23e-04 | 4168.73 ms | 32.4% bf16 MFU | 124221 tok/s step 7596/19560 | loss 3.456676 (-0.60z)| norm 0.2444 (-0.63z)| lr 4.23e-04 | 4154.08 ms | 32.5% bf16 MFU | 124321 tok/s step 7597/19560 | loss 3.507387 (+0.58z)| norm 0.2512 (-0.53z)| lr 4.23e-04 | 4158.17 ms | 32.5% bf16 MFU | 124409 tok/s step 7598/19560 | loss 3.493325 (+0.27z)| norm 0.2596 (-0.40z)| lr 4.23e-04 | 4178.48 ms | 32.3% bf16 MFU | 124462 tok/s step 7599/19560 | loss 3.474796 (-0.17z)| norm 0.2536 (-0.48z)| lr 4.23e-04 | 4155.13 ms | 32.5% bf16 MFU | 124548 tok/s step 7600/19560 | loss 3.463157 (-0.45z)| norm 0.2490 (-0.55z)| lr 4.23e-04 | 4166.31 ms | 32.4% bf16 MFU | 124613 tok/s step 7601/19560 | loss 3.424634 (-1.33z)| norm 0.2689 (-0.26z)| lr 4.23e-04 | 4178.08 ms | 32.3% bf16 MFU | 124656 tok/s step 7602/19560 | loss 3.462255 (-0.45z)| norm 0.2483 (-0.55z)| lr 4.23e-04 | 4172.69 ms | 32.4% bf16 MFU | 124706 tok/s step 7603/19560 | loss 3.492121 (+0.25z)| norm 0.2622 (-0.35z)| lr 4.23e-04 | 4242.00 ms | 31.8% bf16 MFU | 124650 tok/s step 7604/19560 | loss 3.553113 (+1.68z)| norm 0.2722 (-0.21z)| lr 4.23e-04 | 4158.18 ms | 32.5% bf16 MFU | 124722 tok/s step 7605/19560 | loss 3.500453 (+0.44z)| norm 0.2691 (-0.25z)| lr 4.23e-04 | 4297.80 ms | 31.4% bf16 MFU | 124585 tok/s step 7606/19560 | loss 3.402544 (-1.82z)| norm 0.3249 (+0.55z)| lr 4.22e-04 | 4252.47 ms | 31.8% bf16 MFU | 124521 tok/s step 7607/19560 | loss 3.474065 (-0.16z)| norm 0.2862 (-0.01z)| lr 4.22e-04 | 4181.69 ms | 32.3% bf16 MFU | 124563 tok/s step 7608/19560 | loss 3.487058 (+0.14z)| norm 0.2618 (-0.36z)| lr 4.22e-04 | 4170.56 ms | 32.4% bf16 MFU | 124621 tok/s step 7609/19560 | loss 3.523632 (+0.98z)| norm 0.2856 (-0.02z)| lr 4.22e-04 | 4182.33 ms | 32.3% bf16 MFU | 124658 tok/s step 7610/19560 | loss 3.474219 (-0.17z)| norm 0.2863 (-0.01z)| lr 4.22e-04 | 4168.04 ms | 32.4% bf16 MFU | 124714 tok/s step 7611/19560 | loss 3.507189 (+0.58z)| norm 0.2920 (+0.07z)| lr 4.22e-04 | 4160.64 ms | 32.5% bf16 MFU | 124779 tok/s step 7612/19560 | loss 3.461433 (-0.48z)| norm 0.2950 (+0.11z)| lr 4.22e-04 | 4179.97 ms | 32.3% bf16 MFU | 124812 tok/s step 7613/19560 | loss 3.503423 (+0.49z)| norm 0.2688 (-0.27z)| lr 4.22e-04 | 4183.12 ms | 32.3% bf16 MFU | 124838 tok/s step 7614/19560 | loss 3.457385 (-0.57z)| norm 0.2787 (-0.13z)| lr 4.22e-04 | 4154.37 ms | 32.5% bf16 MFU | 124906 tok/s step 7615/19560 | loss 3.455272 (-0.61z)| norm 0.2819 (-0.08z)| lr 4.22e-04 | 4176.98 ms | 32.3% bf16 MFU | 124937 tok/s step 7616/19560 | loss 3.465943 (-0.35z)| norm 0.2758 (-0.17z)| lr 4.22e-04 | 4158.08 ms | 32.5% bf16 MFU | 124994 tok/s step 7617/19560 | loss 3.443023 (-0.89z)| norm 0.2551 (-0.46z)| lr 4.22e-04 | 4170.58 ms | 32.4% bf16 MFU | 125030 tok/s step 7618/19560 | loss 3.488584 (+0.20z)| norm 0.2825 (-0.07z)| lr 4.22e-04 | 4188.16 ms | 32.2% bf16 MFU | 125038 tok/s step 7619/19560 | loss 3.489959 (+0.22z)| norm 0.2944 (+0.10z)| lr 4.22e-04 | 4155.70 ms | 32.5% bf16 MFU | 125094 tok/s step 7620/19560 | loss 3.479764 (-0.02z)| norm 0.2848 (-0.03z)| lr 4.22e-04 | 4181.81 ms | 32.3% bf16 MFU | 125108 tok/s step 7621/19560 | loss 3.491997 (+0.27z)| norm 0.2875 (+0.00z)| lr 4.22e-04 | 4256.00 ms | 31.7% bf16 MFU | 125012 tok/s step 7622/19560 | loss 3.505433 (+0.58z)| norm 0.3189 (+0.45z)| lr 4.22e-04 | 4168.62 ms | 32.4% bf16 MFU | 125050 tok/s step 7623/19560 | loss 3.432126 (-1.20z)| norm 0.2710 (-0.24z)| lr 4.22e-04 | 4219.39 ms | 32.0% bf16 MFU | 125010 tok/s step 7624/19560 | loss 3.419306 (-1.52z)| norm 0.2695 (-0.26z)| lr 4.22e-04 | 4159.15 ms | 32.5% bf16 MFU | 125062 tok/s step 7625/19560 | loss 3.470847 (-0.26z)| norm 0.2761 (-0.16z)| lr 4.22e-04 | 4152.96 ms | 32.5% bf16 MFU | 125122 tok/s step 7626/19560 | loss 3.549584 (+1.64z)| norm 0.2875 (-0.00z)| lr 4.22e-04 | 4164.48 ms | 32.4% bf16 MFU | 125160 tok/s step 7627/19560 | loss 3.478074 (-0.09z)| norm 0.2729 (-0.21z)| lr 4.22e-04 | 4173.69 ms | 32.3% bf16 MFU | 125183 tok/s step 7628/19560 | loss 3.469517 (-0.30z)| norm 0.2833 (-0.06z)| lr 4.21e-04 | 4172.59 ms | 32.4% bf16 MFU | 125206 tok/s step 7629/19560 | loss 3.550820 (+1.81z)| norm 0.2687 (-0.27z)| lr 4.21e-04 | 4182.12 ms | 32.3% bf16 MFU | 125214 tok/s step 7630/19560 | loss 3.523957 (+1.14z)| norm 0.3195 (+0.46z)| lr 4.21e-04 | 4149.78 ms | 32.5% bf16 MFU | 125271 tok/s step 7631/19560 | loss 3.430341 (-1.32z)| norm 0.2581 (-0.42z)| lr 4.21e-04 | 4172.36 ms | 32.4% bf16 MFU | 125290 tok/s step 7632/19560 | loss 3.429360 (-1.33z)| norm 0.2983 (+0.31z)| lr 4.21e-04 | 4209.73 ms | 32.1% bf16 MFU | 125253 tok/s step 7633/19560 | loss 3.439946 (-1.04z)| norm 0.2720 (-0.22z)| lr 4.21e-04 | 4167.67 ms | 32.4% bf16 MFU | 125280 tok/s step 7634/19560 | loss 3.375595 (-2.65z)| norm 0.2848 (+0.05z)| lr 4.21e-04 | 4161.63 ms | 32.4% bf16 MFU | 125315 tok/s step 7635/19560 | loss 3.494483 (+0.42z)| norm 0.3096 (+0.58z)| lr 4.21e-04 | 4200.72 ms | 32.1% bf16 MFU | 125290 tok/s step 7636/19560 | loss 3.456337 (-0.56z)| norm 0.2624 (-0.41z)| lr 4.21e-04 | 4168.28 ms | 32.4% bf16 MFU | 125314 tok/s step 7637/19560 | loss 3.497290 (+0.51z)| norm 0.2869 (+0.11z)| lr 4.21e-04 | 4194.88 ms | 32.2% bf16 MFU | 125298 tok/s step 7638/19560 | loss 3.439157 (-1.00z)| norm 0.2760 (-0.12z)| lr 4.21e-04 | 4162.37 ms | 32.4% bf16 MFU | 125331 tok/s step 7639/19560 | loss 3.512083 (+0.89z)| norm 0.2915 (+0.20z)| lr 4.21e-04 | 4175.06 ms | 32.3% bf16 MFU | 125343 tok/s step 7640/19560 | loss 3.484012 (+0.15z)| norm 0.2966 (+0.31z)| lr 4.21e-04 | 4172.66 ms | 32.4% bf16 MFU | 125358 tok/s step 7641/19560 | loss 3.418458 (-1.54z)| norm 0.2776 (-0.10z)| lr 4.21e-04 | 4181.97 ms | 32.3% bf16 MFU | 125359 tok/s step 7642/19560 | loss 3.503856 (+0.69z)| norm 0.2775 (-0.10z)| lr 4.21e-04 | 4163.61 ms | 32.4% bf16 MFU | 125387 tok/s step 7643/19560 | loss 3.444539 (-0.86z)| norm 0.2669 (-0.32z)| lr 4.21e-04 | 4160.90 ms | 32.4% bf16 MFU | 125418 tok/s step 7644/19560 | loss 3.464441 (-0.33z)| norm 0.2691 (-0.28z)| lr 4.21e-04 | 4166.54 ms | 32.4% bf16 MFU | 125438 tok/s step 7645/19560 | loss 3.369385 (-2.72z)| norm 0.2567 (-0.54z)| lr 4.21e-04 | 4164.73 ms | 32.4% bf16 MFU | 125461 tok/s step 7646/19560 | loss 3.445415 (-0.78z)| norm 0.2542 (-0.59z)| lr 4.21e-04 | 4226.04 ms | 31.9% bf16 MFU | 125391 tok/s step 7647/19560 | loss 3.384915 (-2.38z)| norm 0.2660 (-0.34z)| lr 4.21e-04 | 4165.64 ms | 32.4% bf16 MFU | 125414 tok/s step 7648/19560 | loss 3.502113 (+0.72z)| norm 0.2653 (-0.36z)| lr 4.21e-04 | 4230.58 ms | 31.9% bf16 MFU | 125340 tok/s step 7649/19560 | loss 3.461858 (-0.34z)| norm 0.2805 (-0.04z)| lr 4.21e-04 | 4173.00 ms | 32.4% bf16 MFU | 125355 tok/s step 7650/19560 | loss 3.442370 (-0.85z)| norm 0.2716 (-0.22z)| lr 4.20e-04 | 4154.08 ms | 32.5% bf16 MFU | 125398 tok/s step 7651/19560 | loss 3.463299 (-0.30z)| norm 0.2760 (-0.13z)| lr 4.20e-04 | 4157.49 ms | 32.5% bf16 MFU | 125433 tok/s step 7652/19560 | loss 3.399143 (-1.97z)| norm 0.2717 (-0.22z)| lr 4.20e-04 | 4178.80 ms | 32.3% bf16 MFU | 125435 tok/s step 7653/19560 | loss 3.570890 (+2.48z)| norm 0.2805 (-0.03z)| lr 4.20e-04 | 4415.75 ms | 30.6% bf16 MFU | 125100 tok/s step 7654/19560 | loss 3.521541 (+1.20z)| norm 0.3306 (+1.01z)| lr 4.20e-04 | 4147.57 ms | 32.6% bf16 MFU | 125165 tok/s step 7655/19560 | loss 3.412188 (-1.60z)| norm 0.3141 (+0.66z)| lr 4.20e-04 | 4174.79 ms | 32.3% bf16 MFU | 125186 tok/s step 7656/19560 | loss 3.451121 (-0.57z)| norm 0.2611 (-0.45z)| lr 4.20e-04 | 4150.96 ms | 32.5% bf16 MFU | 125242 tok/s step 7657/19560 | loss 3.541521 (+1.79z)| norm 0.3339 (+1.07z)| lr 4.20e-04 | 4159.46 ms | 32.5% bf16 MFU | 125282 tok/s step 7658/19560 | loss 3.453077 (-0.52z)| norm 0.3222 (+0.81z)| lr 4.20e-04 | 4179.36 ms | 32.3% bf16 MFU | 125290 tok/s step 7659/19560 | loss 3.455210 (-0.46z)| norm 0.2642 (-0.39z)| lr 4.20e-04 | 4170.22 ms | 32.4% bf16 MFU | 125312 tok/s step 7660/19560 | loss 3.543541 (+1.83z)| norm 0.3259 (+0.88z)| lr 4.20e-04 | 4171.50 ms | 32.4% bf16 MFU | 125331 tok/s step 7661/19560 | loss 3.447701 (-0.65z)| norm 0.2954 (+0.25z)| lr 4.20e-04 | 4173.53 ms | 32.4% bf16 MFU | 125345 tok/s step 7662/19560 | loss 3.497653 (+0.64z)| norm 0.2900 (+0.14z)| lr 4.20e-04 | 4174.35 ms | 32.3% bf16 MFU | 125358 tok/s step 7663/19560 | loss 3.543638 (+1.80z)| norm 0.3121 (+0.58z)| lr 4.20e-04 | 4162.06 ms | 32.4% bf16 MFU | 125388 tok/s step 7664/19560 | loss 3.443360 (-0.77z)| norm 0.2760 (-0.16z)| lr 4.20e-04 | 4157.62 ms | 32.5% bf16 MFU | 125424 tok/s step 7665/19560 | loss 3.449054 (-0.62z)| norm 0.3259 (+0.87z)| lr 4.20e-04 | 4164.40 ms | 32.4% bf16 MFU | 125448 tok/s step 7666/19560 | loss 3.457514 (-0.40z)| norm 0.2695 (-0.30z)| lr 4.20e-04 | 4193.42 ms | 32.2% bf16 MFU | 125427 tok/s step 7667/19560 | loss 3.392542 (-2.01z)| norm 0.2881 (+0.09z)| lr 4.20e-04 | 4180.84 ms | 32.3% bf16 MFU | 125425 tok/s step 7668/19560 | loss 3.412112 (-1.54z)| norm 0.3087 (+0.51z)| lr 4.20e-04 | 4163.89 ms | 32.4% bf16 MFU | 125450 tok/s step 7669/19560 | loss 3.444694 (-0.68z)| norm 0.2821 (-0.04z)| lr 4.20e-04 | 4168.33 ms | 32.4% bf16 MFU | 125466 tok/s step 7670/19560 | loss 3.443114 (-0.72z)| norm 0.2568 (-1.16z)| lr 4.20e-04 | 4200.09 ms | 32.1% bf16 MFU | 125434 tok/s step 7671/19560 | loss 3.500735 (+0.82z)| norm 0.2731 (-0.34z)| lr 4.20e-04 | 4168.00 ms | 32.4% bf16 MFU | 125452 tok/s step 7672/19560 | loss 3.487030 (+0.44z)| norm 0.2711 (-0.43z)| lr 4.19e-04 | 4178.93 ms | 32.3% bf16 MFU | 125452 tok/s step 7673/19560 | loss 3.455223 (-0.41z)| norm 0.2667 (-0.64z)| lr 4.19e-04 | 4169.01 ms | 32.4% bf16 MFU | 125468 tok/s step 7674/19560 | loss 3.420551 (-1.32z)| norm 0.2981 (+1.02z)| lr 4.19e-04 | 4191.71 ms | 32.2% bf16 MFU | 125448 tok/s step 7675/19560 | loss 3.393148 (-2.01z)| norm 0.2740 (-0.24z)| lr 4.19e-04 | 4210.96 ms | 32.1% bf16 MFU | 125401 tok/s step 7676/19560 | loss 3.450078 (-0.50z)| norm 0.2518 (-1.40z)| lr 4.19e-04 | 4174.00 ms | 32.3% bf16 MFU | 125411 tok/s step 7677/19560 | loss 3.436715 (-0.84z)| norm 0.2651 (-0.69z)| lr 4.19e-04 | 4159.06 ms | 32.5% bf16 MFU | 125444 tok/s step 7678/19560 | loss 3.448685 (-0.52z)| norm 0.2677 (-0.54z)| lr 4.19e-04 | 4163.02 ms | 32.4% bf16 MFU | 125469 tok/s step 7679/19560 | loss 3.456320 (-0.30z)| norm 0.2676 (-0.55z)| lr 4.19e-04 | 4156.42 ms | 32.5% bf16 MFU | 125502 tok/s step 7680/19560 | loss 3.553353 (+2.23z)| norm 0.2700 (-0.42z)| lr 4.19e-04 | 4170.26 ms | 32.4% bf16 MFU | 125513 tok/s step 7681/19560 | loss 3.462467 (-0.15z)| norm 0.2620 (-0.86z)| lr 4.19e-04 | 4184.52 ms | 32.3% bf16 MFU | 125502 tok/s step 7682/19560 | loss 3.481141 (+0.33z)| norm 0.2717 (-0.33z)| lr 4.19e-04 | 4166.98 ms | 32.4% bf16 MFU | 125518 tok/s step 7683/19560 | loss 3.456944 (-0.29z)| norm 0.2839 (+0.33z)| lr 4.19e-04 | 4180.63 ms | 32.3% bf16 MFU | 125512 tok/s step 7684/19560 | loss 3.508684 (+1.05z)| norm 0.2972 (+1.03z)| lr 4.19e-04 | 4164.19 ms | 32.4% bf16 MFU | 125532 tok/s step 7685/19560 | loss 3.448333 (-0.53z)| norm 0.2717 (-0.35z)| lr 4.19e-04 | 4195.35 ms | 32.2% bf16 MFU | 125504 tok/s step 7686/19560 | loss 3.517232 (+1.25z)| norm 0.2786 (+0.01z)| lr 4.19e-04 | 4175.46 ms | 32.3% bf16 MFU | 125507 tok/s step 7687/19560 | loss 3.482757 (+0.35z)| norm 0.2626 (-0.85z)| lr 4.19e-04 | 4165.95 ms | 32.4% bf16 MFU | 125524 tok/s step 7688/19560 | loss 3.441054 (-0.73z)| norm 0.2687 (-0.52z)| lr 4.19e-04 | 4186.75 ms | 32.2% bf16 MFU | 125509 tok/s step 7689/19560 | loss 3.474018 (+0.13z)| norm 0.2673 (-0.59z)| lr 4.19e-04 | 4191.46 ms | 32.2% bf16 MFU | 125488 tok/s step 7690/19560 | loss 3.473650 (+0.12z)| norm 0.2821 (+0.21z)| lr 4.19e-04 | 4166.02 ms | 32.4% bf16 MFU | 125506 tok/s step 7691/19560 | loss 3.517172 (+1.25z)| norm 0.2559 (-1.19z)| lr 4.19e-04 | 4167.90 ms | 32.4% bf16 MFU | 125520 tok/s step 7692/19560 | loss 3.475588 (+0.16z)| norm 0.2751 (-0.15z)| lr 4.19e-04 | 4168.56 ms | 32.4% bf16 MFU | 125533 tok/s step 7693/19560 | loss 3.434304 (-0.90z)| norm 0.2739 (-0.21z)| lr 4.19e-04 | 4165.23 ms | 32.4% bf16 MFU | 125550 tok/s step 7694/19560 | loss 3.452184 (-0.43z)| norm 0.2532 (-1.32z)| lr 4.18e-04 | 4215.72 ms | 32.0% bf16 MFU | 125491 tok/s step 7695/19560 | loss 3.501637 (+0.85z)| norm 0.2670 (-0.57z)| lr 4.18e-04 | 4164.59 ms | 32.4% bf16 MFU | 125511 tok/s step 7696/19560 | loss 3.496329 (+0.70z)| norm 0.2821 (+0.25z)| lr 4.18e-04 | 4177.61 ms | 32.3% bf16 MFU | 125510 tok/s step 7697/19560 | loss 3.498425 (+0.76z)| norm 0.2879 (+0.55z)| lr 4.18e-04 | 4157.62 ms | 32.5% bf16 MFU | 125540 tok/s step 7698/19560 | loss 3.454160 (-0.38z)| norm 0.2609 (-0.90z)| lr 4.18e-04 | 4173.91 ms | 32.3% bf16 MFU | 125543 tok/s step 7699/19560 | loss 3.502929 (+0.87z)| norm 0.2750 (-0.14z)| lr 4.18e-04 | 4164.60 ms | 32.4% bf16 MFU | 125561 tok/s step 7700/19560 | loss 3.479898 (+0.28z)| norm 0.2788 (+0.07z)| lr 4.18e-04 | 4184.78 ms | 32.3% bf16 MFU | 125547 tok/s step 7701/19560 | loss 3.585118 (+2.91z)| norm 0.3002 (+1.21z)| lr 4.18e-04 | 4158.93 ms | 32.5% bf16 MFU | 125573 tok/s step 7702/19560 | loss 3.498484 (+0.71z)| norm 0.3036 (+1.37z)| lr 4.18e-04 | 4168.07 ms | 32.4% bf16 MFU | 125583 tok/s step 7703/19560 | loss 3.405221 (-1.60z)| norm 0.2752 (-0.14z)| lr 4.18e-04 | 4160.72 ms | 32.5% bf16 MFU | 125605 tok/s step 7704/19560 | loss 3.530433 (+1.49z)| norm 0.2930 (+0.79z)| lr 4.18e-04 | 4166.38 ms | 32.4% bf16 MFU | 125616 tok/s step 7705/19560 | loss 3.496827 (+0.65z)| norm 0.2869 (+0.47z)| lr 4.18e-04 | 4357.81 ms | 31.0% bf16 MFU | 125351 tok/s step 7706/19560 | loss 3.475672 (+0.12z)| norm 0.2844 (+0.34z)| lr 4.18e-04 | 4160.37 ms | 32.5% bf16 MFU | 125384 tok/s step 7707/19560 | loss 3.456370 (-0.35z)| norm 0.2669 (-0.59z)| lr 4.18e-04 | 4174.51 ms | 32.3% bf16 MFU | 125395 tok/s step 7708/19560 | loss 3.490751 (+0.49z)| norm 0.2620 (-0.87z)| lr 4.18e-04 | 4223.61 ms | 32.0% bf16 MFU | 125332 tok/s step 7709/19560 | loss 3.488672 (+0.45z)| norm 0.2745 (-0.20z)| lr 4.18e-04 | 4161.88 ms | 32.4% bf16 MFU | 125364 tok/s step 7710/19560 | loss 3.485562 (+0.37z)| norm 0.2688 (-0.51z)| lr 4.18e-04 | 4151.57 ms | 32.5% bf16 MFU | 125410 tok/s step 7711/19560 | loss 3.453967 (-0.41z)| norm 0.2680 (-0.55z)| lr 4.18e-04 | 4158.63 ms | 32.5% bf16 MFU | 125443 tok/s step 7712/19560 | loss 3.491508 (+0.53z)| norm 0.2712 (-0.39z)| lr 4.18e-04 | 4164.31 ms | 32.4% bf16 MFU | 125466 tok/s step 7713/19560 | loss 3.480559 (+0.25z)| norm 0.2818 (+0.18z)| lr 4.18e-04 | 4162.02 ms | 32.4% bf16 MFU | 125491 tok/s step 7714/19560 | loss 3.535566 (+1.60z)| norm 0.2721 (-0.36z)| lr 4.18e-04 | 4193.34 ms | 32.2% bf16 MFU | 125468 tok/s step 7715/19560 | loss 3.405700 (-1.61z)| norm 0.2723 (-0.34z)| lr 4.18e-04 | 4179.57 ms | 32.3% bf16 MFU | 125467 tok/s step 7716/19560 | loss 3.490694 (+0.48z)| norm 0.3054 (+1.47z)| lr 4.17e-04 | 4166.16 ms | 32.4% bf16 MFU | 125486 tok/s step 7717/19560 | loss 3.432813 (-0.94z)| norm 0.2790 (+0.02z)| lr 4.17e-04 | 4174.70 ms | 32.3% bf16 MFU | 125491 tok/s step 7718/19560 | loss 3.477367 (+0.15z)| norm 0.2673 (-0.61z)| lr 4.17e-04 | 4162.27 ms | 32.4% bf16 MFU | 125514 tok/s step 7719/19560 | loss 3.469524 (-0.04z)| norm 0.2824 (+0.21z)| lr 4.17e-04 | 4188.43 ms | 32.2% bf16 MFU | 125497 tok/s step 7720/19560 | loss 3.442381 (-0.70z)| norm 0.2909 (+0.67z)| lr 4.17e-04 | 4163.83 ms | 32.4% bf16 MFU | 125518 tok/s step 7721/19560 | loss 3.433996 (-0.90z)| norm 0.2805 (+0.09z)| lr 4.17e-04 | 4162.93 ms | 32.4% bf16 MFU | 125539 tok/s step 7722/19560 | loss 3.460802 (-0.24z)| norm 0.2891 (+0.55z)| lr 4.17e-04 | 4161.08 ms | 32.4% bf16 MFU | 125562 tok/s step 7723/19560 | loss 3.474377 (+0.08z)| norm 0.2612 (-0.96z)| lr 4.17e-04 | 4150.22 ms | 32.5% bf16 MFU | 125601 tok/s step 7724/19560 | loss 3.458107 (-0.32z)| norm 0.2551 (-1.31z)| lr 4.17e-04 | 4177.47 ms | 32.3% bf16 MFU | 125596 tok/s step 7725/19560 | loss 3.445199 (-0.63z)| norm 0.2731 (-0.33z)| lr 4.17e-04 | 4163.60 ms | 32.4% bf16 MFU | 125612 tok/s step 7726/19560 | loss 3.431726 (-0.95z)| norm 0.2719 (-0.41z)| lr 4.17e-04 | 4159.55 ms | 32.5% bf16 MFU | 125634 tok/s step 7727/19560 | loss 3.433911 (-0.88z)| norm 0.2941 (+0.82z)| lr 4.17e-04 | 4150.53 ms | 32.5% bf16 MFU | 125668 tok/s step 7728/19560 | loss 3.425215 (-1.09z)| norm 0.2551 (-1.38z)| lr 4.17e-04 | 4191.81 ms | 32.2% bf16 MFU | 125638 tok/s step 7729/19560 | loss 3.457801 (-0.29z)| norm 0.2598 (-1.10z)| lr 4.17e-04 | 4202.64 ms | 32.1% bf16 MFU | 125594 tok/s step 7730/19560 | loss 3.441272 (-0.70z)| norm 0.2735 (-0.35z)| lr 4.17e-04 | 4190.66 ms | 32.2% bf16 MFU | 125570 tok/s step 7731/19560 | loss 3.468308 (-0.03z)| norm 0.2645 (-0.86z)| lr 4.17e-04 | 4161.29 ms | 32.4% bf16 MFU | 125591 tok/s step 7732/19560 | loss 3.458809 (-0.25z)| norm 0.2710 (-0.50z)| lr 4.17e-04 | 4157.26 ms | 32.5% bf16 MFU | 125617 tok/s step 7733/19560 | loss 3.443193 (-0.63z)| norm 0.2945 (+0.83z)| lr 4.17e-04 | 4165.07 ms | 32.4% bf16 MFU | 125630 tok/s step 7734/19560 | loss 3.481116 (+0.31z)| norm 0.2721 (-0.43z)| lr 4.17e-04 | 4415.83 ms | 30.6% bf16 MFU | 125285 tok/s step 7735/19560 | loss 3.491163 (+0.56z)| norm 0.2902 (+0.62z)| lr 4.17e-04 | 4344.80 ms | 31.1% bf16 MFU | 125054 tok/s step 7736/19560 | loss 3.413029 (-1.39z)| norm 0.2762 (-0.20z)| lr 4.17e-04 | 4242.18 ms | 31.8% bf16 MFU | 124981 tok/s step 7737/19560 | loss 3.469127 (+0.03z)| norm 0.2705 (-0.53z)| lr 4.16e-04 | 4174.12 ms | 32.3% bf16 MFU | 125012 tok/s step 7738/19560 | loss 3.369533 (-2.42z)| norm 0.2865 (+0.41z)| lr 4.16e-04 | 4220.74 ms | 32.0% bf16 MFU | 124972 tok/s step 7739/19560 | loss 3.430796 (-0.89z)| norm 0.2680 (-0.66z)| lr 4.16e-04 | 4238.81 ms | 31.9% bf16 MFU | 124908 tok/s step 7740/19560 | loss 3.437637 (-0.71z)| norm 0.2556 (-1.36z)| lr 4.16e-04 | 4172.23 ms | 32.4% bf16 MFU | 124946 tok/s step 7741/19560 | loss 3.449546 (-0.41z)| norm 0.2625 (-0.95z)| lr 4.16e-04 | 4232.19 ms | 31.9% bf16 MFU | 124892 tok/s step 7742/19560 | loss 3.455016 (-0.27z)| norm 0.2744 (-0.26z)| lr 4.16e-04 | 4227.57 ms | 31.9% bf16 MFU | 124849 tok/s step 7743/19560 | loss 3.510714 (+1.09z)| norm 0.2702 (-0.50z)| lr 4.16e-04 | 4156.34 ms | 32.5% bf16 MFU | 124913 tok/s step 7744/19560 | loss 3.503864 (+0.92z)| norm 0.2800 (+0.07z)| lr 4.16e-04 | 4158.31 ms | 32.5% bf16 MFU | 124972 tok/s step 7745/19560 | loss 3.460241 (-0.16z)| norm 0.2578 (-1.22z)| lr 4.16e-04 | 4153.24 ms | 32.5% bf16 MFU | 125035 tok/s step 7746/19560 | loss 3.483377 (+0.41z)| norm 0.2864 (+0.43z)| lr 4.16e-04 | 4170.80 ms | 32.4% bf16 MFU | 125068 tok/s step 7747/19560 | loss 3.394403 (-1.74z)| norm 0.3015 (+1.30z)| lr 4.16e-04 | 4157.25 ms | 32.5% bf16 MFU | 125121 tok/s step 7748/19560 | loss 3.480719 (+0.36z)| norm 0.2823 (+0.20z)| lr 4.16e-04 | 4183.21 ms | 32.3% bf16 MFU | 125131 tok/s step 7749/19560 | loss 3.476339 (+0.26z)| norm 0.2558 (-1.32z)| lr 4.16e-04 | 4255.75 ms | 31.7% bf16 MFU | 125034 tok/s step 7750/19560 | loss 3.468488 (+0.07z)| norm 0.2959 (+1.02z)| lr 4.16e-04 | 4224.21 ms | 32.0% bf16 MFU | 124988 tok/s val loss 3.445715 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2874/10042 = 0.286198 step 7751/19560 | loss 3.434403 (-0.76z)| norm 0.2783 (-0.02z)| lr 4.16e-04 | 4174.01 ms | 32.3% bf16 MFU | 125019 tok/s step 7752/19560 | loss 3.471983 (+0.15z)| norm 0.2671 (-0.67z)| lr 4.16e-04 | 4152.57 ms | 32.5% bf16 MFU | 125081 tok/s step 7753/19560 | loss 3.529016 (+1.53z)| norm 0.3111 (+1.86z)| lr 4.16e-04 | 4237.62 ms | 31.9% bf16 MFU | 125013 tok/s step 7754/19560 | loss 3.468814 (+0.07z)| norm 0.3013 (+1.28z)| lr 4.16e-04 | 4158.46 ms | 32.5% bf16 MFU | 125067 tok/s step 7755/19560 | loss 3.466226 (+0.01z)| norm 0.2752 (-0.22z)| lr 4.16e-04 | 4193.77 ms | 32.2% bf16 MFU | 125064 tok/s step 7756/19560 | loss 3.581321 (+2.76z)| norm 0.3129 (+1.91z)| lr 4.16e-04 | 4150.87 ms | 32.5% bf16 MFU | 125126 tok/s step 7757/19560 | loss 3.440796 (-0.61z)| norm 0.2678 (-0.65z)| lr 4.16e-04 | 4153.31 ms | 32.5% bf16 MFU | 125182 tok/s step 7758/19560 | loss 3.445009 (-0.50z)| norm 0.2743 (-0.26z)| lr 4.16e-04 | 4150.41 ms | 32.5% bf16 MFU | 125239 tok/s step 7759/19560 | loss 3.467865 (+0.06z)| norm 0.2810 (+0.11z)| lr 4.15e-04 | 4150.33 ms | 32.5% bf16 MFU | 125293 tok/s step 7760/19560 | loss 3.481936 (+0.40z)| norm 0.2901 (+0.65z)| lr 4.15e-04 | 4155.79 ms | 32.5% bf16 MFU | 125336 tok/s step 7761/19560 | loss 3.451555 (-0.36z)| norm 0.2858 (+0.39z)| lr 4.15e-04 | 4156.99 ms | 32.5% bf16 MFU | 125375 tok/s step 7762/19560 | loss 3.435523 (-0.78z)| norm 0.2694 (-0.55z)| lr 4.15e-04 | 4158.03 ms | 32.5% bf16 MFU | 125411 tok/s step 7763/19560 | loss 3.460711 (-0.14z)| norm 0.2874 (+0.51z)| lr 4.15e-04 | 4153.08 ms | 32.5% bf16 MFU | 125453 tok/s step 7764/19560 | loss 3.490297 (+0.60z)| norm 0.2705 (-0.49z)| lr 4.15e-04 | 4227.64 ms | 31.9% bf16 MFU | 125381 tok/s step 7765/19560 | loss 3.467322 (+0.03z)| norm 0.3080 (+1.70z)| lr 4.15e-04 | 4156.19 ms | 32.5% bf16 MFU | 125419 tok/s step 7766/19560 | loss 3.488877 (+0.56z)| norm 0.2972 (+1.05z)| lr 4.15e-04 | 4139.25 ms | 32.6% bf16 MFU | 125481 tok/s step 7767/19560 | loss 3.467714 (+0.04z)| norm 0.2841 (+0.29z)| lr 4.15e-04 | 4245.17 ms | 31.8% bf16 MFU | 125382 tok/s step 7768/19560 | loss 3.519953 (+1.35z)| norm 0.2626 (-0.94z)| lr 4.15e-04 | 4154.63 ms | 32.5% bf16 MFU | 125423 tok/s step 7769/19560 | loss 3.442103 (-0.62z)| norm 0.3121 (+1.90z)| lr 4.15e-04 | 4148.01 ms | 32.5% bf16 MFU | 125471 tok/s step 7770/19560 | loss 3.513488 (+1.18z)| norm 0.2763 (-0.16z)| lr 4.15e-04 | 4150.19 ms | 32.5% bf16 MFU | 125514 tok/s step 7771/19560 | loss 3.504754 (+0.95z)| norm 0.2667 (-0.71z)| lr 4.15e-04 | 4160.17 ms | 32.5% bf16 MFU | 125540 tok/s step 7772/19560 | loss 3.454409 (-0.32z)| norm 0.2586 (-1.17z)| lr 4.15e-04 | 4151.91 ms | 32.5% bf16 MFU | 125577 tok/s step 7773/19560 | loss 3.485938 (+0.46z)| norm 0.2764 (-0.16z)| lr 4.15e-04 | 4255.63 ms | 31.7% bf16 MFU | 125458 tok/s step 7774/19560 | loss 3.469563 (+0.03z)| norm 0.2917 (+0.71z)| lr 4.15e-04 | 4196.27 ms | 32.2% bf16 MFU | 125432 tok/s step 7775/19560 | loss 3.499350 (+0.79z)| norm 0.2807 (+0.06z)| lr 4.15e-04 | 4196.56 ms | 32.2% bf16 MFU | 125407 tok/s step 7776/19560 | loss 3.450325 (-0.49z)| norm 0.2731 (-0.38z)| lr 4.15e-04 | 4260.56 ms | 31.7% bf16 MFU | 125289 tok/s step 7777/19560 | loss 3.418924 (-1.30z)| norm 0.2884 (+0.51z)| lr 4.15e-04 | 4184.74 ms | 32.3% bf16 MFU | 125289 tok/s step 7778/19560 | loss 3.428867 (-1.03z)| norm 0.2813 (+0.09z)| lr 4.15e-04 | 4159.76 ms | 32.5% bf16 MFU | 125327 tok/s step 7779/19560 | loss 3.480451 (+0.31z)| norm 0.2829 (+0.18z)| lr 4.15e-04 | 4181.77 ms | 32.3% bf16 MFU | 125329 tok/s step 7780/19560 | loss 3.501426 (+0.85z)| norm 0.2954 (+0.89z)| lr 4.15e-04 | 4182.99 ms | 32.3% bf16 MFU | 125330 tok/s step 7781/19560 | loss 3.489465 (+0.57z)| norm 0.2982 (+1.05z)| lr 4.14e-04 | 4159.54 ms | 32.5% bf16 MFU | 125365 tok/s step 7782/19560 | loss 3.444979 (-0.63z)| norm 0.3335 (+3.07z)| lr 4.14e-04 | 4147.13 ms | 32.6% bf16 MFU | 125418 tok/s step 7783/19560 | loss 3.494774 (+0.72z)| norm 0.3019 (+1.27z)| lr 4.14e-04 | 4157.93 ms | 32.5% bf16 MFU | 125452 tok/s step 7784/19560 | loss 3.447161 (-0.59z)| norm 0.2934 (+0.76z)| lr 4.14e-04 | 4150.21 ms | 32.5% bf16 MFU | 125496 tok/s step 7785/19560 | loss 3.443350 (-0.68z)| norm 0.2847 (+0.29z)| lr 4.14e-04 | 4157.40 ms | 32.5% bf16 MFU | 125526 tok/s step 7786/19560 | loss 3.457938 (-0.28z)| norm 0.2930 (+0.83z)| lr 4.14e-04 | 4141.25 ms | 32.6% bf16 MFU | 125580 tok/s step 7787/19560 | loss 3.451710 (-0.45z)| norm 0.2880 (+0.51z)| lr 4.14e-04 | 4154.23 ms | 32.5% bf16 MFU | 125611 tok/s step 7788/19560 | loss 3.467082 (-0.00z)| norm 0.2987 (+1.21z)| lr 4.14e-04 | 4158.43 ms | 32.5% bf16 MFU | 125635 tok/s step 7789/19560 | loss 3.436258 (-0.88z)| norm 0.3116 (+2.01z)| lr 4.14e-04 | 4172.33 ms | 32.4% bf16 MFU | 125636 tok/s step 7790/19560 | loss 3.427554 (-1.11z)| norm 0.2876 (+0.50z)| lr 4.14e-04 | 4171.70 ms | 32.4% bf16 MFU | 125638 tok/s step 7791/19560 | loss 3.531784 (+1.86z)| norm 0.3133 (+2.12z)| lr 4.14e-04 | 4163.85 ms | 32.4% bf16 MFU | 125652 tok/s step 7792/19560 | loss 3.442379 (-0.69z)| norm 0.2884 (+0.53z)| lr 4.14e-04 | 4142.90 ms | 32.6% bf16 MFU | 125697 tok/s step 7793/19560 | loss 3.484042 (+0.49z)| norm 0.3007 (+1.36z)| lr 4.14e-04 | 4163.10 ms | 32.4% bf16 MFU | 125709 tok/s step 7794/19560 | loss 3.515231 (+1.36z)| norm 0.2623 (-1.12z)| lr 4.14e-04 | 4165.83 ms | 32.4% bf16 MFU | 125716 tok/s step 7795/19560 | loss 3.434151 (-0.96z)| norm 0.2881 (+0.55z)| lr 4.14e-04 | 4194.12 ms | 32.2% bf16 MFU | 125681 tok/s step 7796/19560 | loss 3.460262 (-0.22z)| norm 0.2669 (-0.81z)| lr 4.14e-04 | 4160.82 ms | 32.4% bf16 MFU | 125697 tok/s step 7797/19560 | loss 3.454077 (-0.40z)| norm 0.2752 (-0.27z)| lr 4.14e-04 | 4161.93 ms | 32.4% bf16 MFU | 125711 tok/s step 7798/19560 | loss 3.445927 (-0.64z)| norm 0.2850 (+0.36z)| lr 4.14e-04 | 4158.07 ms | 32.5% bf16 MFU | 125730 tok/s step 7799/19560 | loss 3.509370 (+1.19z)| norm 0.2748 (-0.31z)| lr 4.14e-04 | 4161.40 ms | 32.4% bf16 MFU | 125742 tok/s step 7800/19560 | loss 3.432817 (-1.01z)| norm 0.2734 (-0.40z)| lr 4.14e-04 | 4145.25 ms | 32.6% bf16 MFU | 125779 tok/s step 7801/19560 | loss 3.475121 (+0.21z)| norm 0.2846 (+0.33z)| lr 4.14e-04 | 4193.24 ms | 32.2% bf16 MFU | 125742 tok/s step 7802/19560 | loss 3.470069 (+0.05z)| norm 0.3084 (+1.88z)| lr 4.13e-04 | 4191.28 ms | 32.2% bf16 MFU | 125709 tok/s step 7803/19560 | loss 3.454752 (-0.41z)| norm 0.2644 (-1.00z)| lr 4.13e-04 | 4156.62 ms | 32.5% bf16 MFU | 125731 tok/s step 7804/19560 | loss 3.453863 (-0.44z)| norm 0.2603 (-1.28z)| lr 4.13e-04 | 4150.79 ms | 32.5% bf16 MFU | 125760 tok/s step 7805/19560 | loss 3.492389 (+0.69z)| norm 0.2669 (-0.84z)| lr 4.13e-04 | 4169.73 ms | 32.4% bf16 MFU | 125758 tok/s step 7806/19560 | loss 3.470985 (+0.05z)| norm 0.3095 (+1.92z)| lr 4.13e-04 | 4158.83 ms | 32.5% bf16 MFU | 125774 tok/s step 7807/19560 | loss 3.456797 (-0.37z)| norm 0.3192 (+2.47z)| lr 4.13e-04 | 4180.72 ms | 32.3% bf16 MFU | 125755 tok/s step 7808/19560 | loss 3.488679 (+0.60z)| norm 0.3087 (+1.76z)| lr 4.13e-04 | 4178.31 ms | 32.3% bf16 MFU | 125742 tok/s step 7809/19560 | loss 3.484041 (+0.46z)| norm 0.3126 (+1.96z)| lr 4.13e-04 | 4161.08 ms | 32.4% bf16 MFU | 125754 tok/s step 7810/19560 | loss 3.493634 (+0.74z)| norm 0.2677 (-0.84z)| lr 4.13e-04 | 4160.78 ms | 32.5% bf16 MFU | 125767 tok/s step 7811/19560 | loss 3.433940 (-1.06z)| norm 0.3115 (+1.85z)| lr 4.13e-04 | 4151.49 ms | 32.5% bf16 MFU | 125793 tok/s step 7812/19560 | loss 3.487033 (+0.56z)| norm 0.2636 (-1.08z)| lr 4.13e-04 | 4160.01 ms | 32.5% bf16 MFU | 125805 tok/s step 7813/19560 | loss 3.470603 (+0.05z)| norm 0.3307 (+2.92z)| lr 4.13e-04 | 4161.20 ms | 32.4% bf16 MFU | 125814 tok/s step 7814/19560 | loss 3.478039 (+0.29z)| norm 0.2905 (+0.53z)| lr 4.13e-04 | 4146.75 ms | 32.6% bf16 MFU | 125845 tok/s step 7815/19560 | loss 3.495948 (+0.83z)| norm 0.2701 (-0.69z)| lr 4.13e-04 | 4156.24 ms | 32.5% bf16 MFU | 125860 tok/s step 7816/19560 | loss 3.463139 (-0.18z)| norm 0.2744 (-0.44z)| lr 4.13e-04 | 4205.14 ms | 32.1% bf16 MFU | 125801 tok/s step 7817/19560 | loss 3.457718 (-0.34z)| norm 0.3031 (+1.25z)| lr 4.13e-04 | 4154.99 ms | 32.5% bf16 MFU | 125820 tok/s step 7818/19560 | loss 3.417688 (-1.54z)| norm 0.2715 (-0.62z)| lr 4.13e-04 | 4156.42 ms | 32.5% bf16 MFU | 125836 tok/s step 7819/19560 | loss 3.516744 (+1.47z)| norm 0.2676 (-0.86z)| lr 4.13e-04 | 4163.39 ms | 32.4% bf16 MFU | 125841 tok/s step 7820/19560 | loss 3.500138 (+0.96z)| norm 0.2991 (+1.01z)| lr 4.13e-04 | 4156.23 ms | 32.5% bf16 MFU | 125856 tok/s step 7821/19560 | loss 3.432873 (-1.08z)| norm 0.2799 (-0.14z)| lr 4.13e-04 | 4152.82 ms | 32.5% bf16 MFU | 125876 tok/s step 7822/19560 | loss 3.465509 (-0.09z)| norm 0.2762 (-0.37z)| lr 4.13e-04 | 4169.88 ms | 32.4% bf16 MFU | 125869 tok/s step 7823/19560 | loss 3.426189 (-1.27z)| norm 0.2661 (-0.98z)| lr 4.13e-04 | 4153.54 ms | 32.5% bf16 MFU | 125886 tok/s step 7824/19560 | loss 3.475232 (+0.22z)| norm 0.2860 (+0.21z)| lr 4.12e-04 | 4169.46 ms | 32.4% bf16 MFU | 125879 tok/s step 7825/19560 | loss 3.502354 (+1.04z)| norm 0.2679 (-0.87z)| lr 4.12e-04 | 4155.70 ms | 32.5% bf16 MFU | 125893 tok/s step 7826/19560 | loss 3.444133 (-0.72z)| norm 0.2696 (-0.77z)| lr 4.12e-04 | 4149.35 ms | 32.5% bf16 MFU | 125916 tok/s step 7827/19560 | loss 3.430648 (-1.11z)| norm 0.2702 (-0.73z)| lr 4.12e-04 | 4146.50 ms | 32.6% bf16 MFU | 125943 tok/s step 7828/19560 | loss 3.448254 (-0.57z)| norm 0.2716 (-0.65z)| lr 4.12e-04 | 4160.88 ms | 32.4% bf16 MFU | 125946 tok/s step 7829/19560 | loss 3.476039 (+0.31z)| norm 0.2640 (-1.09z)| lr 4.12e-04 | 4156.38 ms | 32.5% bf16 MFU | 125956 tok/s step 7830/19560 | loss 3.459774 (-0.19z)| norm 0.2659 (-0.96z)| lr 4.12e-04 | 4157.94 ms | 32.5% bf16 MFU | 125962 tok/s step 7831/19560 | loss 3.403044 (-2.01z)| norm 0.2647 (-1.02z)| lr 4.12e-04 | 4156.82 ms | 32.5% bf16 MFU | 125971 tok/s step 7832/19560 | loss 3.556087 (+2.83z)| norm 0.2901 (+0.51z)| lr 4.12e-04 | 4163.09 ms | 32.4% bf16 MFU | 125969 tok/s step 7833/19560 | loss 3.451982 (-0.43z)| norm 0.3306 (+2.83z)| lr 4.12e-04 | 4156.48 ms | 32.5% bf16 MFU | 125977 tok/s step 7834/19560 | loss 3.438197 (-0.86z)| norm 0.2916 (+0.56z)| lr 4.12e-04 | 4146.19 ms | 32.6% bf16 MFU | 126001 tok/s step 7835/19560 | loss 3.545142 (+2.43z)| norm 0.2940 (+0.69z)| lr 4.12e-04 | 4150.93 ms | 32.5% bf16 MFU | 126016 tok/s step 7836/19560 | loss 3.492850 (+0.82z)| norm 0.2986 (+0.94z)| lr 4.12e-04 | 4153.40 ms | 32.5% bf16 MFU | 126027 tok/s step 7837/19560 | loss 3.518156 (+1.58z)| norm 0.2860 (+0.20z)| lr 4.12e-04 | 4152.76 ms | 32.5% bf16 MFU | 126038 tok/s step 7838/19560 | loss 3.450374 (-0.48z)| norm 0.3100 (+1.57z)| lr 4.12e-04 | 4166.39 ms | 32.4% bf16 MFU | 126028 tok/s step 7839/19560 | loss 3.488438 (+0.67z)| norm 0.2971 (+0.81z)| lr 4.12e-04 | 4140.95 ms | 32.6% bf16 MFU | 126057 tok/s step 7840/19560 | loss 3.442480 (-0.71z)| norm 0.2673 (-0.91z)| lr 4.12e-04 | 4156.23 ms | 32.5% bf16 MFU | 126062 tok/s step 7841/19560 | loss 3.445783 (-0.61z)| norm 0.2767 (-0.37z)| lr 4.12e-04 | 4158.00 ms | 32.5% bf16 MFU | 126063 tok/s step 7842/19560 | loss 3.448108 (-0.52z)| norm 0.2894 (+0.36z)| lr 4.12e-04 | 4171.36 ms | 32.4% bf16 MFU | 126044 tok/s step 7843/19560 | loss 3.557653 (+2.78z)| norm 0.2723 (-0.63z)| lr 4.12e-04 | 4160.64 ms | 32.5% bf16 MFU | 126043 tok/s step 7844/19560 | loss 3.448151 (-0.54z)| norm 0.2840 (+0.05z)| lr 4.12e-04 | 4150.64 ms | 32.5% bf16 MFU | 126056 tok/s step 7845/19560 | loss 3.444918 (-0.64z)| norm 0.2954 (+0.71z)| lr 4.11e-04 | 4159.52 ms | 32.5% bf16 MFU | 126056 tok/s step 7846/19560 | loss 3.500144 (+1.03z)| norm 0.2884 (+0.29z)| lr 4.11e-04 | 4170.52 ms | 32.4% bf16 MFU | 126039 tok/s step 7847/19560 | loss 3.473389 (+0.22z)| norm 0.3245 (+2.33z)| lr 4.11e-04 | 4151.45 ms | 32.5% bf16 MFU | 126051 tok/s step 7848/19560 | loss 3.442197 (-0.73z)| norm 0.3179 (+1.92z)| lr 4.11e-04 | 4162.09 ms | 32.4% bf16 MFU | 126047 tok/s step 7849/19560 | loss 3.458389 (-0.24z)| norm 0.2654 (-1.03z)| lr 4.11e-04 | 4161.58 ms | 32.4% bf16 MFU | 126044 tok/s step 7850/19560 | loss 3.474796 (+0.25z)| norm 0.2980 (+0.79z)| lr 4.11e-04 | 4158.87 ms | 32.5% bf16 MFU | 126045 tok/s step 7851/19560 | loss 3.434316 (-0.97z)| norm 0.2734 (-0.59z)| lr 4.11e-04 | 4153.72 ms | 32.5% bf16 MFU | 126054 tok/s step 7852/19560 | loss 3.453916 (-0.37z)| norm 0.3039 (+1.11z)| lr 4.11e-04 | 4153.60 ms | 32.5% bf16 MFU | 126062 tok/s step 7853/19560 | loss 3.454306 (-0.36z)| norm 0.2925 (+0.45z)| lr 4.11e-04 | 4163.01 ms | 32.4% bf16 MFU | 126056 tok/s step 7854/19560 | loss 3.583308 (+3.38z)| norm 0.2906 (+0.34z)| lr 4.11e-04 | 4159.78 ms | 32.5% bf16 MFU | 126055 tok/s step 7855/19560 | loss 3.437063 (-0.88z)| norm 0.2670 (-0.99z)| lr 4.11e-04 | 4165.91 ms | 32.4% bf16 MFU | 126045 tok/s step 7856/19560 | loss 3.467979 (+0.01z)| norm 0.3239 (+2.19z)| lr 4.11e-04 | 4161.53 ms | 32.4% bf16 MFU | 126042 tok/s step 7857/19560 | loss 3.463700 (-0.12z)| norm 0.2709 (-0.80z)| lr 4.11e-04 | 4197.23 ms | 32.2% bf16 MFU | 125986 tok/s step 7858/19560 | loss 3.480591 (+0.37z)| norm 0.2956 (+0.59z)| lr 4.11e-04 | 4156.43 ms | 32.5% bf16 MFU | 125993 tok/s step 7859/19560 | loss 3.470943 (+0.08z)| norm 0.2659 (-1.09z)| lr 4.11e-04 | 4159.74 ms | 32.5% bf16 MFU | 125995 tok/s step 7860/19560 | loss 3.435492 (-0.95z)| norm 0.2563 (-1.61z)| lr 4.11e-04 | 4157.70 ms | 32.5% bf16 MFU | 126001 tok/s step 7861/19560 | loss 3.415611 (-1.52z)| norm 0.2702 (-0.83z)| lr 4.11e-04 | 4152.75 ms | 32.5% bf16 MFU | 126013 tok/s step 7862/19560 | loss 3.450303 (-0.50z)| norm 0.2533 (-1.74z)| lr 4.11e-04 | 4150.83 ms | 32.5% bf16 MFU | 126028 tok/s step 7863/19560 | loss 3.485394 (+0.52z)| norm 0.2505 (-1.85z)| lr 4.11e-04 | 4186.24 ms | 32.3% bf16 MFU | 125989 tok/s step 7864/19560 | loss 3.528903 (+1.76z)| norm 0.2483 (-1.94z)| lr 4.11e-04 | 4152.01 ms | 32.5% bf16 MFU | 126003 tok/s step 7865/19560 | loss 3.441696 (-0.76z)| norm 0.2810 (-0.18z)| lr 4.11e-04 | 4169.34 ms | 32.4% bf16 MFU | 125990 tok/s step 7866/19560 | loss 3.416887 (-1.53z)| norm 0.2615 (-1.22z)| lr 4.11e-04 | 4164.90 ms | 32.4% bf16 MFU | 125985 tok/s step 7867/19560 | loss 3.458513 (-0.30z)| norm 0.2532 (-1.64z)| lr 4.10e-04 | 4160.11 ms | 32.5% bf16 MFU | 125987 tok/s step 7868/19560 | loss 3.406267 (-1.83z)| norm 0.2627 (-1.14z)| lr 4.10e-04 | 4160.58 ms | 32.5% bf16 MFU | 125988 tok/s step 7869/19560 | loss 3.425889 (-1.24z)| norm 0.2535 (-1.62z)| lr 4.10e-04 | 4147.54 ms | 32.6% bf16 MFU | 126009 tok/s step 7870/19560 | loss 3.426823 (-1.20z)| norm 0.2545 (-1.55z)| lr 4.10e-04 | 4150.21 ms | 32.5% bf16 MFU | 126025 tok/s step 7871/19560 | loss 3.426938 (-1.18z)| norm 0.2458 (-1.97z)| lr 4.10e-04 | 4147.40 ms | 32.6% bf16 MFU | 126045 tok/s step 7872/19560 | loss 3.434600 (-0.94z)| norm 0.2559 (-1.42z)| lr 4.10e-04 | 4155.92 ms | 32.5% bf16 MFU | 126050 tok/s step 7873/19560 | loss 3.427152 (-1.15z)| norm 0.2762 (-0.38z)| lr 4.10e-04 | 4149.58 ms | 32.5% bf16 MFU | 126065 tok/s step 7874/19560 | loss 3.453665 (-0.37z)| norm 0.2653 (-0.94z)| lr 4.10e-04 | 4154.17 ms | 32.5% bf16 MFU | 126072 tok/s step 7875/19560 | loss 3.446822 (-0.59z)| norm 0.2909 (+0.39z)| lr 4.10e-04 | 4158.96 ms | 32.5% bf16 MFU | 126072 tok/s step 7876/19560 | loss 3.443804 (-0.67z)| norm 0.2825 (-0.04z)| lr 4.10e-04 | 4154.61 ms | 32.5% bf16 MFU | 126078 tok/s step 7877/19560 | loss 3.468291 (+0.06z)| norm 0.2673 (-0.84z)| lr 4.10e-04 | 4159.82 ms | 32.5% bf16 MFU | 126076 tok/s step 7878/19560 | loss 3.456327 (-0.30z)| norm 0.2751 (-0.43z)| lr 4.10e-04 | 4144.50 ms | 32.6% bf16 MFU | 126097 tok/s step 7879/19560 | loss 3.497972 (+0.92z)| norm 0.2584 (-1.29z)| lr 4.10e-04 | 4136.16 ms | 32.6% bf16 MFU | 126130 tok/s step 7880/19560 | loss 3.445723 (-0.62z)| norm 0.2449 (-1.95z)| lr 4.10e-04 | 4148.77 ms | 32.5% bf16 MFU | 126142 tok/s step 7881/19560 | loss 3.446234 (-0.59z)| norm 0.2701 (-0.64z)| lr 4.10e-04 | 4152.22 ms | 32.5% bf16 MFU | 126148 tok/s step 7882/19560 | loss 3.415221 (-1.49z)| norm 0.2769 (-0.29z)| lr 4.10e-04 | 4146.81 ms | 32.6% bf16 MFU | 126162 tok/s step 7883/19560 | loss 3.449035 (-0.48z)| norm 0.2732 (-0.48z)| lr 4.10e-04 | 4152.17 ms | 32.5% bf16 MFU | 126168 tok/s step 7884/19560 | loss 3.522933 (+1.78z)| norm 0.2846 (+0.13z)| lr 4.10e-04 | 4148.13 ms | 32.5% bf16 MFU | 126179 tok/s step 7885/19560 | loss 3.452809 (-0.38z)| norm 0.2862 (+0.20z)| lr 4.10e-04 | 4147.09 ms | 32.6% bf16 MFU | 126191 tok/s step 7886/19560 | loss 3.503557 (+1.17z)| norm 0.2981 (+0.81z)| lr 4.10e-04 | 4192.13 ms | 32.2% bf16 MFU | 126135 tok/s step 7887/19560 | loss 3.495927 (+0.92z)| norm 0.2673 (-0.79z)| lr 4.10e-04 | 4114.81 ms | 32.8% bf16 MFU | 126199 tok/s step 7888/19560 | loss 3.439938 (-0.78z)| norm 0.4496 (+6.87z)| lr 4.09e-04 | 4131.78 ms | 32.7% bf16 MFU | 126233 tok/s step 7889/19560 | loss 3.440587 (-0.75z)| norm 0.2815 (-0.09z)| lr 4.09e-04 | 4133.28 ms | 32.7% bf16 MFU | 126264 tok/s step 7890/19560 | loss 3.453997 (-0.35z)| norm 0.3044 (+0.85z)| lr 4.09e-04 | 4133.28 ms | 32.7% bf16 MFU | 126293 tok/s step 7891/19560 | loss 3.471117 (+0.17z)| norm 0.3221 (+1.55z)| lr 4.09e-04 | 4143.10 ms | 32.6% bf16 MFU | 126306 tok/s step 7892/19560 | loss 3.461912 (-0.10z)| norm 0.2826 (-0.07z)| lr 4.09e-04 | 4140.13 ms | 32.6% bf16 MFU | 126322 tok/s step 7893/19560 | loss 3.448661 (-0.50z)| norm 0.2819 (-0.09z)| lr 4.09e-04 | 4128.31 ms | 32.7% bf16 MFU | 126356 tok/s step 7894/19560 | loss 3.458244 (-0.20z)| norm 0.3009 (+0.69z)| lr 4.09e-04 | 4345.05 ms | 31.1% bf16 MFU | 126071 tok/s step 7895/19560 | loss 3.426641 (-1.16z)| norm 0.2581 (-1.05z)| lr 4.09e-04 | 4128.92 ms | 32.7% bf16 MFU | 126117 tok/s step 7896/19560 | loss 3.439890 (-0.74z)| norm 0.2645 (-0.79z)| lr 4.09e-04 | 4139.51 ms | 32.6% bf16 MFU | 126144 tok/s step 7897/19560 | loss 3.505470 (+1.25z)| norm 0.2846 (+0.04z)| lr 4.09e-04 | 4158.53 ms | 32.5% bf16 MFU | 126140 tok/s step 7898/19560 | loss 3.384042 (-2.40z)| norm 0.2477 (-1.46z)| lr 4.09e-04 | 4138.37 ms | 32.6% bf16 MFU | 126168 tok/s step 7899/19560 | loss 3.437556 (-0.77z)| norm 0.2676 (-0.65z)| lr 4.09e-04 | 4146.67 ms | 32.6% bf16 MFU | 126181 tok/s step 7900/19560 | loss 3.438975 (-0.72z)| norm 0.2363 (-1.90z)| lr 4.09e-04 | 4144.96 ms | 32.6% bf16 MFU | 126196 tok/s step 7901/19560 | loss 3.442766 (-0.60z)| norm 0.2497 (-1.34z)| lr 4.09e-04 | 4150.20 ms | 32.5% bf16 MFU | 126203 tok/s step 7902/19560 | loss 3.450231 (-0.37z)| norm 0.2635 (-0.78z)| lr 4.09e-04 | 4166.59 ms | 32.4% bf16 MFU | 126184 tok/s step 7903/19560 | loss 3.469963 (+0.24z)| norm 0.2458 (-1.46z)| lr 4.09e-04 | 4149.50 ms | 32.5% bf16 MFU | 126193 tok/s step 7904/19560 | loss 3.453060 (-0.28z)| norm 0.2532 (-1.15z)| lr 4.09e-04 | 4152.44 ms | 32.5% bf16 MFU | 126196 tok/s step 7905/19560 | loss 3.430456 (-0.97z)| norm 0.2952 (+0.50z)| lr 4.09e-04 | 4147.11 ms | 32.6% bf16 MFU | 126207 tok/s step 7906/19560 | loss 3.488194 (+0.78z)| norm 0.2527 (-1.16z)| lr 4.09e-04 | 4149.46 ms | 32.5% bf16 MFU | 126215 tok/s step 7907/19560 | loss 3.444514 (-0.55z)| norm 0.2893 (+0.27z)| lr 4.09e-04 | 4143.43 ms | 32.6% bf16 MFU | 126231 tok/s step 7908/19560 | loss 3.572996 (+3.24z)| norm 0.2829 (+0.03z)| lr 4.09e-04 | 4153.25 ms | 32.5% bf16 MFU | 126231 tok/s step 7909/19560 | loss 3.413652 (-1.42z)| norm 0.2583 (-0.93z)| lr 4.09e-04 | 4153.22 ms | 32.5% bf16 MFU | 126231 tok/s step 7910/19560 | loss 3.479531 (+0.49z)| norm 0.2877 (+0.25z)| lr 4.08e-04 | 4164.03 ms | 32.4% bf16 MFU | 126215 tok/s step 7911/19560 | loss 3.490992 (+0.83z)| norm 0.2627 (-0.74z)| lr 4.08e-04 | 4156.48 ms | 32.5% bf16 MFU | 126211 tok/s step 7912/19560 | loss 3.449658 (-0.38z)| norm 0.3011 (+0.79z)| lr 4.08e-04 | 4148.37 ms | 32.5% bf16 MFU | 126220 tok/s step 7913/19560 | loss 3.451973 (-0.31z)| norm 0.2984 (+0.68z)| lr 4.08e-04 | 4143.77 ms | 32.6% bf16 MFU | 126235 tok/s step 7914/19560 | loss 3.511099 (+1.40z)| norm 0.2801 (-0.05z)| lr 4.08e-04 | 4145.62 ms | 32.6% bf16 MFU | 126247 tok/s step 7915/19560 | loss 3.454264 (-0.26z)| norm 0.3196 (+1.50z)| lr 4.08e-04 | 4158.03 ms | 32.5% bf16 MFU | 126239 tok/s step 7916/19560 | loss 3.477459 (+0.41z)| norm 0.2785 (-0.12z)| lr 4.08e-04 | 4180.04 ms | 32.3% bf16 MFU | 126198 tok/s step 7917/19560 | loss 3.461447 (-0.06z)| norm 0.2894 (+0.33z)| lr 4.08e-04 | 4164.46 ms | 32.4% bf16 MFU | 126183 tok/s step 7918/19560 | loss 3.450949 (-0.37z)| norm 0.2630 (-0.72z)| lr 4.08e-04 | 4144.47 ms | 32.6% bf16 MFU | 126199 tok/s step 7919/19560 | loss 3.495538 (+0.95z)| norm 0.2765 (-0.17z)| lr 4.08e-04 | 4148.01 ms | 32.5% bf16 MFU | 126209 tok/s step 7920/19560 | loss 3.453875 (-0.28z)| norm 0.2937 (+0.52z)| lr 4.08e-04 | 4145.37 ms | 32.6% bf16 MFU | 126222 tok/s step 7921/19560 | loss 3.489600 (+0.78z)| norm 0.2891 (+0.34z)| lr 4.08e-04 | 4158.47 ms | 32.5% bf16 MFU | 126215 tok/s step 7922/19560 | loss 3.402112 (-1.78z)| norm 0.2616 (-0.76z)| lr 4.08e-04 | 4152.33 ms | 32.5% bf16 MFU | 126217 tok/s step 7923/19560 | loss 3.465681 (+0.09z)| norm 0.3141 (+1.32z)| lr 4.08e-04 | 4149.50 ms | 32.5% bf16 MFU | 126224 tok/s step 7924/19560 | loss 3.424459 (-1.12z)| norm 0.2965 (+0.61z)| lr 4.08e-04 | 4379.57 ms | 30.8% bf16 MFU | 125898 tok/s step 7925/19560 | loss 3.448065 (-0.42z)| norm 0.2914 (+0.40z)| lr 4.08e-04 | 4281.75 ms | 31.5% bf16 MFU | 125726 tok/s step 7926/19560 | loss 3.476004 (+0.39z)| norm 0.2839 (+0.10z)| lr 4.08e-04 | 4186.58 ms | 32.3% bf16 MFU | 125701 tok/s step 7927/19560 | loss 3.446178 (-0.48z)| norm 0.2966 (+0.60z)| lr 4.08e-04 | 4272.94 ms | 31.6% bf16 MFU | 125551 tok/s step 7928/19560 | loss 3.370510 (-2.64z)| norm 0.2679 (-0.53z)| lr 4.08e-04 | 4193.44 ms | 32.2% bf16 MFU | 125525 tok/s step 7929/19560 | loss 3.530280 (+1.94z)| norm 0.2996 (+0.72z)| lr 4.08e-04 | 4184.03 ms | 32.3% bf16 MFU | 125514 tok/s step 7930/19560 | loss 3.460020 (-0.06z)| norm 0.3006 (+0.76z)| lr 4.08e-04 | 4194.60 ms | 32.2% bf16 MFU | 125488 tok/s step 7931/19560 | loss 3.491590 (+0.83z)| norm 0.2800 (-0.06z)| lr 4.07e-04 | 4201.00 ms | 32.1% bf16 MFU | 125453 tok/s step 7932/19560 | loss 3.451849 (-0.30z)| norm 0.3188 (+1.45z)| lr 4.07e-04 | 4142.36 ms | 32.6% bf16 MFU | 125509 tok/s step 7933/19560 | loss 3.481978 (+0.56z)| norm 0.2835 (+0.05z)| lr 4.07e-04 | 4162.93 ms | 32.4% bf16 MFU | 125531 tok/s step 7934/19560 | loss 3.428180 (-0.96z)| norm 0.3040 (+0.87z)| lr 4.07e-04 | 4146.85 ms | 32.6% bf16 MFU | 125576 tok/s step 7935/19560 | loss 3.490239 (+0.79z)| norm 0.3098 (+1.10z)| lr 4.07e-04 | 4148.76 ms | 32.5% bf16 MFU | 125616 tok/s step 7936/19560 | loss 3.373115 (-2.44z)| norm 0.2923 (+0.41z)| lr 4.07e-04 | 5555.88 ms | 24.3% bf16 MFU | 124053 tok/s step 7937/19560 | loss 3.375588 (-2.30z)| norm 0.2767 (-0.20z)| lr 4.07e-04 | 4181.48 ms | 32.3% bf16 MFU | 124120 tok/s step 7938/19560 | loss 3.465150 (+0.13z)| norm 0.2782 (-0.14z)| lr 4.07e-04 | 4144.97 ms | 32.6% bf16 MFU | 124238 tok/s step 7939/19560 | loss 3.439396 (-0.57z)| norm 0.2761 (-0.21z)| lr 4.07e-04 | 4157.18 ms | 32.5% bf16 MFU | 124332 tok/s step 7940/19560 | loss 3.395718 (-1.72z)| norm 0.2781 (-0.14z)| lr 4.07e-04 | 4156.77 ms | 32.5% bf16 MFU | 124422 tok/s step 7941/19560 | loss 3.411197 (-1.28z)| norm 0.2893 (+0.33z)| lr 4.07e-04 | 4151.31 ms | 32.5% bf16 MFU | 124515 tok/s step 7942/19560 | loss 3.416281 (-1.13z)| norm 0.3002 (+0.78z)| lr 4.07e-04 | 4156.99 ms | 32.5% bf16 MFU | 124596 tok/s step 7943/19560 | loss 3.480659 (+0.59z)| norm 0.2850 (+0.15z)| lr 4.07e-04 | 4159.59 ms | 32.5% bf16 MFU | 124668 tok/s step 7944/19560 | loss 3.412308 (-1.22z)| norm 0.2764 (-0.20z)| lr 4.07e-04 | 4156.67 ms | 32.5% bf16 MFU | 124741 tok/s step 7945/19560 | loss 3.485150 (+0.71z)| norm 0.3088 (+1.12z)| lr 4.07e-04 | 4142.67 ms | 32.6% bf16 MFU | 124832 tok/s step 7946/19560 | loss 3.394058 (-1.69z)| norm 0.3096 (+1.13z)| lr 4.07e-04 | 4152.62 ms | 32.5% bf16 MFU | 124903 tok/s step 7947/19560 | loss 3.448630 (-0.24z)| norm 0.2915 (+0.39z)| lr 4.07e-04 | 4209.24 ms | 32.1% bf16 MFU | 124886 tok/s step 7948/19560 | loss 3.499333 (+1.11z)| norm 0.7120 (+9.47z)| lr 4.07e-04 | 4146.77 ms | 32.6% bf16 MFU | 124963 tok/s step 7949/19560 | loss 3.467217 (+0.25z)| norm 0.4008 (+2.48z)| lr 4.07e-04 | 4158.88 ms | 32.5% bf16 MFU | 125018 tok/s step 7950/19560 | loss 3.576352 (+3.02z)| norm 0.3556 (+1.48z)| lr 4.07e-04 | 4151.18 ms | 32.5% bf16 MFU | 125082 tok/s step 7951/19560 | loss 3.451703 (-0.19z)| norm 0.3285 (+0.88z)| lr 4.07e-04 | 4152.10 ms | 32.5% bf16 MFU | 125142 tok/s step 7952/19560 | loss 3.437670 (-0.54z)| norm 0.3181 (+0.65z)| lr 4.07e-04 | 4226.79 ms | 31.9% bf16 MFU | 125087 tok/s step 7953/19560 | loss 3.445209 (-0.34z)| norm 0.3180 (+0.64z)| lr 4.06e-04 | 4155.49 ms | 32.5% bf16 MFU | 125141 tok/s step 7954/19560 | loss 3.457652 (-0.02z)| norm 0.3213 (+0.70z)| lr 4.06e-04 | 4156.05 ms | 32.5% bf16 MFU | 125191 tok/s step 7955/19560 | loss 3.426576 (-0.82z)| norm 0.3005 (+0.26z)| lr 4.06e-04 | 4148.51 ms | 32.5% bf16 MFU | 125251 tok/s step 7956/19560 | loss 3.465453 (+0.18z)| norm 0.3064 (+0.38z)| lr 4.06e-04 | 4144.84 ms | 32.6% bf16 MFU | 125313 tok/s step 7957/19560 | loss 3.449466 (-0.23z)| norm 0.2998 (+0.23z)| lr 4.06e-04 | 4152.31 ms | 32.5% bf16 MFU | 125360 tok/s step 7958/19560 | loss 3.457998 (-0.00z)| norm 0.2782 (-0.23z)| lr 4.06e-04 | 4145.69 ms | 32.6% bf16 MFU | 125415 tok/s step 7959/19560 | loss 3.511079 (+1.35z)| norm 0.2894 (+0.00z)| lr 4.06e-04 | 4160.57 ms | 32.5% bf16 MFU | 125445 tok/s step 7960/19560 | loss 3.406634 (-1.35z)| norm 0.2801 (-0.20z)| lr 4.06e-04 | 4182.58 ms | 32.3% bf16 MFU | 125441 tok/s step 7961/19560 | loss 3.459103 (+0.03z)| norm 0.3086 (+0.42z)| lr 4.06e-04 | 4148.88 ms | 32.5% bf16 MFU | 125487 tok/s step 7962/19560 | loss 3.400013 (-1.51z)| norm 0.2704 (-0.40z)| lr 4.06e-04 | 4143.80 ms | 32.6% bf16 MFU | 125539 tok/s step 7963/19560 | loss 3.445472 (-0.30z)| norm 0.2660 (-0.49z)| lr 4.06e-04 | 4143.07 ms | 32.6% bf16 MFU | 125589 tok/s step 7964/19560 | loss 3.403325 (-1.41z)| norm 0.2878 (-0.02z)| lr 4.06e-04 | 4173.98 ms | 32.3% bf16 MFU | 125590 tok/s step 7965/19560 | loss 3.381103 (-1.96z)| norm 0.2546 (-0.72z)| lr 4.06e-04 | 4157.46 ms | 32.5% bf16 MFU | 125616 tok/s step 7966/19560 | loss 3.456841 (+0.05z)| norm 0.2776 (-0.22z)| lr 4.06e-04 | 4149.07 ms | 32.5% bf16 MFU | 125653 tok/s step 7967/19560 | loss 3.446140 (-0.23z)| norm 0.2887 (+0.01z)| lr 4.06e-04 | 4149.08 ms | 32.5% bf16 MFU | 125689 tok/s step 7968/19560 | loss 3.425637 (-0.77z)| norm 0.2730 (-0.32z)| lr 4.06e-04 | 4160.34 ms | 32.5% bf16 MFU | 125705 tok/s step 7969/19560 | loss 3.440286 (-0.38z)| norm 0.2926 (+0.09z)| lr 4.06e-04 | 4153.10 ms | 32.5% bf16 MFU | 125732 tok/s step 7970/19560 | loss 3.418326 (-0.95z)| norm 0.2779 (-0.22z)| lr 4.06e-04 | 4139.10 ms | 32.6% bf16 MFU | 125779 tok/s step 7971/19560 | loss 3.410260 (-1.16z)| norm 0.2913 (+0.06z)| lr 4.06e-04 | 4154.49 ms | 32.5% bf16 MFU | 125800 tok/s step 7972/19560 | loss 3.507907 (+1.46z)| norm 0.2507 (-0.80z)| lr 4.06e-04 | 4161.91 ms | 32.4% bf16 MFU | 125809 tok/s step 7973/19560 | loss 3.445484 (-0.22z)| norm 0.2757 (-0.26z)| lr 4.06e-04 | 4170.16 ms | 32.4% bf16 MFU | 125804 tok/s step 7974/19560 | loss 3.454276 (+0.03z)| norm 0.2675 (-0.43z)| lr 4.05e-04 | 4151.00 ms | 32.5% bf16 MFU | 125829 tok/s step 7975/19560 | loss 3.382645 (-1.87z)| norm 0.2839 (-0.08z)| lr 4.05e-04 | 4177.16 ms | 32.3% bf16 MFU | 125813 tok/s step 7976/19560 | loss 3.439875 (-0.34z)| norm 0.2741 (-0.28z)| lr 4.05e-04 | 4162.35 ms | 32.4% bf16 MFU | 125821 tok/s step 7977/19560 | loss 3.389005 (-1.67z)| norm 0.2620 (-0.54z)| lr 4.05e-04 | 4146.64 ms | 32.6% bf16 MFU | 125852 tok/s step 7978/19560 | loss 3.464848 (+0.34z)| norm 0.2887 (+0.04z)| lr 4.05e-04 | 4165.05 ms | 32.4% bf16 MFU | 125853 tok/s step 7979/19560 | loss 3.483855 (+0.83z)| norm 0.2646 (-0.48z)| lr 4.05e-04 | 4152.64 ms | 32.5% bf16 MFU | 125873 tok/s step 7980/19560 | loss 3.444592 (-0.20z)| norm 0.2819 (-0.10z)| lr 4.05e-04 | 4163.97 ms | 32.4% bf16 MFU | 125875 tok/s step 7981/19560 | loss 3.421498 (-0.80z)| norm 0.2725 (-0.30z)| lr 4.05e-04 | 4156.79 ms | 32.5% bf16 MFU | 125887 tok/s step 7982/19560 | loss 3.465944 (+0.41z)| norm 0.2835 (-0.07z)| lr 4.05e-04 | 4154.49 ms | 32.5% bf16 MFU | 125903 tok/s step 7983/19560 | loss 3.436615 (-0.40z)| norm 0.2680 (-0.40z)| lr 4.05e-04 | 4152.44 ms | 32.5% bf16 MFU | 125921 tok/s step 7984/19560 | loss 3.444346 (-0.18z)| norm 0.2725 (-0.29z)| lr 4.05e-04 | 4150.69 ms | 32.5% bf16 MFU | 125940 tok/s step 7985/19560 | loss 3.460075 (+0.25z)| norm 0.2915 (+0.11z)| lr 4.05e-04 | 4146.34 ms | 32.6% bf16 MFU | 125966 tok/s step 7986/19560 | loss 3.417271 (-0.92z)| norm 0.2840 (-0.05z)| lr 4.05e-04 | 4162.30 ms | 32.4% bf16 MFU | 125966 tok/s step 7987/19560 | loss 3.426352 (-0.65z)| norm 0.2815 (-0.10z)| lr 4.05e-04 | 4154.67 ms | 32.5% bf16 MFU | 125977 tok/s step 7988/19560 | loss 3.457754 (+0.21z)| norm 0.2847 (-0.04z)| lr 4.05e-04 | 4145.85 ms | 32.6% bf16 MFU | 126001 tok/s step 7989/19560 | loss 3.388006 (-1.70z)| norm 0.2656 (-0.45z)| lr 4.05e-04 | 4163.07 ms | 32.4% bf16 MFU | 125998 tok/s step 7990/19560 | loss 3.393360 (-1.53z)| norm 0.2934 (+0.14z)| lr 4.05e-04 | 4153.92 ms | 32.5% bf16 MFU | 126009 tok/s step 7991/19560 | loss 3.427814 (-0.58z)| norm 0.2726 (-0.31z)| lr 4.05e-04 | 4148.70 ms | 32.5% bf16 MFU | 126027 tok/s step 7992/19560 | loss 3.446687 (-0.05z)| norm 0.2489 (-0.82z)| lr 4.05e-04 | 4154.48 ms | 32.5% bf16 MFU | 126036 tok/s step 7993/19560 | loss 3.439856 (-0.24z)| norm 0.2542 (-0.70z)| lr 4.05e-04 | 4158.67 ms | 32.5% bf16 MFU | 126037 tok/s step 7994/19560 | loss 3.469318 (+0.57z)| norm 0.2615 (-0.55z)| lr 4.05e-04 | 4153.44 ms | 32.5% bf16 MFU | 126047 tok/s step 7995/19560 | loss 3.419090 (-0.82z)| norm 0.2797 (-0.16z)| lr 4.05e-04 | 4151.13 ms | 32.5% bf16 MFU | 126060 tok/s step 7996/19560 | loss 3.401882 (-1.29z)| norm 0.2889 (+0.04z)| lr 4.04e-04 | 4174.19 ms | 32.3% bf16 MFU | 126037 tok/s step 7997/19560 | loss 3.420838 (-0.77z)| norm 0.2869 (-0.01z)| lr 4.04e-04 | 4158.51 ms | 32.5% bf16 MFU | 126039 tok/s step 7998/19560 | loss 3.437053 (-0.32z)| norm 0.2979 (+0.22z)| lr 4.04e-04 | 4147.17 ms | 32.6% bf16 MFU | 126058 tok/s step 7999/19560 | loss 3.446553 (-0.06z)| norm 0.2771 (-0.24z)| lr 4.04e-04 | 4158.55 ms | 32.5% bf16 MFU | 126059 tok/s step 8000/19560 | loss 3.448921 (+0.00z)| norm 0.3025 (+0.31z)| lr 4.04e-04 | 4158.55 ms | 32.5% bf16 MFU | 126059 tok/s val loss 3.437401 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2878/10042 = 0.286596 step 8001/19560 | loss 3.445506 (-0.10z)| norm 0.3192 (+0.66z)| lr 4.04e-04 | 4313.59 ms | 31.3% bf16 MFU | 125834 tok/s step 8002/19560 | loss 3.472582 (+0.65z)| norm 0.2881 (-0.02z)| lr 4.04e-04 | 4251.15 ms | 31.8% bf16 MFU | 125708 tok/s step 8003/19560 | loss 3.481338 (+0.88z)| norm 0.2933 (+0.10z)| lr 4.04e-04 | 4199.47 ms | 32.2% bf16 MFU | 125665 tok/s step 8004/19560 | loss 3.418643 (-0.84z)| norm 0.3228 (+0.73z)| lr 4.04e-04 | 4155.33 ms | 32.5% bf16 MFU | 125691 tok/s step 8005/19560 | loss 3.403417 (-1.24z)| norm 0.3031 (+0.29z)| lr 4.04e-04 | 4250.56 ms | 31.8% bf16 MFU | 125573 tok/s step 8006/19560 | loss 3.434691 (-0.38z)| norm 0.3222 (+0.70z)| lr 4.04e-04 | 4469.62 ms | 30.2% bf16 MFU | 125160 tok/s step 8007/19560 | loss 3.441061 (-0.19z)| norm 0.3270 (+0.79z)| lr 4.04e-04 | 4159.79 ms | 32.5% bf16 MFU | 125204 tok/s step 8008/19560 | loss 3.577220 (+3.38z)| norm 0.2954 (+0.10z)| lr 4.04e-04 | 4143.19 ms | 32.6% bf16 MFU | 125271 tok/s step 8009/19560 | loss 3.421878 (-0.71z)| norm 0.3459 (+1.18z)| lr 4.04e-04 | 4170.73 ms | 32.4% bf16 MFU | 125292 tok/s step 8010/19560 | loss 3.454325 (+0.14z)| norm 0.3143 (+0.49z)| lr 4.04e-04 | 4143.68 ms | 32.6% bf16 MFU | 125354 tok/s step 8011/19560 | loss 3.449213 (-0.00z)| norm 0.2931 (+0.03z)| lr 4.04e-04 | 4166.47 ms | 32.4% bf16 MFU | 125378 tok/s step 8012/19560 | loss 3.496602 (+1.27z)| norm 0.3202 (+0.61z)| lr 4.04e-04 | 4165.74 ms | 32.4% bf16 MFU | 125402 tok/s step 8013/19560 | loss 3.460558 (+0.31z)| norm 0.3229 (+0.66z)| lr 4.04e-04 | 4160.45 ms | 32.5% bf16 MFU | 125433 tok/s step 8014/19560 | loss 3.465932 (+0.46z)| norm 0.2943 (+0.04z)| lr 4.04e-04 | 4159.42 ms | 32.5% bf16 MFU | 125464 tok/s step 8015/19560 | loss 3.414943 (-0.89z)| norm 0.2928 (+0.01z)| lr 4.04e-04 | 4140.40 ms | 32.6% bf16 MFU | 125522 tok/s step 8016/19560 | loss 3.423513 (-0.66z)| norm 0.3020 (+0.24z)| lr 4.04e-04 | 4148.80 ms | 32.5% bf16 MFU | 125564 tok/s step 8017/19560 | loss 3.458482 (+0.28z)| norm 0.2876 (-0.09z)| lr 4.03e-04 | 4163.00 ms | 32.4% bf16 MFU | 125583 tok/s step 8018/19560 | loss 3.457305 (+0.25z)| norm 0.3057 (+0.32z)| lr 4.03e-04 | 4220.41 ms | 32.0% bf16 MFU | 125515 tok/s step 8019/19560 | loss 3.382611 (-1.73z)| norm 0.2773 (-0.31z)| lr 4.03e-04 | 4185.80 ms | 32.3% bf16 MFU | 125502 tok/s step 8020/19560 | loss 3.413152 (-0.90z)| norm 0.2914 (+0.00z)| lr 4.03e-04 | 4138.57 ms | 32.6% bf16 MFU | 125561 tok/s step 8021/19560 | loss 3.439615 (-0.20z)| norm 0.2925 (+0.03z)| lr 4.03e-04 | 4152.21 ms | 32.5% bf16 MFU | 125597 tok/s step 8022/19560 | loss 3.423452 (-0.62z)| norm 0.2771 (-0.32z)| lr 4.03e-04 | 4141.36 ms | 32.6% bf16 MFU | 125647 tok/s step 8023/19560 | loss 3.450356 (+0.09z)| norm 0.2880 (-0.08z)| lr 4.03e-04 | 4208.43 ms | 32.1% bf16 MFU | 125593 tok/s step 8024/19560 | loss 3.420444 (-0.70z)| norm 0.2971 (+0.13z)| lr 4.03e-04 | 4148.11 ms | 32.5% bf16 MFU | 125633 tok/s step 8025/19560 | loss 3.401938 (-1.17z)| norm 0.2829 (-0.20z)| lr 4.03e-04 | 4151.64 ms | 32.5% bf16 MFU | 125666 tok/s step 8026/19560 | loss 3.449192 (+0.07z)| norm 0.2862 (-0.13z)| lr 4.03e-04 | 4152.20 ms | 32.5% bf16 MFU | 125696 tok/s step 8027/19560 | loss 3.404076 (-1.13z)| norm 0.2798 (-0.28z)| lr 4.03e-04 | 4187.11 ms | 32.2% bf16 MFU | 125672 tok/s step 8028/19560 | loss 3.408892 (-0.99z)| norm 0.3135 (+0.48z)| lr 4.03e-04 | 4157.27 ms | 32.5% bf16 MFU | 125694 tok/s step 8029/19560 | loss 3.395325 (-1.33z)| norm 0.3249 (+0.73z)| lr 4.03e-04 | 4162.78 ms | 32.4% bf16 MFU | 125707 tok/s step 8030/19560 | loss 3.490597 (+1.17z)| norm 0.3194 (+0.60z)| lr 4.03e-04 | 4142.91 ms | 32.6% bf16 MFU | 125749 tok/s step 8031/19560 | loss 3.465649 (+0.52z)| norm 0.2908 (-0.07z)| lr 4.03e-04 | 4147.59 ms | 32.6% bf16 MFU | 125782 tok/s step 8032/19560 | loss 3.481664 (+0.93z)| norm 0.2933 (-0.02z)| lr 4.03e-04 | 4453.08 ms | 30.3% bf16 MFU | 125379 tok/s step 8033/19560 | loss 3.448137 (+0.05z)| norm 0.2741 (-0.47z)| lr 4.03e-04 | 4148.21 ms | 32.5% bf16 MFU | 125430 tok/s step 8034/19560 | loss 3.475190 (+0.76z)| norm 0.2990 (+0.11z)| lr 4.03e-04 | 4142.15 ms | 32.6% bf16 MFU | 125487 tok/s step 8035/19560 | loss 3.440281 (-0.15z)| norm 0.2863 (-0.19z)| lr 4.03e-04 | 4145.50 ms | 32.6% bf16 MFU | 125536 tok/s step 8036/19560 | loss 3.490799 (+1.24z)| norm 0.2698 (-0.57z)| lr 4.03e-04 | 4144.40 ms | 32.6% bf16 MFU | 125585 tok/s step 8037/19560 | loss 3.517669 (+1.93z)| norm 0.3254 (+0.71z)| lr 4.03e-04 | 4148.49 ms | 32.5% bf16 MFU | 125625 tok/s step 8038/19560 | loss 3.498172 (+1.39z)| norm 0.2706 (-0.56z)| lr 4.02e-04 | 4147.89 ms | 32.6% bf16 MFU | 125663 tok/s step 8039/19560 | loss 3.453450 (+0.20z)| norm 0.3254 (+0.70z)| lr 4.02e-04 | 4296.07 ms | 31.4% bf16 MFU | 125482 tok/s step 8040/19560 | loss 3.425787 (-0.55z)| norm 0.2638 (-0.72z)| lr 4.02e-04 | 4137.41 ms | 32.6% bf16 MFU | 125544 tok/s step 8041/19560 | loss 3.432469 (-0.36z)| norm 0.3008 (+0.14z)| lr 4.02e-04 | 4140.64 ms | 32.6% bf16 MFU | 125598 tok/s step 8042/19560 | loss 3.565090 (+3.13z)| norm 0.2985 (+0.08z)| lr 4.02e-04 | 4195.66 ms | 32.2% bf16 MFU | 125566 tok/s step 8043/19560 | loss 3.404831 (-1.07z)| norm 0.3175 (+0.52z)| lr 4.02e-04 | 4148.16 ms | 32.5% bf16 MFU | 125607 tok/s step 8044/19560 | loss 3.451354 (+0.15z)| norm 0.2695 (-0.59z)| lr 4.02e-04 | 4147.65 ms | 32.6% bf16 MFU | 125647 tok/s step 8045/19560 | loss 3.476891 (+0.82z)| norm 0.3004 (+0.12z)| lr 4.02e-04 | 4160.97 ms | 32.4% bf16 MFU | 125665 tok/s step 8046/19560 | loss 3.440694 (-0.13z)| norm 0.2818 (-0.31z)| lr 4.02e-04 | 4144.64 ms | 32.6% bf16 MFU | 125706 tok/s step 8047/19560 | loss 3.433662 (-0.30z)| norm 0.2949 (-0.01z)| lr 4.02e-04 | 4161.23 ms | 32.4% bf16 MFU | 125721 tok/s step 8048/19560 | loss 3.425122 (-0.52z)| norm 0.2796 (-0.37z)| lr 4.02e-04 | 4139.02 ms | 32.6% bf16 MFU | 125768 tok/s step 8049/19560 | loss 3.427554 (-0.45z)| norm 0.2918 (-0.08z)| lr 4.02e-04 | 4149.37 ms | 32.5% bf16 MFU | 125797 tok/s step 8050/19560 | loss 3.398439 (-1.22z)| norm 0.2878 (-0.18z)| lr 4.02e-04 | 4144.50 ms | 32.6% bf16 MFU | 125833 tok/s step 8051/19560 | loss 3.449958 (+0.15z)| norm 0.2657 (-0.69z)| lr 4.02e-04 | 4167.72 ms | 32.4% bf16 MFU | 125831 tok/s step 8052/19560 | loss 3.393049 (-1.34z)| norm 0.3459 (+1.17z)| lr 4.02e-04 | 4143.98 ms | 32.6% bf16 MFU | 125865 tok/s step 8053/19560 | loss 3.413862 (-0.79z)| norm 0.3094 (+0.32z)| lr 4.02e-04 | 4143.63 ms | 32.6% bf16 MFU | 125898 tok/s step 8054/19560 | loss 3.463715 (+0.53z)| norm 0.2925 (-0.08z)| lr 4.02e-04 | 4151.50 ms | 32.5% bf16 MFU | 125918 tok/s step 8055/19560 | loss 3.413309 (-0.79z)| norm 0.2998 (+0.09z)| lr 4.02e-04 | 4163.43 ms | 32.4% bf16 MFU | 125918 tok/s step 8056/19560 | loss 3.435599 (-0.22z)| norm 0.3205 (+0.57z)| lr 4.02e-04 | 4212.40 ms | 32.1% bf16 MFU | 125846 tok/s step 8057/19560 | loss 3.425459 (-0.48z)| norm 0.2904 (-0.13z)| lr 4.02e-04 | 4150.11 ms | 32.5% bf16 MFU | 125870 tok/s step 8058/19560 | loss 3.426570 (-0.44z)| norm 0.2648 (-0.72z)| lr 4.02e-04 | 4159.69 ms | 32.5% bf16 MFU | 125878 tok/s step 8059/19560 | loss 3.484536 (+1.13z)| norm 0.2951 (-0.02z)| lr 4.01e-04 | 4139.99 ms | 32.6% bf16 MFU | 125916 tok/s step 8060/19560 | loss 3.419784 (-0.62z)| norm 0.2458 (-1.15z)| lr 4.01e-04 | 4146.53 ms | 32.6% bf16 MFU | 125943 tok/s step 8061/19560 | loss 3.467826 (+0.69z)| norm 0.3036 (+0.19z)| lr 4.01e-04 | 4157.36 ms | 32.5% bf16 MFU | 125951 tok/s step 8062/19560 | loss 3.447590 (+0.13z)| norm 0.3095 (+0.32z)| lr 4.01e-04 | 4157.47 ms | 32.5% bf16 MFU | 125959 tok/s step 8063/19560 | loss 3.497029 (+1.48z)| norm 0.2730 (-0.51z)| lr 4.01e-04 | 4161.76 ms | 32.4% bf16 MFU | 125960 tok/s step 8064/19560 | loss 3.411102 (-0.88z)| norm 0.3376 (+0.97z)| lr 4.01e-04 | 4145.05 ms | 32.6% bf16 MFU | 125986 tok/s step 8065/19560 | loss 3.384094 (-1.63z)| norm 0.2871 (-0.20z)| lr 4.01e-04 | 4156.37 ms | 32.5% bf16 MFU | 125994 tok/s step 8066/19560 | loss 3.514259 (+1.93z)| norm 0.2821 (-0.31z)| lr 4.01e-04 | 4153.37 ms | 32.5% bf16 MFU | 126006 tok/s step 8067/19560 | loss 3.408390 (-0.95z)| norm 0.3273 (+0.72z)| lr 4.01e-04 | 4134.13 ms | 32.7% bf16 MFU | 126046 tok/s step 8068/19560 | loss 3.455461 (+0.32z)| norm 0.2511 (-1.03z)| lr 4.01e-04 | 4181.74 ms | 32.3% bf16 MFU | 126013 tok/s step 8069/19560 | loss 3.456360 (+0.34z)| norm 0.2914 (-0.10z)| lr 4.01e-04 | 4147.22 ms | 32.6% bf16 MFU | 126033 tok/s step 8070/19560 | loss 3.451041 (+0.19z)| norm 0.2603 (-0.81z)| lr 4.01e-04 | 4342.50 ms | 31.1% bf16 MFU | 125768 tok/s step 8071/19560 | loss 3.428774 (-0.42z)| norm 0.2832 (-0.28z)| lr 4.01e-04 | 4158.70 ms | 32.5% bf16 MFU | 125783 tok/s step 8072/19560 | loss 3.515495 (+1.93z)| norm 0.2913 (-0.10z)| lr 4.01e-04 | 4215.46 ms | 32.0% bf16 MFU | 125713 tok/s step 8073/19560 | loss 3.433777 (-0.29z)| norm 0.2938 (-0.04z)| lr 4.01e-04 | 4154.36 ms | 32.5% bf16 MFU | 125737 tok/s step 8074/19560 | loss 3.450469 (+0.16z)| norm 0.2826 (-0.29z)| lr 4.01e-04 | 4152.79 ms | 32.5% bf16 MFU | 125763 tok/s step 8075/19560 | loss 3.430808 (-0.38z)| norm 0.3237 (+0.64z)| lr 4.01e-04 | 4143.33 ms | 32.6% bf16 MFU | 125802 tok/s step 8076/19560 | loss 3.425088 (-0.53z)| norm 0.2776 (-0.63z)| lr 4.01e-04 | 4146.92 ms | 32.6% bf16 MFU | 125833 tok/s step 8077/19560 | loss 3.462257 (+0.51z)| norm 0.2856 (-0.27z)| lr 4.01e-04 | 4153.47 ms | 32.5% bf16 MFU | 125853 tok/s step 8078/19560 | loss 3.365046 (-2.23z)| norm 0.2797 (-0.54z)| lr 4.01e-04 | 4177.42 ms | 32.3% bf16 MFU | 125835 tok/s step 8079/19560 | loss 3.461805 (+0.56z)| norm 0.2783 (-0.59z)| lr 4.01e-04 | 4142.22 ms | 32.6% bf16 MFU | 125872 tok/s step 8080/19560 | loss 3.482627 (+1.14z)| norm 0.3113 (+1.03z)| lr 4.01e-04 | 4158.21 ms | 32.5% bf16 MFU | 125883 tok/s step 8081/19560 | loss 3.465390 (+0.64z)| norm 0.2734 (-0.82z)| lr 4.00e-04 | 4141.35 ms | 32.6% bf16 MFU | 125919 tok/s step 8082/19560 | loss 3.480335 (+1.06z)| norm 0.2786 (-0.55z)| lr 4.00e-04 | 4139.44 ms | 32.6% bf16 MFU | 125956 tok/s step 8083/19560 | loss 3.447620 (+0.12z)| norm 0.3001 (+0.52z)| lr 4.00e-04 | 4143.96 ms | 32.6% bf16 MFU | 125984 tok/s step 8084/19560 | loss 3.525986 (+2.30z)| norm 0.3008 (+0.56z)| lr 4.00e-04 | 4155.56 ms | 32.5% bf16 MFU | 125993 tok/s step 8085/19560 | loss 3.426850 (-0.47z)| norm 0.2719 (-0.87z)| lr 4.00e-04 | 4142.47 ms | 32.6% bf16 MFU | 126021 tok/s step 8086/19560 | loss 3.441684 (-0.05z)| norm 0.2830 (-0.32z)| lr 4.00e-04 | 4160.87 ms | 32.4% bf16 MFU | 126020 tok/s step 8087/19560 | loss 3.408149 (-0.97z)| norm 0.2532 (-1.77z)| lr 4.00e-04 | 4154.42 ms | 32.5% bf16 MFU | 126029 tok/s step 8088/19560 | loss 3.482806 (+1.12z)| norm 0.2597 (-1.43z)| lr 4.00e-04 | 4216.88 ms | 32.0% bf16 MFU | 125944 tok/s step 8089/19560 | loss 3.383046 (-1.66z)| norm 0.2547 (-1.64z)| lr 4.00e-04 | 4143.31 ms | 32.6% bf16 MFU | 125974 tok/s step 8090/19560 | loss 3.438791 (-0.12z)| norm 0.2627 (-1.25z)| lr 4.00e-04 | 4142.13 ms | 32.6% bf16 MFU | 126004 tok/s step 8091/19560 | loss 3.500680 (+1.59z)| norm 0.2534 (-1.68z)| lr 4.00e-04 | 4161.11 ms | 32.4% bf16 MFU | 126004 tok/s step 8092/19560 | loss 3.463446 (+0.55z)| norm 0.2627 (-1.22z)| lr 4.00e-04 | 4149.62 ms | 32.5% bf16 MFU | 126021 tok/s step 8093/19560 | loss 3.407755 (-1.02z)| norm 0.2715 (-0.81z)| lr 4.00e-04 | 4154.78 ms | 32.5% bf16 MFU | 126029 tok/s step 8094/19560 | loss 3.414762 (-0.81z)| norm 0.2715 (-0.80z)| lr 4.00e-04 | 4159.48 ms | 32.5% bf16 MFU | 126030 tok/s step 8095/19560 | loss 3.444841 (+0.03z)| norm 0.2562 (-1.52z)| lr 4.00e-04 | 4161.03 ms | 32.4% bf16 MFU | 126029 tok/s step 8096/19560 | loss 3.435733 (-0.23z)| norm 0.2717 (-0.78z)| lr 4.00e-04 | 4157.13 ms | 32.5% bf16 MFU | 126033 tok/s step 8097/19560 | loss 3.471739 (+0.78z)| norm 0.2706 (-0.82z)| lr 4.00e-04 | 4141.65 ms | 32.6% bf16 MFU | 126061 tok/s step 8098/19560 | loss 3.426594 (-0.49z)| norm 0.2836 (-0.20z)| lr 4.00e-04 | 4193.47 ms | 32.2% bf16 MFU | 126009 tok/s step 8099/19560 | loss 3.468428 (+0.67z)| norm 0.2731 (-0.69z)| lr 4.00e-04 | 4146.05 ms | 32.6% bf16 MFU | 126031 tok/s step 8100/19560 | loss 3.456604 (+0.35z)| norm 0.2861 (-0.09z)| lr 4.00e-04 | 4147.57 ms | 32.6% bf16 MFU | 126050 tok/s step 8101/19560 | loss 3.432629 (-0.32z)| norm 0.2917 (+0.17z)| lr 4.00e-04 | 4151.17 ms | 32.5% bf16 MFU | 126063 tok/s step 8102/19560 | loss 3.419681 (-0.68z)| norm 0.2652 (-1.10z)| lr 3.99e-04 | 4141.33 ms | 32.6% bf16 MFU | 126090 tok/s step 8103/19560 | loss 3.442545 (-0.05z)| norm 0.2619 (-1.25z)| lr 3.99e-04 | 4143.82 ms | 32.6% bf16 MFU | 126111 tok/s step 8104/19560 | loss 3.414208 (-0.85z)| norm 0.2649 (-1.10z)| lr 3.99e-04 | 4611.53 ms | 29.3% bf16 MFU | 125490 tok/s step 8105/19560 | loss 3.438315 (-0.18z)| norm 0.2597 (-1.34z)| lr 3.99e-04 | 4360.22 ms | 31.0% bf16 MFU | 125228 tok/s step 8106/19560 | loss 3.422019 (-0.64z)| norm 0.2884 (+0.02z)| lr 3.99e-04 | 4152.68 ms | 32.5% bf16 MFU | 125279 tok/s step 8107/19560 | loss 3.424858 (-0.55z)| norm 0.2857 (-0.11z)| lr 3.99e-04 | 4146.89 ms | 32.6% bf16 MFU | 125337 tok/s step 8108/19560 | loss 3.408808 (-1.00z)| norm 0.2701 (-0.85z)| lr 3.99e-04 | 4189.61 ms | 32.2% bf16 MFU | 125327 tok/s step 8109/19560 | loss 3.440518 (-0.09z)| norm 0.2708 (-0.81z)| lr 3.99e-04 | 4158.28 ms | 32.5% bf16 MFU | 125365 tok/s step 8110/19560 | loss 3.474878 (+0.91z)| norm 0.2855 (-0.12z)| lr 3.99e-04 | 4241.33 ms | 31.8% bf16 MFU | 125277 tok/s step 8111/19560 | loss 3.429882 (-0.39z)| norm 0.2570 (-1.46z)| lr 3.99e-04 | 4164.75 ms | 32.4% bf16 MFU | 125308 tok/s step 8112/19560 | loss 3.462618 (+0.55z)| norm 0.2690 (-0.89z)| lr 3.99e-04 | 4225.32 ms | 32.0% bf16 MFU | 125246 tok/s step 8113/19560 | loss 3.397741 (-1.30z)| norm 0.2793 (-0.40z)| lr 3.99e-04 | 4171.03 ms | 32.4% bf16 MFU | 125269 tok/s step 8114/19560 | loss 3.451990 (+0.25z)| norm 0.2804 (-0.35z)| lr 3.99e-04 | 4150.18 ms | 32.5% bf16 MFU | 125322 tok/s step 8115/19560 | loss 3.504193 (+1.71z)| norm 0.2854 (-0.11z)| lr 3.99e-04 | 4170.44 ms | 32.4% bf16 MFU | 125342 tok/s step 8116/19560 | loss 3.479220 (+0.99z)| norm 0.2703 (-0.82z)| lr 3.99e-04 | 4174.26 ms | 32.3% bf16 MFU | 125354 tok/s step 8117/19560 | loss 3.477902 (+0.94z)| norm 0.3146 (+1.25z)| lr 3.99e-04 | 4174.65 ms | 32.3% bf16 MFU | 125366 tok/s step 8118/19560 | loss 3.550177 (+2.90z)| norm 0.2707 (-0.81z)| lr 3.99e-04 | 4224.36 ms | 32.0% bf16 MFU | 125303 tok/s step 8119/19560 | loss 3.365649 (-2.19z)| norm 0.2921 (+0.19z)| lr 3.99e-04 | 4164.76 ms | 32.4% bf16 MFU | 125333 tok/s step 8120/19560 | loss 3.468855 (+0.63z)| norm 0.2796 (-0.41z)| lr 3.99e-04 | 4165.29 ms | 32.4% bf16 MFU | 125359 tok/s step 8121/19560 | loss 3.473309 (+0.74z)| norm 0.2903 (+0.09z)| lr 3.99e-04 | 4153.88 ms | 32.5% bf16 MFU | 125402 tok/s step 8122/19560 | loss 3.498286 (+1.41z)| norm 0.3192 (+1.46z)| lr 3.99e-04 | 4168.40 ms | 32.4% bf16 MFU | 125421 tok/s step 8123/19560 | loss 3.480417 (+0.91z)| norm 0.2863 (-0.13z)| lr 3.98e-04 | 4158.20 ms | 32.5% bf16 MFU | 125454 tok/s step 8124/19560 | loss 3.490777 (+1.17z)| norm 0.2756 (-0.64z)| lr 3.98e-04 | 4185.47 ms | 32.3% bf16 MFU | 125445 tok/s step 8125/19560 | loss 3.449683 (+0.05z)| norm 0.3083 (+0.92z)| lr 3.98e-04 | 4157.68 ms | 32.5% bf16 MFU | 125478 tok/s step 8126/19560 | loss 3.496604 (+1.30z)| norm 0.3025 (+0.64z)| lr 3.98e-04 | 4180.10 ms | 32.3% bf16 MFU | 125475 tok/s step 8127/19560 | loss 3.454275 (+0.16z)| norm 0.2791 (-0.48z)| lr 3.98e-04 | 4170.89 ms | 32.4% bf16 MFU | 125486 tok/s step 8128/19560 | loss 3.418280 (-0.80z)| norm 0.2594 (-1.40z)| lr 3.98e-04 | 4170.09 ms | 32.4% bf16 MFU | 125498 tok/s step 8129/19560 | loss 3.470039 (+0.59z)| norm 0.2783 (-0.49z)| lr 3.98e-04 | 4157.26 ms | 32.5% bf16 MFU | 125529 tok/s step 8130/19560 | loss 3.599142 (+3.79z)| norm 0.2606 (-1.32z)| lr 3.98e-04 | 4160.26 ms | 32.5% bf16 MFU | 125554 tok/s step 8131/19560 | loss 3.391988 (-1.42z)| norm 0.3091 (+0.98z)| lr 3.98e-04 | 4186.27 ms | 32.3% bf16 MFU | 125538 tok/s step 8132/19560 | loss 3.460471 (+0.29z)| norm 0.2779 (-0.49z)| lr 3.98e-04 | 4164.03 ms | 32.4% bf16 MFU | 125557 tok/s step 8133/19560 | loss 3.468475 (+0.49z)| norm 0.3011 (+0.63z)| lr 3.98e-04 | 4165.40 ms | 32.4% bf16 MFU | 125572 tok/s step 8134/19560 | loss 3.456198 (+0.17z)| norm 0.2849 (-0.13z)| lr 3.98e-04 | 4160.81 ms | 32.4% bf16 MFU | 125594 tok/s step 8135/19560 | loss 3.448498 (-0.03z)| norm 0.3011 (+0.67z)| lr 3.98e-04 | 4160.53 ms | 32.5% bf16 MFU | 125615 tok/s step 8136/19560 | loss 3.464272 (+0.41z)| norm 0.2817 (-0.28z)| lr 3.98e-04 | 4159.83 ms | 32.5% bf16 MFU | 125636 tok/s step 8137/19560 | loss 3.438438 (-0.27z)| norm 0.2747 (-0.62z)| lr 3.98e-04 | 4156.19 ms | 32.5% bf16 MFU | 125661 tok/s step 8138/19560 | loss 3.414718 (-0.89z)| norm 0.2996 (+0.66z)| lr 3.98e-04 | 4178.72 ms | 32.3% bf16 MFU | 125652 tok/s step 8139/19560 | loss 3.442865 (-0.15z)| norm 0.2692 (-0.88z)| lr 3.98e-04 | 4153.62 ms | 32.5% bf16 MFU | 125680 tok/s step 8140/19560 | loss 3.471344 (+0.61z)| norm 0.3135 (+1.38z)| lr 3.98e-04 | 4155.58 ms | 32.5% bf16 MFU | 125705 tok/s step 8141/19560 | loss 3.443783 (-0.11z)| norm 0.3109 (+1.26z)| lr 3.98e-04 | 4170.26 ms | 32.4% bf16 MFU | 125705 tok/s step 8142/19560 | loss 3.453277 (+0.14z)| norm 0.2757 (-0.54z)| lr 3.98e-04 | 4154.62 ms | 32.5% bf16 MFU | 125730 tok/s step 8143/19560 | loss 3.409609 (-1.02z)| norm 0.3283 (+2.11z)| lr 3.98e-04 | 4168.49 ms | 32.4% bf16 MFU | 125732 tok/s step 8144/19560 | loss 3.451447 (+0.09z)| norm 0.2735 (-0.65z)| lr 3.97e-04 | 4173.47 ms | 32.4% bf16 MFU | 125727 tok/s step 8145/19560 | loss 3.462610 (+0.38z)| norm 0.3749 (+4.14z)| lr 3.97e-04 | 4163.08 ms | 32.4% bf16 MFU | 125737 tok/s step 8146/19560 | loss 3.436851 (-0.30z)| norm 0.3153 (+1.32z)| lr 3.97e-04 | 4161.18 ms | 32.4% bf16 MFU | 125750 tok/s step 8147/19560 | loss 3.493143 (+1.18z)| norm 0.2957 (+0.40z)| lr 3.97e-04 | 4158.90 ms | 32.5% bf16 MFU | 125766 tok/s step 8148/19560 | loss 3.491855 (+1.13z)| norm 0.2948 (+0.36z)| lr 3.97e-04 | 4154.87 ms | 32.5% bf16 MFU | 125787 tok/s step 8149/19560 | loss 3.487505 (+1.00z)| norm 0.2914 (+0.20z)| lr 3.97e-04 | 4159.10 ms | 32.5% bf16 MFU | 125800 tok/s step 8150/19560 | loss 3.425469 (-0.65z)| norm 0.2879 (+0.03z)| lr 3.97e-04 | 4173.97 ms | 32.3% bf16 MFU | 125791 tok/s step 8151/19560 | loss 3.423165 (-0.70z)| norm 0.2809 (-0.30z)| lr 3.97e-04 | 4167.86 ms | 32.4% bf16 MFU | 125791 tok/s step 8152/19560 | loss 3.521157 (+1.86z)| norm 0.3155 (+1.31z)| lr 3.97e-04 | 4166.76 ms | 32.4% bf16 MFU | 125793 tok/s step 8153/19560 | loss 3.477687 (+0.70z)| norm 0.2651 (-1.03z)| lr 3.97e-04 | 4158.48 ms | 32.5% bf16 MFU | 125807 tok/s step 8154/19560 | loss 3.477264 (+0.68z)| norm 0.2984 (+0.51z)| lr 3.97e-04 | 4158.95 ms | 32.5% bf16 MFU | 125820 tok/s step 8155/19560 | loss 3.490106 (+1.01z)| norm 0.3162 (+1.32z)| lr 3.97e-04 | 4172.95 ms | 32.4% bf16 MFU | 125811 tok/s step 8156/19560 | loss 3.503824 (+1.35z)| norm 0.3091 (+0.99z)| lr 3.97e-04 | 4161.16 ms | 32.4% bf16 MFU | 125820 tok/s step 8157/19560 | loss 3.521189 (+1.77z)| norm 0.3051 (+0.82z)| lr 3.97e-04 | 4163.63 ms | 32.4% bf16 MFU | 125825 tok/s step 8158/19560 | loss 3.475280 (+0.57z)| norm 0.3111 (+1.11z)| lr 3.97e-04 | 4154.21 ms | 32.5% bf16 MFU | 125844 tok/s step 8159/19560 | loss 3.409350 (-1.15z)| norm 0.3045 (+0.80z)| lr 3.97e-04 | 4171.53 ms | 32.4% bf16 MFU | 125836 tok/s step 8160/19560 | loss 3.487367 (+0.89z)| norm 0.3499 (+2.80z)| lr 3.97e-04 | 4170.38 ms | 32.4% bf16 MFU | 125830 tok/s step 8161/19560 | loss 3.495368 (+1.09z)| norm 0.3192 (+1.39z)| lr 3.97e-04 | 4164.68 ms | 32.4% bf16 MFU | 125833 tok/s step 8162/19560 | loss 3.533669 (+2.04z)| norm 0.2998 (+0.52z)| lr 3.97e-04 | 4166.75 ms | 32.4% bf16 MFU | 125833 tok/s step 8163/19560 | loss 3.518439 (+1.62z)| norm 0.3351 (+2.05z)| lr 3.97e-04 | 4166.70 ms | 32.4% bf16 MFU | 125832 tok/s step 8164/19560 | loss 3.477121 (+0.57z)| norm 0.3050 (+0.71z)| lr 3.97e-04 | 4167.96 ms | 32.4% bf16 MFU | 125830 tok/s step 8165/19560 | loss 3.468957 (+0.38z)| norm 0.3209 (+1.42z)| lr 3.96e-04 | 4167.60 ms | 32.4% bf16 MFU | 125829 tok/s step 8166/19560 | loss 3.463980 (+0.26z)| norm 0.2772 (-0.52z)| lr 3.96e-04 | 4178.43 ms | 32.3% bf16 MFU | 125811 tok/s step 8167/19560 | loss 3.474596 (+0.53z)| norm 0.2999 (+0.50z)| lr 3.96e-04 | 4161.10 ms | 32.4% bf16 MFU | 125820 tok/s step 8168/19560 | loss 3.451978 (-0.06z)| norm 0.2662 (-1.01z)| lr 3.96e-04 | 4165.81 ms | 32.4% bf16 MFU | 125822 tok/s step 8169/19560 | loss 3.422857 (-0.81z)| norm 0.2935 (+0.21z)| lr 3.96e-04 | 4159.69 ms | 32.5% bf16 MFU | 125833 tok/s step 8170/19560 | loss 3.458927 (+0.15z)| norm 0.2876 (-0.05z)| lr 3.96e-04 | 4161.22 ms | 32.4% bf16 MFU | 125841 tok/s step 8171/19560 | loss 3.495771 (+1.11z)| norm 0.2646 (-1.06z)| lr 3.96e-04 | 4442.65 ms | 30.4% bf16 MFU | 125450 tok/s step 8172/19560 | loss 3.531614 (+2.02z)| norm 0.3961 (+4.44z)| lr 3.96e-04 | 4159.49 ms | 32.5% bf16 MFU | 125479 tok/s step 8173/19560 | loss 3.542239 (+2.24z)| norm 0.2894 (+0.01z)| lr 3.96e-04 | 4173.77 ms | 32.3% bf16 MFU | 125486 tok/s step 8174/19560 | loss 3.525544 (+1.78z)| norm 0.2593 (-1.22z)| lr 3.96e-04 | 4160.82 ms | 32.4% bf16 MFU | 125512 tok/s step 8175/19560 | loss 3.427608 (-0.72z)| norm 0.2740 (-0.61z)| lr 3.96e-04 | 4168.90 ms | 32.4% bf16 MFU | 125525 tok/s step 8176/19560 | loss 3.498463 (+1.07z)| norm 0.2639 (-1.02z)| lr 3.96e-04 | 4155.35 ms | 32.5% bf16 MFU | 125557 tok/s step 8177/19560 | loss 3.443326 (-0.34z)| norm 0.2789 (-0.40z)| lr 3.96e-04 | 4166.50 ms | 32.4% bf16 MFU | 125571 tok/s step 8178/19560 | loss 3.487353 (+0.77z)| norm 0.2569 (-1.28z)| lr 3.96e-04 | 4160.08 ms | 32.5% bf16 MFU | 125594 tok/s step 8179/19560 | loss 3.450553 (-0.17z)| norm 0.3242 (+1.44z)| lr 3.96e-04 | 4160.47 ms | 32.5% bf16 MFU | 125615 tok/s step 8180/19560 | loss 3.499511 (+1.07z)| norm 0.3025 (+0.59z)| lr 3.96e-04 | 4155.68 ms | 32.5% bf16 MFU | 125642 tok/s step 8181/19560 | loss 3.415122 (-1.11z)| norm 0.2809 (-0.31z)| lr 3.96e-04 | 4158.99 ms | 32.5% bf16 MFU | 125663 tok/s step 8182/19560 | loss 3.398133 (-1.52z)| norm 0.2766 (-0.48z)| lr 3.96e-04 | 4181.63 ms | 32.3% bf16 MFU | 125649 tok/s step 8183/19560 | loss 3.512732 (+1.39z)| norm 0.2733 (-0.61z)| lr 3.96e-04 | 4162.16 ms | 32.4% bf16 MFU | 125665 tok/s step 8184/19560 | loss 3.521472 (+1.58z)| norm 0.2739 (-0.57z)| lr 3.96e-04 | 4172.18 ms | 32.4% bf16 MFU | 125665 tok/s step 8185/19560 | loss 3.468230 (+0.23z)| norm 0.2966 (+0.38z)| lr 3.96e-04 | 4172.47 ms | 32.4% bf16 MFU | 125664 tok/s step 8186/19560 | loss 3.437210 (-0.56z)| norm 0.2979 (+0.42z)| lr 3.96e-04 | 4170.99 ms | 32.4% bf16 MFU | 125666 tok/s step 8187/19560 | loss 3.415474 (-1.10z)| norm 0.2870 (-0.03z)| lr 3.95e-04 | 4159.99 ms | 32.5% bf16 MFU | 125684 tok/s step 8188/19560 | loss 3.537613 (+1.95z)| norm 0.2830 (-0.21z)| lr 3.95e-04 | 4165.04 ms | 32.4% bf16 MFU | 125694 tok/s step 8189/19560 | loss 3.494398 (+0.86z)| norm 0.2751 (-0.54z)| lr 3.95e-04 | 4151.59 ms | 32.5% bf16 MFU | 125723 tok/s step 8190/19560 | loss 3.465828 (+0.14z)| norm 0.2798 (-0.33z)| lr 3.95e-04 | 4161.08 ms | 32.4% bf16 MFU | 125737 tok/s step 8191/19560 | loss 3.545063 (+2.08z)| norm 0.2775 (-0.43z)| lr 3.95e-04 | 4189.19 ms | 32.2% bf16 MFU | 125708 tok/s step 8192/19560 | loss 3.396926 (-1.55z)| norm 0.2602 (-1.16z)| lr 3.95e-04 | 4159.16 ms | 32.5% bf16 MFU | 125725 tok/s step 8193/19560 | loss 3.464884 (+0.10z)| norm 0.2856 (-0.06z)| lr 3.95e-04 | 4150.85 ms | 32.5% bf16 MFU | 125755 tok/s step 8194/19560 | loss 3.466073 (+0.14z)| norm 0.2670 (-0.85z)| lr 3.95e-04 | 4171.45 ms | 32.4% bf16 MFU | 125751 tok/s step 8195/19560 | loss 3.441150 (-0.50z)| norm 0.2684 (-0.78z)| lr 3.95e-04 | 4164.70 ms | 32.4% bf16 MFU | 125758 tok/s step 8196/19560 | loss 3.435976 (-0.62z)| norm 0.2816 (-0.22z)| lr 3.95e-04 | 4166.95 ms | 32.4% bf16 MFU | 125761 tok/s step 8197/19560 | loss 3.417681 (-1.07z)| norm 0.2790 (-0.33z)| lr 3.95e-04 | 4170.99 ms | 32.4% bf16 MFU | 125758 tok/s step 8198/19560 | loss 3.386700 (-1.81z)| norm 0.2874 (+0.02z)| lr 3.95e-04 | 4172.95 ms | 32.4% bf16 MFU | 125752 tok/s step 8199/19560 | loss 3.443222 (-0.42z)| norm 0.2610 (-1.12z)| lr 3.95e-04 | 4172.71 ms | 32.4% bf16 MFU | 125747 tok/s step 8200/19560 | loss 3.564065 (+2.52z)| norm 0.2725 (-0.61z)| lr 3.95e-04 | 4153.57 ms | 32.5% bf16 MFU | 125771 tok/s step 8201/19560 | loss 3.470255 (+0.23z)| norm 0.2567 (-1.28z)| lr 3.95e-04 | 4163.16 ms | 32.4% bf16 MFU | 125779 tok/s step 8202/19560 | loss 3.426551 (-0.82z)| norm 0.2682 (-0.78z)| lr 3.95e-04 | 4179.39 ms | 32.3% bf16 MFU | 125762 tok/s step 8203/19560 | loss 3.455489 (-0.13z)| norm 0.2594 (-1.14z)| lr 3.95e-04 | 4155.19 ms | 32.5% bf16 MFU | 125783 tok/s step 8204/19560 | loss 3.553520 (+2.20z)| norm 0.2725 (-0.57z)| lr 3.95e-04 | 4172.31 ms | 32.4% bf16 MFU | 125777 tok/s step 8205/19560 | loss 3.624344 (+3.65z)| norm 0.2604 (-1.08z)| lr 3.95e-04 | 4165.02 ms | 32.4% bf16 MFU | 125782 tok/s step 8206/19560 | loss 3.475481 (+0.27z)| norm 0.2861 (+0.03z)| lr 3.95e-04 | 4174.67 ms | 32.3% bf16 MFU | 125772 tok/s step 8207/19560 | loss 3.411057 (-1.20z)| norm 0.2474 (-1.62z)| lr 3.95e-04 | 4169.22 ms | 32.4% bf16 MFU | 125771 tok/s step 8208/19560 | loss 3.441351 (-0.50z)| norm 0.2648 (-0.86z)| lr 3.94e-04 | 4170.14 ms | 32.4% bf16 MFU | 125769 tok/s step 8209/19560 | loss 3.515768 (+1.19z)| norm 0.2595 (-1.08z)| lr 3.94e-04 | 4164.42 ms | 32.4% bf16 MFU | 125775 tok/s step 8210/19560 | loss 3.495192 (+0.72z)| norm 0.2816 (-0.13z)| lr 3.94e-04 | 4158.96 ms | 32.5% bf16 MFU | 125790 tok/s step 8211/19560 | loss 3.471364 (+0.17z)| norm 0.2891 (+0.19z)| lr 3.94e-04 | 4170.19 ms | 32.4% bf16 MFU | 125786 tok/s step 8212/19560 | loss 3.507689 (+1.01z)| norm 0.2746 (-0.42z)| lr 3.94e-04 | 4171.08 ms | 32.4% bf16 MFU | 125782 tok/s step 8213/19560 | loss 3.434446 (-0.67z)| norm 0.2853 (+0.03z)| lr 3.94e-04 | 4167.18 ms | 32.4% bf16 MFU | 125783 tok/s step 8214/19560 | loss 3.467674 (+0.09z)| norm 0.2720 (-0.53z)| lr 3.94e-04 | 4181.22 ms | 32.3% bf16 MFU | 125764 tok/s step 8215/19560 | loss 3.431556 (-0.75z)| norm 0.2697 (-0.64z)| lr 3.94e-04 | 4170.75 ms | 32.4% bf16 MFU | 125761 tok/s step 8216/19560 | loss 3.499790 (+0.82z)| norm 0.2853 (+0.02z)| lr 3.94e-04 | 4163.53 ms | 32.4% bf16 MFU | 125769 tok/s step 8217/19560 | loss 3.466253 (+0.03z)| norm 0.2633 (-0.94z)| lr 3.94e-04 | 4168.67 ms | 32.4% bf16 MFU | 125769 tok/s step 8218/19560 | loss 3.470872 (+0.13z)| norm 0.3080 (+1.00z)| lr 3.94e-04 | 4179.80 ms | 32.3% bf16 MFU | 125752 tok/s step 8219/19560 | loss 3.445600 (-0.45z)| norm 0.2821 (-0.14z)| lr 3.94e-04 | 4154.42 ms | 32.5% bf16 MFU | 125775 tok/s step 8220/19560 | loss 3.394571 (-1.61z)| norm 0.2651 (-0.89z)| lr 3.94e-04 | 4164.92 ms | 32.4% bf16 MFU | 125780 tok/s step 8221/19560 | loss 3.413823 (-1.17z)| norm 0.2757 (-0.43z)| lr 3.94e-04 | 4161.58 ms | 32.4% bf16 MFU | 125790 tok/s step 8222/19560 | loss 3.404090 (-1.39z)| norm 0.2696 (-0.70z)| lr 3.94e-04 | 4165.87 ms | 32.4% bf16 MFU | 125793 tok/s step 8223/19560 | loss 3.472149 (+0.18z)| norm 0.2889 (+0.14z)| lr 3.94e-04 | 4168.78 ms | 32.4% bf16 MFU | 125792 tok/s step 8224/19560 | loss 3.451238 (-0.31z)| norm 0.2822 (-0.16z)| lr 3.94e-04 | 4168.48 ms | 32.4% bf16 MFU | 125791 tok/s step 8225/19560 | loss 3.508181 (+1.00z)| norm 0.2930 (+0.31z)| lr 3.94e-04 | 4170.75 ms | 32.4% bf16 MFU | 125787 tok/s step 8226/19560 | loss 3.482368 (+0.40z)| norm 0.3274 (+1.80z)| lr 3.94e-04 | 4161.01 ms | 32.4% bf16 MFU | 125797 tok/s step 8227/19560 | loss 3.470624 (+0.12z)| norm 0.2825 (-0.17z)| lr 3.94e-04 | 4177.53 ms | 32.3% bf16 MFU | 125783 tok/s step 8228/19560 | loss 3.540398 (+1.70z)| norm 0.3365 (+2.14z)| lr 3.94e-04 | 4177.02 ms | 32.3% bf16 MFU | 125769 tok/s step 8229/19560 | loss 3.454381 (-0.27z)| norm 0.2713 (-0.66z)| lr 3.93e-04 | 4163.24 ms | 32.4% bf16 MFU | 125778 tok/s step 8230/19560 | loss 3.480690 (+0.33z)| norm 0.3180 (+1.33z)| lr 3.93e-04 | 4179.54 ms | 32.3% bf16 MFU | 125761 tok/s step 8231/19560 | loss 3.367734 (-2.22z)| norm 0.3346 (+1.99z)| lr 3.93e-04 | 4156.00 ms | 32.5% bf16 MFU | 125780 tok/s step 8232/19560 | loss 3.467275 (+0.02z)| norm 0.3224 (+1.45z)| lr 3.93e-04 | 4173.29 ms | 32.4% bf16 MFU | 125773 tok/s step 8233/19560 | loss 3.439179 (-0.62z)| norm 0.3103 (+0.93z)| lr 3.93e-04 | 4171.44 ms | 32.4% bf16 MFU | 125768 tok/s step 8234/19560 | loss 3.533795 (+1.50z)| norm 0.2821 (-0.27z)| lr 3.93e-04 | 4156.25 ms | 32.5% bf16 MFU | 125787 tok/s step 8235/19560 | loss 3.471844 (+0.10z)| norm 0.2797 (-0.36z)| lr 3.93e-04 | 4166.67 ms | 32.4% bf16 MFU | 125789 tok/s step 8236/19560 | loss 3.473455 (+0.12z)| norm 0.2723 (-0.68z)| lr 3.93e-04 | 4165.42 ms | 32.4% bf16 MFU | 125793 tok/s step 8237/19560 | loss 3.434793 (-0.76z)| norm 0.2677 (-0.87z)| lr 3.93e-04 | 4164.13 ms | 32.4% bf16 MFU | 125799 tok/s step 8238/19560 | loss 3.476302 (+0.19z)| norm 0.2823 (-0.25z)| lr 3.93e-04 | 4168.33 ms | 32.4% bf16 MFU | 125798 tok/s step 8239/19560 | loss 3.464865 (-0.08z)| norm 0.2684 (-0.85z)| lr 3.93e-04 | 4172.18 ms | 32.4% bf16 MFU | 125791 tok/s step 8240/19560 | loss 3.410323 (-1.31z)| norm 0.2856 (-0.12z)| lr 3.93e-04 | 4162.48 ms | 32.4% bf16 MFU | 125799 tok/s step 8241/19560 | loss 3.462520 (-0.14z)| norm 0.2635 (-1.05z)| lr 3.93e-04 | 4157.74 ms | 32.5% bf16 MFU | 125814 tok/s step 8242/19560 | loss 3.490545 (+0.50z)| norm 0.2629 (-1.07z)| lr 3.93e-04 | 4159.28 ms | 32.5% bf16 MFU | 125826 tok/s step 8243/19560 | loss 3.476760 (+0.19z)| norm 0.2653 (-0.96z)| lr 3.93e-04 | 4147.92 ms | 32.6% bf16 MFU | 125855 tok/s step 8244/19560 | loss 3.470633 (+0.05z)| norm 0.2847 (-0.15z)| lr 3.93e-04 | 4164.28 ms | 32.4% bf16 MFU | 125857 tok/s step 8245/19560 | loss 3.424320 (-1.00z)| norm 0.2744 (-0.57z)| lr 3.93e-04 | 4159.96 ms | 32.5% bf16 MFU | 125866 tok/s step 8246/19560 | loss 3.507306 (+0.92z)| norm 0.2856 (-0.10z)| lr 3.93e-04 | 4169.24 ms | 32.4% bf16 MFU | 125860 tok/s step 8247/19560 | loss 3.490193 (+0.51z)| norm 0.2795 (-0.36z)| lr 3.93e-04 | 4164.96 ms | 32.4% bf16 MFU | 125861 tok/s step 8248/19560 | loss 3.407843 (-1.42z)| norm 0.2767 (-0.47z)| lr 3.93e-04 | 4221.10 ms | 32.0% bf16 MFU | 125778 tok/s step 8249/19560 | loss 3.460372 (-0.18z)| norm 0.2973 (+0.40z)| lr 3.93e-04 | 4156.61 ms | 32.5% bf16 MFU | 125796 tok/s step 8250/19560 | loss 3.470148 (+0.05z)| norm 0.2917 (+0.17z)| lr 3.92e-04 | 4154.86 ms | 32.5% bf16 MFU | 125816 tok/s val loss 3.430561 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2852/10042 = 0.284007 step 8251/19560 | loss 3.448609 (-0.45z)| norm 0.2820 (-0.24z)| lr 3.92e-04 | 4165.02 ms | 32.4% bf16 MFU | 125819 tok/s step 8252/19560 | loss 3.489233 (+0.51z)| norm 0.2957 (+0.34z)| lr 3.92e-04 | 4180.08 ms | 32.3% bf16 MFU | 125799 tok/s step 8253/19560 | loss 3.519990 (+1.21z)| norm 0.2838 (-0.16z)| lr 3.92e-04 | 4173.62 ms | 32.4% bf16 MFU | 125790 tok/s step 8254/19560 | loss 3.479082 (+0.26z)| norm 0.3285 (+1.72z)| lr 3.92e-04 | 4166.67 ms | 32.4% bf16 MFU | 125792 tok/s step 8255/19560 | loss 3.488195 (+0.47z)| norm 0.2854 (-0.10z)| lr 3.92e-04 | 4173.63 ms | 32.4% bf16 MFU | 125783 tok/s step 8256/19560 | loss 3.474766 (+0.14z)| norm 0.2886 (+0.02z)| lr 3.92e-04 | 4168.02 ms | 32.4% bf16 MFU | 125784 tok/s step 8257/19560 | loss 3.519500 (+1.18z)| norm 0.2951 (+0.29z)| lr 3.92e-04 | 4164.72 ms | 32.4% bf16 MFU | 125789 tok/s step 8258/19560 | loss 3.417868 (-1.21z)| norm 0.2974 (+0.38z)| lr 3.92e-04 | 4166.93 ms | 32.4% bf16 MFU | 125791 tok/s step 8259/19560 | loss 3.378830 (-2.14z)| norm 0.2757 (-0.54z)| lr 3.92e-04 | 4170.04 ms | 32.4% bf16 MFU | 125787 tok/s step 8260/19560 | loss 3.458005 (-0.23z)| norm 0.2999 (+0.49z)| lr 3.92e-04 | 4163.79 ms | 32.4% bf16 MFU | 125794 tok/s step 8261/19560 | loss 3.588612 (+2.80z)| norm 0.2946 (+0.26z)| lr 3.92e-04 | 4169.25 ms | 32.4% bf16 MFU | 125792 tok/s step 8262/19560 | loss 3.446922 (-0.50z)| norm 0.3323 (+1.84z)| lr 3.92e-04 | 4177.86 ms | 32.3% bf16 MFU | 125777 tok/s step 8263/19560 | loss 3.531917 (+1.45z)| norm 0.3042 (+0.65z)| lr 3.92e-04 | 4167.89 ms | 32.4% bf16 MFU | 125777 tok/s step 8264/19560 | loss 3.435190 (-0.78z)| norm 0.2700 (-0.79z)| lr 3.92e-04 | 4169.93 ms | 32.4% bf16 MFU | 125775 tok/s step 8265/19560 | loss 3.457254 (-0.27z)| norm 0.2891 (+0.01z)| lr 3.92e-04 | 4174.85 ms | 32.3% bf16 MFU | 125765 tok/s step 8266/19560 | loss 3.468028 (-0.03z)| norm 0.2848 (-0.17z)| lr 3.92e-04 | 4195.25 ms | 32.2% bf16 MFU | 125726 tok/s step 8267/19560 | loss 3.446285 (-0.54z)| norm 0.2596 (-1.22z)| lr 3.92e-04 | 4165.56 ms | 32.4% bf16 MFU | 125733 tok/s step 8268/19560 | loss 3.528080 (+1.34z)| norm 0.2804 (-0.34z)| lr 3.92e-04 | 4161.05 ms | 32.4% bf16 MFU | 125746 tok/s step 8269/19560 | loss 3.424517 (-1.04z)| norm 0.2669 (-0.90z)| lr 3.92e-04 | 4174.90 ms | 32.3% bf16 MFU | 125738 tok/s step 8270/19560 | loss 3.415545 (-1.24z)| norm 0.2553 (-1.37z)| lr 3.92e-04 | 4169.11 ms | 32.4% bf16 MFU | 125739 tok/s step 8271/19560 | loss 3.473189 (+0.07z)| norm 0.2737 (-0.58z)| lr 3.91e-04 | 4175.43 ms | 32.3% bf16 MFU | 125730 tok/s step 8272/19560 | loss 3.489034 (+0.43z)| norm 0.2915 (+0.17z)| lr 3.91e-04 | 4175.26 ms | 32.3% bf16 MFU | 125722 tok/s step 8273/19560 | loss 3.460963 (-0.21z)| norm 0.2507 (-1.60z)| lr 3.91e-04 | 4161.88 ms | 32.4% bf16 MFU | 125734 tok/s step 8274/19560 | loss 3.388643 (-1.85z)| norm 0.2651 (-0.94z)| lr 3.91e-04 | 4165.03 ms | 32.4% bf16 MFU | 125742 tok/s step 8275/19560 | loss 3.492349 (+0.51z)| norm 0.2744 (-0.52z)| lr 3.91e-04 | 4169.59 ms | 32.4% bf16 MFU | 125742 tok/s step 8276/19560 | loss 3.514071 (+1.00z)| norm 0.2876 (+0.07z)| lr 3.91e-04 | 4169.53 ms | 32.4% bf16 MFU | 125742 tok/s step 8277/19560 | loss 3.477044 (+0.16z)| norm 0.3049 (+0.84z)| lr 3.91e-04 | 4167.40 ms | 32.4% bf16 MFU | 125745 tok/s step 8278/19560 | loss 3.474105 (+0.09z)| norm 0.2654 (-0.92z)| lr 3.91e-04 | 4164.48 ms | 32.4% bf16 MFU | 125752 tok/s step 8279/19560 | loss 3.439086 (-0.72z)| norm 0.3328 (+2.03z)| lr 3.91e-04 | 4180.55 ms | 32.3% bf16 MFU | 125735 tok/s step 8280/19560 | loss 3.458109 (-0.27z)| norm 0.2933 (+0.31z)| lr 3.91e-04 | 4169.09 ms | 32.4% bf16 MFU | 125736 tok/s step 8281/19560 | loss 3.396065 (-1.67z)| norm 0.2789 (-0.33z)| lr 3.91e-04 | 4168.05 ms | 32.4% bf16 MFU | 125739 tok/s step 8282/19560 | loss 3.469672 (+0.01z)| norm 0.3109 (+1.08z)| lr 3.91e-04 | 4170.71 ms | 32.4% bf16 MFU | 125737 tok/s step 8283/19560 | loss 3.417769 (-1.16z)| norm 0.2545 (-1.38z)| lr 3.91e-04 | 4160.42 ms | 32.5% bf16 MFU | 125751 tok/s step 8284/19560 | loss 3.416121 (-1.17z)| norm 0.3075 (+0.95z)| lr 3.91e-04 | 4163.29 ms | 32.4% bf16 MFU | 125760 tok/s step 8285/19560 | loss 3.421387 (-1.04z)| norm 0.2442 (-1.80z)| lr 3.91e-04 | 4166.56 ms | 32.4% bf16 MFU | 125764 tok/s step 8286/19560 | loss 3.434392 (-0.74z)| norm 0.2775 (-0.34z)| lr 3.91e-04 | 4157.07 ms | 32.5% bf16 MFU | 125782 tok/s step 8287/19560 | loss 3.470398 (+0.07z)| norm 0.2660 (-0.82z)| lr 3.91e-04 | 4173.98 ms | 32.3% bf16 MFU | 125773 tok/s step 8288/19560 | loss 3.515741 (+1.09z)| norm 0.2901 (+0.26z)| lr 3.91e-04 | 4168.21 ms | 32.4% bf16 MFU | 125774 tok/s step 8289/19560 | loss 3.439544 (-0.63z)| norm 0.2571 (-1.22z)| lr 3.91e-04 | 4179.54 ms | 32.3% bf16 MFU | 125757 tok/s step 8290/19560 | loss 3.528250 (+1.39z)| norm 0.2789 (-0.22z)| lr 3.91e-04 | 4183.38 ms | 32.3% bf16 MFU | 125735 tok/s step 8291/19560 | loss 3.562316 (+2.12z)| norm 0.2945 (+0.51z)| lr 3.91e-04 | 4167.37 ms | 32.4% bf16 MFU | 125739 tok/s step 8292/19560 | loss 3.442074 (-0.56z)| norm 0.2705 (-0.59z)| lr 3.90e-04 | 4154.17 ms | 32.5% bf16 MFU | 125763 tok/s step 8293/19560 | loss 3.458678 (-0.19z)| norm 0.2990 (+0.76z)| lr 3.90e-04 | 4162.37 ms | 32.4% bf16 MFU | 125772 tok/s step 8294/19560 | loss 3.406617 (-1.33z)| norm 0.2838 (+0.04z)| lr 3.90e-04 | 4178.09 ms | 32.3% bf16 MFU | 125758 tok/s step 8295/19560 | loss 3.478213 (+0.26z)| norm 0.2753 (-0.35z)| lr 3.90e-04 | 4159.47 ms | 32.5% bf16 MFU | 125772 tok/s step 8296/19560 | loss 3.501496 (+0.76z)| norm 0.3279 (+2.07z)| lr 3.90e-04 | 4162.20 ms | 32.4% bf16 MFU | 125782 tok/s step 8297/19560 | loss 3.442118 (-0.56z)| norm 0.3075 (+1.11z)| lr 3.90e-04 | 4171.21 ms | 32.4% bf16 MFU | 125778 tok/s step 8298/19560 | loss 3.443816 (-0.52z)| norm 0.2888 (+0.25z)| lr 3.90e-04 | 4157.21 ms | 32.5% bf16 MFU | 125794 tok/s step 8299/19560 | loss 3.380211 (-1.89z)| norm 0.2845 (+0.04z)| lr 3.90e-04 | 4173.40 ms | 32.4% bf16 MFU | 125786 tok/s step 8300/19560 | loss 3.440842 (-0.55z)| norm 0.2824 (-0.01z)| lr 3.90e-04 | 4163.74 ms | 32.4% bf16 MFU | 125793 tok/s step 8301/19560 | loss 3.442021 (-0.51z)| norm 0.2637 (-0.98z)| lr 3.90e-04 | 4159.68 ms | 32.5% bf16 MFU | 125805 tok/s step 8302/19560 | loss 3.401101 (-1.40z)| norm 0.2627 (-1.04z)| lr 3.90e-04 | 4180.17 ms | 32.3% bf16 MFU | 125786 tok/s step 8303/19560 | loss 3.475412 (+0.25z)| norm 0.2840 (+0.07z)| lr 3.90e-04 | 4161.07 ms | 32.4% bf16 MFU | 125796 tok/s step 8304/19560 | loss 3.508637 (+0.99z)| norm 0.2895 (+0.35z)| lr 3.90e-04 | 4160.61 ms | 32.5% bf16 MFU | 125807 tok/s step 8305/19560 | loss 3.472556 (+0.18z)| norm 0.2650 (-0.93z)| lr 3.90e-04 | 4154.08 ms | 32.5% bf16 MFU | 125827 tok/s step 8306/19560 | loss 3.467628 (+0.07z)| norm 0.2754 (-0.39z)| lr 3.90e-04 | 4376.56 ms | 30.9% bf16 MFU | 125526 tok/s step 8307/19560 | loss 3.450968 (-0.30z)| norm 0.2811 (-0.08z)| lr 3.90e-04 | 4261.00 ms | 31.7% bf16 MFU | 125402 tok/s step 8308/19560 | loss 3.467817 (+0.08z)| norm 0.2823 (-0.00z)| lr 3.90e-04 | 4244.89 ms | 31.8% bf16 MFU | 125307 tok/s step 8309/19560 | loss 3.474139 (+0.22z)| norm 0.3095 (+1.44z)| lr 3.90e-04 | 4260.99 ms | 31.7% bf16 MFU | 125194 tok/s step 8310/19560 | loss 3.414665 (-1.13z)| norm 0.2638 (-1.00z)| lr 3.90e-04 | 4302.79 ms | 31.4% bf16 MFU | 125027 tok/s step 8311/19560 | loss 3.410699 (-1.20z)| norm 0.2838 (+0.06z)| lr 3.90e-04 | 4514.63 ms | 29.9% bf16 MFU | 124582 tok/s step 8312/19560 | loss 3.449551 (-0.31z)| norm 0.3041 (+1.13z)| lr 3.90e-04 | 4192.26 ms | 32.2% bf16 MFU | 124606 tok/s step 8313/19560 | loss 3.418993 (-0.99z)| norm 0.2858 (+0.16z)| lr 3.89e-04 | 4175.56 ms | 32.3% bf16 MFU | 124654 tok/s step 8314/19560 | loss 3.488765 (+0.58z)| norm 0.2804 (-0.12z)| lr 3.89e-04 | 4159.44 ms | 32.5% bf16 MFU | 124723 tok/s step 8315/19560 | loss 3.492339 (+0.65z)| norm 0.2865 (+0.21z)| lr 3.89e-04 | 4155.28 ms | 32.5% bf16 MFU | 124796 tok/s step 8316/19560 | loss 3.464227 (+0.02z)| norm 0.2656 (-0.89z)| lr 3.89e-04 | 4179.03 ms | 32.3% bf16 MFU | 124829 tok/s step 8317/19560 | loss 3.499756 (+0.83z)| norm 0.2654 (-0.90z)| lr 3.89e-04 | 4182.62 ms | 32.3% bf16 MFU | 124855 tok/s step 8318/19560 | loss 3.431669 (-0.72z)| norm 0.2796 (-0.15z)| lr 3.89e-04 | 4168.17 ms | 32.4% bf16 MFU | 124901 tok/s step 8319/19560 | loss 3.470339 (+0.18z)| norm 0.2476 (-1.81z)| lr 3.89e-04 | 4189.68 ms | 32.2% bf16 MFU | 124913 tok/s step 8320/19560 | loss 3.517267 (+1.25z)| norm 0.2613 (-1.09z)| lr 3.89e-04 | 4170.69 ms | 32.4% bf16 MFU | 124953 tok/s step 8321/19560 | loss 3.473363 (+0.23z)| norm 0.2822 (+0.01z)| lr 3.89e-04 | 4161.39 ms | 32.4% bf16 MFU | 125005 tok/s step 8322/19560 | loss 3.463088 (-0.01z)| norm 0.2612 (-1.09z)| lr 3.89e-04 | 4227.47 ms | 31.9% bf16 MFU | 124955 tok/s step 8323/19560 | loss 3.440373 (-0.54z)| norm 0.2860 (+0.20z)| lr 3.89e-04 | 5139.95 ms | 26.3% bf16 MFU | 123808 tok/s step 8324/19560 | loss 3.538218 (+1.70z)| norm 0.2723 (-0.51z)| lr 3.89e-04 | 4163.66 ms | 32.4% bf16 MFU | 123913 tok/s step 8325/19560 | loss 3.501002 (+0.83z)| norm 0.2777 (-0.23z)| lr 3.89e-04 | 4176.55 ms | 32.3% bf16 MFU | 123994 tok/s step 8326/19560 | loss 3.437426 (-0.65z)| norm 0.2717 (-0.54z)| lr 3.89e-04 | 4168.43 ms | 32.4% bf16 MFU | 124083 tok/s step 8327/19560 | loss 3.500262 (+0.80z)| norm 0.2849 (+0.15z)| lr 3.89e-04 | 4227.10 ms | 31.9% bf16 MFU | 124081 tok/s step 8328/19560 | loss 3.436413 (-0.67z)| norm 0.2957 (+0.70z)| lr 3.89e-04 | 4164.65 ms | 32.4% bf16 MFU | 124171 tok/s step 8329/19560 | loss 3.476104 (+0.27z)| norm 0.2999 (+0.91z)| lr 3.89e-04 | 4169.56 ms | 32.4% bf16 MFU | 124250 tok/s step 8330/19560 | loss 3.428879 (-0.86z)| norm 0.2894 (+0.34z)| lr 3.89e-04 | 4164.11 ms | 32.4% bf16 MFU | 124333 tok/s step 8331/19560 | loss 3.519696 (+1.28z)| norm 0.2809 (-0.11z)| lr 3.89e-04 | 4170.57 ms | 32.4% bf16 MFU | 124401 tok/s step 8332/19560 | loss 3.443718 (-0.50z)| norm 0.2724 (-0.56z)| lr 3.89e-04 | 4167.73 ms | 32.4% bf16 MFU | 124471 tok/s step 8333/19560 | loss 3.487401 (+0.61z)| norm 0.2791 (-0.22z)| lr 3.89e-04 | 4167.62 ms | 32.4% bf16 MFU | 124538 tok/s step 8334/19560 | loss 3.459475 (-0.10z)| norm 0.2974 (+0.75z)| lr 3.88e-04 | 4173.69 ms | 32.3% bf16 MFU | 124592 tok/s step 8335/19560 | loss 3.502379 (+0.98z)| norm 0.2735 (-0.54z)| lr 3.88e-04 | 4215.65 ms | 32.0% bf16 MFU | 124580 tok/s step 8336/19560 | loss 3.616944 (+3.67z)| norm 0.2915 (+0.43z)| lr 3.88e-04 | 4175.24 ms | 32.3% bf16 MFU | 124630 tok/s step 8337/19560 | loss 3.527892 (+1.51z)| norm 0.2851 (+0.07z)| lr 3.88e-04 | 4167.32 ms | 32.4% bf16 MFU | 124689 tok/s step 8338/19560 | loss 3.487698 (+0.54z)| norm 0.2634 (-1.10z)| lr 3.88e-04 | 4212.63 ms | 32.1% bf16 MFU | 124677 tok/s step 8339/19560 | loss 3.464517 (-0.02z)| norm 0.2745 (-0.50z)| lr 3.88e-04 | 4160.27 ms | 32.5% bf16 MFU | 124745 tok/s step 8340/19560 | loss 3.537373 (+1.72z)| norm 0.2748 (-0.48z)| lr 3.88e-04 | 4195.18 ms | 32.2% bf16 MFU | 124756 tok/s step 8341/19560 | loss 3.464088 (-0.04z)| norm 0.2755 (-0.44z)| lr 3.88e-04 | 4163.77 ms | 32.4% bf16 MFU | 124814 tok/s step 8342/19560 | loss 3.386487 (-1.87z)| norm 0.2623 (-1.15z)| lr 3.88e-04 | 4255.88 ms | 31.7% bf16 MFU | 124733 tok/s step 8343/19560 | loss 3.454923 (-0.25z)| norm 0.2813 (-0.12z)| lr 3.88e-04 | 4168.83 ms | 32.4% bf16 MFU | 124784 tok/s step 8344/19560 | loss 3.464407 (-0.02z)| norm 0.2843 (+0.04z)| lr 3.88e-04 | 4171.63 ms | 32.4% bf16 MFU | 124829 tok/s step 8345/19560 | loss 3.532905 (+1.59z)| norm 0.2639 (-1.06z)| lr 3.88e-04 | 4158.97 ms | 32.5% bf16 MFU | 124891 tok/s step 8346/19560 | loss 3.520303 (+1.27z)| norm 0.2903 (+0.38z)| lr 3.88e-04 | 4168.71 ms | 32.4% bf16 MFU | 124935 tok/s step 8347/19560 | loss 3.464651 (-0.04z)| norm 0.2620 (-1.15z)| lr 3.88e-04 | 4172.74 ms | 32.4% bf16 MFU | 124970 tok/s step 8348/19560 | loss 3.447543 (-0.45z)| norm 0.3122 (+1.54z)| lr 3.88e-04 | 4163.72 ms | 32.4% bf16 MFU | 125018 tok/s step 8349/19560 | loss 3.480645 (+0.32z)| norm 0.2965 (+0.68z)| lr 3.88e-04 | 4166.26 ms | 32.4% bf16 MFU | 125059 tok/s step 8350/19560 | loss 3.523978 (+1.34z)| norm 0.2651 (-1.00z)| lr 3.88e-04 | 4170.74 ms | 32.4% bf16 MFU | 125091 tok/s step 8351/19560 | loss 3.474196 (+0.15z)| norm 0.3222 (+2.02z)| lr 3.88e-04 | 4183.93 ms | 32.3% bf16 MFU | 125102 tok/s step 8352/19560 | loss 3.416822 (-1.21z)| norm 0.2774 (-0.35z)| lr 3.88e-04 | 4258.08 ms | 31.7% bf16 MFU | 125003 tok/s step 8353/19560 | loss 3.478458 (+0.26z)| norm 0.2790 (-0.26z)| lr 3.88e-04 | 4164.89 ms | 32.4% bf16 MFU | 125047 tok/s step 8354/19560 | loss 3.434930 (-0.77z)| norm 0.2677 (-0.85z)| lr 3.88e-04 | 4323.18 ms | 31.2% bf16 MFU | 124859 tok/s step 8355/19560 | loss 3.406802 (-1.41z)| norm 0.2623 (-1.12z)| lr 3.87e-04 | 4290.65 ms | 31.5% bf16 MFU | 124725 tok/s step 8356/19560 | loss 3.441854 (-0.57z)| norm 0.2590 (-1.31z)| lr 3.87e-04 | 4163.64 ms | 32.4% bf16 MFU | 124785 tok/s step 8357/19560 | loss 3.466316 (+0.01z)| norm 0.2816 (-0.06z)| lr 3.87e-04 | 4172.72 ms | 32.4% bf16 MFU | 124828 tok/s step 8358/19560 | loss 3.450247 (-0.37z)| norm 0.2676 (-0.82z)| lr 3.87e-04 | 4179.56 ms | 32.3% bf16 MFU | 124859 tok/s step 8359/19560 | loss 3.505203 (+0.93z)| norm 0.2790 (-0.17z)| lr 3.87e-04 | 4169.06 ms | 32.4% bf16 MFU | 124904 tok/s step 8360/19560 | loss 3.489829 (+0.55z)| norm 0.2646 (-0.99z)| lr 3.87e-04 | 4168.13 ms | 32.4% bf16 MFU | 124948 tok/s step 8361/19560 | loss 3.440240 (-0.65z)| norm 0.2779 (-0.19z)| lr 3.87e-04 | 4174.88 ms | 32.3% bf16 MFU | 124980 tok/s step 8362/19560 | loss 3.496529 (+0.73z)| norm 0.2547 (-1.55z)| lr 3.87e-04 | 4164.58 ms | 32.4% bf16 MFU | 125025 tok/s step 8363/19560 | loss 3.424225 (-1.03z)| norm 0.2803 (-0.04z)| lr 3.87e-04 | 4265.19 ms | 31.7% bf16 MFU | 124920 tok/s step 8364/19560 | loss 3.425640 (-0.98z)| norm 0.2637 (-1.02z)| lr 3.87e-04 | 4198.94 ms | 32.2% bf16 MFU | 124917 tok/s step 8365/19560 | loss 3.498317 (+0.77z)| norm 0.2565 (-1.43z)| lr 3.87e-04 | 4168.07 ms | 32.4% bf16 MFU | 124961 tok/s step 8366/19560 | loss 3.446170 (-0.49z)| norm 0.2766 (-0.24z)| lr 3.87e-04 | 4222.83 ms | 32.0% bf16 MFU | 124920 tok/s step 8367/19560 | loss 3.428431 (-0.91z)| norm 0.2794 (-0.08z)| lr 3.87e-04 | 4443.24 ms | 30.4% bf16 MFU | 124574 tok/s step 8368/19560 | loss 3.453932 (-0.30z)| norm 0.2801 (-0.04z)| lr 3.87e-04 | 4176.99 ms | 32.3% bf16 MFU | 124621 tok/s step 8369/19560 | loss 3.447113 (-0.47z)| norm 0.2896 (+0.51z)| lr 3.87e-04 | 4307.52 ms | 31.3% bf16 MFU | 124476 tok/s step 8370/19560 | loss 3.443923 (-0.53z)| norm 0.2865 (+0.31z)| lr 3.87e-04 | 4241.25 ms | 31.8% bf16 MFU | 124433 tok/s step 8371/19560 | loss 3.430861 (-0.84z)| norm 0.2886 (+0.43z)| lr 3.87e-04 | 4162.38 ms | 32.4% bf16 MFU | 124509 tok/s step 8372/19560 | loss 3.435492 (-0.72z)| norm 0.2910 (+0.57z)| lr 3.87e-04 | 4161.11 ms | 32.4% bf16 MFU | 124584 tok/s step 8373/19560 | loss 3.421254 (-1.07z)| norm 0.3202 (+2.23z)| lr 3.87e-04 | 4213.24 ms | 32.0% bf16 MFU | 124576 tok/s step 8374/19560 | loss 3.430019 (-0.84z)| norm 0.2790 (-0.16z)| lr 3.87e-04 | 4289.69 ms | 31.5% bf16 MFU | 124459 tok/s step 8375/19560 | loss 3.458269 (-0.15z)| norm 0.2982 (+0.94z)| lr 3.87e-04 | 4167.06 ms | 32.4% bf16 MFU | 124527 tok/s step 8376/19560 | loss 3.535745 (+1.71z)| norm 0.2940 (+0.69z)| lr 3.86e-04 | 4268.86 ms | 31.6% bf16 MFU | 124441 tok/s step 8377/19560 | loss 3.464635 (-0.02z)| norm 0.3205 (+2.18z)| lr 3.86e-04 | 4171.17 ms | 32.4% bf16 MFU | 124504 tok/s step 8378/19560 | loss 3.480225 (+0.36z)| norm 0.3021 (+1.12z)| lr 3.86e-04 | 4233.74 ms | 31.9% bf16 MFU | 124470 tok/s step 8379/19560 | loss 3.467584 (+0.05z)| norm 0.3056 (+1.30z)| lr 3.86e-04 | 4340.74 ms | 31.1% bf16 MFU | 124286 tok/s step 8380/19560 | loss 3.459343 (-0.15z)| norm 0.3044 (+1.23z)| lr 3.86e-04 | 4224.87 ms | 32.0% bf16 MFU | 124276 tok/s step 8381/19560 | loss 3.439644 (-0.61z)| norm 0.2918 (+0.52z)| lr 3.86e-04 | 4188.56 ms | 32.2% bf16 MFU | 124321 tok/s step 8382/19560 | loss 3.396977 (-1.62z)| norm 0.2840 (+0.10z)| lr 3.86e-04 | 4171.59 ms | 32.4% bf16 MFU | 124389 tok/s step 8383/19560 | loss 3.400412 (-1.51z)| norm 0.2727 (-0.55z)| lr 3.86e-04 | 4171.39 ms | 32.4% bf16 MFU | 124454 tok/s step 8384/19560 | loss 3.422331 (-0.97z)| norm 0.3053 (+1.32z)| lr 3.86e-04 | 4262.45 ms | 31.7% bf16 MFU | 124381 tok/s step 8385/19560 | loss 3.482028 (+0.46z)| norm 0.2806 (-0.09z)| lr 3.86e-04 | 4176.99 ms | 32.3% bf16 MFU | 124438 tok/s step 8386/19560 | loss 3.365197 (-2.29z)| norm 0.2828 (+0.04z)| lr 3.86e-04 | 4192.41 ms | 32.2% bf16 MFU | 124469 tok/s step 8387/19560 | loss 3.483057 (+0.48z)| norm 0.3097 (+1.56z)| lr 3.86e-04 | 4383.83 ms | 30.8% bf16 MFU | 124225 tok/s step 8388/19560 | loss 3.417337 (-1.08z)| norm 0.3015 (+1.09z)| lr 3.86e-04 | 4203.52 ms | 32.1% bf16 MFU | 124251 tok/s step 8389/19560 | loss 3.541682 (+1.93z)| norm 0.2991 (+0.95z)| lr 3.86e-04 | 4179.19 ms | 32.3% bf16 MFU | 124311 tok/s step 8390/19560 | loss 3.556869 (+2.24z)| norm 0.3055 (+1.36z)| lr 3.86e-04 | 4206.57 ms | 32.1% bf16 MFU | 124327 tok/s step 8391/19560 | loss 3.404189 (-1.39z)| norm 0.3089 (+1.55z)| lr 3.86e-04 | 4223.85 ms | 32.0% bf16 MFU | 124317 tok/s step 8392/19560 | loss 3.453135 (-0.22z)| norm 0.2684 (-0.80z)| lr 3.86e-04 | 4212.11 ms | 32.1% bf16 MFU | 124324 tok/s step 8393/19560 | loss 3.483135 (+0.49z)| norm 0.2873 (+0.30z)| lr 3.86e-04 | 4256.12 ms | 31.7% bf16 MFU | 124267 tok/s step 8394/19560 | loss 3.465664 (+0.07z)| norm 0.2991 (+0.97z)| lr 3.86e-04 | 4174.01 ms | 32.3% bf16 MFU | 124335 tok/s step 8395/19560 | loss 3.443201 (-0.47z)| norm 0.2663 (-0.93z)| lr 3.86e-04 | 4191.25 ms | 32.2% bf16 MFU | 124372 tok/s step 8396/19560 | loss 3.502955 (+0.98z)| norm 0.2742 (-0.47z)| lr 3.86e-04 | 4221.83 ms | 32.0% bf16 MFU | 124363 tok/s step 8397/19560 | loss 3.504594 (+1.00z)| norm 0.2757 (-0.39z)| lr 3.85e-04 | 4181.39 ms | 32.3% bf16 MFU | 124414 tok/s step 8398/19560 | loss 3.498539 (+0.84z)| norm 0.2817 (-0.05z)| lr 3.85e-04 | 4217.59 ms | 32.0% bf16 MFU | 124409 tok/s step 8399/19560 | loss 3.477735 (+0.34z)| norm 0.2511 (-1.81z)| lr 3.85e-04 | 4165.87 ms | 32.4% bf16 MFU | 124481 tok/s step 8400/19560 | loss 3.460866 (-0.06z)| norm 0.2894 (+0.41z)| lr 3.85e-04 | 4197.19 ms | 32.2% bf16 MFU | 124503 tok/s step 8401/19560 | loss 3.512387 (+1.17z)| norm 0.2708 (-0.68z)| lr 3.85e-04 | 4166.99 ms | 32.4% bf16 MFU | 124569 tok/s step 8402/19560 | loss 3.487503 (+0.56z)| norm 0.2814 (-0.07z)| lr 3.85e-04 | 4166.75 ms | 32.4% bf16 MFU | 124631 tok/s step 8403/19560 | loss 3.445154 (-0.47z)| norm 0.2750 (-0.45z)| lr 3.85e-04 | 4162.19 ms | 32.4% bf16 MFU | 124698 tok/s step 8404/19560 | loss 3.463994 (+0.00z)| norm 0.2471 (-2.04z)| lr 3.85e-04 | 4166.96 ms | 32.4% bf16 MFU | 124754 tok/s step 8405/19560 | loss 3.467119 (+0.08z)| norm 0.2821 (-0.01z)| lr 3.85e-04 | 4177.20 ms | 32.3% bf16 MFU | 124792 tok/s step 8406/19560 | loss 3.486813 (+0.56z)| norm 0.2547 (-1.59z)| lr 3.85e-04 | 4277.95 ms | 31.6% bf16 MFU | 124680 tok/s step 8407/19560 | loss 3.498669 (+0.84z)| norm 0.2639 (-1.05z)| lr 3.85e-04 | 4161.46 ms | 32.4% bf16 MFU | 124746 tok/s step 8408/19560 | loss 3.450294 (-0.34z)| norm 0.2830 (+0.09z)| lr 3.85e-04 | 4161.00 ms | 32.4% bf16 MFU | 124808 tok/s step 8409/19560 | loss 3.431395 (-0.82z)| norm 0.2776 (-0.23z)| lr 3.85e-04 | 4163.22 ms | 32.4% bf16 MFU | 124865 tok/s step 8410/19560 | loss 3.490109 (+0.63z)| norm 0.2885 (+0.44z)| lr 3.85e-04 | 4183.77 ms | 32.3% bf16 MFU | 124887 tok/s step 8411/19560 | loss 3.481695 (+0.41z)| norm 0.2878 (+0.39z)| lr 3.85e-04 | 4175.62 ms | 32.3% bf16 MFU | 124921 tok/s step 8412/19560 | loss 3.429860 (-0.88z)| norm 0.2816 (+0.02z)| lr 3.85e-04 | 4161.43 ms | 32.4% bf16 MFU | 124974 tok/s step 8413/19560 | loss 3.374337 (-2.22z)| norm 0.2938 (+0.76z)| lr 3.85e-04 | 4167.69 ms | 32.4% bf16 MFU | 125015 tok/s step 8414/19560 | loss 3.447304 (-0.44z)| norm 0.2808 (-0.06z)| lr 3.85e-04 | 4182.52 ms | 32.3% bf16 MFU | 125032 tok/s step 8415/19560 | loss 3.428774 (-0.88z)| norm 0.2748 (-0.44z)| lr 3.85e-04 | 4159.84 ms | 32.5% bf16 MFU | 125082 tok/s step 8416/19560 | loss 3.511903 (+1.15z)| norm 0.2689 (-0.81z)| lr 3.85e-04 | 4175.26 ms | 32.3% bf16 MFU | 125107 tok/s step 8417/19560 | loss 3.428009 (-0.89z)| norm 0.2760 (-0.37z)| lr 3.84e-04 | 4169.33 ms | 32.4% bf16 MFU | 125139 tok/s step 8418/19560 | loss 3.499481 (+0.86z)| norm 0.2848 (+0.19z)| lr 3.84e-04 | 4197.75 ms | 32.2% bf16 MFU | 125127 tok/s step 8419/19560 | loss 3.486865 (+0.58z)| norm 0.2676 (-0.89z)| lr 3.84e-04 | 4174.05 ms | 32.3% bf16 MFU | 125151 tok/s step 8420/19560 | loss 3.484962 (+0.52z)| norm 0.2655 (-1.02z)| lr 3.84e-04 | 4166.40 ms | 32.4% bf16 MFU | 125185 tok/s step 8421/19560 | loss 3.466746 (+0.06z)| norm 0.2874 (+0.38z)| lr 3.84e-04 | 4160.18 ms | 32.5% bf16 MFU | 125227 tok/s step 8422/19560 | loss 3.497394 (+0.82z)| norm 0.2899 (+0.53z)| lr 3.84e-04 | 4191.26 ms | 32.2% bf16 MFU | 125220 tok/s step 8423/19560 | loss 3.447274 (-0.44z)| norm 0.2739 (-0.48z)| lr 3.84e-04 | 4168.96 ms | 32.4% bf16 MFU | 125247 tok/s step 8424/19560 | loss 3.503424 (+0.98z)| norm 0.2587 (-1.46z)| lr 3.84e-04 | 4173.26 ms | 32.4% bf16 MFU | 125266 tok/s step 8425/19560 | loss 3.582759 (+2.86z)| norm 0.2843 (+0.23z)| lr 3.84e-04 | 4170.41 ms | 32.4% bf16 MFU | 125289 tok/s step 8426/19560 | loss 3.408728 (-1.38z)| norm 0.2739 (-0.45z)| lr 3.84e-04 | 4161.34 ms | 32.4% bf16 MFU | 125324 tok/s step 8427/19560 | loss 3.520936 (+1.33z)| norm 0.2883 (+0.50z)| lr 3.84e-04 | 4618.18 ms | 29.2% bf16 MFU | 124734 tok/s step 8428/19560 | loss 3.472111 (+0.13z)| norm 0.2913 (+0.69z)| lr 3.84e-04 | 4185.55 ms | 32.3% bf16 MFU | 124760 tok/s step 8429/19560 | loss 3.466054 (-0.02z)| norm 0.2798 (-0.07z)| lr 3.84e-04 | 4207.28 ms | 32.1% bf16 MFU | 124753 tok/s step 8430/19560 | loss 3.410938 (-1.39z)| norm 0.3076 (+1.74z)| lr 3.84e-04 | 4179.41 ms | 32.3% bf16 MFU | 124788 tok/s step 8431/19560 | loss 3.517335 (+1.23z)| norm 0.2705 (-0.70z)| lr 3.84e-04 | 4165.98 ms | 32.4% bf16 MFU | 124841 tok/s step 8432/19560 | loss 3.422222 (-1.09z)| norm 0.2826 (+0.10z)| lr 3.84e-04 | 4232.19 ms | 31.9% bf16 MFU | 124793 tok/s step 8433/19560 | loss 3.490967 (+0.59z)| norm 0.2768 (-0.29z)| lr 3.84e-04 | 4263.97 ms | 31.7% bf16 MFU | 124701 tok/s step 8434/19560 | loss 3.516723 (+1.20z)| norm 0.3084 (+1.76z)| lr 3.84e-04 | 4173.31 ms | 32.4% bf16 MFU | 124747 tok/s step 8435/19560 | loss 3.420099 (-1.14z)| norm 0.2802 (-0.08z)| lr 3.84e-04 | 4159.58 ms | 32.5% bf16 MFU | 124812 tok/s step 8436/19560 | loss 3.451598 (-0.37z)| norm 0.3462 (+3.94z)| lr 3.84e-04 | 4163.62 ms | 32.4% bf16 MFU | 124868 tok/s step 8437/19560 | loss 3.427140 (-0.95z)| norm 0.2851 (+0.21z)| lr 3.84e-04 | 4177.25 ms | 32.3% bf16 MFU | 124900 tok/s step 8438/19560 | loss 3.414995 (-1.25z)| norm 0.2787 (-0.20z)| lr 3.83e-04 | 4174.75 ms | 32.3% bf16 MFU | 124934 tok/s step 8439/19560 | loss 3.461643 (-0.13z)| norm 0.2730 (-0.55z)| lr 3.83e-04 | 4159.45 ms | 32.5% bf16 MFU | 124990 tok/s step 8440/19560 | loss 3.409352 (-1.39z)| norm 0.2833 (+0.11z)| lr 3.83e-04 | 4198.27 ms | 32.2% bf16 MFU | 124984 tok/s step 8441/19560 | loss 3.494066 (+0.65z)| norm 0.2787 (-0.18z)| lr 3.83e-04 | 4212.44 ms | 32.1% bf16 MFU | 124958 tok/s step 8442/19560 | loss 3.545351 (+1.86z)| norm 0.2898 (+0.51z)| lr 3.83e-04 | 4191.29 ms | 32.2% bf16 MFU | 124965 tok/s step 8443/19560 | loss 3.424852 (-1.01z)| norm 0.2978 (+1.00z)| lr 3.83e-04 | 4154.20 ms | 32.5% bf16 MFU | 125027 tok/s step 8444/19560 | loss 3.502920 (+0.84z)| norm 0.2816 (-0.02z)| lr 3.83e-04 | 4175.78 ms | 32.3% bf16 MFU | 125053 tok/s step 8445/19560 | loss 3.532391 (+1.53z)| norm 0.2726 (-0.59z)| lr 3.83e-04 | 4165.56 ms | 32.4% bf16 MFU | 125094 tok/s step 8446/19560 | loss 3.560754 (+2.14z)| norm 0.2860 (+0.25z)| lr 3.83e-04 | 4163.67 ms | 32.4% bf16 MFU | 125135 tok/s step 8447/19560 | loss 3.426830 (-0.96z)| norm 0.3014 (+1.21z)| lr 3.83e-04 | 4168.87 ms | 32.4% bf16 MFU | 125166 tok/s step 8448/19560 | loss 3.433051 (-0.81z)| norm 0.2980 (+0.98z)| lr 3.83e-04 | 4186.01 ms | 32.3% bf16 MFU | 125170 tok/s step 8449/19560 | loss 3.470593 (+0.07z)| norm 0.2848 (+0.14z)| lr 3.83e-04 | 4167.28 ms | 32.4% bf16 MFU | 125202 tok/s step 8450/19560 | loss 3.464103 (-0.08z)| norm 0.2623 (-1.31z)| lr 3.83e-04 | 4158.98 ms | 32.5% bf16 MFU | 125245 tok/s step 8451/19560 | loss 3.447063 (-0.48z)| norm 0.2743 (-0.53z)| lr 3.83e-04 | 4165.46 ms | 32.4% bf16 MFU | 125276 tok/s step 8452/19560 | loss 3.484164 (+0.40z)| norm 0.2872 (+0.29z)| lr 3.83e-04 | 4160.81 ms | 32.4% bf16 MFU | 125313 tok/s step 8453/19560 | loss 3.494766 (+0.65z)| norm 0.2722 (-0.67z)| lr 3.83e-04 | 4262.41 ms | 31.7% bf16 MFU | 125197 tok/s step 8454/19560 | loss 3.456515 (-0.26z)| norm 0.2739 (-0.56z)| lr 3.83e-04 | 4178.85 ms | 32.3% bf16 MFU | 125211 tok/s step 8455/19560 | loss 3.449231 (-0.42z)| norm 0.2742 (-0.54z)| lr 3.83e-04 | 4171.46 ms | 32.4% bf16 MFU | 125234 tok/s step 8456/19560 | loss 3.524384 (+1.33z)| norm 0.2576 (-1.57z)| lr 3.83e-04 | 4166.54 ms | 32.4% bf16 MFU | 125264 tok/s step 8457/19560 | loss 3.410401 (-1.33z)| norm 0.2929 (+0.68z)| lr 3.83e-04 | 4165.62 ms | 32.4% bf16 MFU | 125294 tok/s step 8458/19560 | loss 3.395610 (-1.65z)| norm 0.2814 (-0.05z)| lr 3.83e-04 | 4165.42 ms | 32.4% bf16 MFU | 125323 tok/s step 8459/19560 | loss 3.467055 (+0.01z)| norm 0.2858 (+0.23z)| lr 3.82e-04 | 4161.69 ms | 32.4% bf16 MFU | 125356 tok/s step 8460/19560 | loss 3.461515 (-0.12z)| norm 0.2898 (+0.47z)| lr 3.82e-04 | 4179.86 ms | 32.3% bf16 MFU | 125359 tok/s step 8461/19560 | loss 3.463764 (-0.06z)| norm 0.2791 (-0.21z)| lr 3.82e-04 | 4196.96 ms | 32.2% bf16 MFU | 125337 tok/s step 8462/19560 | loss 3.492109 (+0.59z)| norm 0.2632 (-1.20z)| lr 3.82e-04 | 4172.71 ms | 32.4% bf16 MFU | 125353 tok/s step 8463/19560 | loss 3.476996 (+0.24z)| norm 0.2740 (-0.52z)| lr 3.82e-04 | 4158.94 ms | 32.5% bf16 MFU | 125388 tok/s step 8464/19560 | loss 3.450917 (-0.35z)| norm 0.2949 (+0.81z)| lr 3.82e-04 | 4179.63 ms | 32.3% bf16 MFU | 125391 tok/s step 8465/19560 | loss 3.489517 (+0.61z)| norm 0.2838 (+0.11z)| lr 3.82e-04 | 4164.75 ms | 32.4% bf16 MFU | 125416 tok/s step 8466/19560 | loss 3.473232 (+0.21z)| norm 0.2725 (-0.62z)| lr 3.82e-04 | 4193.55 ms | 32.2% bf16 MFU | 125396 tok/s step 8467/19560 | loss 3.426602 (-0.94z)| norm 0.2764 (-0.37z)| lr 3.82e-04 | 4168.98 ms | 32.4% bf16 MFU | 125414 tok/s step 8468/19560 | loss 3.419453 (-1.10z)| norm 0.2799 (-0.15z)| lr 3.82e-04 | 4189.14 ms | 32.2% bf16 MFU | 125401 tok/s step 8469/19560 | loss 3.442945 (-0.51z)| norm 0.2644 (-1.13z)| lr 3.82e-04 | 4166.58 ms | 32.4% bf16 MFU | 125423 tok/s step 8470/19560 | loss 3.485191 (+0.53z)| norm 0.2685 (-0.88z)| lr 3.82e-04 | 4170.16 ms | 32.4% bf16 MFU | 125438 tok/s step 8471/19560 | loss 3.538442 (+1.83z)| norm 0.2741 (-0.51z)| lr 3.82e-04 | 4164.60 ms | 32.4% bf16 MFU | 125460 tok/s step 8472/19560 | loss 3.450492 (-0.36z)| norm 0.2504 (-1.98z)| lr 3.82e-04 | 4172.69 ms | 32.4% bf16 MFU | 125470 tok/s step 8473/19560 | loss 3.462094 (-0.05z)| norm 0.2644 (-1.10z)| lr 3.82e-04 | 4159.93 ms | 32.5% bf16 MFU | 125498 tok/s step 8474/19560 | loss 3.417472 (-1.16z)| norm 0.2585 (-1.44z)| lr 3.82e-04 | 4167.66 ms | 32.4% bf16 MFU | 125513 tok/s step 8475/19560 | loss 3.489807 (+0.66z)| norm 0.2720 (-0.61z)| lr 3.82e-04 | 4161.60 ms | 32.4% bf16 MFU | 125536 tok/s step 8476/19560 | loss 3.427397 (-0.91z)| norm 0.2719 (-0.61z)| lr 3.82e-04 | 4173.41 ms | 32.4% bf16 MFU | 125541 tok/s step 8477/19560 | loss 3.436134 (-0.68z)| norm 0.2451 (-2.24z)| lr 3.82e-04 | 4173.11 ms | 32.4% bf16 MFU | 125546 tok/s step 8478/19560 | loss 3.489283 (+0.67z)| norm 0.2958 (+0.91z)| lr 3.82e-04 | 4171.18 ms | 32.4% bf16 MFU | 125553 tok/s step 8479/19560 | loss 3.517264 (+1.36z)| norm 0.2662 (-0.94z)| lr 3.82e-04 | 4161.69 ms | 32.4% bf16 MFU | 125574 tok/s step 8480/19560 | loss 3.509793 (+1.15z)| norm 0.2547 (-1.64z)| lr 3.81e-04 | 4162.38 ms | 32.4% bf16 MFU | 125594 tok/s step 8481/19560 | loss 3.419937 (-1.09z)| norm 0.2737 (-0.44z)| lr 3.81e-04 | 4174.15 ms | 32.3% bf16 MFU | 125594 tok/s step 8482/19560 | loss 3.453688 (-0.25z)| norm 0.2680 (-0.79z)| lr 3.81e-04 | 4171.06 ms | 32.4% bf16 MFU | 125599 tok/s step 8483/19560 | loss 3.452897 (-0.28z)| norm 0.2437 (-2.29z)| lr 3.81e-04 | 4169.07 ms | 32.4% bf16 MFU | 125607 tok/s step 8484/19560 | loss 3.436317 (-0.70z)| norm 0.2577 (-1.41z)| lr 3.81e-04 | 4167.37 ms | 32.4% bf16 MFU | 125617 tok/s step 8485/19560 | loss 3.480367 (+0.41z)| norm 0.2589 (-1.32z)| lr 3.81e-04 | 4164.51 ms | 32.4% bf16 MFU | 125631 tok/s step 8486/19560 | loss 3.474836 (+0.27z)| norm 0.2510 (-1.78z)| lr 3.81e-04 | 4160.71 ms | 32.5% bf16 MFU | 125650 tok/s step 8487/19560 | loss 3.478809 (+0.37z)| norm 0.2673 (-0.78z)| lr 3.81e-04 | 4218.59 ms | 32.0% bf16 MFU | 125581 tok/s step 8488/19560 | loss 3.446654 (-0.43z)| norm 0.2715 (-0.52z)| lr 3.81e-04 | 4173.98 ms | 32.3% bf16 MFU | 125583 tok/s step 8489/19560 | loss 3.493545 (+0.75z)| norm 0.2767 (-0.21z)| lr 3.81e-04 | 4159.48 ms | 32.5% bf16 MFU | 125606 tok/s step 8490/19560 | loss 3.462528 (-0.03z)| norm 0.2806 (+0.02z)| lr 3.81e-04 | 4172.58 ms | 32.4% bf16 MFU | 125608 tok/s step 8491/19560 | loss 3.485263 (+0.53z)| norm 0.3039 (+1.43z)| lr 3.81e-04 | 4164.81 ms | 32.4% bf16 MFU | 125622 tok/s step 8492/19560 | loss 3.471972 (+0.19z)| norm 0.3023 (+1.31z)| lr 3.81e-04 | 4158.35 ms | 32.5% bf16 MFU | 125645 tok/s step 8493/19560 | loss 3.403609 (-1.53z)| norm 0.2585 (-1.36z)| lr 3.81e-04 | 4163.51 ms | 32.4% bf16 MFU | 125659 tok/s step 8494/19560 | loss 3.450492 (-0.34z)| norm 0.3151 (+2.04z)| lr 3.81e-04 | 4170.38 ms | 32.4% bf16 MFU | 125662 tok/s step 8495/19560 | loss 3.493606 (+0.74z)| norm 0.2843 (+0.19z)| lr 3.81e-04 | 4168.02 ms | 32.4% bf16 MFU | 125668 tok/s step 8496/19560 | loss 3.498155 (+0.85z)| norm 0.2926 (+0.68z)| lr 3.81e-04 | 4172.30 ms | 32.4% bf16 MFU | 125668 tok/s step 8497/19560 | loss 3.450053 (-0.38z)| norm 0.3203 (+2.28z)| lr 3.81e-04 | 4207.38 ms | 32.1% bf16 MFU | 125615 tok/s step 8498/19560 | loss 3.415992 (-1.23z)| norm 0.3061 (+1.42z)| lr 3.81e-04 | 4182.11 ms | 32.3% bf16 MFU | 125602 tok/s step 8499/19560 | loss 3.433285 (-0.79z)| norm 0.2520 (-1.69z)| lr 3.81e-04 | 4174.33 ms | 32.3% bf16 MFU | 125602 tok/s step 8500/19560 | loss 3.473520 (+0.22z)| norm 0.3157 (+1.94z)| lr 3.81e-04 | 4165.48 ms | 32.4% bf16 MFU | 125615 tok/s val loss 3.424425 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2893/10042 = 0.288090 step 8501/19560 | loss 3.422142 (-1.08z)| norm 0.2645 (-0.96z)| lr 3.80e-04 | 4190.84 ms | 32.2% bf16 MFU | 125590 tok/s step 8502/19560 | loss 3.470343 (+0.13z)| norm 0.2897 (+0.49z)| lr 3.80e-04 | 4178.41 ms | 32.3% bf16 MFU | 125584 tok/s step 8503/19560 | loss 3.437703 (-0.69z)| norm 0.2928 (+0.67z)| lr 3.80e-04 | 4167.67 ms | 32.4% bf16 MFU | 125595 tok/s step 8504/19560 | loss 3.444096 (-0.52z)| norm 0.2818 (+0.04z)| lr 3.80e-04 | 4186.23 ms | 32.3% bf16 MFU | 125577 tok/s step 8505/19560 | loss 3.448843 (-0.40z)| norm 0.2886 (+0.46z)| lr 3.80e-04 | 4170.83 ms | 32.4% bf16 MFU | 125583 tok/s step 8506/19560 | loss 3.387080 (-1.93z)| norm 0.2780 (-0.16z)| lr 3.80e-04 | 4178.70 ms | 32.3% bf16 MFU | 125578 tok/s step 8507/19560 | loss 3.488510 (+0.63z)| norm 0.2834 (+0.18z)| lr 3.80e-04 | 4163.10 ms | 32.4% bf16 MFU | 125596 tok/s step 8508/19560 | loss 3.483755 (+0.50z)| norm 0.2846 (+0.26z)| lr 3.80e-04 | 4170.38 ms | 32.4% bf16 MFU | 125602 tok/s step 8509/19560 | loss 3.389878 (-1.83z)| norm 0.2915 (+0.68z)| lr 3.80e-04 | 4160.12 ms | 32.5% bf16 MFU | 125623 tok/s step 8510/19560 | loss 3.453655 (-0.26z)| norm 0.2815 (+0.08z)| lr 3.80e-04 | 4168.37 ms | 32.4% bf16 MFU | 125631 tok/s step 8511/19560 | loss 3.395290 (-1.72z)| norm 0.2847 (+0.27z)| lr 3.80e-04 | 4171.75 ms | 32.4% bf16 MFU | 125633 tok/s step 8512/19560 | loss 3.412840 (-1.28z)| norm 0.2575 (-1.36z)| lr 3.80e-04 | 4176.56 ms | 32.3% bf16 MFU | 125628 tok/s step 8513/19560 | loss 3.423115 (-1.01z)| norm 0.2795 (-0.03z)| lr 3.80e-04 | 4184.32 ms | 32.3% bf16 MFU | 125611 tok/s step 8514/19560 | loss 3.485041 (+0.53z)| norm 0.2766 (-0.20z)| lr 3.80e-04 | 4156.10 ms | 32.5% bf16 MFU | 125638 tok/s step 8515/19560 | loss 3.450328 (-0.35z)| norm 0.2822 (+0.15z)| lr 3.80e-04 | 4178.81 ms | 32.3% bf16 MFU | 125629 tok/s step 8516/19560 | loss 3.446601 (-0.45z)| norm 0.2784 (-0.07z)| lr 3.80e-04 | 4194.22 ms | 32.2% bf16 MFU | 125598 tok/s step 8517/19560 | loss 3.466051 (+0.06z)| norm 0.2741 (-0.32z)| lr 3.80e-04 | 4200.63 ms | 32.1% bf16 MFU | 125559 tok/s step 8518/19560 | loss 3.426516 (-0.96z)| norm 0.2663 (-0.80z)| lr 3.80e-04 | 4169.29 ms | 32.4% bf16 MFU | 125568 tok/s step 8519/19560 | loss 3.423089 (-1.07z)| norm 0.2785 (-0.02z)| lr 3.80e-04 | 4157.43 ms | 32.5% bf16 MFU | 125595 tok/s step 8520/19560 | loss 3.453199 (-0.26z)| norm 0.2862 (+0.46z)| lr 3.80e-04 | 4179.34 ms | 32.3% bf16 MFU | 125588 tok/s step 8521/19560 | loss 3.535430 (+1.91z)| norm 0.2727 (-0.39z)| lr 3.79e-04 | 4163.26 ms | 32.4% bf16 MFU | 125605 tok/s step 8522/19560 | loss 3.428528 (-0.91z)| norm 0.2685 (-0.64z)| lr 3.79e-04 | 4176.63 ms | 32.3% bf16 MFU | 125601 tok/s step 8523/19560 | loss 3.465266 (+0.06z)| norm 0.2848 (+0.39z)| lr 3.79e-04 | 4174.70 ms | 32.3% bf16 MFU | 125601 tok/s step 8524/19560 | loss 3.439239 (-0.62z)| norm 0.2599 (-1.19z)| lr 3.79e-04 | 4159.25 ms | 32.5% bf16 MFU | 125623 tok/s step 8525/19560 | loss 3.454521 (-0.20z)| norm 0.3001 (+1.35z)| lr 3.79e-04 | 4163.31 ms | 32.4% bf16 MFU | 125639 tok/s step 8526/19560 | loss 3.456500 (-0.14z)| norm 0.2796 (+0.05z)| lr 3.79e-04 | 4173.40 ms | 32.4% bf16 MFU | 125638 tok/s step 8527/19560 | loss 3.480976 (+0.51z)| norm 0.2755 (-0.22z)| lr 3.79e-04 | 4182.40 ms | 32.3% bf16 MFU | 125624 tok/s step 8528/19560 | loss 3.487414 (+0.67z)| norm 0.2867 (+0.49z)| lr 3.79e-04 | 4189.26 ms | 32.2% bf16 MFU | 125600 tok/s step 8529/19560 | loss 3.469849 (+0.22z)| norm 0.2583 (-1.31z)| lr 3.79e-04 | 4160.69 ms | 32.5% bf16 MFU | 125621 tok/s step 8530/19560 | loss 3.391680 (-1.84z)| norm 0.2742 (-0.30z)| lr 3.79e-04 | 4178.52 ms | 32.3% bf16 MFU | 125613 tok/s step 8531/19560 | loss 3.429210 (-0.84z)| norm 0.2512 (-1.73z)| lr 3.79e-04 | 4181.81 ms | 32.3% bf16 MFU | 125601 tok/s step 8532/19560 | loss 3.504677 (+1.14z)| norm 0.2801 (+0.08z)| lr 3.79e-04 | 4162.50 ms | 32.4% bf16 MFU | 125619 tok/s step 8533/19560 | loss 3.426976 (-0.89z)| norm 0.2683 (-0.67z)| lr 3.79e-04 | 4171.98 ms | 32.4% bf16 MFU | 125621 tok/s step 8534/19560 | loss 3.446493 (-0.37z)| norm 0.2799 (+0.06z)| lr 3.79e-04 | 4174.25 ms | 32.3% bf16 MFU | 125620 tok/s step 8535/19560 | loss 3.421674 (-1.01z)| norm 0.2697 (-0.60z)| lr 3.79e-04 | 4182.52 ms | 32.3% bf16 MFU | 125607 tok/s step 8536/19560 | loss 3.584033 (+3.11z)| norm 0.3016 (+1.44z)| lr 3.79e-04 | 4176.99 ms | 32.3% bf16 MFU | 125603 tok/s step 8537/19560 | loss 3.442472 (-0.47z)| norm 0.2803 (+0.07z)| lr 3.79e-04 | 4203.13 ms | 32.1% bf16 MFU | 125559 tok/s step 8538/19560 | loss 3.539578 (+1.95z)| norm 0.3175 (+2.40z)| lr 3.79e-04 | 4163.27 ms | 32.4% bf16 MFU | 125578 tok/s step 8539/19560 | loss 3.415694 (-1.13z)| norm 0.2688 (-0.66z)| lr 3.79e-04 | 4167.47 ms | 32.4% bf16 MFU | 125589 tok/s step 8540/19560 | loss 3.436290 (-0.62z)| norm 0.3438 (+3.79z)| lr 3.79e-04 | 4167.20 ms | 32.4% bf16 MFU | 125600 tok/s step 8541/19560 | loss 3.519555 (+1.44z)| norm 0.2781 (-0.09z)| lr 3.79e-04 | 4161.21 ms | 32.4% bf16 MFU | 125620 tok/s step 8542/19560 | loss 3.518597 (+1.39z)| norm 0.2948 (+0.89z)| lr 3.78e-04 | 4181.89 ms | 32.3% bf16 MFU | 125608 tok/s step 8543/19560 | loss 3.420111 (-1.06z)| norm 0.2474 (-1.87z)| lr 3.78e-04 | 4171.40 ms | 32.4% bf16 MFU | 125612 tok/s step 8544/19560 | loss 3.459052 (-0.08z)| norm 0.2757 (-0.23z)| lr 3.78e-04 | 4165.33 ms | 32.4% bf16 MFU | 125625 tok/s step 8545/19560 | loss 3.491108 (+0.71z)| norm 0.2722 (-0.43z)| lr 3.78e-04 | 4165.17 ms | 32.4% bf16 MFU | 125637 tok/s step 8546/19560 | loss 3.478301 (+0.40z)| norm 0.2682 (-0.66z)| lr 3.78e-04 | 4173.85 ms | 32.3% bf16 MFU | 125636 tok/s step 8547/19560 | loss 3.517066 (+1.36z)| norm 0.2887 (+0.53z)| lr 3.78e-04 | 4155.50 ms | 32.5% bf16 MFU | 125662 tok/s step 8548/19560 | loss 3.511601 (+1.21z)| norm 0.2617 (-1.04z)| lr 3.78e-04 | 4165.74 ms | 32.4% bf16 MFU | 125672 tok/s step 8549/19560 | loss 3.489717 (+0.66z)| norm 0.2823 (+0.16z)| lr 3.78e-04 | 4180.77 ms | 32.3% bf16 MFU | 125659 tok/s step 8550/19560 | loss 3.442240 (-0.51z)| norm 0.2970 (+1.01z)| lr 3.78e-04 | 4155.31 ms | 32.5% bf16 MFU | 125684 tok/s step 8551/19560 | loss 3.387331 (-1.84z)| norm 0.2820 (+0.14z)| lr 3.78e-04 | 4161.87 ms | 32.4% bf16 MFU | 125699 tok/s step 8552/19560 | loss 3.463506 (+0.04z)| norm 0.2707 (-0.53z)| lr 3.78e-04 | 4161.54 ms | 32.4% bf16 MFU | 125713 tok/s step 8553/19560 | loss 3.506168 (+1.14z)| norm 0.2971 (+1.00z)| lr 3.78e-04 | 4172.80 ms | 32.4% bf16 MFU | 125710 tok/s step 8554/19560 | loss 3.515014 (+1.34z)| norm 0.3017 (+1.26z)| lr 3.78e-04 | 4190.98 ms | 32.2% bf16 MFU | 125679 tok/s step 8555/19560 | loss 3.451463 (-0.26z)| norm 0.2964 (+0.94z)| lr 3.78e-04 | 4200.37 ms | 32.1% bf16 MFU | 125636 tok/s step 8556/19560 | loss 3.484768 (+0.59z)| norm 0.2952 (+0.87z)| lr 3.78e-04 | 4169.18 ms | 32.4% bf16 MFU | 125642 tok/s step 8557/19560 | loss 3.444643 (-0.44z)| norm 0.2778 (-0.13z)| lr 3.78e-04 | 4156.19 ms | 32.5% bf16 MFU | 125667 tok/s step 8558/19560 | loss 3.381138 (-2.04z)| norm 0.2711 (-0.51z)| lr 3.78e-04 | 4177.32 ms | 32.3% bf16 MFU | 125659 tok/s step 8559/19560 | loss 3.422010 (-0.99z)| norm 0.2797 (-0.01z)| lr 3.78e-04 | 4158.95 ms | 32.5% bf16 MFU | 125679 tok/s step 8560/19560 | loss 3.416560 (-1.12z)| norm 0.2851 (+0.30z)| lr 3.78e-04 | 4172.34 ms | 32.4% bf16 MFU | 125678 tok/s step 8561/19560 | loss 3.454201 (-0.16z)| norm 0.2763 (-0.21z)| lr 3.78e-04 | 4199.99 ms | 32.1% bf16 MFU | 125636 tok/s step 8562/19560 | loss 3.425288 (-0.88z)| norm 0.2745 (-0.30z)| lr 3.78e-04 | 4169.43 ms | 32.4% bf16 MFU | 125642 tok/s step 8563/19560 | loss 3.474985 (+0.39z)| norm 0.3072 (+1.59z)| lr 3.77e-04 | 4174.53 ms | 32.3% bf16 MFU | 125639 tok/s step 8564/19560 | loss 3.420407 (-1.01z)| norm 0.3301 (+3.01z)| lr 3.77e-04 | 4162.26 ms | 32.4% bf16 MFU | 125655 tok/s step 8565/19560 | loss 3.471138 (+0.28z)| norm 0.2926 (+0.77z)| lr 3.77e-04 | 4195.63 ms | 32.2% bf16 MFU | 125620 tok/s step 8566/19560 | loss 3.434221 (-0.67z)| norm 0.2673 (-0.74z)| lr 3.77e-04 | 4161.50 ms | 32.4% bf16 MFU | 125639 tok/s step 8567/19560 | loss 3.452347 (-0.20z)| norm 0.2839 (+0.24z)| lr 3.77e-04 | 4174.98 ms | 32.3% bf16 MFU | 125636 tok/s step 8568/19560 | loss 3.423599 (-0.95z)| norm 0.2641 (-0.92z)| lr 3.77e-04 | 4165.86 ms | 32.4% bf16 MFU | 125647 tok/s step 8569/19560 | loss 3.429109 (-0.80z)| norm 0.2939 (+0.84z)| lr 3.77e-04 | 4169.30 ms | 32.4% bf16 MFU | 125652 tok/s step 8570/19560 | loss 3.496193 (+0.97z)| norm 0.2752 (-0.27z)| lr 3.77e-04 | 4165.48 ms | 32.4% bf16 MFU | 125662 tok/s step 8571/19560 | loss 3.460381 (+0.02z)| norm 0.2962 (+0.98z)| lr 3.77e-04 | 4167.20 ms | 32.4% bf16 MFU | 125670 tok/s step 8572/19560 | loss 3.391593 (-1.76z)| norm 0.2776 (-0.12z)| lr 3.77e-04 | 4160.19 ms | 32.5% bf16 MFU | 125688 tok/s step 8573/19560 | loss 3.469232 (+0.29z)| norm 0.2789 (-0.04z)| lr 3.77e-04 | 4196.00 ms | 32.2% bf16 MFU | 125651 tok/s step 8574/19560 | loss 3.411494 (-1.24z)| norm 0.2876 (+0.47z)| lr 3.77e-04 | 4172.26 ms | 32.4% bf16 MFU | 125651 tok/s step 8575/19560 | loss 3.410855 (-1.25z)| norm 0.2931 (+0.81z)| lr 3.77e-04 | 4161.71 ms | 32.4% bf16 MFU | 125668 tok/s step 8576/19560 | loss 3.451484 (-0.16z)| norm 0.2747 (-0.28z)| lr 3.77e-04 | 4171.68 ms | 32.4% bf16 MFU | 125668 tok/s step 8577/19560 | loss 3.429119 (-0.75z)| norm 0.2710 (-0.49z)| lr 3.77e-04 | 4173.37 ms | 32.4% bf16 MFU | 125666 tok/s step 8578/19560 | loss 3.429037 (-0.75z)| norm 0.2764 (-0.18z)| lr 3.77e-04 | 4165.08 ms | 32.4% bf16 MFU | 125677 tok/s step 8579/19560 | loss 3.487828 (+0.84z)| norm 0.2895 (+0.60z)| lr 3.77e-04 | 4164.44 ms | 32.4% bf16 MFU | 125688 tok/s step 8580/19560 | loss 3.441431 (-0.41z)| norm 0.2680 (-0.68z)| lr 3.77e-04 | 4176.03 ms | 32.3% bf16 MFU | 125681 tok/s step 8581/19560 | loss 3.471717 (+0.42z)| norm 0.2953 (+0.94z)| lr 3.77e-04 | 4195.70 ms | 32.2% bf16 MFU | 125644 tok/s step 8582/19560 | loss 3.409429 (-1.26z)| norm 0.2387 (-2.37z)| lr 3.77e-04 | 4171.20 ms | 32.4% bf16 MFU | 125647 tok/s step 8583/19560 | loss 3.437949 (-0.49z)| norm 0.2939 (+0.84z)| lr 3.77e-04 | 4163.57 ms | 32.4% bf16 MFU | 125661 tok/s step 8584/19560 | loss 3.402467 (-1.42z)| norm 0.2844 (+0.28z)| lr 3.76e-04 | 4173.70 ms | 32.3% bf16 MFU | 125658 tok/s step 8585/19560 | loss 3.481874 (+0.72z)| norm 0.2767 (-0.17z)| lr 3.76e-04 | 4392.24 ms | 30.7% bf16 MFU | 125344 tok/s step 8586/19560 | loss 3.392456 (-1.71z)| norm 0.2643 (-0.88z)| lr 3.76e-04 | 4182.81 ms | 32.3% bf16 MFU | 125344 tok/s step 8587/19560 | loss 3.457752 (+0.06z)| norm 0.2953 (+0.92z)| lr 3.76e-04 | 4158.55 ms | 32.5% bf16 MFU | 125380 tok/s step 8588/19560 | loss 3.436277 (-0.52z)| norm 0.2828 (+0.20z)| lr 3.76e-04 | 4183.09 ms | 32.3% bf16 MFU | 125378 tok/s step 8589/19560 | loss 3.429390 (-0.69z)| norm 0.2634 (-0.93z)| lr 3.76e-04 | 4167.86 ms | 32.4% bf16 MFU | 125399 tok/s step 8590/19560 | loss 3.415230 (-1.06z)| norm 0.2734 (-0.35z)| lr 3.76e-04 | 4158.84 ms | 32.5% bf16 MFU | 125432 tok/s step 8591/19560 | loss 3.424602 (-0.80z)| norm 0.2649 (-0.84z)| lr 3.76e-04 | 4194.75 ms | 32.2% bf16 MFU | 125410 tok/s step 8592/19560 | loss 3.410356 (-1.17z)| norm 0.2456 (-1.92z)| lr 3.76e-04 | 4154.80 ms | 32.5% bf16 MFU | 125449 tok/s step 8593/19560 | loss 3.465546 (+0.33z)| norm 0.2869 (+0.46z)| lr 3.76e-04 | 4167.85 ms | 32.4% bf16 MFU | 125466 tok/s step 8594/19560 | loss 3.447469 (-0.16z)| norm 0.2533 (-1.46z)| lr 3.76e-04 | 4186.88 ms | 32.2% bf16 MFU | 125454 tok/s step 8595/19560 | loss 3.447340 (-0.17z)| norm 0.2674 (-0.65z)| lr 3.76e-04 | 4160.29 ms | 32.5% bf16 MFU | 125482 tok/s step 8596/19560 | loss 3.399283 (-1.46z)| norm 0.2647 (-0.79z)| lr 3.76e-04 | 4213.88 ms | 32.0% bf16 MFU | 125429 tok/s step 8597/19560 | loss 3.489250 (+0.96z)| norm 0.2799 (+0.07z)| lr 3.76e-04 | 4178.82 ms | 32.3% bf16 MFU | 125431 tok/s step 8598/19560 | loss 3.393047 (-1.60z)| norm 0.2697 (-0.51z)| lr 3.76e-04 | 4185.19 ms | 32.3% bf16 MFU | 125423 tok/s step 8599/19560 | loss 3.459857 (+0.21z)| norm 0.2830 (+0.24z)| lr 3.76e-04 | 4162.66 ms | 32.4% bf16 MFU | 125449 tok/s step 8600/19560 | loss 3.411224 (-1.11z)| norm 0.2876 (+0.49z)| lr 3.76e-04 | 4158.62 ms | 32.5% bf16 MFU | 125480 tok/s step 8601/19560 | loss 3.430254 (-0.58z)| norm 0.2831 (+0.22z)| lr 3.76e-04 | 4168.58 ms | 32.4% bf16 MFU | 125495 tok/s step 8602/19560 | loss 3.406680 (-1.22z)| norm 0.2610 (-1.06z)| lr 3.76e-04 | 4164.91 ms | 32.4% bf16 MFU | 125514 tok/s step 8603/19560 | loss 3.444276 (-0.19z)| norm 0.2943 (+0.86z)| lr 3.76e-04 | 4168.58 ms | 32.4% bf16 MFU | 125527 tok/s step 8604/19560 | loss 3.394826 (-1.51z)| norm 0.2803 (+0.05z)| lr 3.75e-04 | 4167.33 ms | 32.4% bf16 MFU | 125541 tok/s step 8605/19560 | loss 3.457633 (+0.17z)| norm 0.2729 (-0.40z)| lr 3.75e-04 | 4165.07 ms | 32.4% bf16 MFU | 125558 tok/s step 8606/19560 | loss 3.388577 (-1.66z)| norm 0.2716 (-0.47z)| lr 3.75e-04 | 4199.03 ms | 32.2% bf16 MFU | 125523 tok/s step 8607/19560 | loss 3.454919 (+0.13z)| norm 0.2856 (+0.35z)| lr 3.75e-04 | 4174.26 ms | 32.3% bf16 MFU | 125527 tok/s step 8608/19560 | loss 3.462249 (+0.35z)| norm 0.2546 (-1.48z)| lr 3.75e-04 | 4171.81 ms | 32.4% bf16 MFU | 125534 tok/s step 8609/19560 | loss 3.426675 (-0.63z)| norm 0.2926 (+0.75z)| lr 3.75e-04 | 4168.99 ms | 32.4% bf16 MFU | 125546 tok/s step 8610/19560 | loss 3.420545 (-0.79z)| norm 0.2559 (-1.39z)| lr 3.75e-04 | 4151.95 ms | 32.5% bf16 MFU | 125582 tok/s step 8611/19560 | loss 3.462802 (+0.36z)| norm 0.3067 (+1.56z)| lr 3.75e-04 | 4171.19 ms | 32.4% bf16 MFU | 125588 tok/s step 8612/19560 | loss 3.409041 (-1.09z)| norm 0.2708 (-0.56z)| lr 3.75e-04 | 4163.00 ms | 32.4% bf16 MFU | 125605 tok/s step 8613/19560 | loss 3.478345 (+0.79z)| norm 0.2768 (-0.22z)| lr 3.75e-04 | 4163.99 ms | 32.4% bf16 MFU | 125620 tok/s step 8614/19560 | loss 3.453865 (+0.13z)| norm 0.3099 (+1.73z)| lr 3.75e-04 | 4172.62 ms | 32.4% bf16 MFU | 125622 tok/s step 8615/19560 | loss 3.426725 (-0.60z)| norm 0.2812 (+0.01z)| lr 3.75e-04 | 4169.98 ms | 32.4% bf16 MFU | 125627 tok/s step 8616/19560 | loss 3.425425 (-0.63z)| norm 0.2769 (-0.25z)| lr 3.75e-04 | 4171.22 ms | 32.4% bf16 MFU | 125630 tok/s step 8617/19560 | loss 3.420042 (-0.76z)| norm 0.2906 (+0.56z)| lr 3.75e-04 | 4150.97 ms | 32.5% bf16 MFU | 125664 tok/s step 8618/19560 | loss 3.425528 (-0.60z)| norm 0.2640 (-1.02z)| lr 3.75e-04 | 4179.47 ms | 32.3% bf16 MFU | 125653 tok/s step 8619/19560 | loss 3.390745 (-1.53z)| norm 0.2731 (-0.46z)| lr 3.75e-04 | 4161.60 ms | 32.4% bf16 MFU | 125670 tok/s step 8620/19560 | loss 3.406976 (-1.07z)| norm 0.2870 (+0.38z)| lr 3.75e-04 | 4178.17 ms | 32.3% bf16 MFU | 125660 tok/s step 8621/19560 | loss 3.410074 (-0.99z)| norm 0.2727 (-0.49z)| lr 3.75e-04 | 4165.85 ms | 32.4% bf16 MFU | 125670 tok/s step 8622/19560 | loss 3.423260 (-0.62z)| norm 0.2760 (-0.28z)| lr 3.75e-04 | 4177.30 ms | 32.3% bf16 MFU | 125662 tok/s step 8623/19560 | loss 3.451121 (+0.14z)| norm 0.3132 (+1.98z)| lr 3.75e-04 | 4169.96 ms | 32.4% bf16 MFU | 125665 tok/s step 8624/19560 | loss 3.459734 (+0.39z)| norm 0.2620 (-1.12z)| lr 3.75e-04 | 4172.39 ms | 32.4% bf16 MFU | 125665 tok/s step 8625/19560 | loss 3.527651 (+2.20z)| norm 0.2836 (+0.21z)| lr 3.74e-04 | 4159.24 ms | 32.5% bf16 MFU | 125684 tok/s step 8626/19560 | loss 3.514678 (+1.81z)| norm 0.2937 (+0.85z)| lr 3.74e-04 | 4176.82 ms | 32.3% bf16 MFU | 125676 tok/s step 8627/19560 | loss 3.430249 (-0.45z)| norm 0.2603 (-1.25z)| lr 3.74e-04 | 4168.32 ms | 32.4% bf16 MFU | 125681 tok/s step 8628/19560 | loss 3.427312 (-0.51z)| norm 0.2622 (-1.12z)| lr 3.74e-04 | 4176.23 ms | 32.3% bf16 MFU | 125674 tok/s step 8629/19560 | loss 3.403020 (-1.16z)| norm 0.2622 (-1.12z)| lr 3.74e-04 | 4188.51 ms | 32.2% bf16 MFU | 125649 tok/s step 8630/19560 | loss 3.437358 (-0.23z)| norm 0.2624 (-1.09z)| lr 3.74e-04 | 4164.66 ms | 32.4% bf16 MFU | 125661 tok/s step 8631/19560 | loss 3.458080 (+0.31z)| norm 0.2552 (-1.52z)| lr 3.74e-04 | 4172.43 ms | 32.4% bf16 MFU | 125661 tok/s step 8632/19560 | loss 3.412071 (-0.90z)| norm 0.2529 (-1.63z)| lr 3.74e-04 | 4167.35 ms | 32.4% bf16 MFU | 125668 tok/s step 8633/19560 | loss 3.449943 (+0.10z)| norm 0.2684 (-0.65z)| lr 3.74e-04 | 4213.63 ms | 32.0% bf16 MFU | 125606 tok/s step 8634/19560 | loss 3.467363 (+0.55z)| norm 0.2592 (-1.21z)| lr 3.74e-04 | 4147.88 ms | 32.6% bf16 MFU | 125646 tok/s step 8635/19560 | loss 3.405739 (-1.08z)| norm 0.2575 (-1.30z)| lr 3.74e-04 | 4154.98 ms | 32.5% bf16 MFU | 125673 tok/s step 8636/19560 | loss 3.461403 (+0.42z)| norm 0.2557 (-1.38z)| lr 3.74e-04 | 4161.52 ms | 32.4% bf16 MFU | 125688 tok/s step 8637/19560 | loss 3.407133 (-1.05z)| norm 0.2660 (-0.74z)| lr 3.74e-04 | 4177.17 ms | 32.3% bf16 MFU | 125680 tok/s step 8638/19560 | loss 3.392405 (-1.42z)| norm 0.2535 (-1.48z)| lr 3.74e-04 | 4187.21 ms | 32.2% bf16 MFU | 125656 tok/s step 8639/19560 | loss 3.430661 (-0.41z)| norm 0.2788 (+0.06z)| lr 3.74e-04 | 4169.39 ms | 32.4% bf16 MFU | 125661 tok/s step 8640/19560 | loss 3.472418 (+0.71z)| norm 0.2704 (-0.46z)| lr 3.74e-04 | 4154.27 ms | 32.5% bf16 MFU | 125688 tok/s step 8641/19560 | loss 3.439736 (-0.18z)| norm 0.3014 (+1.42z)| lr 3.74e-04 | 4169.79 ms | 32.4% bf16 MFU | 125690 tok/s step 8642/19560 | loss 3.402848 (-1.16z)| norm 0.2697 (-0.50z)| lr 3.74e-04 | 4154.54 ms | 32.5% bf16 MFU | 125716 tok/s step 8643/19560 | loss 3.450442 (+0.13z)| norm 0.3130 (+2.07z)| lr 3.74e-04 | 4173.82 ms | 32.3% bf16 MFU | 125710 tok/s step 8644/19560 | loss 3.436836 (-0.24z)| norm 0.2788 (+0.03z)| lr 3.74e-04 | 4157.01 ms | 32.5% bf16 MFU | 125731 tok/s step 8645/19560 | loss 3.509745 (+1.70z)| norm 0.2594 (-1.11z)| lr 3.74e-04 | 4169.58 ms | 32.4% bf16 MFU | 125732 tok/s step 8646/19560 | loss 3.458212 (+0.32z)| norm 0.2634 (-0.87z)| lr 3.73e-04 | 4166.66 ms | 32.4% bf16 MFU | 125736 tok/s step 8647/19560 | loss 3.384285 (-1.64z)| norm 0.2960 (+1.05z)| lr 3.73e-04 | 4170.77 ms | 32.4% bf16 MFU | 125735 tok/s step 8648/19560 | loss 3.425644 (-0.53z)| norm 0.2809 (+0.16z)| lr 3.73e-04 | 4172.15 ms | 32.4% bf16 MFU | 125731 tok/s step 8649/19560 | loss 3.487309 (+1.13z)| norm 0.2718 (-0.38z)| lr 3.73e-04 | 4152.40 ms | 32.5% bf16 MFU | 125758 tok/s step 8650/19560 | loss 3.470297 (+0.66z)| norm 0.2658 (-0.73z)| lr 3.73e-04 | 4186.62 ms | 32.2% bf16 MFU | 125731 tok/s step 8651/19560 | loss 3.377705 (-1.79z)| norm 0.2807 (+0.15z)| lr 3.73e-04 | 4168.35 ms | 32.4% bf16 MFU | 125734 tok/s step 8652/19560 | loss 3.412750 (-0.85z)| norm 0.2620 (-0.95z)| lr 3.73e-04 | 4161.78 ms | 32.4% bf16 MFU | 125746 tok/s step 8653/19560 | loss 3.488666 (+1.15z)| norm 0.2829 (+0.29z)| lr 3.73e-04 | 4159.35 ms | 32.5% bf16 MFU | 125761 tok/s step 8654/19560 | loss 3.440352 (-0.12z)| norm 0.2591 (-1.11z)| lr 3.73e-04 | 4168.89 ms | 32.4% bf16 MFU | 125761 tok/s step 8655/19560 | loss 3.441812 (-0.08z)| norm 0.2601 (-1.04z)| lr 3.73e-04 | 4165.81 ms | 32.4% bf16 MFU | 125766 tok/s step 8656/19560 | loss 3.445341 (+0.03z)| norm 0.2894 (+0.68z)| lr 3.73e-04 | 4164.95 ms | 32.4% bf16 MFU | 125772 tok/s step 8657/19560 | loss 3.513223 (+1.81z)| norm 0.2713 (-0.39z)| lr 3.73e-04 | 4168.20 ms | 32.4% bf16 MFU | 125772 tok/s step 8658/19560 | loss 3.432670 (-0.33z)| norm 0.2831 (+0.31z)| lr 3.73e-04 | 4172.56 ms | 32.4% bf16 MFU | 125766 tok/s step 8659/19560 | loss 3.443000 (-0.06z)| norm 0.2894 (+0.67z)| lr 3.73e-04 | 4158.91 ms | 32.5% bf16 MFU | 125781 tok/s step 8660/19560 | loss 3.477691 (+0.88z)| norm 0.3012 (+1.35z)| lr 3.73e-04 | 4178.83 ms | 32.3% bf16 MFU | 125765 tok/s step 8661/19560 | loss 3.413441 (-0.84z)| norm 0.2848 (+0.37z)| lr 3.73e-04 | 4178.86 ms | 32.3% bf16 MFU | 125750 tok/s step 8662/19560 | loss 3.431362 (-0.36z)| norm 0.2731 (-0.32z)| lr 3.73e-04 | 4170.89 ms | 32.4% bf16 MFU | 125748 tok/s step 8663/19560 | loss 3.492469 (+1.26z)| norm 0.2955 (+1.00z)| lr 3.73e-04 | 4153.34 ms | 32.5% bf16 MFU | 125772 tok/s step 8664/19560 | loss 3.418018 (-0.73z)| norm 0.2699 (-0.51z)| lr 3.73e-04 | 4178.69 ms | 32.3% bf16 MFU | 125757 tok/s step 8665/19560 | loss 3.453798 (+0.27z)| norm 0.2866 (+0.49z)| lr 3.73e-04 | 4173.94 ms | 32.3% bf16 MFU | 125749 tok/s step 8666/19560 | loss 3.384998 (-1.65z)| norm 0.2787 (+0.03z)| lr 3.72e-04 | 4184.29 ms | 32.3% bf16 MFU | 125727 tok/s step 8667/19560 | loss 3.463296 (+0.58z)| norm 0.2681 (-0.61z)| lr 3.72e-04 | 4167.95 ms | 32.4% bf16 MFU | 125730 tok/s step 8668/19560 | loss 3.402092 (-1.16z)| norm 0.3056 (+1.78z)| lr 3.72e-04 | 4161.50 ms | 32.4% bf16 MFU | 125743 tok/s step 8669/19560 | loss 3.471781 (+0.85z)| norm 0.2552 (-1.43z)| lr 3.72e-04 | 4159.80 ms | 32.5% bf16 MFU | 125757 tok/s step 8670/19560 | loss 3.427957 (-0.41z)| norm 0.2841 (+0.42z)| lr 3.72e-04 | 4168.72 ms | 32.4% bf16 MFU | 125758 tok/s step 8671/19560 | loss 3.488708 (+1.36z)| norm 0.3301 (+3.22z)| lr 3.72e-04 | 4178.23 ms | 32.3% bf16 MFU | 125744 tok/s step 8672/19560 | loss 3.409814 (-0.94z)| norm 0.3016 (+1.43z)| lr 3.72e-04 | 4190.03 ms | 32.2% bf16 MFU | 125713 tok/s step 8673/19560 | loss 3.433658 (-0.23z)| norm 0.2692 (-0.57z)| lr 3.72e-04 | 4179.77 ms | 32.3% bf16 MFU | 125699 tok/s step 8674/19560 | loss 3.417966 (-0.68z)| norm 0.2887 (+0.62z)| lr 3.72e-04 | 4164.70 ms | 32.4% bf16 MFU | 125709 tok/s step 8675/19560 | loss 3.467785 (+0.82z)| norm 0.2654 (-0.80z)| lr 3.72e-04 | 4161.47 ms | 32.4% bf16 MFU | 125723 tok/s step 8676/19560 | loss 3.428359 (-0.36z)| norm 0.2706 (-0.49z)| lr 3.72e-04 | 4160.68 ms | 32.5% bf16 MFU | 125737 tok/s step 8677/19560 | loss 3.429272 (-0.32z)| norm 0.2857 (+0.44z)| lr 3.72e-04 | 4162.26 ms | 32.4% bf16 MFU | 125748 tok/s step 8678/19560 | loss 3.432895 (-0.20z)| norm 0.2833 (+0.31z)| lr 3.72e-04 | 4182.98 ms | 32.3% bf16 MFU | 125728 tok/s step 8679/19560 | loss 3.455335 (+0.48z)| norm 0.2689 (-0.58z)| lr 3.72e-04 | 4203.03 ms | 32.1% bf16 MFU | 125678 tok/s step 8680/19560 | loss 3.457989 (+0.56z)| norm 0.2904 (+0.74z)| lr 3.72e-04 | 4174.11 ms | 32.3% bf16 MFU | 125675 tok/s step 8681/19560 | loss 3.459392 (+0.63z)| norm 0.2611 (-1.06z)| lr 3.72e-04 | 4159.14 ms | 32.5% bf16 MFU | 125694 tok/s step 8682/19560 | loss 3.428411 (-0.34z)| norm 0.2826 (+0.29z)| lr 3.72e-04 | 4167.20 ms | 32.4% bf16 MFU | 125700 tok/s step 8683/19560 | loss 3.386502 (-1.66z)| norm 0.3199 (+2.55z)| lr 3.72e-04 | 4169.35 ms | 32.4% bf16 MFU | 125702 tok/s step 8684/19560 | loss 3.384873 (-1.69z)| norm 0.2972 (+1.16z)| lr 3.72e-04 | 4157.22 ms | 32.5% bf16 MFU | 125723 tok/s step 8685/19560 | loss 3.427297 (-0.33z)| norm 0.2659 (-0.74z)| lr 3.72e-04 | 4191.04 ms | 32.2% bf16 MFU | 125692 tok/s step 8686/19560 | loss 3.413661 (-0.78z)| norm 0.2596 (-1.12z)| lr 3.72e-04 | 4179.00 ms | 32.3% bf16 MFU | 125680 tok/s step 8687/19560 | loss 3.481879 (+1.40z)| norm 0.2903 (+0.74z)| lr 3.71e-04 | 4175.06 ms | 32.3% bf16 MFU | 125675 tok/s step 8688/19560 | loss 3.457759 (+0.62z)| norm 0.2844 (+0.38z)| lr 3.71e-04 | 4176.84 ms | 32.3% bf16 MFU | 125667 tok/s step 8689/19560 | loss 3.455153 (+0.53z)| norm 0.2782 (+0.01z)| lr 3.71e-04 | 4171.94 ms | 32.4% bf16 MFU | 125667 tok/s step 8690/19560 | loss 3.442214 (+0.11z)| norm 0.2636 (-0.87z)| lr 3.71e-04 | 4188.63 ms | 32.2% bf16 MFU | 125642 tok/s step 8691/19560 | loss 3.452853 (+0.46z)| norm 0.3106 (+1.96z)| lr 3.71e-04 | 4177.77 ms | 32.3% bf16 MFU | 125635 tok/s step 8692/19560 | loss 3.455307 (+0.53z)| norm 0.2736 (-0.25z)| lr 3.71e-04 | 4170.43 ms | 32.4% bf16 MFU | 125639 tok/s step 8693/19560 | loss 3.443458 (+0.16z)| norm 0.2729 (-0.29z)| lr 3.71e-04 | 4170.54 ms | 32.4% bf16 MFU | 125643 tok/s step 8694/19560 | loss 3.438186 (-0.01z)| norm 0.2569 (-1.28z)| lr 3.71e-04 | 4164.84 ms | 32.4% bf16 MFU | 125655 tok/s step 8695/19560 | loss 3.551222 (+3.45z)| norm 0.2777 (+0.02z)| lr 3.71e-04 | 4167.19 ms | 32.4% bf16 MFU | 125663 tok/s step 8696/19560 | loss 3.497484 (+1.76z)| norm 0.2513 (-1.61z)| lr 3.71e-04 | 4153.54 ms | 32.5% bf16 MFU | 125691 tok/s step 8697/19560 | loss 3.402839 (-1.12z)| norm 0.2644 (-0.78z)| lr 3.71e-04 | 4165.85 ms | 32.4% bf16 MFU | 125699 tok/s step 8698/19560 | loss 3.443698 (+0.13z)| norm 0.2655 (-0.71z)| lr 3.71e-04 | 4164.37 ms | 32.4% bf16 MFU | 125709 tok/s step 8699/19560 | loss 3.438892 (-0.01z)| norm 0.2652 (-0.72z)| lr 3.71e-04 | 4160.15 ms | 32.5% bf16 MFU | 125725 tok/s step 8700/19560 | loss 3.435053 (-0.14z)| norm 0.2874 (+0.66z)| lr 3.71e-04 | 4173.25 ms | 32.4% bf16 MFU | 125720 tok/s step 8701/19560 | loss 3.472228 (+1.02z)| norm 0.3044 (+1.69z)| lr 3.71e-04 | 4158.98 ms | 32.5% bf16 MFU | 125737 tok/s step 8702/19560 | loss 3.491269 (+1.58z)| norm 0.2663 (-0.65z)| lr 3.71e-04 | 4154.09 ms | 32.5% bf16 MFU | 125761 tok/s step 8703/19560 | loss 3.474914 (+1.05z)| norm 0.2759 (-0.05z)| lr 3.71e-04 | 4160.22 ms | 32.5% bf16 MFU | 125774 tok/s step 8704/19560 | loss 3.406011 (-1.05z)| norm 0.2722 (-0.27z)| lr 3.71e-04 | 4157.84 ms | 32.5% bf16 MFU | 125790 tok/s step 8705/19560 | loss 3.466616 (+0.80z)| norm 0.2690 (-0.47z)| lr 3.71e-04 | 4169.02 ms | 32.4% bf16 MFU | 125789 tok/s step 8706/19560 | loss 3.515449 (+2.23z)| norm 0.2869 (+0.62z)| lr 3.71e-04 | 4169.61 ms | 32.4% bf16 MFU | 125786 tok/s step 8707/19560 | loss 3.410918 (-0.90z)| norm 0.2569 (-1.20z)| lr 3.70e-04 | 4166.28 ms | 32.4% bf16 MFU | 125789 tok/s step 8708/19560 | loss 3.456462 (+0.47z)| norm 0.2731 (-0.21z)| lr 3.70e-04 | 4162.94 ms | 32.4% bf16 MFU | 125796 tok/s step 8709/19560 | loss 3.369968 (-2.08z)| norm 0.2655 (-0.66z)| lr 3.70e-04 | 4164.24 ms | 32.4% bf16 MFU | 125802 tok/s step 8710/19560 | loss 3.470149 (+0.88z)| norm 0.2897 (+0.82z)| lr 3.70e-04 | 4166.76 ms | 32.4% bf16 MFU | 125803 tok/s step 8711/19560 | loss 3.540520 (+2.85z)| norm 0.3159 (+2.41z)| lr 3.70e-04 | 4162.24 ms | 32.4% bf16 MFU | 125811 tok/s step 8712/19560 | loss 3.426307 (-0.44z)| norm 0.2743 (-0.16z)| lr 3.70e-04 | 4163.21 ms | 32.4% bf16 MFU | 125817 tok/s step 8713/19560 | loss 3.445594 (+0.13z)| norm 0.2760 (-0.05z)| lr 3.70e-04 | 4171.06 ms | 32.4% bf16 MFU | 125811 tok/s step 8714/19560 | loss 3.490953 (+1.42z)| norm 0.2764 (-0.03z)| lr 3.70e-04 | 4166.28 ms | 32.4% bf16 MFU | 125813 tok/s step 8715/19560 | loss 3.450555 (+0.25z)| norm 0.3054 (+1.75z)| lr 3.70e-04 | 4171.73 ms | 32.4% bf16 MFU | 125806 tok/s step 8716/19560 | loss 3.512767 (+2.01z)| norm 0.2722 (-0.29z)| lr 3.70e-04 | 4164.52 ms | 32.4% bf16 MFU | 125810 tok/s step 8717/19560 | loss 3.454311 (+0.33z)| norm 0.2813 (+0.26z)| lr 3.70e-04 | 4159.71 ms | 32.5% bf16 MFU | 125822 tok/s step 8718/19560 | loss 3.396362 (-1.31z)| norm 0.2784 (+0.09z)| lr 3.70e-04 | 4159.18 ms | 32.5% bf16 MFU | 125833 tok/s step 8719/19560 | loss 3.448746 (+0.17z)| norm 0.2645 (-0.77z)| lr 3.70e-04 | 4169.43 ms | 32.4% bf16 MFU | 125829 tok/s step 8720/19560 | loss 3.412724 (-0.85z)| norm 0.2654 (-0.74z)| lr 3.70e-04 | 4168.01 ms | 32.4% bf16 MFU | 125827 tok/s step 8721/19560 | loss 3.478371 (+1.01z)| norm 0.2590 (-1.12z)| lr 3.70e-04 | 4169.31 ms | 32.4% bf16 MFU | 125823 tok/s step 8722/19560 | loss 3.416949 (-0.73z)| norm 0.2649 (-0.76z)| lr 3.70e-04 | 4168.15 ms | 32.4% bf16 MFU | 125821 tok/s step 8723/19560 | loss 3.486263 (+1.22z)| norm 0.2478 (-1.80z)| lr 3.70e-04 | 4172.89 ms | 32.4% bf16 MFU | 125812 tok/s step 8724/19560 | loss 3.401565 (-1.17z)| norm 0.2703 (-0.41z)| lr 3.70e-04 | 4159.42 ms | 32.5% bf16 MFU | 125824 tok/s step 8725/19560 | loss 3.408932 (-0.94z)| norm 0.2587 (-1.12z)| lr 3.70e-04 | 4168.77 ms | 32.4% bf16 MFU | 125821 tok/s step 8726/19560 | loss 3.454904 (+0.35z)| norm 0.2963 (+1.19z)| lr 3.70e-04 | 4168.47 ms | 32.4% bf16 MFU | 125819 tok/s step 8727/19560 | loss 3.430706 (-0.34z)| norm 0.2553 (-1.31z)| lr 3.70e-04 | 4165.81 ms | 32.4% bf16 MFU | 125820 tok/s step 8728/19560 | loss 3.437034 (-0.16z)| norm 0.2683 (-0.51z)| lr 3.69e-04 | 4167.56 ms | 32.4% bf16 MFU | 125820 tok/s step 8729/19560 | loss 3.489793 (+1.32z)| norm 0.3036 (+1.62z)| lr 3.69e-04 | 4164.30 ms | 32.4% bf16 MFU | 125824 tok/s step 8730/19560 | loss 3.408017 (-1.00z)| norm 0.3012 (+1.45z)| lr 3.69e-04 | 4179.08 ms | 32.3% bf16 MFU | 125805 tok/s step 8731/19560 | loss 3.534463 (+2.51z)| norm 0.2820 (+0.30z)| lr 3.69e-04 | 4160.53 ms | 32.5% bf16 MFU | 125816 tok/s step 8732/19560 | loss 3.433455 (-0.30z)| norm 0.3225 (+2.65z)| lr 3.69e-04 | 4157.35 ms | 32.5% bf16 MFU | 125830 tok/s step 8733/19560 | loss 3.451570 (+0.21z)| norm 0.3059 (+1.65z)| lr 3.69e-04 | 4163.55 ms | 32.4% bf16 MFU | 125835 tok/s step 8734/19560 | loss 3.406059 (-1.07z)| norm 0.2960 (+1.05z)| lr 3.69e-04 | 4151.30 ms | 32.5% bf16 MFU | 125858 tok/s step 8735/19560 | loss 3.389641 (-1.51z)| norm 0.2964 (+1.07z)| lr 3.69e-04 | 4164.05 ms | 32.4% bf16 MFU | 125861 tok/s step 8736/19560 | loss 3.428138 (-0.43z)| norm 0.3162 (+2.16z)| lr 3.69e-04 | 4155.37 ms | 32.5% bf16 MFU | 125876 tok/s step 8737/19560 | loss 3.365566 (-2.12z)| norm 0.2724 (-0.33z)| lr 3.69e-04 | 4166.95 ms | 32.4% bf16 MFU | 125873 tok/s step 8738/19560 | loss 3.479494 (+0.98z)| norm 0.2931 (+0.84z)| lr 3.69e-04 | 4168.12 ms | 32.4% bf16 MFU | 125869 tok/s step 8739/19560 | loss 3.441203 (-0.06z)| norm 0.2831 (+0.28z)| lr 3.69e-04 | 4154.02 ms | 32.5% bf16 MFU | 125886 tok/s step 8740/19560 | loss 3.442501 (-0.03z)| norm 0.2769 (-0.08z)| lr 3.69e-04 | 4165.66 ms | 32.4% bf16 MFU | 125885 tok/s step 8741/19560 | loss 3.366834 (-2.05z)| norm 0.2887 (+0.60z)| lr 3.69e-04 | 4165.20 ms | 32.4% bf16 MFU | 125884 tok/s step 8742/19560 | loss 3.441959 (-0.02z)| norm 0.2567 (-1.24z)| lr 3.69e-04 | 4167.19 ms | 32.4% bf16 MFU | 125881 tok/s step 8743/19560 | loss 3.489092 (+1.24z)| norm 0.2981 (+1.16z)| lr 3.69e-04 | 4166.69 ms | 32.4% bf16 MFU | 125878 tok/s step 8744/19560 | loss 3.385307 (-1.54z)| norm 0.2717 (-0.37z)| lr 3.69e-04 | 4154.73 ms | 32.5% bf16 MFU | 125894 tok/s step 8745/19560 | loss 3.409337 (-0.89z)| norm 0.2628 (-0.87z)| lr 3.69e-04 | 4165.57 ms | 32.4% bf16 MFU | 125892 tok/s step 8746/19560 | loss 3.536637 (+2.42z)| norm 0.2778 (-0.01z)| lr 3.69e-04 | 4168.79 ms | 32.4% bf16 MFU | 125886 tok/s step 8747/19560 | loss 3.412078 (-0.83z)| norm 0.3058 (+1.58z)| lr 3.69e-04 | 4167.27 ms | 32.4% bf16 MFU | 125882 tok/s step 8748/19560 | loss 3.446331 (+0.06z)| norm 0.3094 (+1.76z)| lr 3.69e-04 | 4159.48 ms | 32.5% bf16 MFU | 125890 tok/s step 8749/19560 | loss 3.458634 (+0.37z)| norm 0.2983 (+1.11z)| lr 3.68e-04 | 4158.02 ms | 32.5% bf16 MFU | 125900 tok/s step 8750/19560 | loss 3.459581 (+0.39z)| norm 0.3018 (+1.29z)| lr 3.68e-04 | 4148.30 ms | 32.5% bf16 MFU | 125925 tok/s val loss 3.417643 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2867/10042 = 0.285501 step 8751/19560 | loss 3.417146 (-0.72z)| norm 0.2722 (-0.36z)| lr 3.68e-04 | 4197.36 ms | 32.2% bf16 MFU | 125874 tok/s step 8752/19560 | loss 3.456316 (+0.31z)| norm 0.3108 (+1.80z)| lr 3.68e-04 | 4295.73 ms | 31.4% bf16 MFU | 125683 tok/s step 8753/19560 | loss 3.436871 (-0.18z)| norm 0.2528 (-1.45z)| lr 3.68e-04 | 4304.40 ms | 31.4% bf16 MFU | 125489 tok/s step 8754/19560 | loss 3.424179 (-0.51z)| norm 0.2830 (+0.25z)| lr 3.68e-04 | 4222.39 ms | 32.0% bf16 MFU | 125423 tok/s step 8755/19560 | loss 3.371589 (-1.90z)| norm 0.2731 (-0.32z)| lr 3.68e-04 | 4151.00 ms | 32.5% bf16 MFU | 125467 tok/s step 8756/19560 | loss 3.379284 (-1.67z)| norm 0.3060 (+1.52z)| lr 3.68e-04 | 4153.43 ms | 32.5% bf16 MFU | 125505 tok/s step 8757/19560 | loss 3.443850 (+0.04z)| norm 0.2880 (+0.50z)| lr 3.68e-04 | 4166.20 ms | 32.4% bf16 MFU | 125522 tok/s step 8758/19560 | loss 3.437768 (-0.13z)| norm 0.2911 (+0.66z)| lr 3.68e-04 | 4159.17 ms | 32.5% bf16 MFU | 125548 tok/s step 8759/19560 | loss 3.437250 (-0.14z)| norm 0.2683 (-0.63z)| lr 3.68e-04 | 4161.26 ms | 32.4% bf16 MFU | 125571 tok/s step 8760/19560 | loss 3.399573 (-1.14z)| norm 0.3022 (+1.26z)| lr 3.68e-04 | 4158.89 ms | 32.5% bf16 MFU | 125595 tok/s step 8761/19560 | loss 3.441814 (-0.01z)| norm 0.2781 (-0.11z)| lr 3.68e-04 | 4157.35 ms | 32.5% bf16 MFU | 125621 tok/s step 8762/19560 | loss 3.409726 (-0.85z)| norm 0.2875 (+0.42z)| lr 3.68e-04 | 4174.55 ms | 32.3% bf16 MFU | 125620 tok/s step 8763/19560 | loss 3.433732 (-0.22z)| norm 0.3008 (+1.16z)| lr 3.68e-04 | 4161.16 ms | 32.4% bf16 MFU | 125638 tok/s step 8764/19560 | loss 3.438779 (-0.08z)| norm 0.2696 (-0.64z)| lr 3.68e-04 | 4161.63 ms | 32.4% bf16 MFU | 125656 tok/s step 8765/19560 | loss 3.507172 (+1.71z)| norm 0.3129 (+1.82z)| lr 3.68e-04 | 4163.87 ms | 32.4% bf16 MFU | 125668 tok/s step 8766/19560 | loss 3.471063 (+0.74z)| norm 0.2560 (-1.43z)| lr 3.68e-04 | 4190.40 ms | 32.2% bf16 MFU | 125641 tok/s step 8767/19560 | loss 3.477532 (+0.90z)| norm 0.3391 (+3.16z)| lr 3.68e-04 | 4270.82 ms | 31.6% bf16 MFU | 125497 tok/s step 8768/19560 | loss 3.490062 (+1.23z)| norm 0.2785 (-0.17z)| lr 3.68e-04 | 4177.95 ms | 32.3% bf16 MFU | 125496 tok/s step 8769/19560 | loss 3.452229 (+0.22z)| norm 0.2958 (+0.78z)| lr 3.67e-04 | 4161.47 ms | 32.4% bf16 MFU | 125521 tok/s step 8770/19560 | loss 3.378975 (-1.70z)| norm 0.2932 (+0.63z)| lr 3.67e-04 | 4159.57 ms | 32.5% bf16 MFU | 125547 tok/s step 8771/19560 | loss 3.474326 (+0.80z)| norm 0.3040 (+1.24z)| lr 3.67e-04 | 4163.59 ms | 32.4% bf16 MFU | 125566 tok/s step 8772/19560 | loss 3.445529 (+0.04z)| norm 0.2830 (+0.07z)| lr 3.67e-04 | 4147.78 ms | 32.6% bf16 MFU | 125608 tok/s step 8773/19560 | loss 3.414211 (-0.76z)| norm 0.2892 (+0.40z)| lr 3.67e-04 | 4162.03 ms | 32.4% bf16 MFU | 125626 tok/s step 8774/19560 | loss 3.423391 (-0.51z)| norm 0.2964 (+0.79z)| lr 3.67e-04 | 4162.93 ms | 32.4% bf16 MFU | 125642 tok/s step 8775/19560 | loss 3.510062 (+1.75z)| norm 0.2963 (+0.79z)| lr 3.67e-04 | 4162.39 ms | 32.4% bf16 MFU | 125657 tok/s step 8776/19560 | loss 3.444080 (+0.00z)| norm 0.2692 (-0.72z)| lr 3.67e-04 | 4211.04 ms | 32.1% bf16 MFU | 125600 tok/s step 8777/19560 | loss 3.372841 (-1.84z)| norm 0.2826 (+0.02z)| lr 3.67e-04 | 4165.86 ms | 32.4% bf16 MFU | 125612 tok/s step 8778/19560 | loss 3.430156 (-0.33z)| norm 0.2814 (-0.05z)| lr 3.67e-04 | 4161.98 ms | 32.4% bf16 MFU | 125630 tok/s step 8779/19560 | loss 3.450882 (+0.20z)| norm 0.2663 (-0.89z)| lr 3.67e-04 | 4159.30 ms | 32.5% bf16 MFU | 125651 tok/s step 8780/19560 | loss 3.462461 (+0.50z)| norm 0.3112 (+1.59z)| lr 3.67e-04 | 4163.15 ms | 32.4% bf16 MFU | 125666 tok/s step 8781/19560 | loss 3.459225 (+0.42z)| norm 0.2718 (-0.59z)| lr 3.67e-04 | 4159.26 ms | 32.5% bf16 MFU | 125685 tok/s step 8782/19560 | loss 3.457827 (+0.38z)| norm 0.3234 (+2.21z)| lr 3.67e-04 | 4188.24 ms | 32.2% bf16 MFU | 125660 tok/s step 8783/19560 | loss 3.453221 (+0.25z)| norm 0.3096 (+1.43z)| lr 3.67e-04 | 4160.86 ms | 32.4% bf16 MFU | 125677 tok/s step 8784/19560 | loss 3.372876 (-1.85z)| norm 0.2611 (-1.20z)| lr 3.67e-04 | 4155.84 ms | 32.5% bf16 MFU | 125701 tok/s step 8785/19560 | loss 3.444211 (+0.04z)| norm 0.3040 (+1.11z)| lr 3.67e-04 | 4176.93 ms | 32.3% bf16 MFU | 125692 tok/s step 8786/19560 | loss 3.489756 (+1.24z)| norm 0.2730 (-0.56z)| lr 3.67e-04 | 4152.91 ms | 32.5% bf16 MFU | 125720 tok/s step 8787/19560 | loss 3.483630 (+1.06z)| norm 0.2776 (-0.31z)| lr 3.67e-04 | 4157.52 ms | 32.5% bf16 MFU | 125739 tok/s step 8788/19560 | loss 3.523903 (+2.09z)| norm 0.3074 (+1.30z)| lr 3.67e-04 | 4161.21 ms | 32.4% bf16 MFU | 125752 tok/s step 8789/19560 | loss 3.472119 (+0.73z)| norm 0.2618 (-1.14z)| lr 3.67e-04 | 4157.03 ms | 32.5% bf16 MFU | 125770 tok/s step 8790/19560 | loss 3.430572 (-0.35z)| norm 0.2898 (+0.35z)| lr 3.66e-04 | 4156.46 ms | 32.5% bf16 MFU | 125789 tok/s step 8791/19560 | loss 3.428734 (-0.39z)| norm 0.2703 (-0.68z)| lr 3.66e-04 | 4161.82 ms | 32.4% bf16 MFU | 125798 tok/s step 8792/19560 | loss 3.450973 (+0.18z)| norm 0.2802 (-0.16z)| lr 3.66e-04 | 4157.40 ms | 32.5% bf16 MFU | 125814 tok/s step 8793/19560 | loss 3.433140 (-0.28z)| norm 0.2792 (-0.21z)| lr 3.66e-04 | 4151.71 ms | 32.5% bf16 MFU | 125837 tok/s step 8794/19560 | loss 3.444721 (+0.01z)| norm 0.2857 (+0.14z)| lr 3.66e-04 | 4159.55 ms | 32.5% bf16 MFU | 125847 tok/s step 8795/19560 | loss 3.416314 (-0.73z)| norm 0.2908 (+0.40z)| lr 3.66e-04 | 4160.53 ms | 32.5% bf16 MFU | 125856 tok/s step 8796/19560 | loss 3.441961 (-0.06z)| norm 0.2804 (-0.15z)| lr 3.66e-04 | 4157.20 ms | 32.5% bf16 MFU | 125869 tok/s step 8797/19560 | loss 3.435765 (-0.22z)| norm 0.2781 (-0.28z)| lr 3.66e-04 | 4158.27 ms | 32.5% bf16 MFU | 125879 tok/s step 8798/19560 | loss 3.510171 (+1.73z)| norm 0.2786 (-0.26z)| lr 3.66e-04 | 4159.94 ms | 32.5% bf16 MFU | 125887 tok/s step 8799/19560 | loss 3.444812 (+0.02z)| norm 0.2681 (-0.82z)| lr 3.66e-04 | 4194.73 ms | 32.2% bf16 MFU | 125842 tok/s step 8800/19560 | loss 3.444709 (+0.01z)| norm 0.2621 (-1.14z)| lr 3.66e-04 | 4167.57 ms | 32.4% bf16 MFU | 125840 tok/s step 8801/19560 | loss 3.502522 (+1.51z)| norm 0.2805 (-0.11z)| lr 3.66e-04 | 4258.97 ms | 31.7% bf16 MFU | 125703 tok/s step 8802/19560 | loss 3.452892 (+0.20z)| norm 0.2676 (-0.83z)| lr 3.66e-04 | 4168.07 ms | 32.4% bf16 MFU | 125707 tok/s step 8803/19560 | loss 3.438667 (-0.17z)| norm 0.2693 (-0.73z)| lr 3.66e-04 | 4158.96 ms | 32.5% bf16 MFU | 125725 tok/s step 8804/19560 | loss 3.384733 (-1.57z)| norm 0.2801 (-0.13z)| lr 3.66e-04 | 4157.94 ms | 32.5% bf16 MFU | 125743 tok/s step 8805/19560 | loss 3.454845 (+0.26z)| norm 0.2902 (+0.43z)| lr 3.66e-04 | 4174.38 ms | 32.3% bf16 MFU | 125736 tok/s step 8806/19560 | loss 3.448489 (+0.09z)| norm 0.2722 (-0.57z)| lr 3.66e-04 | 4169.93 ms | 32.4% bf16 MFU | 125736 tok/s step 8807/19560 | loss 3.397137 (-1.23z)| norm 0.2782 (-0.24z)| lr 3.66e-04 | 4154.48 ms | 32.5% bf16 MFU | 125759 tok/s step 8808/19560 | loss 3.442262 (-0.06z)| norm 0.2704 (-0.67z)| lr 3.66e-04 | 4165.35 ms | 32.4% bf16 MFU | 125764 tok/s step 8809/19560 | loss 3.468796 (+0.63z)| norm 0.2872 (+0.26z)| lr 3.66e-04 | 4169.96 ms | 32.4% bf16 MFU | 125763 tok/s step 8810/19560 | loss 3.417916 (-0.69z)| norm 0.2903 (+0.43z)| lr 3.65e-04 | 4152.83 ms | 32.5% bf16 MFU | 125787 tok/s step 8811/19560 | loss 3.466973 (+0.57z)| norm 0.3038 (+1.22z)| lr 3.65e-04 | 4169.25 ms | 32.4% bf16 MFU | 125785 tok/s step 8812/19560 | loss 3.417243 (-0.74z)| norm 0.2760 (-0.36z)| lr 3.65e-04 | 4168.64 ms | 32.4% bf16 MFU | 125784 tok/s step 8813/19560 | loss 3.439866 (-0.15z)| norm 0.2798 (-0.15z)| lr 3.65e-04 | 4163.17 ms | 32.4% bf16 MFU | 125792 tok/s step 8814/19560 | loss 3.470846 (+0.66z)| norm 0.2948 (+0.70z)| lr 3.65e-04 | 4207.91 ms | 32.1% bf16 MFU | 125732 tok/s step 8815/19560 | loss 3.429383 (-0.43z)| norm 0.2878 (+0.29z)| lr 3.65e-04 | 4166.00 ms | 32.4% bf16 MFU | 125738 tok/s step 8816/19560 | loss 3.456964 (+0.30z)| norm 0.2904 (+0.44z)| lr 3.65e-04 | 4156.45 ms | 32.5% bf16 MFU | 125758 tok/s step 8817/19560 | loss 3.365336 (-2.07z)| norm 0.2815 (-0.07z)| lr 3.65e-04 | 4155.04 ms | 32.5% bf16 MFU | 125779 tok/s step 8818/19560 | loss 3.413918 (-0.80z)| norm 0.2525 (-1.73z)| lr 3.65e-04 | 4155.69 ms | 32.5% bf16 MFU | 125798 tok/s step 8819/19560 | loss 3.478207 (+0.87z)| norm 0.3128 (+1.72z)| lr 3.65e-04 | 4158.63 ms | 32.5% bf16 MFU | 125812 tok/s step 8820/19560 | loss 3.445008 (+0.01z)| norm 0.2862 (+0.20z)| lr 3.65e-04 | 4154.78 ms | 32.5% bf16 MFU | 125831 tok/s step 8821/19560 | loss 3.493418 (+1.25z)| norm 0.2736 (-0.52z)| lr 3.65e-04 | 4192.53 ms | 32.2% bf16 MFU | 125792 tok/s step 8822/19560 | loss 3.447502 (+0.06z)| norm 0.2955 (+0.71z)| lr 3.65e-04 | 4163.22 ms | 32.4% bf16 MFU | 125799 tok/s step 8823/19560 | loss 3.462234 (+0.47z)| norm 0.2784 (-0.27z)| lr 3.65e-04 | 4163.86 ms | 32.4% bf16 MFU | 125805 tok/s step 8824/19560 | loss 3.434508 (-0.25z)| norm 0.2691 (-0.82z)| lr 3.65e-04 | 4151.71 ms | 32.5% bf16 MFU | 125829 tok/s step 8825/19560 | loss 3.388984 (-1.46z)| norm 0.2522 (-1.78z)| lr 3.65e-04 | 4151.89 ms | 32.5% bf16 MFU | 125851 tok/s step 8826/19560 | loss 3.407360 (-0.96z)| norm 0.2790 (-0.24z)| lr 3.65e-04 | 4167.46 ms | 32.4% bf16 MFU | 125849 tok/s step 8827/19560 | loss 3.473925 (+0.80z)| norm 0.2669 (-0.94z)| lr 3.65e-04 | 4164.21 ms | 32.4% bf16 MFU | 125851 tok/s step 8828/19560 | loss 3.408566 (-0.93z)| norm 0.2850 (+0.10z)| lr 3.65e-04 | 4155.65 ms | 32.5% bf16 MFU | 125867 tok/s step 8829/19560 | loss 3.475793 (+0.85z)| norm 0.2812 (-0.11z)| lr 3.65e-04 | 4148.22 ms | 32.5% bf16 MFU | 125893 tok/s step 8830/19560 | loss 3.416202 (-0.71z)| norm 0.2528 (-1.74z)| lr 3.65e-04 | 4154.32 ms | 32.5% bf16 MFU | 125909 tok/s step 8831/19560 | loss 3.392442 (-1.32z)| norm 0.2696 (-0.77z)| lr 3.64e-04 | 4809.23 ms | 28.1% bf16 MFU | 125064 tok/s step 8832/19560 | loss 3.404478 (-1.00z)| norm 0.2704 (-0.72z)| lr 3.64e-04 | 4157.28 ms | 32.5% bf16 MFU | 125116 tok/s step 8833/19560 | loss 3.495736 (+1.39z)| norm 0.2720 (-0.63z)| lr 3.64e-04 | 4153.13 ms | 32.5% bf16 MFU | 125173 tok/s step 8834/19560 | loss 3.446209 (+0.11z)| norm 0.2789 (-0.23z)| lr 3.64e-04 | 4158.52 ms | 32.5% bf16 MFU | 125218 tok/s step 8835/19560 | loss 3.467440 (+0.66z)| norm 0.2743 (-0.51z)| lr 3.64e-04 | 4164.32 ms | 32.4% bf16 MFU | 125252 tok/s step 8836/19560 | loss 3.456503 (+0.37z)| norm 0.2915 (+0.49z)| lr 3.64e-04 | 4155.05 ms | 32.5% bf16 MFU | 125298 tok/s step 8837/19560 | loss 3.484413 (+1.10z)| norm 0.2689 (-0.83z)| lr 3.64e-04 | 4150.99 ms | 32.5% bf16 MFU | 125349 tok/s step 8838/19560 | loss 3.486131 (+1.14z)| norm 0.3143 (+1.78z)| lr 3.64e-04 | 4149.92 ms | 32.5% bf16 MFU | 125398 tok/s step 8839/19560 | loss 3.486837 (+1.20z)| norm 0.2683 (-0.86z)| lr 3.64e-04 | 4151.51 ms | 32.5% bf16 MFU | 125443 tok/s step 8840/19560 | loss 3.426062 (-0.47z)| norm 0.2799 (-0.18z)| lr 3.64e-04 | 4156.73 ms | 32.5% bf16 MFU | 125477 tok/s step 8841/19560 | loss 3.486382 (+1.17z)| norm 0.2736 (-0.55z)| lr 3.64e-04 | 4159.13 ms | 32.5% bf16 MFU | 125506 tok/s step 8842/19560 | loss 3.440550 (-0.07z)| norm 0.2672 (-0.91z)| lr 3.64e-04 | 4162.23 ms | 32.4% bf16 MFU | 125529 tok/s step 8843/19560 | loss 3.393780 (-1.33z)| norm 0.2716 (-0.65z)| lr 3.64e-04 | 4150.89 ms | 32.5% bf16 MFU | 125568 tok/s step 8844/19560 | loss 3.407906 (-0.93z)| norm 0.2809 (-0.11z)| lr 3.64e-04 | 4173.47 ms | 32.4% bf16 MFU | 125571 tok/s step 8845/19560 | loss 3.443796 (+0.06z)| norm 0.2632 (-1.13z)| lr 3.64e-04 | 4160.25 ms | 32.5% bf16 MFU | 125593 tok/s step 8846/19560 | loss 3.435915 (-0.17z)| norm 0.2737 (-0.52z)| lr 3.64e-04 | 4149.66 ms | 32.5% bf16 MFU | 125631 tok/s step 8847/19560 | loss 3.392162 (-1.36z)| norm 0.2863 (+0.21z)| lr 3.64e-04 | 4160.07 ms | 32.5% bf16 MFU | 125651 tok/s step 8848/19560 | loss 3.443572 (+0.05z)| norm 0.2852 (+0.14z)| lr 3.64e-04 | 4154.79 ms | 32.5% bf16 MFU | 125678 tok/s step 8849/19560 | loss 3.442619 (+0.03z)| norm 0.3029 (+1.15z)| lr 3.64e-04 | 4156.37 ms | 32.5% bf16 MFU | 125701 tok/s step 8850/19560 | loss 3.374761 (-1.82z)| norm 0.3660 (+4.45z)| lr 3.64e-04 | 4160.45 ms | 32.5% bf16 MFU | 125717 tok/s step 8851/19560 | loss 3.417539 (-0.64z)| norm 0.3167 (+1.75z)| lr 3.63e-04 | 4166.52 ms | 32.4% bf16 MFU | 125722 tok/s step 8852/19560 | loss 3.486895 (+1.25z)| norm 0.2950 (+0.56z)| lr 3.63e-04 | 4203.41 ms | 32.1% bf16 MFU | 125673 tok/s step 8853/19560 | loss 3.468018 (+0.72z)| norm 0.3121 (+1.47z)| lr 3.63e-04 | 4165.07 ms | 32.4% bf16 MFU | 125683 tok/s step 8854/19560 | loss 3.564239 (+3.21z)| norm 0.3284 (+2.30z)| lr 3.63e-04 | 4173.89 ms | 32.3% bf16 MFU | 125679 tok/s step 8855/19560 | loss 3.481128 (+1.00z)| norm 0.2865 (+0.05z)| lr 3.63e-04 | 4157.60 ms | 32.5% bf16 MFU | 125701 tok/s step 8856/19560 | loss 3.410168 (-0.86z)| norm 0.2840 (-0.10z)| lr 3.63e-04 | 4168.36 ms | 32.4% bf16 MFU | 125704 tok/s step 8857/19560 | loss 3.488761 (+1.20z)| norm 0.3193 (+1.79z)| lr 3.63e-04 | 4161.72 ms | 32.4% bf16 MFU | 125718 tok/s step 8858/19560 | loss 3.440771 (-0.06z)| norm 0.2939 (+0.43z)| lr 3.63e-04 | 4164.35 ms | 32.4% bf16 MFU | 125727 tok/s step 8859/19560 | loss 3.506751 (+1.70z)| norm 0.3160 (+1.59z)| lr 3.63e-04 | 4168.50 ms | 32.4% bf16 MFU | 125730 tok/s step 8860/19560 | loss 3.510297 (+1.76z)| norm 0.3327 (+2.45z)| lr 3.63e-04 | 4154.32 ms | 32.5% bf16 MFU | 125753 tok/s step 8861/19560 | loss 3.441067 (-0.06z)| norm 0.2783 (-0.41z)| lr 3.63e-04 | 4157.38 ms | 32.5% bf16 MFU | 125771 tok/s step 8862/19560 | loss 3.467675 (+0.63z)| norm 0.3050 (+1.00z)| lr 3.63e-04 | 4160.29 ms | 32.5% bf16 MFU | 125784 tok/s step 8863/19560 | loss 3.421812 (-0.59z)| norm 0.3043 (+0.96z)| lr 3.63e-04 | 4157.16 ms | 32.5% bf16 MFU | 125800 tok/s step 8864/19560 | loss 3.473135 (+0.76z)| norm 0.2895 (+0.19z)| lr 3.63e-04 | 4163.70 ms | 32.4% bf16 MFU | 125806 tok/s step 8865/19560 | loss 3.432114 (-0.35z)| norm 0.3065 (+1.08z)| lr 3.63e-04 | 4165.42 ms | 32.4% bf16 MFU | 125809 tok/s step 8866/19560 | loss 3.400621 (-1.18z)| norm 0.2814 (-0.25z)| lr 3.63e-04 | 4165.69 ms | 32.4% bf16 MFU | 125812 tok/s step 8867/19560 | loss 3.440592 (-0.10z)| norm 0.2845 (-0.08z)| lr 3.63e-04 | 4170.97 ms | 32.4% bf16 MFU | 125806 tok/s step 8868/19560 | loss 3.436569 (-0.21z)| norm 0.2808 (-0.28z)| lr 3.63e-04 | 4145.74 ms | 32.6% bf16 MFU | 125839 tok/s step 8869/19560 | loss 3.432980 (-0.32z)| norm 0.2724 (-0.72z)| lr 3.63e-04 | 4157.02 ms | 32.5% bf16 MFU | 125853 tok/s step 8870/19560 | loss 3.417668 (-0.74z)| norm 0.2705 (-0.83z)| lr 3.63e-04 | 4155.36 ms | 32.5% bf16 MFU | 125869 tok/s step 8871/19560 | loss 3.478565 (+0.93z)| norm 0.2733 (-0.67z)| lr 3.63e-04 | 4159.53 ms | 32.5% bf16 MFU | 125878 tok/s step 8872/19560 | loss 3.426983 (-0.50z)| norm 0.2885 (+0.13z)| lr 3.62e-04 | 4166.76 ms | 32.4% bf16 MFU | 125875 tok/s step 8873/19560 | loss 3.418981 (-0.72z)| norm 0.2871 (+0.05z)| lr 3.62e-04 | 4162.52 ms | 32.4% bf16 MFU | 125879 tok/s step 8874/19560 | loss 3.415290 (-0.81z)| norm 0.2694 (-0.90z)| lr 3.62e-04 | 4152.13 ms | 32.5% bf16 MFU | 125899 tok/s step 8875/19560 | loss 3.459460 (+0.43z)| norm 0.2971 (+0.59z)| lr 3.62e-04 | 4157.35 ms | 32.5% bf16 MFU | 125909 tok/s step 8876/19560 | loss 3.452740 (+0.24z)| norm 0.2833 (-0.14z)| lr 3.62e-04 | 4159.14 ms | 32.5% bf16 MFU | 125917 tok/s step 8877/19560 | loss 3.437478 (-0.19z)| norm 0.2709 (-0.80z)| lr 3.62e-04 | 4150.26 ms | 32.5% bf16 MFU | 125937 tok/s step 8878/19560 | loss 3.417942 (-0.74z)| norm 0.2963 (+0.58z)| lr 3.62e-04 | 14526.02 ms | 9.3% bf16 MFU | 121445 tok/s step 8879/19560 | loss 3.390279 (-1.51z)| norm 0.2951 (+0.51z)| lr 3.62e-04 | 6610.53 ms | 20.4% bf16 MFU | 119338 tok/s step 8880/19560 | loss 3.478331 (+0.97z)| norm 0.2796 (-0.33z)| lr 3.62e-04 | 4235.20 ms | 31.9% bf16 MFU | 119561 tok/s step 8881/19560 | loss 3.523538 (+2.18z)| norm 0.2631 (-1.24z)| lr 3.62e-04 | 4290.90 ms | 31.5% bf16 MFU | 119692 tok/s step 8882/19560 | loss 3.406054 (-1.06z)| norm 0.2606 (-1.35z)| lr 3.62e-04 | 4165.77 ms | 32.4% bf16 MFU | 120000 tok/s step 8883/19560 | loss 3.474756 (+0.82z)| norm 0.2615 (-1.30z)| lr 3.62e-04 | 4178.30 ms | 32.3% bf16 MFU | 120274 tok/s step 8884/19560 | loss 3.398413 (-1.32z)| norm 0.2805 (-0.26z)| lr 3.62e-04 | 4147.53 ms | 32.6% bf16 MFU | 120581 tok/s step 8885/19560 | loss 3.431161 (-0.40z)| norm 0.2518 (-1.78z)| lr 3.62e-04 | 4165.14 ms | 32.4% bf16 MFU | 120846 tok/s step 8886/19560 | loss 3.430384 (-0.42z)| norm 0.2834 (-0.07z)| lr 3.62e-04 | 4153.15 ms | 32.5% bf16 MFU | 121115 tok/s step 8887/19560 | loss 3.485785 (+1.12z)| norm 0.2878 (+0.15z)| lr 3.62e-04 | 4145.94 ms | 32.6% bf16 MFU | 121383 tok/s step 8888/19560 | loss 3.521400 (+2.06z)| norm 0.2814 (-0.18z)| lr 3.62e-04 | 4176.54 ms | 32.3% bf16 MFU | 121590 tok/s step 8889/19560 | loss 3.481704 (+0.96z)| norm 0.3073 (+1.20z)| lr 3.62e-04 | 4143.31 ms | 32.6% bf16 MFU | 121837 tok/s step 8890/19560 | loss 3.589824 (+3.69z)| norm 0.2902 (+0.28z)| lr 3.62e-04 | 4153.80 ms | 32.5% bf16 MFU | 122057 tok/s step 8891/19560 | loss 3.410057 (-0.99z)| norm 0.3103 (+1.35z)| lr 3.62e-04 | 4147.66 ms | 32.6% bf16 MFU | 122274 tok/s step 8892/19560 | loss 3.398980 (-1.26z)| norm 0.2933 (+0.43z)| lr 3.61e-04 | 4155.63 ms | 32.5% bf16 MFU | 122468 tok/s step 8893/19560 | loss 3.425632 (-0.56z)| norm 0.2894 (+0.23z)| lr 3.61e-04 | 4178.16 ms | 32.3% bf16 MFU | 122619 tok/s step 8894/19560 | loss 3.523776 (+1.96z)| norm 0.2962 (+0.59z)| lr 3.61e-04 | 4157.76 ms | 32.5% bf16 MFU | 122793 tok/s step 8895/19560 | loss 3.463766 (+0.42z)| norm 0.3069 (+1.22z)| lr 3.61e-04 | 4151.41 ms | 32.5% bf16 MFU | 122968 tok/s step 8896/19560 | loss 3.412436 (-0.89z)| norm 0.2725 (-0.71z)| lr 3.61e-04 | 4154.34 ms | 32.5% bf16 MFU | 123130 tok/s step 8897/19560 | loss 3.411724 (-0.89z)| norm 0.2920 (+0.39z)| lr 3.61e-04 | 4148.99 ms | 32.5% bf16 MFU | 123292 tok/s step 8898/19560 | loss 3.415533 (-0.81z)| norm 0.2573 (-1.54z)| lr 3.61e-04 | 4157.17 ms | 32.5% bf16 MFU | 123433 tok/s step 8899/19560 | loss 3.461601 (+0.39z)| norm 0.3110 (+1.45z)| lr 3.61e-04 | 4153.58 ms | 32.5% bf16 MFU | 123572 tok/s step 8900/19560 | loss 3.409132 (-0.97z)| norm 0.2872 (+0.13z)| lr 3.61e-04 | 4191.69 ms | 32.2% bf16 MFU | 123648 tok/s step 8901/19560 | loss 3.501841 (+1.41z)| norm 0.3346 (+2.67z)| lr 3.61e-04 | 4160.63 ms | 32.5% bf16 MFU | 123766 tok/s step 8902/19560 | loss 3.568832 (+3.00z)| norm 0.2793 (-0.31z)| lr 3.61e-04 | 4167.92 ms | 32.4% bf16 MFU | 123867 tok/s step 8903/19560 | loss 3.446274 (-0.04z)| norm 0.3002 (+0.81z)| lr 3.61e-04 | 4148.65 ms | 32.5% bf16 MFU | 123993 tok/s step 8904/19560 | loss 3.404964 (-1.06z)| norm 0.2657 (-1.05z)| lr 3.61e-04 | 4150.75 ms | 32.5% bf16 MFU | 124109 tok/s step 8905/19560 | loss 3.579748 (+3.18z)| norm 0.3198 (+1.83z)| lr 3.61e-04 | 4155.79 ms | 32.5% bf16 MFU | 124211 tok/s step 8906/19560 | loss 3.425091 (-0.58z)| norm 0.2881 (+0.14z)| lr 3.61e-04 | 4149.02 ms | 32.5% bf16 MFU | 124319 tok/s step 8907/19560 | loss 3.424380 (-0.59z)| norm 0.2990 (+0.70z)| lr 3.61e-04 | 4150.93 ms | 32.5% bf16 MFU | 124418 tok/s step 8908/19560 | loss 3.455261 (+0.16z)| norm 0.2738 (-0.62z)| lr 3.61e-04 | 4155.69 ms | 32.5% bf16 MFU | 124505 tok/s step 8909/19560 | loss 3.387839 (-1.45z)| norm 0.2799 (-0.30z)| lr 3.61e-04 | 4149.23 ms | 32.5% bf16 MFU | 124598 tok/s step 8910/19560 | loss 3.424089 (-0.57z)| norm 0.2590 (-1.41z)| lr 3.61e-04 | 4153.12 ms | 32.5% bf16 MFU | 124680 tok/s step 8911/19560 | loss 3.457208 (+0.22z)| norm 0.2786 (-0.34z)| lr 3.61e-04 | 4159.71 ms | 32.5% bf16 MFU | 124748 tok/s step 8912/19560 | loss 3.426958 (-0.52z)| norm 0.3082 (+1.26z)| lr 3.60e-04 | 4168.33 ms | 32.4% bf16 MFU | 124800 tok/s step 8913/19560 | loss 3.406887 (-1.00z)| norm 0.2906 (+0.30z)| lr 3.60e-04 | 4162.89 ms | 32.4% bf16 MFU | 124857 tok/s step 8914/19560 | loss 3.470030 (+0.54z)| norm 0.2902 (+0.28z)| lr 3.60e-04 | 4157.37 ms | 32.5% bf16 MFU | 124919 tok/s step 8915/19560 | loss 3.416212 (-0.76z)| norm 0.3251 (+2.13z)| lr 3.60e-04 | 4158.46 ms | 32.5% bf16 MFU | 124977 tok/s step 8916/19560 | loss 3.406484 (-0.98z)| norm 0.2595 (-1.38z)| lr 3.60e-04 | 4168.68 ms | 32.4% bf16 MFU | 125017 tok/s step 8917/19560 | loss 3.461373 (+0.37z)| norm 0.3182 (+1.74z)| lr 3.60e-04 | 4150.00 ms | 32.5% bf16 MFU | 125083 tok/s step 8918/19560 | loss 3.430397 (-0.39z)| norm 0.3133 (+1.46z)| lr 3.60e-04 | 4161.09 ms | 32.4% bf16 MFU | 125128 tok/s step 8919/19560 | loss 3.443445 (-0.08z)| norm 0.3065 (+1.08z)| lr 3.60e-04 | 4159.56 ms | 32.5% bf16 MFU | 125174 tok/s step 8920/19560 | loss 3.393009 (-1.30z)| norm 0.2950 (+0.47z)| lr 3.60e-04 | 4153.74 ms | 32.5% bf16 MFU | 125227 tok/s step 8921/19560 | loss 3.481565 (+0.86z)| norm 0.2700 (-0.85z)| lr 3.60e-04 | 4152.68 ms | 32.5% bf16 MFU | 125278 tok/s step 8922/19560 | loss 3.413054 (-0.81z)| norm 0.3053 (+1.00z)| lr 3.60e-04 | 4152.71 ms | 32.5% bf16 MFU | 125327 tok/s step 8923/19560 | loss 3.416128 (-0.73z)| norm 0.2573 (-1.50z)| lr 3.60e-04 | 4175.72 ms | 32.3% bf16 MFU | 125338 tok/s step 8924/19560 | loss 3.466020 (+0.48z)| norm 0.2709 (-0.78z)| lr 3.60e-04 | 4158.03 ms | 32.5% bf16 MFU | 125376 tok/s step 8925/19560 | loss 3.414566 (-0.77z)| norm 0.2922 (+0.32z)| lr 3.60e-04 | 4176.88 ms | 32.3% bf16 MFU | 125383 tok/s step 8926/19560 | loss 3.445882 (+0.00z)| norm 0.2685 (-0.91z)| lr 3.60e-04 | 4158.59 ms | 32.5% bf16 MFU | 125418 tok/s step 8927/19560 | loss 3.394602 (-1.24z)| norm 0.2734 (-0.66z)| lr 3.60e-04 | 4159.71 ms | 32.5% bf16 MFU | 125449 tok/s step 8928/19560 | loss 3.456403 (+0.27z)| norm 0.2750 (-0.58z)| lr 3.60e-04 | 4153.55 ms | 32.5% bf16 MFU | 125488 tok/s step 8929/19560 | loss 3.445745 (+0.02z)| norm 0.3080 (+1.13z)| lr 3.60e-04 | 4157.29 ms | 32.5% bf16 MFU | 125519 tok/s step 8930/19560 | loss 3.457176 (+0.30z)| norm 0.2686 (-0.93z)| lr 3.60e-04 | 4193.59 ms | 32.2% bf16 MFU | 125494 tok/s step 8931/19560 | loss 3.442361 (-0.07z)| norm 0.2909 (+0.23z)| lr 3.60e-04 | 4154.51 ms | 32.5% bf16 MFU | 125529 tok/s step 8932/19560 | loss 3.398523 (-1.15z)| norm 0.2794 (-0.37z)| lr 3.60e-04 | 4159.49 ms | 32.5% bf16 MFU | 125555 tok/s step 8933/19560 | loss 3.450166 (+0.12z)| norm 0.2964 (+0.51z)| lr 3.59e-04 | 4159.38 ms | 32.5% bf16 MFU | 125580 tok/s step 8934/19560 | loss 3.397742 (-1.15z)| norm 0.2621 (-1.27z)| lr 3.59e-04 | 4161.94 ms | 32.4% bf16 MFU | 125599 tok/s step 8935/19560 | loss 3.477626 (+0.79z)| norm 0.2942 (+0.39z)| lr 3.59e-04 | 4175.38 ms | 32.3% bf16 MFU | 125598 tok/s step 8936/19560 | loss 3.431726 (-0.33z)| norm 0.3388 (+2.62z)| lr 3.59e-04 | 4166.44 ms | 32.4% bf16 MFU | 125610 tok/s step 8937/19560 | loss 3.421604 (-0.57z)| norm 0.2960 (+0.45z)| lr 3.59e-04 | 4156.54 ms | 32.5% bf16 MFU | 125636 tok/s step 8938/19560 | loss 3.422148 (-0.56z)| norm 0.2768 (-0.52z)| lr 3.59e-04 | 4164.62 ms | 32.4% bf16 MFU | 125649 tok/s step 8939/19560 | loss 3.399569 (-1.10z)| norm 0.2960 (+0.46z)| lr 3.59e-04 | 4718.62 ms | 28.6% bf16 MFU | 124922 tok/s step 8940/19560 | loss 3.536928 (+2.20z)| norm 0.3160 (+1.44z)| lr 3.59e-04 | 4151.94 ms | 32.5% bf16 MFU | 124989 tok/s step 8941/19560 | loss 3.498182 (+1.25z)| norm 0.2963 (+0.44z)| lr 3.59e-04 | 4152.30 ms | 32.5% bf16 MFU | 125053 tok/s step 8942/19560 | loss 3.431681 (-0.33z)| norm 0.2949 (+0.37z)| lr 3.59e-04 | 4151.50 ms | 32.5% bf16 MFU | 125115 tok/s step 8943/19560 | loss 3.430986 (-0.35z)| norm 0.2802 (-0.36z)| lr 3.59e-04 | 4162.44 ms | 32.4% bf16 MFU | 125157 tok/s step 8944/19560 | loss 3.414875 (-0.72z)| norm 0.2974 (+0.50z)| lr 3.59e-04 | 4166.95 ms | 32.4% bf16 MFU | 125190 tok/s step 8945/19560 | loss 3.459771 (+0.34z)| norm 0.3003 (+0.64z)| lr 3.59e-04 | 4155.81 ms | 32.5% bf16 MFU | 125239 tok/s step 8946/19560 | loss 3.412816 (-0.80z)| norm 0.2923 (+0.22z)| lr 3.59e-04 | 4162.53 ms | 32.4% bf16 MFU | 125274 tok/s step 8947/19560 | loss 3.445074 (-0.01z)| norm 0.2956 (+0.40z)| lr 3.59e-04 | 4152.00 ms | 32.5% bf16 MFU | 125324 tok/s step 8948/19560 | loss 3.445410 (-0.01z)| norm 0.2721 (-0.79z)| lr 3.59e-04 | 4148.20 ms | 32.5% bf16 MFU | 125378 tok/s step 8949/19560 | loss 3.583770 (+3.21z)| norm 0.2876 (-0.01z)| lr 3.59e-04 | 4157.77 ms | 32.5% bf16 MFU | 125414 tok/s step 8950/19560 | loss 3.426391 (-0.46z)| norm 0.2769 (-0.55z)| lr 3.59e-04 | 4156.12 ms | 32.5% bf16 MFU | 125450 tok/s step 8951/19560 | loss 3.436386 (-0.22z)| norm 0.2836 (-0.21z)| lr 3.59e-04 | 4157.92 ms | 32.5% bf16 MFU | 125483 tok/s step 8952/19560 | loss 3.461662 (+0.36z)| norm 0.2949 (+0.36z)| lr 3.59e-04 | 4172.80 ms | 32.4% bf16 MFU | 125491 tok/s step 8953/19560 | loss 3.472951 (+0.61z)| norm 0.2641 (-1.23z)| lr 3.58e-04 | 4283.98 ms | 31.5% bf16 MFU | 125335 tok/s step 8954/19560 | loss 3.451343 (+0.10z)| norm 0.2849 (-0.16z)| lr 3.58e-04 | 4199.06 ms | 32.2% bf16 MFU | 125311 tok/s step 8955/19560 | loss 3.416320 (-0.72z)| norm 0.2544 (-1.72z)| lr 3.58e-04 | 4180.29 ms | 32.3% bf16 MFU | 125317 tok/s step 8956/19560 | loss 3.379989 (-1.56z)| norm 0.2705 (-0.89z)| lr 3.58e-04 | 4160.97 ms | 32.4% bf16 MFU | 125351 tok/s step 8957/19560 | loss 3.404225 (-0.98z)| norm 0.2820 (-0.30z)| lr 3.58e-04 | 4163.52 ms | 32.4% bf16 MFU | 125380 tok/s step 8958/19560 | loss 3.483144 (+0.86z)| norm 0.2690 (-0.98z)| lr 3.58e-04 | 4146.81 ms | 32.6% bf16 MFU | 125432 tok/s step 8959/19560 | loss 3.432128 (-0.34z)| norm 0.2858 (-0.12z)| lr 3.58e-04 | 4158.98 ms | 32.5% bf16 MFU | 125464 tok/s step 8960/19560 | loss 3.464706 (+0.41z)| norm 0.2591 (-1.49z)| lr 3.58e-04 | 4151.23 ms | 32.5% bf16 MFU | 125505 tok/s step 8961/19560 | loss 3.590010 (+3.22z)| norm 0.2946 (+0.33z)| lr 3.58e-04 | 4145.87 ms | 32.6% bf16 MFU | 125553 tok/s step 8962/19560 | loss 3.421473 (-0.60z)| norm 0.2861 (-0.11z)| lr 3.58e-04 | 4170.75 ms | 32.4% bf16 MFU | 125561 tok/s step 8963/19560 | loss 3.421481 (-0.59z)| norm 0.2798 (-0.44z)| lr 3.58e-04 | 4164.91 ms | 32.4% bf16 MFU | 125577 tok/s step 8964/19560 | loss 3.395918 (-1.15z)| norm 0.2825 (-0.30z)| lr 3.58e-04 | 4162.65 ms | 32.4% bf16 MFU | 125596 tok/s step 8965/19560 | loss 3.456586 (+0.22z)| norm 0.3068 (+0.95z)| lr 3.58e-04 | 4157.62 ms | 32.5% bf16 MFU | 125621 tok/s step 8966/19560 | loss 3.540531 (+2.08z)| norm 0.2844 (-0.20z)| lr 3.58e-04 | 4164.01 ms | 32.4% bf16 MFU | 125635 tok/s step 8967/19560 | loss 3.435523 (-0.25z)| norm 0.3136 (+1.30z)| lr 3.58e-04 | 4155.06 ms | 32.5% bf16 MFU | 125663 tok/s step 8968/19560 | loss 3.490342 (+0.96z)| norm 0.2887 (-0.00z)| lr 3.58e-04 | 4154.18 ms | 32.5% bf16 MFU | 125690 tok/s step 8969/19560 | loss 3.448496 (+0.03z)| norm 0.3784 (+4.29z)| lr 3.58e-04 | 4154.05 ms | 32.5% bf16 MFU | 125716 tok/s step 8970/19560 | loss 3.403505 (-0.96z)| norm 0.3054 (+0.75z)| lr 3.58e-04 | 4158.12 ms | 32.5% bf16 MFU | 125734 tok/s step 8971/19560 | loss 3.400310 (-1.04z)| norm 0.3086 (+0.89z)| lr 3.58e-04 | 4166.99 ms | 32.4% bf16 MFU | 125739 tok/s step 8972/19560 | loss 3.468080 (+0.47z)| norm 0.3204 (+1.44z)| lr 3.58e-04 | 4166.41 ms | 32.4% bf16 MFU | 125744 tok/s step 8973/19560 | loss 3.482275 (+0.77z)| norm 0.2996 (+0.43z)| lr 3.58e-04 | 4225.71 ms | 32.0% bf16 MFU | 125660 tok/s step 8974/19560 | loss 3.490091 (+0.94z)| norm 0.3123 (+1.03z)| lr 3.57e-04 | 4174.49 ms | 32.3% bf16 MFU | 125657 tok/s step 8975/19560 | loss 3.445977 (-0.05z)| norm 0.2890 (-0.09z)| lr 3.57e-04 | 4168.83 ms | 32.4% bf16 MFU | 125662 tok/s step 8976/19560 | loss 3.387855 (-1.33z)| norm 0.2804 (-0.51z)| lr 3.57e-04 | 4158.52 ms | 32.5% bf16 MFU | 125683 tok/s step 8977/19560 | loss 3.426308 (-0.48z)| norm 0.2899 (-0.04z)| lr 3.57e-04 | 4157.91 ms | 32.5% bf16 MFU | 125703 tok/s step 8978/19560 | loss 3.411219 (-0.82z)| norm 0.2871 (-0.16z)| lr 3.57e-04 | 4192.76 ms | 32.2% bf16 MFU | 125670 tok/s step 8979/19560 | loss 3.490507 (+0.93z)| norm 0.2660 (-1.22z)| lr 3.57e-04 | 4160.00 ms | 32.5% bf16 MFU | 125688 tok/s step 8980/19560 | loss 3.438074 (-0.23z)| norm 0.2870 (-0.14z)| lr 3.57e-04 | 4148.81 ms | 32.5% bf16 MFU | 125723 tok/s step 8981/19560 | loss 3.422220 (-0.57z)| norm 0.2694 (-1.02z)| lr 3.57e-04 | 4147.93 ms | 32.6% bf16 MFU | 125756 tok/s step 8982/19560 | loss 3.469014 (+0.50z)| norm 0.2906 (+0.08z)| lr 3.57e-04 | 4160.66 ms | 32.5% bf16 MFU | 125769 tok/s step 8983/19560 | loss 3.519804 (+1.65z)| norm 0.3148 (+1.31z)| lr 3.57e-04 | 4152.23 ms | 32.5% bf16 MFU | 125794 tok/s step 8984/19560 | loss 3.468226 (+0.46z)| norm 0.2848 (-0.24z)| lr 3.57e-04 | 4188.34 ms | 32.2% bf16 MFU | 125763 tok/s step 8985/19560 | loss 3.412004 (-0.81z)| norm 0.2764 (-0.66z)| lr 3.57e-04 | 4160.66 ms | 32.5% bf16 MFU | 125775 tok/s step 8986/19560 | loss 3.501066 (+1.21z)| norm 0.2875 (-0.08z)| lr 3.57e-04 | 4157.60 ms | 32.5% bf16 MFU | 125792 tok/s step 8987/19560 | loss 3.374151 (-1.64z)| norm 0.3053 (+0.86z)| lr 3.57e-04 | 4154.21 ms | 32.5% bf16 MFU | 125813 tok/s step 8988/19560 | loss 3.531678 (+1.90z)| norm 0.2949 (+0.33z)| lr 3.57e-04 | 4163.29 ms | 32.4% bf16 MFU | 125819 tok/s step 8989/19560 | loss 3.470646 (+0.53z)| norm 0.6550 (+9.74z)| lr 3.57e-04 | 4147.81 ms | 32.6% bf16 MFU | 125848 tok/s step 8990/19560 | loss 3.456403 (+0.21z)| norm 0.3883 (+2.51z)| lr 3.57e-04 | 4149.06 ms | 32.5% bf16 MFU | 125873 tok/s step 8991/19560 | loss 3.432784 (-0.32z)| norm 0.3441 (+1.34z)| lr 3.57e-04 | 4158.58 ms | 32.5% bf16 MFU | 125883 tok/s step 8992/19560 | loss 3.482056 (+0.78z)| norm 0.3076 (+0.39z)| lr 3.57e-04 | 4745.11 ms | 28.5% bf16 MFU | 125114 tok/s step 8993/19560 | loss 3.496877 (+1.10z)| norm 0.3188 (+0.67z)| lr 3.57e-04 | 4152.67 ms | 32.5% bf16 MFU | 125171 tok/s step 8994/19560 | loss 3.428758 (-0.43z)| norm 0.3118 (+0.49z)| lr 3.56e-04 | 4160.78 ms | 32.4% bf16 MFU | 125213 tok/s step 8995/19560 | loss 3.404513 (-0.97z)| norm 0.3047 (+0.30z)| lr 3.56e-04 | 4151.89 ms | 32.5% bf16 MFU | 125266 tok/s step 8996/19560 | loss 3.441995 (-0.13z)| norm 0.3124 (+0.49z)| lr 3.56e-04 | 4181.60 ms | 32.3% bf16 MFU | 125272 tok/s step 8997/19560 | loss 3.462693 (+0.33z)| norm 0.2886 (-0.13z)| lr 3.56e-04 | 4165.64 ms | 32.4% bf16 MFU | 125301 tok/s step 8998/19560 | loss 3.403202 (-1.00z)| norm 0.3015 (+0.20z)| lr 3.56e-04 | 4159.16 ms | 32.5% bf16 MFU | 125339 tok/s step 8999/19560 | loss 3.464080 (+0.37z)| norm 0.2915 (-0.06z)| lr 3.56e-04 | 4165.99 ms | 32.4% bf16 MFU | 125364 tok/s step 9000/19560 | loss 3.450782 (+0.06z)| norm 0.2963 (+0.06z)| lr 3.56e-04 | 4154.07 ms | 32.5% bf16 MFU | 125407 tok/s val loss 3.414019 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2858/10042 = 0.284605 step 9001/19560 | loss 3.451039 (+0.06z)| norm 0.2754 (-0.48z)| lr 3.56e-04 | 4165.42 ms | 32.4% bf16 MFU | 125430 tok/s step 9002/19560 | loss 3.439990 (-0.19z)| norm 0.2989 (+0.12z)| lr 3.56e-04 | 4159.65 ms | 32.5% bf16 MFU | 125460 tok/s step 9003/19560 | loss 3.413393 (-0.78z)| norm 0.2750 (-0.49z)| lr 3.56e-04 | 4179.21 ms | 32.3% bf16 MFU | 125460 tok/s step 9004/19560 | loss 3.421810 (-0.58z)| norm 0.2835 (-0.27z)| lr 3.56e-04 | 4157.54 ms | 32.5% bf16 MFU | 125492 tok/s step 9005/19560 | loss 3.403580 (-0.98z)| norm 0.2718 (-0.58z)| lr 3.56e-04 | 4171.02 ms | 32.4% bf16 MFU | 125502 tok/s step 9006/19560 | loss 3.395021 (-1.16z)| norm 0.2617 (-0.83z)| lr 3.56e-04 | 4162.76 ms | 32.4% bf16 MFU | 125525 tok/s step 9007/19560 | loss 3.443302 (-0.10z)| norm 0.2809 (-0.33z)| lr 3.56e-04 | 4159.40 ms | 32.5% bf16 MFU | 125551 tok/s step 9008/19560 | loss 3.449722 (+0.05z)| norm 0.2755 (-0.47z)| lr 3.56e-04 | 4200.53 ms | 32.1% bf16 MFU | 125514 tok/s step 9009/19560 | loss 3.450061 (+0.07z)| norm 0.2713 (-0.58z)| lr 3.56e-04 | 4164.36 ms | 32.4% bf16 MFU | 125533 tok/s step 9010/19560 | loss 3.386857 (-1.35z)| norm 0.4760 (+4.34z)| lr 3.56e-04 | 4161.31 ms | 32.4% bf16 MFU | 125556 tok/s step 9011/19560 | loss 3.481401 (+0.78z)| norm 0.3443 (+1.16z)| lr 3.56e-04 | 4165.00 ms | 32.4% bf16 MFU | 125572 tok/s step 9012/19560 | loss 3.469669 (+0.50z)| norm 0.3000 (+0.09z)| lr 3.56e-04 | 4162.97 ms | 32.4% bf16 MFU | 125591 tok/s step 9013/19560 | loss 3.580672 (+2.89z)| norm 0.3093 (+0.31z)| lr 3.56e-04 | 4151.87 ms | 32.5% bf16 MFU | 125625 tok/s step 9014/19560 | loss 3.390975 (-1.25z)| norm 0.3208 (+0.58z)| lr 3.55e-04 | 4153.62 ms | 32.5% bf16 MFU | 125655 tok/s step 9015/19560 | loss 3.463385 (+0.33z)| norm 0.2993 (+0.06z)| lr 3.55e-04 | 4165.40 ms | 32.4% bf16 MFU | 125666 tok/s step 9016/19560 | loss 3.447126 (-0.01z)| norm 0.2990 (+0.05z)| lr 3.55e-04 | 4164.82 ms | 32.4% bf16 MFU | 125677 tok/s step 9017/19560 | loss 3.453705 (+0.14z)| norm 0.3008 (+0.09z)| lr 3.55e-04 | 4154.10 ms | 32.5% bf16 MFU | 125703 tok/s step 9018/19560 | loss 3.526718 (+1.81z)| norm 0.2773 (-0.47z)| lr 3.55e-04 | 4156.32 ms | 32.5% bf16 MFU | 125725 tok/s step 9019/19560 | loss 3.452533 (+0.12z)| norm 0.3070 (+0.24z)| lr 3.55e-04 | 4160.51 ms | 32.5% bf16 MFU | 125740 tok/s step 9020/19560 | loss 3.575619 (+2.81z)| norm 0.2773 (-0.47z)| lr 3.55e-04 | 4166.23 ms | 32.4% bf16 MFU | 125745 tok/s step 9021/19560 | loss 3.447355 (-0.03z)| norm 0.3117 (+0.35z)| lr 3.55e-04 | 4164.72 ms | 32.4% bf16 MFU | 125752 tok/s step 9022/19560 | loss 3.436591 (-0.26z)| norm 0.2928 (-0.10z)| lr 3.55e-04 | 4169.19 ms | 32.4% bf16 MFU | 125752 tok/s step 9023/19560 | loss 3.409686 (-0.85z)| norm 0.2784 (-0.44z)| lr 3.55e-04 | 4165.90 ms | 32.4% bf16 MFU | 125757 tok/s step 9024/19560 | loss 3.505118 (+1.27z)| norm 0.3084 (+0.27z)| lr 3.55e-04 | 4155.51 ms | 32.5% bf16 MFU | 125778 tok/s step 9025/19560 | loss 3.431327 (-0.38z)| norm 0.2788 (-0.43z)| lr 3.55e-04 | 4162.82 ms | 32.4% bf16 MFU | 125786 tok/s step 9026/19560 | loss 3.465957 (+0.38z)| norm 0.2976 (+0.01z)| lr 3.55e-04 | 4157.74 ms | 32.5% bf16 MFU | 125802 tok/s step 9027/19560 | loss 3.425743 (-0.51z)| norm 0.2977 (+0.02z)| lr 3.55e-04 | 4158.24 ms | 32.5% bf16 MFU | 125816 tok/s step 9028/19560 | loss 3.402676 (-1.02z)| norm 0.2681 (-0.69z)| lr 3.55e-04 | 4161.45 ms | 32.4% bf16 MFU | 125824 tok/s step 9029/19560 | loss 3.371928 (-1.68z)| norm 0.2948 (-0.04z)| lr 3.55e-04 | 4164.58 ms | 32.4% bf16 MFU | 125828 tok/s step 9030/19560 | loss 3.359014 (-1.95z)| norm 0.2691 (-0.66z)| lr 3.55e-04 | 4162.50 ms | 32.4% bf16 MFU | 125834 tok/s step 9031/19560 | loss 3.553784 (+2.36z)| norm 0.2943 (-0.05z)| lr 3.55e-04 | 4224.53 ms | 32.0% bf16 MFU | 125748 tok/s step 9032/19560 | loss 3.453897 (+0.15z)| norm 0.2771 (-0.47z)| lr 3.55e-04 | 4156.24 ms | 32.5% bf16 MFU | 125767 tok/s step 9033/19560 | loss 3.466681 (+0.47z)| norm 0.2585 (-0.91z)| lr 3.55e-04 | 4158.16 ms | 32.5% bf16 MFU | 125783 tok/s step 9034/19560 | loss 3.457283 (+0.25z)| norm 0.2700 (-0.62z)| lr 3.55e-04 | 4160.02 ms | 32.5% bf16 MFU | 125796 tok/s step 9035/19560 | loss 3.454637 (+0.18z)| norm 0.2740 (-0.52z)| lr 3.54e-04 | 4163.10 ms | 32.4% bf16 MFU | 125803 tok/s step 9036/19560 | loss 3.479635 (+0.75z)| norm 0.2626 (-0.79z)| lr 3.54e-04 | 4157.75 ms | 32.5% bf16 MFU | 125818 tok/s step 9037/19560 | loss 3.422343 (-0.57z)| norm 0.3038 (+0.19z)| lr 3.54e-04 | 4162.73 ms | 32.4% bf16 MFU | 125824 tok/s step 9038/19560 | loss 3.462431 (+0.34z)| norm 0.2734 (-0.54z)| lr 3.54e-04 | 4181.24 ms | 32.3% bf16 MFU | 125802 tok/s step 9039/19560 | loss 3.435034 (-0.28z)| norm 0.2750 (-0.50z)| lr 3.54e-04 | 4172.85 ms | 32.4% bf16 MFU | 125794 tok/s step 9040/19560 | loss 3.451640 (+0.10z)| norm 0.2589 (-0.88z)| lr 3.54e-04 | 4170.61 ms | 32.4% bf16 MFU | 125790 tok/s step 9041/19560 | loss 3.429085 (-0.43z)| norm 0.2657 (-0.71z)| lr 3.54e-04 | 4164.19 ms | 32.4% bf16 MFU | 125796 tok/s step 9042/19560 | loss 3.431841 (-0.36z)| norm 0.2549 (-0.96z)| lr 3.54e-04 | 4161.39 ms | 32.4% bf16 MFU | 125806 tok/s step 9043/19560 | loss 3.407300 (-0.92z)| norm 0.2860 (-0.21z)| lr 3.54e-04 | 4166.43 ms | 32.4% bf16 MFU | 125807 tok/s step 9044/19560 | loss 3.393732 (-1.23z)| norm 0.2570 (-0.90z)| lr 3.54e-04 | 4152.21 ms | 32.5% bf16 MFU | 125830 tok/s step 9045/19560 | loss 3.423684 (-0.53z)| norm 0.2955 (+0.02z)| lr 3.54e-04 | 4162.16 ms | 32.4% bf16 MFU | 125837 tok/s step 9046/19560 | loss 3.427318 (-0.45z)| norm 0.2612 (-0.79z)| lr 3.54e-04 | 4174.73 ms | 32.3% bf16 MFU | 125824 tok/s step 9047/19560 | loss 3.445696 (-0.03z)| norm 0.2822 (-0.28z)| lr 3.54e-04 | 4160.54 ms | 32.5% bf16 MFU | 125834 tok/s step 9048/19560 | loss 3.411981 (-0.81z)| norm 0.2504 (-1.03z)| lr 3.54e-04 | 4157.18 ms | 32.5% bf16 MFU | 125848 tok/s step 9049/19560 | loss 3.383759 (-1.43z)| norm 0.2752 (-0.44z)| lr 3.54e-04 | 4163.46 ms | 32.4% bf16 MFU | 125852 tok/s step 9050/19560 | loss 3.438625 (-0.18z)| norm 0.2635 (-0.71z)| lr 3.54e-04 | 4177.37 ms | 32.3% bf16 MFU | 125835 tok/s step 9051/19560 | loss 3.400834 (-1.04z)| norm 0.2694 (-0.57z)| lr 3.54e-04 | 6215.69 ms | 21.7% bf16 MFU | 123760 tok/s step 9052/19560 | loss 3.457613 (+0.26z)| norm 0.2626 (-0.73z)| lr 3.54e-04 | 4160.29 ms | 32.5% bf16 MFU | 123873 tok/s step 9053/19560 | loss 3.428136 (-0.42z)| norm 0.2697 (-0.56z)| lr 3.54e-04 | 4158.92 ms | 32.5% bf16 MFU | 123983 tok/s step 9054/19560 | loss 3.401110 (-1.03z)| norm 0.2969 (+0.08z)| lr 3.54e-04 | 4169.44 ms | 32.4% bf16 MFU | 124071 tok/s step 9055/19560 | loss 3.446999 (+0.01z)| norm 0.2715 (-0.52z)| lr 3.53e-04 | 4202.87 ms | 32.1% bf16 MFU | 124105 tok/s step 9056/19560 | loss 3.496897 (+1.15z)| norm 0.2727 (-0.49z)| lr 3.53e-04 | 4132.32 ms | 32.7% bf16 MFU | 124243 tok/s step 9057/19560 | loss 3.500391 (+1.21z)| norm 0.2938 (+0.01z)| lr 3.53e-04 | 5649.46 ms | 23.9% bf16 MFU | 122671 tok/s step 9058/19560 | loss 3.417022 (-0.68z)| norm 0.2887 (-0.11z)| lr 3.53e-04 | 4138.78 ms | 32.6% bf16 MFU | 122872 tok/s step 9059/19560 | loss 3.431373 (-0.35z)| norm 0.2861 (-0.17z)| lr 3.53e-04 | 4162.11 ms | 32.4% bf16 MFU | 123026 tok/s step 9060/19560 | loss 3.456146 (+0.20z)| norm 0.2779 (-0.37z)| lr 3.53e-04 | 4139.59 ms | 32.6% bf16 MFU | 123208 tok/s step 9061/19560 | loss 3.479195 (+0.72z)| norm 0.2861 (-0.17z)| lr 3.53e-04 | 4146.65 ms | 32.6% bf16 MFU | 123369 tok/s step 9062/19560 | loss 3.503902 (+1.26z)| norm 0.2813 (-0.29z)| lr 3.53e-04 | 4139.13 ms | 32.6% bf16 MFU | 123534 tok/s step 9063/19560 | loss 3.443354 (-0.11z)| norm 0.2894 (-0.10z)| lr 3.53e-04 | 4153.07 ms | 32.5% bf16 MFU | 123669 tok/s step 9064/19560 | loss 3.306173 (-3.08z)| norm 0.2928 (-0.01z)| lr 3.53e-04 | 4157.49 ms | 32.5% bf16 MFU | 123791 tok/s step 9065/19560 | loss 3.456562 (+0.20z)| norm 0.2841 (-0.21z)| lr 3.53e-04 | 5538.34 ms | 24.4% bf16 MFU | 122335 tok/s step 9066/19560 | loss 3.427976 (-0.43z)| norm 0.3464 (+1.26z)| lr 3.53e-04 | 4157.33 ms | 32.5% bf16 MFU | 122524 tok/s step 9067/19560 | loss 3.375222 (-1.57z)| norm 0.2947 (+0.03z)| lr 3.53e-04 | 4159.49 ms | 32.5% bf16 MFU | 122700 tok/s step 9068/19560 | loss 3.534787 (+1.91z)| norm 0.3197 (+0.62z)| lr 3.53e-04 | 4160.22 ms | 32.5% bf16 MFU | 122866 tok/s step 9069/19560 | loss 3.475472 (+0.62z)| norm 0.3056 (+0.28z)| lr 3.53e-04 | 4202.49 ms | 32.1% bf16 MFU | 122961 tok/s step 9070/19560 | loss 3.417839 (-0.63z)| norm 0.2859 (-0.18z)| lr 3.53e-04 | 4159.00 ms | 32.5% bf16 MFU | 123116 tok/s step 9071/19560 | loss 3.477671 (+0.67z)| norm 0.3035 (+0.23z)| lr 3.53e-04 | 4157.78 ms | 32.5% bf16 MFU | 123265 tok/s step 9072/19560 | loss 3.428806 (-0.41z)| norm 0.2576 (-0.85z)| lr 3.53e-04 | 4155.05 ms | 32.5% bf16 MFU | 123410 tok/s step 9073/19560 | loss 3.500272 (+1.15z)| norm 0.2873 (-0.14z)| lr 3.53e-04 | 4156.69 ms | 32.5% bf16 MFU | 123547 tok/s step 9074/19560 | loss 3.442244 (-0.12z)| norm 0.2605 (-0.77z)| lr 3.53e-04 | 4158.84 ms | 32.5% bf16 MFU | 123673 tok/s step 9075/19560 | loss 3.450755 (+0.06z)| norm 0.2654 (-0.65z)| lr 3.52e-04 | 4155.84 ms | 32.5% bf16 MFU | 123797 tok/s step 9076/19560 | loss 3.401055 (-1.01z)| norm 0.2819 (-0.26z)| lr 3.52e-04 | 4160.33 ms | 32.5% bf16 MFU | 123908 tok/s step 9077/19560 | loss 3.454701 (+0.18z)| norm 0.2989 (+0.14z)| lr 3.52e-04 | 4158.34 ms | 32.5% bf16 MFU | 124017 tok/s step 9078/19560 | loss 3.443512 (-0.07z)| norm 0.2949 (+0.04z)| lr 3.52e-04 | 4161.26 ms | 32.4% bf16 MFU | 124115 tok/s step 9079/19560 | loss 3.460654 (+0.31z)| norm 0.2665 (-0.63z)| lr 3.52e-04 | 4157.20 ms | 32.5% bf16 MFU | 124215 tok/s step 9080/19560 | loss 3.542379 (+2.10z)| norm 0.3171 (+0.57z)| lr 3.52e-04 | 4152.50 ms | 32.5% bf16 MFU | 124318 tok/s step 9081/19560 | loss 3.439042 (-0.18z)| norm 0.2711 (-0.52z)| lr 3.52e-04 | 4164.18 ms | 32.4% bf16 MFU | 124397 tok/s step 9082/19560 | loss 3.461493 (+0.31z)| norm 0.2597 (-0.79z)| lr 3.52e-04 | 4164.35 ms | 32.4% bf16 MFU | 124472 tok/s step 9083/19560 | loss 3.468896 (+0.47z)| norm 0.3065 (+0.31z)| lr 3.52e-04 | 4172.08 ms | 32.4% bf16 MFU | 124532 tok/s step 9084/19560 | loss 3.445524 (-0.06z)| norm 0.2585 (-0.82z)| lr 3.52e-04 | 4162.83 ms | 32.4% bf16 MFU | 124602 tok/s step 9085/19560 | loss 3.474389 (+0.57z)| norm 0.3051 (+0.27z)| lr 3.52e-04 | 4159.35 ms | 32.5% bf16 MFU | 124675 tok/s step 9086/19560 | loss 3.456389 (+0.18z)| norm 0.2669 (-0.63z)| lr 3.52e-04 | 4161.60 ms | 32.4% bf16 MFU | 124740 tok/s step 9087/19560 | loss 3.439308 (-0.21z)| norm 0.3473 (+1.25z)| lr 3.52e-04 | 4176.12 ms | 32.3% bf16 MFU | 124780 tok/s step 9088/19560 | loss 3.456059 (+0.17z)| norm 0.2704 (-0.56z)| lr 3.52e-04 | 4170.45 ms | 32.4% bf16 MFU | 124827 tok/s step 9089/19560 | loss 3.465420 (+0.42z)| norm 0.3015 (+0.17z)| lr 3.52e-04 | 4153.71 ms | 32.5% bf16 MFU | 124897 tok/s step 9090/19560 | loss 3.453248 (+0.13z)| norm 0.2742 (-0.46z)| lr 3.52e-04 | 4147.21 ms | 32.6% bf16 MFU | 124973 tok/s step 9091/19560 | loss 3.504202 (+1.30z)| norm 0.2950 (+0.02z)| lr 3.52e-04 | 4162.44 ms | 32.4% bf16 MFU | 125022 tok/s step 9092/19560 | loss 3.432950 (-0.37z)| norm 0.2913 (-0.07z)| lr 3.52e-04 | 4165.95 ms | 32.4% bf16 MFU | 125064 tok/s step 9093/19560 | loss 3.373707 (-1.73z)| norm 0.2845 (-0.22z)| lr 3.52e-04 | 5509.66 ms | 24.5% bf16 MFU | 123568 tok/s step 9094/19560 | loss 3.492636 (+1.06z)| norm 0.2736 (-0.48z)| lr 3.52e-04 | 4171.42 ms | 32.4% bf16 MFU | 123674 tok/s step 9095/19560 | loss 3.389232 (-1.36z)| norm 0.2825 (-0.26z)| lr 3.52e-04 | 4158.18 ms | 32.5% bf16 MFU | 123795 tok/s step 9096/19560 | loss 3.453028 (+0.14z)| norm 0.2681 (-0.60z)| lr 3.51e-04 | 4161.41 ms | 32.4% bf16 MFU | 123904 tok/s step 9097/19560 | loss 3.440856 (-0.15z)| norm 0.2857 (-0.17z)| lr 3.51e-04 | 4172.49 ms | 32.4% bf16 MFU | 123992 tok/s step 9098/19560 | loss 3.436692 (-0.25z)| norm 0.2762 (-0.39z)| lr 3.51e-04 | 4147.71 ms | 32.6% bf16 MFU | 124112 tok/s step 9099/19560 | loss 3.486068 (+0.90z)| norm 0.2704 (-0.52z)| lr 3.51e-04 | 4150.42 ms | 32.5% bf16 MFU | 124223 tok/s step 9100/19560 | loss 3.521909 (+1.72z)| norm 0.3004 (+0.20z)| lr 3.51e-04 | 4166.50 ms | 32.4% bf16 MFU | 124304 tok/s step 9101/19560 | loss 3.484881 (+0.85z)| norm 0.3052 (+0.31z)| lr 3.51e-04 | 4155.40 ms | 32.5% bf16 MFU | 124397 tok/s step 9102/19560 | loss 3.448423 (+0.01z)| norm 0.2907 (-0.03z)| lr 3.51e-04 | 4151.73 ms | 32.5% bf16 MFU | 124491 tok/s step 9103/19560 | loss 3.469168 (+0.49z)| norm 0.3325 (+0.95z)| lr 3.51e-04 | 4168.97 ms | 32.4% bf16 MFU | 124555 tok/s step 9104/19560 | loss 3.435919 (-0.30z)| norm 0.2978 (+0.13z)| lr 3.51e-04 | 4151.32 ms | 32.5% bf16 MFU | 124642 tok/s step 9105/19560 | loss 3.457398 (+0.20z)| norm 0.2754 (-0.40z)| lr 3.51e-04 | 4160.52 ms | 32.5% bf16 MFU | 124710 tok/s step 9106/19560 | loss 3.447946 (-0.03z)| norm 0.2993 (+0.16z)| lr 3.51e-04 | 4146.15 ms | 32.6% bf16 MFU | 124797 tok/s step 9107/19560 | loss 3.386553 (-1.46z)| norm 0.3007 (+0.19z)| lr 3.51e-04 | 4161.83 ms | 32.4% bf16 MFU | 124856 tok/s step 9108/19560 | loss 3.477147 (+0.67z)| norm 0.2944 (+0.04z)| lr 3.51e-04 | 4154.73 ms | 32.5% bf16 MFU | 124923 tok/s step 9109/19560 | loss 3.553049 (+2.38z)| norm 0.3027 (+0.23z)| lr 3.51e-04 | 4154.85 ms | 32.5% bf16 MFU | 124986 tok/s step 9110/19560 | loss 3.437293 (-0.28z)| norm 0.3259 (+0.77z)| lr 3.51e-04 | 4153.98 ms | 32.5% bf16 MFU | 125047 tok/s step 9111/19560 | loss 3.422931 (-0.60z)| norm 0.2757 (-0.41z)| lr 3.51e-04 | 4151.68 ms | 32.5% bf16 MFU | 125109 tok/s step 9112/19560 | loss 3.461470 (+0.30z)| norm 0.2816 (-0.27z)| lr 3.51e-04 | 4163.81 ms | 32.4% bf16 MFU | 125150 tok/s step 9113/19560 | loss 3.409475 (-0.91z)| norm 0.2879 (-0.12z)| lr 3.51e-04 | 4166.13 ms | 32.4% bf16 MFU | 125184 tok/s step 9114/19560 | loss 3.443030 (-0.12z)| norm 0.2856 (-0.18z)| lr 3.51e-04 | 4153.95 ms | 32.5% bf16 MFU | 125236 tok/s step 9115/19560 | loss 3.476667 (+0.65z)| norm 0.2919 (-0.03z)| lr 3.51e-04 | 4166.25 ms | 32.4% bf16 MFU | 125266 tok/s step 9116/19560 | loss 3.438098 (-0.24z)| norm 0.2739 (-0.45z)| lr 3.50e-04 | 4169.85 ms | 32.4% bf16 MFU | 125289 tok/s step 9117/19560 | loss 3.419827 (-0.67z)| norm 0.2707 (-0.70z)| lr 3.50e-04 | 4153.76 ms | 32.5% bf16 MFU | 125336 tok/s step 9118/19560 | loss 3.424309 (-0.56z)| norm 0.2729 (-0.62z)| lr 3.50e-04 | 4172.94 ms | 32.4% bf16 MFU | 125351 tok/s step 9119/19560 | loss 3.479759 (+0.76z)| norm 0.2611 (-1.06z)| lr 3.50e-04 | 4161.86 ms | 32.4% bf16 MFU | 125382 tok/s step 9120/19560 | loss 3.387081 (-1.43z)| norm 0.2648 (-0.91z)| lr 3.50e-04 | 4153.79 ms | 32.5% bf16 MFU | 125424 tok/s step 9121/19560 | loss 3.406419 (-0.96z)| norm 0.2936 (+0.23z)| lr 3.50e-04 | 4172.44 ms | 32.4% bf16 MFU | 125436 tok/s step 9122/19560 | loss 3.488338 (+0.98z)| norm 0.2663 (-0.83z)| lr 3.50e-04 | 4152.42 ms | 32.5% bf16 MFU | 125477 tok/s step 9123/19560 | loss 3.365349 (-1.91z)| norm 0.2583 (-1.13z)| lr 3.50e-04 | 4162.10 ms | 32.4% bf16 MFU | 125502 tok/s step 9124/19560 | loss 3.451344 (+0.11z)| norm 0.2666 (-0.79z)| lr 3.50e-04 | 4151.79 ms | 32.5% bf16 MFU | 125540 tok/s step 9125/19560 | loss 3.453194 (+0.15z)| norm 0.2697 (-0.66z)| lr 3.50e-04 | 4158.46 ms | 32.5% bf16 MFU | 125567 tok/s step 9126/19560 | loss 3.400278 (-1.09z)| norm 0.2780 (-0.33z)| lr 3.50e-04 | 4150.07 ms | 32.5% bf16 MFU | 125606 tok/s step 9127/19560 | loss 3.396210 (-1.17z)| norm 0.2712 (-0.59z)| lr 3.50e-04 | 4161.46 ms | 32.4% bf16 MFU | 125625 tok/s step 9128/19560 | loss 3.473342 (+0.63z)| norm 0.2824 (-0.14z)| lr 3.50e-04 | 4150.23 ms | 32.5% bf16 MFU | 125660 tok/s step 9129/19560 | loss 3.361487 (-1.93z)| norm 0.2549 (-1.22z)| lr 3.50e-04 | 4155.65 ms | 32.5% bf16 MFU | 125685 tok/s step 9130/19560 | loss 3.476355 (+0.70z)| norm 0.3258 (+1.54z)| lr 3.50e-04 | 4151.05 ms | 32.5% bf16 MFU | 125716 tok/s step 9131/19560 | loss 3.435107 (-0.25z)| norm 0.2736 (-0.48z)| lr 3.50e-04 | 4146.98 ms | 32.6% bf16 MFU | 125751 tok/s step 9132/19560 | loss 3.425386 (-0.48z)| norm 0.2837 (-0.09z)| lr 3.50e-04 | 4160.98 ms | 32.4% bf16 MFU | 125764 tok/s step 9133/19560 | loss 3.408813 (-0.86z)| norm 0.2719 (-0.55z)| lr 3.50e-04 | 5796.27 ms | 23.3% bf16 MFU | 123998 tok/s step 9134/19560 | loss 3.391025 (-1.27z)| norm 0.2767 (-0.37z)| lr 3.50e-04 | 4163.52 ms | 32.4% bf16 MFU | 124095 tok/s step 9135/19560 | loss 3.451891 (+0.13z)| norm 0.2756 (-0.41z)| lr 3.50e-04 | 4154.16 ms | 32.5% bf16 MFU | 124200 tok/s step 9136/19560 | loss 3.428830 (-0.39z)| norm 0.2867 (+0.02z)| lr 3.49e-04 | 4174.94 ms | 32.3% bf16 MFU | 124269 tok/s step 9137/19560 | loss 3.416535 (-0.67z)| norm 0.2712 (-0.58z)| lr 3.49e-04 | 4176.73 ms | 32.3% bf16 MFU | 124332 tok/s step 9138/19560 | loss 3.441656 (-0.10z)| norm 0.2652 (-1.00z)| lr 3.49e-04 | 4157.31 ms | 32.5% bf16 MFU | 124421 tok/s step 9139/19560 | loss 3.506835 (+1.39z)| norm 0.2809 (-0.17z)| lr 3.49e-04 | 4159.25 ms | 32.5% bf16 MFU | 124503 tok/s step 9140/19560 | loss 3.490180 (+1.00z)| norm 0.2575 (-1.40z)| lr 3.49e-04 | 4157.15 ms | 32.5% bf16 MFU | 124583 tok/s step 9141/19560 | loss 3.467694 (+0.53z)| norm 0.2858 (+0.12z)| lr 3.49e-04 | 4149.63 ms | 32.5% bf16 MFU | 124671 tok/s step 9142/19560 | loss 3.428347 (-0.42z)| norm 0.2672 (-0.86z)| lr 3.49e-04 | 4195.85 ms | 32.2% bf16 MFU | 124686 tok/s step 9143/19560 | loss 3.364287 (-1.91z)| norm 0.2559 (-1.45z)| lr 3.49e-04 | 4152.56 ms | 32.5% bf16 MFU | 124764 tok/s step 9144/19560 | loss 3.392349 (-1.23z)| norm 0.2896 (+0.37z)| lr 3.49e-04 | 4160.89 ms | 32.4% bf16 MFU | 124826 tok/s step 9145/19560 | loss 3.493544 (+1.13z)| norm 0.2775 (-0.28z)| lr 3.49e-04 | 4157.55 ms | 32.5% bf16 MFU | 124890 tok/s step 9146/19560 | loss 3.452562 (+0.19z)| norm 0.2931 (+0.56z)| lr 3.49e-04 | 4157.04 ms | 32.5% bf16 MFU | 124952 tok/s step 9147/19560 | loss 3.443748 (-0.02z)| norm 0.2965 (+0.76z)| lr 3.49e-04 | 4152.86 ms | 32.5% bf16 MFU | 125016 tok/s step 9148/19560 | loss 3.411311 (-0.78z)| norm 0.3134 (+1.64z)| lr 3.49e-04 | 4173.61 ms | 32.4% bf16 MFU | 125047 tok/s step 9149/19560 | loss 3.503429 (+1.46z)| norm 0.2962 (+0.72z)| lr 3.49e-04 | 4163.50 ms | 32.4% bf16 MFU | 125090 tok/s step 9150/19560 | loss 3.422425 (-0.51z)| norm 0.2992 (+0.88z)| lr 3.49e-04 | 4157.65 ms | 32.5% bf16 MFU | 125141 tok/s step 9151/19560 | loss 3.519266 (+1.81z)| norm 0.3068 (+1.27z)| lr 3.49e-04 | 4171.06 ms | 32.4% bf16 MFU | 125169 tok/s step 9152/19560 | loss 3.377942 (-1.58z)| norm 0.2798 (-0.16z)| lr 3.49e-04 | 4156.23 ms | 32.5% bf16 MFU | 125218 tok/s step 9153/19560 | loss 3.416219 (-0.65z)| norm 0.2747 (-0.44z)| lr 3.49e-04 | 4158.63 ms | 32.5% bf16 MFU | 125260 tok/s step 9154/19560 | loss 3.443176 (+0.00z)| norm 0.2982 (+0.83z)| lr 3.49e-04 | 4147.53 ms | 32.6% bf16 MFU | 125318 tok/s step 9155/19560 | loss 3.438791 (-0.10z)| norm 0.2802 (-0.13z)| lr 3.49e-04 | 4161.87 ms | 32.4% bf16 MFU | 125351 tok/s step 9156/19560 | loss 3.360013 (-1.97z)| norm 0.2961 (+0.72z)| lr 3.49e-04 | 4172.59 ms | 32.4% bf16 MFU | 125366 tok/s step 9157/19560 | loss 3.393781 (-1.18z)| norm 0.2646 (-0.98z)| lr 3.48e-04 | 4162.42 ms | 32.4% bf16 MFU | 125395 tok/s step 9158/19560 | loss 3.470519 (+0.65z)| norm 0.2881 (+0.28z)| lr 3.48e-04 | 4159.90 ms | 32.5% bf16 MFU | 125427 tok/s step 9159/19560 | loss 3.412845 (-0.75z)| norm 0.2903 (+0.41z)| lr 3.48e-04 | 4169.86 ms | 32.4% bf16 MFU | 125442 tok/s step 9160/19560 | loss 3.511994 (+1.70z)| norm 0.2578 (-1.34z)| lr 3.48e-04 | 4166.09 ms | 32.4% bf16 MFU | 125463 tok/s step 9161/19560 | loss 3.450383 (+0.18z)| norm 0.2792 (-0.20z)| lr 3.48e-04 | 4244.80 ms | 31.8% bf16 MFU | 125365 tok/s step 9162/19560 | loss 3.407578 (-0.87z)| norm 0.2592 (-1.27z)| lr 3.48e-04 | 4171.24 ms | 32.4% bf16 MFU | 125381 tok/s step 9163/19560 | loss 3.417177 (-0.62z)| norm 0.2540 (-1.53z)| lr 3.48e-04 | 4153.08 ms | 32.5% bf16 MFU | 125424 tok/s step 9164/19560 | loss 3.455472 (+0.33z)| norm 0.2785 (-0.22z)| lr 3.48e-04 | 4160.86 ms | 32.4% bf16 MFU | 125453 tok/s step 9165/19560 | loss 3.496737 (+1.33z)| norm 0.2597 (-1.22z)| lr 3.48e-04 | 4158.51 ms | 32.5% bf16 MFU | 125485 tok/s step 9166/19560 | loss 3.428019 (-0.36z)| norm 0.2777 (-0.25z)| lr 3.48e-04 | 4171.96 ms | 32.4% bf16 MFU | 125494 tok/s step 9167/19560 | loss 3.444756 (+0.05z)| norm 0.2714 (-0.59z)| lr 3.48e-04 | 4156.12 ms | 32.5% bf16 MFU | 125527 tok/s step 9168/19560 | loss 3.456770 (+0.35z)| norm 0.2695 (-0.70z)| lr 3.48e-04 | 4200.23 ms | 32.1% bf16 MFU | 125491 tok/s step 9169/19560 | loss 3.390471 (-1.27z)| norm 0.2836 (+0.06z)| lr 3.48e-04 | 4161.97 ms | 32.4% bf16 MFU | 125515 tok/s step 9170/19560 | loss 3.403867 (-0.93z)| norm 0.2835 (+0.04z)| lr 3.48e-04 | 4160.43 ms | 32.5% bf16 MFU | 125541 tok/s step 9171/19560 | loss 3.419669 (-0.55z)| norm 0.2853 (+0.14z)| lr 3.48e-04 | 4157.18 ms | 32.5% bf16 MFU | 125569 tok/s step 9172/19560 | loss 3.395207 (-1.15z)| norm 0.2731 (-0.54z)| lr 3.48e-04 | 4152.27 ms | 32.5% bf16 MFU | 125604 tok/s step 9173/19560 | loss 3.432848 (-0.23z)| norm 0.3224 (+2.13z)| lr 3.48e-04 | 4168.05 ms | 32.4% bf16 MFU | 125613 tok/s step 9174/19560 | loss 3.404053 (-0.93z)| norm 0.2880 (+0.25z)| lr 3.48e-04 | 4169.67 ms | 32.4% bf16 MFU | 125620 tok/s step 9175/19560 | loss 3.511045 (+1.65z)| norm 0.2854 (+0.11z)| lr 3.48e-04 | 4152.83 ms | 32.5% bf16 MFU | 125651 tok/s step 9176/19560 | loss 3.428046 (-0.35z)| norm 0.2796 (-0.22z)| lr 3.48e-04 | 4170.93 ms | 32.4% bf16 MFU | 125653 tok/s step 9177/19560 | loss 3.462262 (+0.46z)| norm 0.2634 (-1.10z)| lr 3.47e-04 | 4158.37 ms | 32.5% bf16 MFU | 125675 tok/s step 9178/19560 | loss 3.460711 (+0.42z)| norm 0.2942 (+0.58z)| lr 3.47e-04 | 4163.04 ms | 32.4% bf16 MFU | 125688 tok/s step 9179/19560 | loss 3.397357 (-1.12z)| norm 0.2468 (-2.00z)| lr 3.47e-04 | 4167.09 ms | 32.4% bf16 MFU | 125694 tok/s step 9180/19560 | loss 3.372303 (-1.69z)| norm 0.3035 (+1.07z)| lr 3.47e-04 | 4169.37 ms | 32.4% bf16 MFU | 125697 tok/s step 9181/19560 | loss 3.423676 (-0.46z)| norm 0.2738 (-0.55z)| lr 3.47e-04 | 4162.39 ms | 32.4% bf16 MFU | 125710 tok/s step 9182/19560 | loss 3.403322 (-0.95z)| norm 0.2722 (-0.63z)| lr 3.47e-04 | 4169.85 ms | 32.4% bf16 MFU | 125711 tok/s step 9183/19560 | loss 3.348179 (-2.21z)| norm 0.3026 (+1.02z)| lr 3.47e-04 | 4152.53 ms | 32.5% bf16 MFU | 125739 tok/s step 9184/19560 | loss 3.395479 (-1.08z)| norm 0.2461 (-2.02z)| lr 3.47e-04 | 4169.88 ms | 32.4% bf16 MFU | 125738 tok/s step 9185/19560 | loss 3.380940 (-1.40z)| norm 0.3045 (+1.11z)| lr 3.47e-04 | 4151.97 ms | 32.5% bf16 MFU | 125765 tok/s step 9186/19560 | loss 3.428833 (-0.27z)| norm 0.2488 (-1.83z)| lr 3.47e-04 | 4161.65 ms | 32.4% bf16 MFU | 125776 tok/s step 9187/19560 | loss 3.434005 (-0.15z)| norm 0.3096 (+1.35z)| lr 3.47e-04 | 4172.56 ms | 32.4% bf16 MFU | 125770 tok/s step 9188/19560 | loss 3.427185 (-0.31z)| norm 0.2795 (-0.22z)| lr 3.47e-04 | 4180.90 ms | 32.3% bf16 MFU | 125751 tok/s step 9189/19560 | loss 3.456738 (+0.40z)| norm 0.2759 (-0.40z)| lr 3.47e-04 | 4154.32 ms | 32.5% bf16 MFU | 125774 tok/s step 9190/19560 | loss 3.406482 (-0.78z)| norm 0.3007 (+0.88z)| lr 3.47e-04 | 4161.36 ms | 32.4% bf16 MFU | 125785 tok/s step 9191/19560 | loss 3.416198 (-0.54z)| norm 0.2804 (-0.17z)| lr 3.47e-04 | 4159.94 ms | 32.5% bf16 MFU | 125797 tok/s step 9192/19560 | loss 3.443910 (+0.09z)| norm 0.2891 (+0.28z)| lr 3.47e-04 | 4230.60 ms | 31.9% bf16 MFU | 125703 tok/s step 9193/19560 | loss 3.382872 (-1.40z)| norm 0.2669 (-0.86z)| lr 3.47e-04 | 4174.01 ms | 32.3% bf16 MFU | 125699 tok/s step 9194/19560 | loss 3.373216 (-1.61z)| norm 0.2873 (+0.23z)| lr 3.47e-04 | 4192.80 ms | 32.2% bf16 MFU | 125666 tok/s step 9195/19560 | loss 3.438183 (-0.03z)| norm 0.2711 (-0.64z)| lr 3.47e-04 | 4167.71 ms | 32.4% bf16 MFU | 125673 tok/s step 9196/19560 | loss 3.516491 (+1.91z)| norm 0.2647 (-0.98z)| lr 3.47e-04 | 4153.83 ms | 32.5% bf16 MFU | 125700 tok/s step 9197/19560 | loss 3.345142 (-2.27z)| norm 0.3063 (+1.31z)| lr 3.46e-04 | 4167.97 ms | 32.4% bf16 MFU | 125704 tok/s step 9198/19560 | loss 3.406878 (-0.77z)| norm 0.2710 (-0.63z)| lr 3.46e-04 | 4166.32 ms | 32.4% bf16 MFU | 125711 tok/s step 9199/19560 | loss 3.487800 (+1.20z)| norm 0.2728 (-0.52z)| lr 3.46e-04 | 4160.07 ms | 32.5% bf16 MFU | 125727 tok/s step 9200/19560 | loss 3.434474 (-0.10z)| norm 0.2894 (+0.39z)| lr 3.46e-04 | 4154.28 ms | 32.5% bf16 MFU | 125751 tok/s step 9201/19560 | loss 3.473392 (+0.86z)| norm 0.2929 (+0.58z)| lr 3.46e-04 | 4168.56 ms | 32.4% bf16 MFU | 125752 tok/s step 9202/19560 | loss 3.446606 (+0.20z)| norm 0.2584 (-1.33z)| lr 3.46e-04 | 4151.51 ms | 32.5% bf16 MFU | 125779 tok/s step 9203/19560 | loss 3.390015 (-1.16z)| norm 0.2761 (-0.35z)| lr 3.46e-04 | 4153.51 ms | 32.5% bf16 MFU | 125801 tok/s step 9204/19560 | loss 3.465762 (+0.67z)| norm 0.2938 (+0.62z)| lr 3.46e-04 | 4166.01 ms | 32.4% bf16 MFU | 125804 tok/s step 9205/19560 | loss 3.442640 (+0.11z)| norm 0.2601 (-1.23z)| lr 3.46e-04 | 4154.21 ms | 32.5% bf16 MFU | 125824 tok/s step 9206/19560 | loss 3.354481 (-1.99z)| norm 0.3028 (+1.13z)| lr 3.46e-04 | 4164.71 ms | 32.4% bf16 MFU | 125827 tok/s step 9207/19560 | loss 3.445045 (+0.18z)| norm 0.2893 (+0.38z)| lr 3.46e-04 | 4154.86 ms | 32.5% bf16 MFU | 125845 tok/s step 9208/19560 | loss 3.443943 (+0.18z)| norm 0.2838 (+0.09z)| lr 3.46e-04 | 4152.16 ms | 32.5% bf16 MFU | 125866 tok/s step 9209/19560 | loss 3.380775 (-1.36z)| norm 0.2742 (-0.46z)| lr 3.46e-04 | 4165.96 ms | 32.4% bf16 MFU | 125865 tok/s step 9210/19560 | loss 3.437721 (+0.04z)| norm 0.2873 (+0.27z)| lr 3.46e-04 | 4167.13 ms | 32.4% bf16 MFU | 125863 tok/s step 9211/19560 | loss 3.471118 (+0.86z)| norm 0.3078 (+1.43z)| lr 3.46e-04 | 4160.97 ms | 32.4% bf16 MFU | 125870 tok/s step 9212/19560 | loss 3.439560 (+0.09z)| norm 0.2881 (+0.31z)| lr 3.46e-04 | 4155.63 ms | 32.5% bf16 MFU | 125884 tok/s step 9213/19560 | loss 3.509294 (+1.77z)| norm 0.3214 (+2.16z)| lr 3.46e-04 | 4154.16 ms | 32.5% bf16 MFU | 125901 tok/s step 9214/19560 | loss 3.470266 (+0.82z)| norm 0.2873 (+0.24z)| lr 3.46e-04 | 4163.84 ms | 32.4% bf16 MFU | 125901 tok/s step 9215/19560 | loss 3.425133 (-0.27z)| norm 0.2980 (+0.91z)| lr 3.46e-04 | 4161.03 ms | 32.4% bf16 MFU | 125906 tok/s step 9216/19560 | loss 3.417758 (-0.44z)| norm 0.3210 (+2.21z)| lr 3.46e-04 | 4164.07 ms | 32.4% bf16 MFU | 125906 tok/s step 9217/19560 | loss 3.447997 (+0.30z)| norm 0.2744 (-0.49z)| lr 3.45e-04 | 4150.96 ms | 32.5% bf16 MFU | 125926 tok/s step 9218/19560 | loss 3.434663 (-0.02z)| norm 0.3005 (+1.02z)| lr 3.45e-04 | 4173.26 ms | 32.4% bf16 MFU | 125911 tok/s step 9219/19560 | loss 3.347644 (-2.09z)| norm 0.2822 (-0.04z)| lr 3.45e-04 | 4157.17 ms | 32.5% bf16 MFU | 125922 tok/s step 9220/19560 | loss 3.436496 (+0.05z)| norm 0.2698 (-0.75z)| lr 3.45e-04 | 4166.89 ms | 32.4% bf16 MFU | 125917 tok/s step 9221/19560 | loss 3.486618 (+1.24z)| norm 0.3119 (+1.67z)| lr 3.45e-04 | 4172.54 ms | 32.4% bf16 MFU | 125903 tok/s step 9222/19560 | loss 3.404545 (-0.73z)| norm 0.2632 (-1.13z)| lr 3.45e-04 | 4157.90 ms | 32.5% bf16 MFU | 125913 tok/s step 9223/19560 | loss 3.435185 (+0.00z)| norm 0.2774 (-0.31z)| lr 3.45e-04 | 4156.36 ms | 32.5% bf16 MFU | 125924 tok/s step 9224/19560 | loss 3.348189 (-2.07z)| norm 0.2735 (-0.54z)| lr 3.45e-04 | 4161.17 ms | 32.4% bf16 MFU | 125928 tok/s step 9225/19560 | loss 3.341630 (-2.17z)| norm 0.2675 (-0.88z)| lr 3.45e-04 | 4175.64 ms | 32.3% bf16 MFU | 125910 tok/s step 9226/19560 | loss 3.504579 (+1.65z)| norm 0.2630 (-1.12z)| lr 3.45e-04 | 4164.28 ms | 32.4% bf16 MFU | 125909 tok/s step 9227/19560 | loss 3.421822 (-0.27z)| norm 0.2694 (-0.76z)| lr 3.45e-04 | 4171.20 ms | 32.4% bf16 MFU | 125898 tok/s step 9228/19560 | loss 3.364830 (-1.59z)| norm 0.2823 (-0.01z)| lr 3.45e-04 | 4172.83 ms | 32.4% bf16 MFU | 125886 tok/s step 9229/19560 | loss 3.404984 (-0.63z)| norm 0.3039 (+1.23z)| lr 3.45e-04 | 4159.96 ms | 32.5% bf16 MFU | 125893 tok/s step 9230/19560 | loss 3.378900 (-1.23z)| norm 0.3133 (+1.74z)| lr 3.45e-04 | 4161.23 ms | 32.4% bf16 MFU | 125898 tok/s step 9231/19560 | loss 3.434870 (+0.10z)| norm 0.2715 (-0.63z)| lr 3.45e-04 | 4172.84 ms | 32.4% bf16 MFU | 125885 tok/s step 9232/19560 | loss 3.502005 (+1.66z)| norm 0.3147 (+1.87z)| lr 3.45e-04 | 4159.27 ms | 32.5% bf16 MFU | 125894 tok/s step 9233/19560 | loss 3.442801 (+0.27z)| norm 0.2867 (+0.25z)| lr 3.45e-04 | 4169.56 ms | 32.4% bf16 MFU | 125886 tok/s step 9234/19560 | loss 3.410677 (-0.47z)| norm 0.2868 (+0.26z)| lr 3.45e-04 | 4174.00 ms | 32.3% bf16 MFU | 125872 tok/s step 9235/19560 | loss 3.381068 (-1.17z)| norm 0.2799 (-0.13z)| lr 3.45e-04 | 4167.56 ms | 32.4% bf16 MFU | 125869 tok/s step 9236/19560 | loss 3.387423 (-1.00z)| norm 0.2731 (-0.52z)| lr 3.45e-04 | 4159.01 ms | 32.5% bf16 MFU | 125878 tok/s step 9237/19560 | loss 3.437240 (+0.19z)| norm 0.2929 (+0.65z)| lr 3.45e-04 | 4155.38 ms | 32.5% bf16 MFU | 125893 tok/s step 9238/19560 | loss 3.419963 (-0.22z)| norm 0.2704 (-0.67z)| lr 3.44e-04 | 4156.56 ms | 32.5% bf16 MFU | 125905 tok/s step 9239/19560 | loss 3.472541 (+1.04z)| norm 0.2649 (-0.99z)| lr 3.44e-04 | 4173.38 ms | 32.4% bf16 MFU | 125891 tok/s step 9240/19560 | loss 3.372048 (-1.36z)| norm 0.2498 (-1.86z)| lr 3.44e-04 | 4158.04 ms | 32.5% bf16 MFU | 125901 tok/s step 9241/19560 | loss 3.428731 (-0.01z)| norm 0.2674 (-0.80z)| lr 3.44e-04 | 4155.92 ms | 32.5% bf16 MFU | 125914 tok/s step 9242/19560 | loss 3.373942 (-1.30z)| norm 0.2553 (-1.49z)| lr 3.44e-04 | 4169.45 ms | 32.4% bf16 MFU | 125905 tok/s step 9243/19560 | loss 3.436102 (+0.19z)| norm 0.2811 (+0.02z)| lr 3.44e-04 | 4191.77 ms | 32.2% bf16 MFU | 125864 tok/s step 9244/19560 | loss 3.415158 (-0.31z)| norm 0.2778 (-0.17z)| lr 3.44e-04 | 4159.89 ms | 32.5% bf16 MFU | 125872 tok/s step 9245/19560 | loss 3.418029 (-0.24z)| norm 0.2760 (-0.28z)| lr 3.44e-04 | 4157.81 ms | 32.5% bf16 MFU | 125883 tok/s step 9246/19560 | loss 3.404819 (-0.55z)| norm 0.2907 (+0.58z)| lr 3.44e-04 | 4179.67 ms | 32.3% bf16 MFU | 125861 tok/s step 9247/19560 | loss 3.443994 (+0.40z)| norm 0.2670 (-0.82z)| lr 3.44e-04 | 4173.96 ms | 32.3% bf16 MFU | 125849 tok/s step 9248/19560 | loss 3.410277 (-0.42z)| norm 0.2758 (-0.31z)| lr 3.44e-04 | 4152.83 ms | 32.5% bf16 MFU | 125869 tok/s step 9249/19560 | loss 3.425599 (-0.05z)| norm 0.2775 (-0.20z)| lr 3.44e-04 | 4163.13 ms | 32.4% bf16 MFU | 125872 tok/s step 9250/19560 | loss 3.544339 (+2.75z)| norm 0.2921 (+0.65z)| lr 3.44e-04 | 4160.67 ms | 32.5% bf16 MFU | 125879 tok/s val loss 3.404068 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2885/10042 = 0.287293 step 9251/19560 | loss 3.444287 (+0.37z)| norm 0.2970 (+0.93z)| lr 3.44e-04 | 4241.72 ms | 31.8% bf16 MFU | 125765 tok/s step 9252/19560 | loss 3.415631 (-0.31z)| norm 0.2907 (+0.54z)| lr 3.44e-04 | 4161.57 ms | 32.4% bf16 MFU | 125776 tok/s step 9253/19560 | loss 3.472366 (+1.04z)| norm 0.2994 (+1.04z)| lr 3.44e-04 | 4163.88 ms | 32.4% bf16 MFU | 125783 tok/s step 9254/19560 | loss 3.418306 (-0.25z)| norm 0.2650 (-0.99z)| lr 3.44e-04 | 4154.76 ms | 32.5% bf16 MFU | 125803 tok/s step 9255/19560 | loss 3.446099 (+0.40z)| norm 0.3138 (+1.86z)| lr 3.44e-04 | 4167.05 ms | 32.4% bf16 MFU | 125804 tok/s step 9256/19560 | loss 3.438131 (+0.22z)| norm 0.3113 (+1.68z)| lr 3.44e-04 | 4156.92 ms | 32.5% bf16 MFU | 125820 tok/s step 9257/19560 | loss 3.399928 (-0.71z)| norm 0.2919 (+0.54z)| lr 3.44e-04 | 4167.68 ms | 32.4% bf16 MFU | 125819 tok/s step 9258/19560 | loss 3.390877 (-0.92z)| norm 0.2980 (+0.93z)| lr 3.43e-04 | 4172.19 ms | 32.4% bf16 MFU | 125811 tok/s step 9259/19560 | loss 3.531147 (+2.41z)| norm 0.2714 (-0.65z)| lr 3.43e-04 | 4178.04 ms | 32.3% bf16 MFU | 125795 tok/s step 9260/19560 | loss 3.342021 (-2.02z)| norm 0.2950 (+0.75z)| lr 3.43e-04 | 4275.44 ms | 31.6% bf16 MFU | 125636 tok/s step 9261/19560 | loss 3.432881 (+0.09z)| norm 0.3066 (+1.42z)| lr 3.43e-04 | 4173.46 ms | 32.4% bf16 MFU | 125636 tok/s step 9262/19560 | loss 3.422594 (-0.15z)| norm 0.2894 (+0.39z)| lr 3.43e-04 | 4156.70 ms | 32.5% bf16 MFU | 125661 tok/s step 9263/19560 | loss 3.440135 (+0.26z)| norm 0.2899 (+0.41z)| lr 3.43e-04 | 4175.06 ms | 32.3% bf16 MFU | 125656 tok/s step 9264/19560 | loss 3.403235 (-0.60z)| norm 0.2685 (-0.84z)| lr 3.43e-04 | 4169.13 ms | 32.4% bf16 MFU | 125661 tok/s step 9265/19560 | loss 3.395924 (-0.77z)| norm 0.2574 (-1.47z)| lr 3.43e-04 | 4161.15 ms | 32.4% bf16 MFU | 125678 tok/s step 9266/19560 | loss 3.436297 (+0.18z)| norm 0.2826 (-0.01z)| lr 3.43e-04 | 4172.78 ms | 32.4% bf16 MFU | 125676 tok/s step 9267/19560 | loss 3.408094 (-0.47z)| norm 0.2707 (-0.71z)| lr 3.43e-04 | 4159.00 ms | 32.5% bf16 MFU | 125696 tok/s step 9268/19560 | loss 3.491304 (+1.50z)| norm 0.2791 (-0.22z)| lr 3.43e-04 | 4160.98 ms | 32.4% bf16 MFU | 125711 tok/s step 9269/19560 | loss 3.434818 (+0.17z)| norm 0.4990 (+8.43z)| lr 3.43e-04 | 4162.72 ms | 32.4% bf16 MFU | 125723 tok/s step 9270/19560 | loss 3.454051 (+0.62z)| norm 0.3346 (+1.92z)| lr 3.43e-04 | 4182.33 ms | 32.3% bf16 MFU | 125704 tok/s step 9271/19560 | loss 3.478460 (+1.18z)| norm 0.2898 (+0.17z)| lr 3.43e-04 | 4173.42 ms | 32.4% bf16 MFU | 125701 tok/s step 9272/19560 | loss 3.444834 (+0.37z)| norm 0.2861 (+0.03z)| lr 3.43e-04 | 4171.43 ms | 32.4% bf16 MFU | 125700 tok/s step 9273/19560 | loss 3.430369 (+0.04z)| norm 0.3295 (+1.69z)| lr 3.43e-04 | 4164.67 ms | 32.4% bf16 MFU | 125709 tok/s step 9274/19560 | loss 3.438602 (+0.24z)| norm 0.3250 (+1.49z)| lr 3.43e-04 | 4163.74 ms | 32.4% bf16 MFU | 125720 tok/s step 9275/19560 | loss 3.430441 (+0.05z)| norm 0.2746 (-0.43z)| lr 3.43e-04 | 4164.44 ms | 32.4% bf16 MFU | 125728 tok/s step 9276/19560 | loss 3.376615 (-1.23z)| norm 0.3042 (+0.71z)| lr 3.43e-04 | 4164.35 ms | 32.4% bf16 MFU | 125737 tok/s step 9277/19560 | loss 3.396314 (-0.75z)| norm 0.2661 (-0.74z)| lr 3.43e-04 | 4162.50 ms | 32.4% bf16 MFU | 125748 tok/s step 9278/19560 | loss 3.402369 (-0.60z)| norm 0.2791 (-0.24z)| lr 3.42e-04 | 4180.01 ms | 32.3% bf16 MFU | 125732 tok/s step 9279/19560 | loss 3.407701 (-0.46z)| norm 0.2885 (+0.13z)| lr 3.42e-04 | 4163.98 ms | 32.4% bf16 MFU | 125741 tok/s step 9280/19560 | loss 3.519007 (+2.23z)| norm 0.2865 (+0.05z)| lr 3.42e-04 | 4161.59 ms | 32.4% bf16 MFU | 125753 tok/s step 9281/19560 | loss 3.420415 (-0.17z)| norm 0.2939 (+0.33z)| lr 3.42e-04 | 4161.25 ms | 32.4% bf16 MFU | 125765 tok/s step 9282/19560 | loss 3.428980 (+0.04z)| norm 0.2926 (+0.28z)| lr 3.42e-04 | 4171.68 ms | 32.4% bf16 MFU | 125761 tok/s step 9283/19560 | loss 3.446298 (+0.46z)| norm 0.2777 (-0.29z)| lr 3.42e-04 | 4171.06 ms | 32.4% bf16 MFU | 125757 tok/s step 9284/19560 | loss 3.477515 (+1.20z)| norm 0.2664 (-0.71z)| lr 3.42e-04 | 4167.32 ms | 32.4% bf16 MFU | 125760 tok/s step 9285/19560 | loss 3.407598 (-0.51z)| norm 0.2739 (-0.43z)| lr 3.42e-04 | 4169.15 ms | 32.4% bf16 MFU | 125760 tok/s step 9286/19560 | loss 3.408569 (-0.48z)| norm 0.2600 (-0.95z)| lr 3.42e-04 | 4164.15 ms | 32.4% bf16 MFU | 125767 tok/s step 9287/19560 | loss 3.417243 (-0.26z)| norm 0.2716 (-0.51z)| lr 3.42e-04 | 4162.87 ms | 32.4% bf16 MFU | 125776 tok/s step 9288/19560 | loss 3.369723 (-1.42z)| norm 0.2520 (-1.25z)| lr 3.42e-04 | 4164.19 ms | 32.4% bf16 MFU | 125782 tok/s step 9289/19560 | loss 3.383094 (-1.07z)| norm 0.2944 (+0.36z)| lr 3.42e-04 | 4166.57 ms | 32.4% bf16 MFU | 125785 tok/s step 9290/19560 | loss 3.407119 (-0.47z)| norm 0.2894 (+0.17z)| lr 3.42e-04 | 4167.58 ms | 32.4% bf16 MFU | 125786 tok/s step 9291/19560 | loss 3.361506 (-1.57z)| norm 0.2960 (+0.41z)| lr 3.42e-04 | 4168.05 ms | 32.4% bf16 MFU | 125786 tok/s step 9292/19560 | loss 3.425715 (+0.00z)| norm 0.3125 (+1.03z)| lr 3.42e-04 | 4173.69 ms | 32.3% bf16 MFU | 125777 tok/s step 9293/19560 | loss 3.412119 (-0.32z)| norm 0.2899 (+0.16z)| lr 3.42e-04 | 4173.91 ms | 32.3% bf16 MFU | 125769 tok/s step 9294/19560 | loss 3.411693 (-0.33z)| norm 0.3103 (+0.93z)| lr 3.42e-04 | 4170.95 ms | 32.4% bf16 MFU | 125765 tok/s step 9295/19560 | loss 3.399662 (-0.62z)| norm 0.3112 (+0.95z)| lr 3.42e-04 | 4168.03 ms | 32.4% bf16 MFU | 125767 tok/s step 9296/19560 | loss 3.419865 (-0.11z)| norm 0.2849 (-0.07z)| lr 3.42e-04 | 4186.97 ms | 32.2% bf16 MFU | 125739 tok/s step 9297/19560 | loss 3.424365 (-0.00z)| norm 0.3241 (+1.41z)| lr 3.42e-04 | 4163.24 ms | 32.4% bf16 MFU | 125749 tok/s step 9298/19560 | loss 3.446087 (+0.53z)| norm 0.2549 (-1.20z)| lr 3.41e-04 | 4172.97 ms | 32.4% bf16 MFU | 125743 tok/s step 9299/19560 | loss 3.461488 (+0.90z)| norm 0.2983 (+0.43z)| lr 3.41e-04 | 4161.02 ms | 32.4% bf16 MFU | 125756 tok/s step 9300/19560 | loss 3.362884 (-1.53z)| norm 0.2889 (+0.08z)| lr 3.41e-04 | 4163.44 ms | 32.4% bf16 MFU | 125765 tok/s step 9301/19560 | loss 3.408081 (-0.41z)| norm 0.2574 (-1.10z)| lr 3.41e-04 | 4166.25 ms | 32.4% bf16 MFU | 125769 tok/s step 9302/19560 | loss 3.391863 (-0.81z)| norm 0.2860 (-0.02z)| lr 3.41e-04 | 4161.60 ms | 32.4% bf16 MFU | 125779 tok/s step 9303/19560 | loss 3.411250 (-0.31z)| norm 0.2859 (-0.02z)| lr 3.41e-04 | 4169.80 ms | 32.4% bf16 MFU | 125777 tok/s step 9304/19560 | loss 3.481910 (+1.43z)| norm 0.2684 (-0.68z)| lr 3.41e-04 | 4162.30 ms | 32.4% bf16 MFU | 125786 tok/s step 9305/19560 | loss 3.377295 (-1.15z)| norm 0.2803 (-0.23z)| lr 3.41e-04 | 4169.14 ms | 32.4% bf16 MFU | 125785 tok/s step 9306/19560 | loss 3.461261 (+0.93z)| norm 0.2688 (-0.66z)| lr 3.41e-04 | 4166.31 ms | 32.4% bf16 MFU | 125787 tok/s step 9307/19560 | loss 3.365819 (-1.42z)| norm 0.3020 (+0.58z)| lr 3.41e-04 | 4166.58 ms | 32.4% bf16 MFU | 125790 tok/s step 9308/19560 | loss 3.425183 (+0.04z)| norm 0.2931 (+0.25z)| lr 3.41e-04 | 4176.29 ms | 32.3% bf16 MFU | 125777 tok/s step 9309/19560 | loss 3.440659 (+0.42z)| norm 0.2896 (+0.11z)| lr 3.41e-04 | 4164.11 ms | 32.4% bf16 MFU | 125784 tok/s step 9310/19560 | loss 3.383476 (-0.99z)| norm 0.2914 (+0.17z)| lr 3.41e-04 | 4171.26 ms | 32.4% bf16 MFU | 125779 tok/s step 9311/19560 | loss 3.385854 (-0.95z)| norm 0.2774 (-0.36z)| lr 3.41e-04 | 4166.92 ms | 32.4% bf16 MFU | 125781 tok/s step 9312/19560 | loss 3.471126 (+1.16z)| norm 0.2798 (-0.28z)| lr 3.41e-04 | 4192.55 ms | 32.2% bf16 MFU | 125745 tok/s step 9313/19560 | loss 3.525696 (+2.44z)| norm 0.2806 (-0.24z)| lr 3.41e-04 | 4154.09 ms | 32.5% bf16 MFU | 125768 tok/s step 9314/19560 | loss 3.473373 (+1.15z)| norm 0.3019 (+0.58z)| lr 3.41e-04 | 4177.02 ms | 32.3% bf16 MFU | 125755 tok/s step 9315/19560 | loss 3.399188 (-0.65z)| norm 0.2971 (+0.39z)| lr 3.41e-04 | 4175.19 ms | 32.3% bf16 MFU | 125746 tok/s step 9316/19560 | loss 3.474111 (+1.16z)| norm 0.2830 (-0.16z)| lr 3.41e-04 | 4175.48 ms | 32.3% bf16 MFU | 125737 tok/s step 9317/19560 | loss 3.497192 (+1.69z)| norm 0.2801 (-0.28z)| lr 3.41e-04 | 4153.63 ms | 32.5% bf16 MFU | 125761 tok/s step 9318/19560 | loss 3.443965 (+0.41z)| norm 0.2707 (-0.63z)| lr 3.41e-04 | 4167.57 ms | 32.4% bf16 MFU | 125763 tok/s step 9319/19560 | loss 3.422476 (-0.10z)| norm 0.2659 (-0.82z)| lr 3.40e-04 | 4157.62 ms | 32.5% bf16 MFU | 125780 tok/s step 9320/19560 | loss 3.383608 (-1.02z)| norm 0.2892 (+0.09z)| lr 3.40e-04 | 4161.94 ms | 32.4% bf16 MFU | 125790 tok/s step 9321/19560 | loss 3.408188 (-0.44z)| norm 0.2833 (-0.14z)| lr 3.40e-04 | 4188.31 ms | 32.2% bf16 MFU | 125759 tok/s step 9322/19560 | loss 3.389307 (-0.90z)| norm 0.2862 (-0.03z)| lr 3.40e-04 | 4171.91 ms | 32.4% bf16 MFU | 125755 tok/s step 9323/19560 | loss 3.405424 (-0.50z)| norm 0.2855 (-0.06z)| lr 3.40e-04 | 4176.27 ms | 32.3% bf16 MFU | 125744 tok/s step 9324/19560 | loss 3.455203 (+0.71z)| norm 0.2811 (-0.24z)| lr 3.40e-04 | 4173.95 ms | 32.3% bf16 MFU | 125737 tok/s step 9325/19560 | loss 3.433107 (+0.16z)| norm 0.2704 (-0.65z)| lr 3.40e-04 | 4161.46 ms | 32.4% bf16 MFU | 125750 tok/s step 9326/19560 | loss 3.426016 (-0.02z)| norm 0.2863 (-0.03z)| lr 3.40e-04 | 4166.96 ms | 32.4% bf16 MFU | 125753 tok/s step 9327/19560 | loss 3.461012 (+0.86z)| norm 0.2486 (-1.50z)| lr 3.40e-04 | 4178.83 ms | 32.3% bf16 MFU | 125739 tok/s step 9328/19560 | loss 3.401275 (-0.62z)| norm 0.2821 (-0.18z)| lr 3.40e-04 | 4175.67 ms | 32.3% bf16 MFU | 125730 tok/s step 9329/19560 | loss 3.382802 (-1.07z)| norm 0.2688 (-0.69z)| lr 3.40e-04 | 4178.02 ms | 32.3% bf16 MFU | 125718 tok/s step 9330/19560 | loss 3.413896 (-0.29z)| norm 0.2855 (-0.05z)| lr 3.40e-04 | 4163.89 ms | 32.4% bf16 MFU | 125727 tok/s step 9331/19560 | loss 3.434261 (+0.21z)| norm 0.2555 (-1.22z)| lr 3.40e-04 | 4168.22 ms | 32.4% bf16 MFU | 125730 tok/s step 9332/19560 | loss 3.430735 (+0.13z)| norm 0.2887 (+0.08z)| lr 3.40e-04 | 4164.27 ms | 32.4% bf16 MFU | 125739 tok/s step 9333/19560 | loss 3.430514 (+0.13z)| norm 0.2471 (-1.53z)| lr 3.40e-04 | 5054.69 ms | 26.7% bf16 MFU | 124638 tok/s step 9334/19560 | loss 3.408407 (-0.44z)| norm 0.2689 (-0.67z)| lr 3.40e-04 | 4172.88 ms | 32.4% bf16 MFU | 124688 tok/s step 9335/19560 | loss 3.404487 (-0.53z)| norm 0.2718 (-0.55z)| lr 3.40e-04 | 4167.30 ms | 32.4% bf16 MFU | 124744 tok/s step 9336/19560 | loss 3.356969 (-1.70z)| norm 0.2817 (-0.17z)| lr 3.40e-04 | 4167.09 ms | 32.4% bf16 MFU | 124798 tok/s step 9337/19560 | loss 3.446647 (+0.54z)| norm 0.2427 (-1.66z)| lr 3.40e-04 | 4165.20 ms | 32.4% bf16 MFU | 124852 tok/s step 9338/19560 | loss 3.446475 (+0.53z)| norm 0.2926 (+0.26z)| lr 3.40e-04 | 4158.66 ms | 32.5% bf16 MFU | 124913 tok/s step 9339/19560 | loss 3.396941 (-0.70z)| norm 0.2941 (+0.32z)| lr 3.39e-04 | 4160.72 ms | 32.5% bf16 MFU | 124967 tok/s step 9340/19560 | loss 3.504538 (+1.98z)| norm 0.2992 (+0.52z)| lr 3.39e-04 | 4169.25 ms | 32.4% bf16 MFU | 125007 tok/s step 9341/19560 | loss 3.489652 (+1.62z)| norm 0.2974 (+0.46z)| lr 3.39e-04 | 4165.44 ms | 32.4% bf16 MFU | 125050 tok/s step 9342/19560 | loss 3.396815 (-0.70z)| norm 0.2543 (-1.20z)| lr 3.39e-04 | 4548.97 ms | 29.7% bf16 MFU | 124560 tok/s step 9343/19560 | loss 3.457375 (+0.82z)| norm 0.3013 (+0.61z)| lr 3.39e-04 | 4186.70 ms | 32.2% bf16 MFU | 124593 tok/s step 9344/19560 | loss 3.458868 (+0.84z)| norm 0.2981 (+0.50z)| lr 3.39e-04 | 4178.77 ms | 32.3% bf16 MFU | 124637 tok/s step 9345/19560 | loss 3.362829 (-1.53z)| norm 0.2772 (-0.31z)| lr 3.39e-04 | 4280.64 ms | 31.5% bf16 MFU | 124529 tok/s step 9346/19560 | loss 3.428150 (+0.09z)| norm 0.2868 (+0.06z)| lr 3.39e-04 | 4167.13 ms | 32.4% bf16 MFU | 124593 tok/s step 9347/19560 | loss 3.519812 (+2.32z)| norm 0.2862 (+0.04z)| lr 3.39e-04 | 4177.16 ms | 32.3% bf16 MFU | 124639 tok/s step 9348/19560 | loss 3.435165 (+0.23z)| norm 0.3033 (+0.69z)| lr 3.39e-04 | 4181.67 ms | 32.3% bf16 MFU | 124676 tok/s step 9349/19560 | loss 3.383804 (-1.02z)| norm 0.2683 (-0.65z)| lr 3.39e-04 | 4170.90 ms | 32.4% bf16 MFU | 124727 tok/s step 9350/19560 | loss 3.441800 (+0.41z)| norm 0.2901 (+0.19z)| lr 3.39e-04 | 4159.18 ms | 32.5% bf16 MFU | 124794 tok/s step 9351/19560 | loss 3.456883 (+0.78z)| norm 0.2804 (-0.19z)| lr 3.39e-04 | 4170.18 ms | 32.4% bf16 MFU | 124840 tok/s step 9352/19560 | loss 3.437245 (+0.28z)| norm 0.2933 (+0.30z)| lr 3.39e-04 | 4166.81 ms | 32.4% bf16 MFU | 124890 tok/s step 9353/19560 | loss 3.403036 (-0.60z)| norm 0.2746 (-0.43z)| lr 3.39e-04 | 4174.39 ms | 32.3% bf16 MFU | 124925 tok/s step 9354/19560 | loss 3.388149 (-0.97z)| norm 0.2690 (-0.65z)| lr 3.39e-04 | 4160.00 ms | 32.5% bf16 MFU | 124980 tok/s step 9355/19560 | loss 3.434853 (+0.23z)| norm 0.2608 (-0.96z)| lr 3.39e-04 | 4171.58 ms | 32.4% bf16 MFU | 125015 tok/s step 9356/19560 | loss 3.421599 (-0.12z)| norm 0.2691 (-0.64z)| lr 3.39e-04 | 4161.61 ms | 32.4% bf16 MFU | 125064 tok/s step 9357/19560 | loss 3.445947 (+0.51z)| norm 0.2535 (-1.22z)| lr 3.39e-04 | 4188.49 ms | 32.2% bf16 MFU | 125069 tok/s step 9358/19560 | loss 3.483770 (+1.47z)| norm 0.2646 (-0.78z)| lr 3.39e-04 | 4164.69 ms | 32.4% bf16 MFU | 125110 tok/s step 9359/19560 | loss 3.347675 (-2.02z)| norm 0.2851 (+0.02z)| lr 3.38e-04 | 4167.66 ms | 32.4% bf16 MFU | 125144 tok/s step 9360/19560 | loss 3.493728 (+1.72z)| norm 0.2669 (-0.68z)| lr 3.38e-04 | 4162.06 ms | 32.4% bf16 MFU | 125186 tok/s step 9361/19560 | loss 3.458187 (+0.81z)| norm 0.2672 (-0.67z)| lr 3.38e-04 | 4161.29 ms | 32.4% bf16 MFU | 125226 tok/s step 9362/19560 | loss 3.491375 (+1.63z)| norm 0.2724 (-0.46z)| lr 3.38e-04 | 4154.04 ms | 32.5% bf16 MFU | 125275 tok/s step 9363/19560 | loss 3.335145 (-2.30z)| norm 0.2617 (-0.87z)| lr 3.38e-04 | 4173.05 ms | 32.4% bf16 MFU | 125293 tok/s step 9364/19560 | loss 3.399195 (-0.70z)| norm 0.2977 (+0.53z)| lr 3.38e-04 | 4158.40 ms | 32.5% bf16 MFU | 125333 tok/s step 9365/19560 | loss 3.429841 (+0.07z)| norm 0.2426 (-1.58z)| lr 3.38e-04 | 4166.32 ms | 32.4% bf16 MFU | 125358 tok/s step 9366/19560 | loss 3.403720 (-0.58z)| norm 0.2876 (+0.14z)| lr 3.38e-04 | 4180.69 ms | 32.3% bf16 MFU | 125360 tok/s step 9367/19560 | loss 3.422943 (-0.09z)| norm 0.2799 (-0.16z)| lr 3.38e-04 | 4201.95 ms | 32.1% bf16 MFU | 125331 tok/s step 9368/19560 | loss 3.401616 (-0.64z)| norm 0.2638 (-0.79z)| lr 3.38e-04 | 4158.55 ms | 32.5% bf16 MFU | 125368 tok/s step 9369/19560 | loss 3.326104 (-2.46z)| norm 0.2741 (-0.39z)| lr 3.38e-04 | 4165.34 ms | 32.4% bf16 MFU | 125393 tok/s step 9370/19560 | loss 3.419844 (-0.16z)| norm 0.2463 (-1.46z)| lr 3.38e-04 | 4174.96 ms | 32.3% bf16 MFU | 125403 tok/s step 9371/19560 | loss 3.497838 (+1.74z)| norm 0.2729 (-0.43z)| lr 3.38e-04 | 4161.73 ms | 32.4% bf16 MFU | 125431 tok/s step 9372/19560 | loss 3.510614 (+2.01z)| norm 0.2805 (-0.14z)| lr 3.38e-04 | 4160.56 ms | 32.5% bf16 MFU | 125460 tok/s step 9373/19560 | loss 3.471364 (+1.04z)| norm 0.2928 (+0.33z)| lr 3.38e-04 | 4167.71 ms | 32.4% bf16 MFU | 125477 tok/s step 9374/19560 | loss 3.413183 (-0.36z)| norm 0.2811 (-0.12z)| lr 3.38e-04 | 4153.46 ms | 32.5% bf16 MFU | 125515 tok/s step 9375/19560 | loss 3.453156 (+0.60z)| norm 0.3225 (+1.46z)| lr 3.38e-04 | 4171.00 ms | 32.4% bf16 MFU | 125524 tok/s step 9376/19560 | loss 3.402761 (-0.61z)| norm 0.3103 (+0.97z)| lr 3.38e-04 | 4164.62 ms | 32.4% bf16 MFU | 125542 tok/s step 9377/19560 | loss 3.415321 (-0.30z)| norm 0.2869 (+0.08z)| lr 3.38e-04 | 4188.16 ms | 32.2% bf16 MFU | 125524 tok/s step 9378/19560 | loss 3.410828 (-0.40z)| norm 0.3053 (+0.78z)| lr 3.38e-04 | 4158.96 ms | 32.5% bf16 MFU | 125551 tok/s step 9379/19560 | loss 3.367033 (-1.46z)| norm 0.2939 (+0.34z)| lr 3.37e-04 | 4183.77 ms | 32.3% bf16 MFU | 125539 tok/s step 9380/19560 | loss 3.449869 (+0.57z)| norm 0.2682 (-0.63z)| lr 3.37e-04 | 4161.95 ms | 32.4% bf16 MFU | 125561 tok/s step 9381/19560 | loss 3.403193 (-0.56z)| norm 0.2876 (+0.11z)| lr 3.37e-04 | 4168.84 ms | 32.4% bf16 MFU | 125571 tok/s step 9382/19560 | loss 3.456249 (+0.74z)| norm 0.2701 (-0.56z)| lr 3.37e-04 | 4160.39 ms | 32.5% bf16 MFU | 125594 tok/s step 9383/19560 | loss 3.393578 (-0.80z)| norm 0.2784 (-0.23z)| lr 3.37e-04 | 4166.34 ms | 32.4% bf16 MFU | 125606 tok/s step 9384/19560 | loss 3.437501 (+0.28z)| norm 0.2932 (+0.34z)| lr 3.37e-04 | 4161.68 ms | 32.4% bf16 MFU | 125625 tok/s step 9385/19560 | loss 3.420661 (-0.13z)| norm 0.2805 (-0.14z)| lr 3.37e-04 | 4238.09 ms | 31.9% bf16 MFU | 125529 tok/s step 9386/19560 | loss 3.476346 (+1.22z)| norm 0.2752 (-0.34z)| lr 3.37e-04 | 4175.26 ms | 32.3% bf16 MFU | 125531 tok/s step 9387/19560 | loss 3.489089 (+1.56z)| norm 0.3256 (+1.58z)| lr 3.37e-04 | 4181.27 ms | 32.3% bf16 MFU | 125524 tok/s step 9388/19560 | loss 3.371213 (-1.40z)| norm 0.3368 (+1.96z)| lr 3.37e-04 | 4182.20 ms | 32.3% bf16 MFU | 125516 tok/s step 9389/19560 | loss 3.364408 (-1.54z)| norm 0.3049 (+0.76z)| lr 3.37e-04 | 4164.08 ms | 32.4% bf16 MFU | 125535 tok/s step 9390/19560 | loss 3.468544 (+1.05z)| norm 0.2975 (+0.48z)| lr 3.37e-04 | 4162.30 ms | 32.4% bf16 MFU | 125557 tok/s step 9391/19560 | loss 3.414762 (-0.29z)| norm 0.2880 (+0.12z)| lr 3.37e-04 | 4179.24 ms | 32.3% bf16 MFU | 125551 tok/s step 9392/19560 | loss 3.440331 (+0.34z)| norm 0.3297 (+1.66z)| lr 3.37e-04 | 4170.64 ms | 32.4% bf16 MFU | 125559 tok/s step 9393/19560 | loss 3.406675 (-0.50z)| norm 0.2958 (+0.38z)| lr 3.37e-04 | 4345.35 ms | 31.1% bf16 MFU | 125314 tok/s step 9394/19560 | loss 3.425726 (-0.02z)| norm 0.2919 (+0.23z)| lr 3.37e-04 | 4172.24 ms | 32.4% bf16 MFU | 125331 tok/s step 9395/19560 | loss 3.371776 (-1.35z)| norm 0.2756 (-0.38z)| lr 3.37e-04 | 4178.29 ms | 32.3% bf16 MFU | 125339 tok/s step 9396/19560 | loss 3.361242 (-1.59z)| norm 0.2873 (+0.06z)| lr 3.37e-04 | 4245.63 ms | 31.8% bf16 MFU | 125246 tok/s step 9397/19560 | loss 3.364436 (-1.48z)| norm 0.2994 (+0.81z)| lr 3.37e-04 | 4194.97 ms | 32.2% bf16 MFU | 125233 tok/s step 9398/19560 | loss 3.310149 (-2.71z)| norm 0.2687 (-0.82z)| lr 3.37e-04 | 4171.59 ms | 32.4% bf16 MFU | 125255 tok/s step 9399/19560 | loss 3.431439 (+0.20z)| norm 0.2793 (-0.23z)| lr 3.36e-04 | 4255.08 ms | 31.7% bf16 MFU | 125153 tok/s step 9400/19560 | loss 3.416900 (-0.15z)| norm 0.3010 (+0.94z)| lr 3.36e-04 | 4166.25 ms | 32.4% bf16 MFU | 125188 tok/s step 9401/19560 | loss 3.397836 (-0.60z)| norm 0.3036 (+1.12z)| lr 3.36e-04 | 4206.41 ms | 32.1% bf16 MFU | 125160 tok/s step 9402/19560 | loss 3.525541 (+2.40z)| norm 0.3062 (+1.28z)| lr 3.36e-04 | 4194.11 ms | 32.2% bf16 MFU | 125153 tok/s step 9403/19560 | loss 3.416778 (-0.15z)| norm 0.3168 (+1.84z)| lr 3.36e-04 | 4168.79 ms | 32.4% bf16 MFU | 125183 tok/s step 9404/19560 | loss 3.382234 (-0.97z)| norm 0.2709 (-0.70z)| lr 3.36e-04 | 4172.58 ms | 32.4% bf16 MFU | 125207 tok/s step 9405/19560 | loss 3.431188 (+0.18z)| norm 0.2992 (+0.87z)| lr 3.36e-04 | 4157.30 ms | 32.5% bf16 MFU | 125252 tok/s step 9406/19560 | loss 3.436347 (+0.29z)| norm 0.2842 (+0.02z)| lr 3.36e-04 | 4157.84 ms | 32.5% bf16 MFU | 125294 tok/s step 9407/19560 | loss 3.417434 (-0.15z)| norm 0.2736 (-0.56z)| lr 3.36e-04 | 4174.04 ms | 32.3% bf16 MFU | 125310 tok/s step 9408/19560 | loss 3.417818 (-0.13z)| norm 0.2892 (+0.31z)| lr 3.36e-04 | 4163.62 ms | 32.4% bf16 MFU | 125340 tok/s step 9409/19560 | loss 3.393045 (-0.72z)| norm 0.2717 (-0.66z)| lr 3.36e-04 | 4170.37 ms | 32.4% bf16 MFU | 125359 tok/s step 9410/19560 | loss 3.496399 (+1.73z)| norm 0.2799 (-0.20z)| lr 3.36e-04 | 4175.33 ms | 32.3% bf16 MFU | 125370 tok/s step 9411/19560 | loss 3.477888 (+1.28z)| norm 0.2807 (-0.15z)| lr 3.36e-04 | 4165.90 ms | 32.4% bf16 MFU | 125394 tok/s step 9412/19560 | loss 3.411070 (-0.29z)| norm 0.2797 (-0.21z)| lr 3.36e-04 | 4166.45 ms | 32.4% bf16 MFU | 125416 tok/s step 9413/19560 | loss 3.385995 (-0.88z)| norm 0.2572 (-1.46z)| lr 3.36e-04 | 4154.72 ms | 32.5% bf16 MFU | 125455 tok/s step 9414/19560 | loss 3.353611 (-1.62z)| norm 0.2814 (-0.12z)| lr 3.36e-04 | 4165.80 ms | 32.4% bf16 MFU | 125475 tok/s step 9415/19560 | loss 3.507506 (+1.95z)| norm 0.6880 (+10.08z)| lr 3.36e-04 | 4161.46 ms | 32.4% bf16 MFU | 125500 tok/s step 9416/19560 | loss 3.428534 (+0.11z)| norm 0.4056 (+2.87z)| lr 3.36e-04 | 4182.31 ms | 32.3% bf16 MFU | 125493 tok/s step 9417/19560 | loss 3.527867 (+2.35z)| norm 0.3114 (+0.57z)| lr 3.36e-04 | 4167.98 ms | 32.4% bf16 MFU | 125508 tok/s step 9418/19560 | loss 3.340312 (-1.89z)| norm 0.3996 (+2.62z)| lr 3.36e-04 | 4173.13 ms | 32.4% bf16 MFU | 125514 tok/s step 9419/19560 | loss 3.393354 (-0.71z)| norm 0.3144 (+0.60z)| lr 3.35e-04 | 4167.25 ms | 32.4% bf16 MFU | 125529 tok/s step 9420/19560 | loss 3.399601 (-0.56z)| norm 0.3535 (+1.50z)| lr 3.35e-04 | 4156.19 ms | 32.5% bf16 MFU | 125560 tok/s step 9421/19560 | loss 3.484567 (+1.34z)| norm 0.3068 (+0.40z)| lr 3.35e-04 | 4163.92 ms | 32.4% bf16 MFU | 125578 tok/s step 9422/19560 | loss 3.402743 (-0.50z)| norm 0.3045 (+0.35z)| lr 3.35e-04 | 4158.93 ms | 32.5% bf16 MFU | 125602 tok/s step 9423/19560 | loss 3.444582 (+0.43z)| norm 0.2990 (+0.22z)| lr 3.35e-04 | 5310.42 ms | 25.4% bf16 MFU | 124258 tok/s step 9424/19560 | loss 3.561875 (+2.94z)| norm 0.5759 (+5.74z)| lr 3.35e-04 | 5434.85 ms | 24.8% bf16 MFU | 122869 tok/s step 9425/19560 | loss 3.403008 (-0.50z)| norm 0.3693 (+1.55z)| lr 3.35e-04 | 4144.03 ms | 32.6% bf16 MFU | 123051 tok/s step 9426/19560 | loss 3.432784 (+0.14z)| norm 0.2859 (-0.13z)| lr 3.35e-04 | 4192.60 ms | 32.2% bf16 MFU | 123151 tok/s step 9427/19560 | loss 3.491541 (+1.41z)| norm 0.3380 (+0.91z)| lr 3.35e-04 | 4149.61 ms | 32.5% bf16 MFU | 123311 tok/s step 9428/19560 | loss 3.557758 (+2.74z)| norm 0.3137 (+0.42z)| lr 3.35e-04 | 4159.30 ms | 32.5% bf16 MFU | 123448 tok/s step 9429/19560 | loss 3.460642 (+0.68z)| norm 0.2884 (-0.09z)| lr 3.35e-04 | 4159.40 ms | 32.5% bf16 MFU | 123578 tok/s step 9430/19560 | loss 3.461905 (+0.70z)| norm 0.2741 (-0.38z)| lr 3.35e-04 | 4155.60 ms | 32.5% bf16 MFU | 123707 tok/s step 9431/19560 | loss 3.408648 (-0.43z)| norm 0.3022 (+0.18z)| lr 3.35e-04 | 4149.38 ms | 32.5% bf16 MFU | 123840 tok/s step 9432/19560 | loss 3.437428 (+0.19z)| norm 0.2867 (-0.13z)| lr 3.35e-04 | 4146.03 ms | 32.6% bf16 MFU | 123970 tok/s step 9433/19560 | loss 3.470376 (+0.87z)| norm 0.2809 (-0.25z)| lr 3.35e-04 | 4154.97 ms | 32.5% bf16 MFU | 124081 tok/s step 9434/19560 | loss 3.454646 (+0.54z)| norm 0.2865 (-0.14z)| lr 3.35e-04 | 4167.62 ms | 32.4% bf16 MFU | 124167 tok/s step 9435/19560 | loss 3.491396 (+1.30z)| norm 0.2916 (-0.04z)| lr 3.35e-04 | 4152.78 ms | 32.5% bf16 MFU | 124271 tok/s step 9436/19560 | loss 3.482322 (+1.09z)| norm 0.2815 (-0.24z)| lr 3.35e-04 | 4154.21 ms | 32.5% bf16 MFU | 124368 tok/s step 9437/19560 | loss 3.446761 (+0.34z)| norm 0.2756 (-0.35z)| lr 3.35e-04 | 4155.84 ms | 32.5% bf16 MFU | 124457 tok/s step 9438/19560 | loss 3.424744 (-0.13z)| norm 0.2811 (-0.24z)| lr 3.35e-04 | 4159.26 ms | 32.5% bf16 MFU | 124537 tok/s step 9439/19560 | loss 3.411487 (-0.42z)| norm 0.2664 (-0.53z)| lr 3.35e-04 | 4154.28 ms | 32.5% bf16 MFU | 124620 tok/s step 9440/19560 | loss 3.455704 (+0.53z)| norm 0.2630 (-0.60z)| lr 3.34e-04 | 4162.97 ms | 32.4% bf16 MFU | 124686 tok/s step 9441/19560 | loss 3.407583 (-0.49z)| norm 0.2863 (-0.13z)| lr 3.34e-04 | 4170.71 ms | 32.4% bf16 MFU | 124738 tok/s step 9442/19560 | loss 3.461323 (+0.68z)| norm 0.2754 (-0.35z)| lr 3.34e-04 | 4160.55 ms | 32.5% bf16 MFU | 124801 tok/s step 9443/19560 | loss 3.358285 (-1.54z)| norm 0.2812 (-0.23z)| lr 3.34e-04 | 4154.71 ms | 32.5% bf16 MFU | 124871 tok/s step 9444/19560 | loss 3.406942 (-0.48z)| norm 0.2745 (-0.36z)| lr 3.34e-04 | 4156.56 ms | 32.5% bf16 MFU | 124934 tok/s step 9445/19560 | loss 3.377861 (-1.09z)| norm 0.2631 (-0.58z)| lr 3.34e-04 | 4162.86 ms | 32.4% bf16 MFU | 124985 tok/s step 9446/19560 | loss 3.447678 (+0.42z)| norm 0.3001 (+0.15z)| lr 3.34e-04 | 4161.13 ms | 32.4% bf16 MFU | 125035 tok/s step 9447/19560 | loss 3.410703 (-0.38z)| norm 0.2574 (-0.70z)| lr 3.34e-04 | 4159.38 ms | 32.5% bf16 MFU | 125086 tok/s step 9448/19560 | loss 3.431305 (+0.06z)| norm 0.2781 (-0.29z)| lr 3.34e-04 | 4159.97 ms | 32.5% bf16 MFU | 125133 tok/s step 9449/19560 | loss 3.385697 (-0.92z)| norm 0.2698 (-0.45z)| lr 3.34e-04 | 4165.75 ms | 32.4% bf16 MFU | 125169 tok/s step 9450/19560 | loss 3.374099 (-1.17z)| norm 0.3015 (+0.18z)| lr 3.34e-04 | 4317.17 ms | 31.3% bf16 MFU | 124983 tok/s step 9451/19560 | loss 3.404260 (-0.52z)| norm 0.2607 (-0.63z)| lr 3.34e-04 | 4255.97 ms | 31.7% bf16 MFU | 124893 tok/s step 9452/19560 | loss 3.394428 (-0.72z)| norm 0.2807 (-0.23z)| lr 3.34e-04 | 4182.00 ms | 32.3% bf16 MFU | 124917 tok/s step 9453/19560 | loss 3.413908 (-0.30z)| norm 0.2855 (-0.14z)| lr 3.34e-04 | 4151.00 ms | 32.5% bf16 MFU | 124986 tok/s step 9454/19560 | loss 3.440980 (+0.29z)| norm 0.2567 (-0.71z)| lr 3.34e-04 | 4153.55 ms | 32.5% bf16 MFU | 125048 tok/s step 9455/19560 | loss 3.439285 (+0.25z)| norm 0.2849 (-0.15z)| lr 3.34e-04 | 4194.49 ms | 32.2% bf16 MFU | 125046 tok/s step 9456/19560 | loss 3.443097 (+0.33z)| norm 0.2517 (-0.81z)| lr 3.34e-04 | 4156.80 ms | 32.5% bf16 MFU | 125100 tok/s step 9457/19560 | loss 3.421263 (-0.15z)| norm 0.2879 (-0.09z)| lr 3.34e-04 | 4162.14 ms | 32.4% bf16 MFU | 125143 tok/s step 9458/19560 | loss 3.464421 (+0.78z)| norm 0.2958 (+0.07z)| lr 3.34e-04 | 4162.76 ms | 32.4% bf16 MFU | 125183 tok/s step 9459/19560 | loss 3.503579 (+1.60z)| norm 0.2665 (-0.52z)| lr 3.34e-04 | 4154.97 ms | 32.5% bf16 MFU | 125233 tok/s step 9460/19560 | loss 3.415072 (-0.30z)| norm 0.2904 (-0.04z)| lr 3.33e-04 | 4160.28 ms | 32.5% bf16 MFU | 125273 tok/s step 9461/19560 | loss 3.428624 (-0.01z)| norm 0.2937 (+0.02z)| lr 3.33e-04 | 4160.13 ms | 32.5% bf16 MFU | 125310 tok/s step 9462/19560 | loss 3.356880 (-1.52z)| norm 0.2788 (-0.28z)| lr 3.33e-04 | 4159.15 ms | 32.5% bf16 MFU | 125348 tok/s step 9463/19560 | loss 3.426153 (-0.06z)| norm 0.2657 (-0.54z)| lr 3.33e-04 | 4144.40 ms | 32.6% bf16 MFU | 125406 tok/s step 9464/19560 | loss 3.442068 (+0.27z)| norm 0.3068 (+0.27z)| lr 3.33e-04 | 4165.76 ms | 32.4% bf16 MFU | 125428 tok/s step 9465/19560 | loss 3.402290 (-0.57z)| norm 0.2413 (-1.04z)| lr 3.33e-04 | 4155.10 ms | 32.5% bf16 MFU | 125466 tok/s step 9466/19560 | loss 3.404375 (-0.52z)| norm 0.2771 (-0.32z)| lr 3.33e-04 | 4160.18 ms | 32.5% bf16 MFU | 125494 tok/s step 9467/19560 | loss 3.336149 (-1.95z)| norm 0.2708 (-0.44z)| lr 3.33e-04 | 4157.44 ms | 32.5% bf16 MFU | 125524 tok/s step 9468/19560 | loss 3.430859 (+0.07z)| norm 0.2959 (+0.06z)| lr 3.33e-04 | 4161.45 ms | 32.4% bf16 MFU | 125548 tok/s step 9469/19560 | loss 3.407864 (-0.41z)| norm 0.3271 (+0.68z)| lr 3.33e-04 | 4168.89 ms | 32.4% bf16 MFU | 125558 tok/s step 9470/19560 | loss 3.369833 (-1.22z)| norm 0.3001 (+0.14z)| lr 3.33e-04 | 4190.11 ms | 32.2% bf16 MFU | 125537 tok/s step 9471/19560 | loss 3.389956 (-0.78z)| norm 0.2847 (-0.17z)| lr 3.33e-04 | 4155.26 ms | 32.5% bf16 MFU | 125568 tok/s step 9472/19560 | loss 3.412103 (-0.30z)| norm 0.3180 (+0.49z)| lr 3.33e-04 | 4159.49 ms | 32.5% bf16 MFU | 125592 tok/s step 9473/19560 | loss 3.484901 (+1.24z)| norm 0.2764 (-0.34z)| lr 3.33e-04 | 4163.21 ms | 32.4% bf16 MFU | 125609 tok/s step 9474/19560 | loss 3.533890 (+2.23z)| norm 0.2966 (+0.06z)| lr 3.33e-04 | 4154.31 ms | 32.5% bf16 MFU | 125639 tok/s step 9475/19560 | loss 3.415329 (-0.25z)| norm 0.2571 (-0.72z)| lr 3.33e-04 | 4151.26 ms | 32.5% bf16 MFU | 125672 tok/s step 9476/19560 | loss 3.425030 (-0.04z)| norm 0.2710 (-0.44z)| lr 3.33e-04 | 4170.82 ms | 32.4% bf16 MFU | 125674 tok/s step 9477/19560 | loss 3.468987 (+0.89z)| norm 0.3009 (+0.15z)| lr 3.33e-04 | 4155.68 ms | 32.5% bf16 MFU | 125698 tok/s step 9478/19560 | loss 3.501156 (+1.55z)| norm 0.2651 (-0.56z)| lr 3.33e-04 | 4158.80 ms | 32.5% bf16 MFU | 125716 tok/s step 9479/19560 | loss 3.499117 (+1.49z)| norm 0.2798 (-0.26z)| lr 3.33e-04 | 4151.85 ms | 32.5% bf16 MFU | 125745 tok/s step 9480/19560 | loss 3.416347 (-0.25z)| norm 0.2811 (-0.23z)| lr 3.32e-04 | 4161.52 ms | 32.4% bf16 MFU | 125757 tok/s step 9481/19560 | loss 3.344945 (-1.72z)| norm 0.2881 (-0.10z)| lr 3.32e-04 | 4166.32 ms | 32.4% bf16 MFU | 125761 tok/s step 9482/19560 | loss 3.418179 (-0.20z)| norm 0.2863 (-0.14z)| lr 3.32e-04 | 4159.35 ms | 32.5% bf16 MFU | 125775 tok/s step 9483/19560 | loss 3.379561 (-0.99z)| norm 0.3123 (+0.38z)| lr 3.32e-04 | 4148.31 ms | 32.5% bf16 MFU | 125806 tok/s step 9484/19560 | loss 3.443957 (+0.34z)| norm 0.2707 (-0.46z)| lr 3.32e-04 | 4157.88 ms | 32.5% bf16 MFU | 125820 tok/s step 9485/19560 | loss 3.396055 (-0.65z)| norm 0.2996 (+0.11z)| lr 3.32e-04 | 4166.45 ms | 32.4% bf16 MFU | 125821 tok/s step 9486/19560 | loss 3.486410 (+1.22z)| norm 0.3055 (+0.23z)| lr 3.32e-04 | 4160.16 ms | 32.5% bf16 MFU | 125831 tok/s step 9487/19560 | loss 3.380865 (-0.97z)| norm 0.2822 (-0.24z)| lr 3.32e-04 | 4151.98 ms | 32.5% bf16 MFU | 125853 tok/s step 9488/19560 | loss 3.478463 (+1.07z)| norm 0.3245 (+0.60z)| lr 3.32e-04 | 4159.34 ms | 32.5% bf16 MFU | 125863 tok/s step 9489/19560 | loss 3.448314 (+0.44z)| norm 0.2713 (-0.47z)| lr 3.32e-04 | 4156.71 ms | 32.5% bf16 MFU | 125877 tok/s step 9490/19560 | loss 3.424268 (-0.05z)| norm 0.2859 (-0.18z)| lr 3.32e-04 | 4169.49 ms | 32.4% bf16 MFU | 125870 tok/s step 9491/19560 | loss 3.451174 (+0.50z)| norm 0.2775 (-0.35z)| lr 3.32e-04 | 4167.92 ms | 32.4% bf16 MFU | 125866 tok/s step 9492/19560 | loss 3.461288 (+0.71z)| norm 0.2637 (-0.62z)| lr 3.32e-04 | 4154.86 ms | 32.5% bf16 MFU | 125882 tok/s step 9493/19560 | loss 3.458668 (+0.64z)| norm 0.2780 (-0.34z)| lr 3.32e-04 | 4163.60 ms | 32.4% bf16 MFU | 125884 tok/s step 9494/19560 | loss 3.338865 (-1.88z)| norm 0.2642 (-0.61z)| lr 3.32e-04 | 4159.47 ms | 32.5% bf16 MFU | 125892 tok/s step 9495/19560 | loss 3.402552 (-0.53z)| norm 0.2618 (-0.66z)| lr 3.32e-04 | 4151.82 ms | 32.5% bf16 MFU | 125912 tok/s step 9496/19560 | loss 3.428602 (+0.01z)| norm 0.2749 (-0.40z)| lr 3.32e-04 | 4148.98 ms | 32.5% bf16 MFU | 125934 tok/s step 9497/19560 | loss 3.429883 (+0.02z)| norm 0.2586 (-0.72z)| lr 3.32e-04 | 4158.53 ms | 32.5% bf16 MFU | 125941 tok/s step 9498/19560 | loss 3.525921 (+2.03z)| norm 0.2863 (-0.17z)| lr 3.32e-04 | 4161.10 ms | 32.4% bf16 MFU | 125944 tok/s step 9499/19560 | loss 3.386744 (-0.89z)| norm 0.2711 (-0.48z)| lr 3.32e-04 | 4164.91 ms | 32.4% bf16 MFU | 125941 tok/s step 9500/19560 | loss 3.413628 (-0.31z)| norm 0.2821 (-0.26z)| lr 3.31e-04 | 4150.44 ms | 32.5% bf16 MFU | 125960 tok/s val loss 3.398757 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2906/10042 = 0.289385 step 9501/19560 | loss 3.421389 (-0.13z)| norm 0.2874 (-0.15z)| lr 3.31e-04 | 4151.12 ms | 32.5% bf16 MFU | 125977 tok/s step 9502/19560 | loss 3.368796 (-1.25z)| norm 0.2705 (-0.49z)| lr 3.31e-04 | 4167.82 ms | 32.4% bf16 MFU | 125968 tok/s step 9503/19560 | loss 3.423349 (-0.08z)| norm 0.2809 (-0.27z)| lr 3.31e-04 | 4170.67 ms | 32.4% bf16 MFU | 125955 tok/s step 9504/19560 | loss 3.437146 (+0.21z)| norm 0.2695 (-0.50z)| lr 3.31e-04 | 4167.38 ms | 32.4% bf16 MFU | 125948 tok/s step 9505/19560 | loss 3.565250 (+2.84z)| norm 0.2814 (-0.25z)| lr 3.31e-04 | 4156.77 ms | 32.5% bf16 MFU | 125957 tok/s step 9506/19560 | loss 3.422531 (-0.13z)| norm 0.2905 (-0.07z)| lr 3.31e-04 | 4160.10 ms | 32.5% bf16 MFU | 125960 tok/s step 9507/19560 | loss 3.415581 (-0.28z)| norm 0.2905 (-0.07z)| lr 3.31e-04 | 4172.27 ms | 32.4% bf16 MFU | 125945 tok/s step 9508/19560 | loss 3.431568 (+0.06z)| norm 0.2794 (-0.30z)| lr 3.31e-04 | 4157.48 ms | 32.5% bf16 MFU | 125953 tok/s step 9509/19560 | loss 3.412857 (-0.34z)| norm 0.2890 (-0.10z)| lr 3.31e-04 | 4149.23 ms | 32.5% bf16 MFU | 125973 tok/s step 9510/19560 | loss 3.401215 (-0.57z)| norm 0.2793 (-0.30z)| lr 3.31e-04 | 4164.83 ms | 32.4% bf16 MFU | 125969 tok/s step 9511/19560 | loss 3.349770 (-1.63z)| norm 0.3029 (+0.17z)| lr 3.31e-04 | 4153.90 ms | 32.5% bf16 MFU | 125981 tok/s step 9512/19560 | loss 3.410600 (-0.36z)| norm 0.2556 (-0.77z)| lr 3.31e-04 | 4153.99 ms | 32.5% bf16 MFU | 125993 tok/s step 9513/19560 | loss 3.399044 (-0.59z)| norm 0.2749 (-0.38z)| lr 3.31e-04 | 4151.96 ms | 32.5% bf16 MFU | 126007 tok/s step 9514/19560 | loss 3.487436 (+1.23z)| norm 0.8211 (+7.69z)| lr 3.31e-04 | 4164.68 ms | 32.4% bf16 MFU | 126001 tok/s step 9515/19560 | loss 3.454029 (+0.55z)| norm 0.3880 (+1.31z)| lr 3.31e-04 | 4147.90 ms | 32.6% bf16 MFU | 126021 tok/s step 9516/19560 | loss 3.420783 (-0.15z)| norm 0.3229 (+0.36z)| lr 3.31e-04 | 4164.67 ms | 32.4% bf16 MFU | 126014 tok/s step 9517/19560 | loss 3.395313 (-0.69z)| norm 0.3241 (+0.37z)| lr 3.31e-04 | 4183.61 ms | 32.3% bf16 MFU | 125980 tok/s step 9518/19560 | loss 3.502236 (+1.54z)| norm 0.2904 (-0.12z)| lr 3.31e-04 | 4153.51 ms | 32.5% bf16 MFU | 125992 tok/s step 9519/19560 | loss 3.489813 (+1.26z)| norm 0.3264 (+0.40z)| lr 3.31e-04 | 4153.01 ms | 32.5% bf16 MFU | 126005 tok/s step 9520/19560 | loss 3.470067 (+0.84z)| norm 0.2896 (-0.13z)| lr 3.30e-04 | 4158.89 ms | 32.5% bf16 MFU | 126008 tok/s step 9521/19560 | loss 3.421187 (-0.17z)| norm 0.2992 (+0.01z)| lr 3.30e-04 | 4158.41 ms | 32.5% bf16 MFU | 126011 tok/s step 9522/19560 | loss 3.367003 (-1.27z)| norm 0.2606 (-0.55z)| lr 3.30e-04 | 4164.13 ms | 32.4% bf16 MFU | 126006 tok/s step 9523/19560 | loss 3.470508 (+0.84z)| norm 0.2777 (-0.30z)| lr 3.30e-04 | 4161.53 ms | 32.4% bf16 MFU | 126005 tok/s step 9524/19560 | loss 3.453108 (+0.47z)| norm 0.2989 (+0.01z)| lr 3.30e-04 | 4151.67 ms | 32.5% bf16 MFU | 126019 tok/s step 9525/19560 | loss 3.411190 (-0.41z)| norm 0.2751 (-0.34z)| lr 3.30e-04 | 4163.06 ms | 32.4% bf16 MFU | 126015 tok/s step 9526/19560 | loss 3.417355 (-0.31z)| norm 0.2619 (-0.53z)| lr 3.30e-04 | 4154.08 ms | 32.5% bf16 MFU | 126025 tok/s step 9527/19560 | loss 3.441999 (+0.22z)| norm 0.2727 (-0.37z)| lr 3.30e-04 | 4163.02 ms | 32.4% bf16 MFU | 126020 tok/s step 9528/19560 | loss 3.437104 (+0.11z)| norm 0.2673 (-0.45z)| lr 3.30e-04 | 4155.10 ms | 32.5% bf16 MFU | 126028 tok/s step 9529/19560 | loss 3.369425 (-1.33z)| norm 0.2620 (-0.52z)| lr 3.30e-04 | 4154.65 ms | 32.5% bf16 MFU | 126036 tok/s step 9530/19560 | loss 3.596286 (+3.40z)| norm 0.2861 (-0.17z)| lr 3.30e-04 | 4157.87 ms | 32.5% bf16 MFU | 126039 tok/s step 9531/19560 | loss 3.444052 (+0.24z)| norm 0.2747 (-0.33z)| lr 3.30e-04 | 4164.62 ms | 32.4% bf16 MFU | 126032 tok/s step 9532/19560 | loss 3.410676 (-0.46z)| norm 0.2700 (-0.40z)| lr 3.30e-04 | 4167.56 ms | 32.4% bf16 MFU | 126020 tok/s step 9533/19560 | loss 3.470120 (+0.77z)| norm 0.2930 (-0.06z)| lr 3.30e-04 | 4152.66 ms | 32.5% bf16 MFU | 126032 tok/s step 9534/19560 | loss 3.425667 (-0.15z)| norm 0.2826 (-0.21z)| lr 3.30e-04 | 4145.74 ms | 32.6% bf16 MFU | 126054 tok/s step 9535/19560 | loss 3.416016 (-0.35z)| norm 0.2760 (-0.31z)| lr 3.30e-04 | 4161.66 ms | 32.4% bf16 MFU | 126050 tok/s step 9536/19560 | loss 3.398978 (-0.70z)| norm 0.2604 (-0.53z)| lr 3.30e-04 | 4151.54 ms | 32.5% bf16 MFU | 126062 tok/s step 9537/19560 | loss 3.459580 (+0.55z)| norm 0.2518 (-0.65z)| lr 3.30e-04 | 4160.65 ms | 32.5% bf16 MFU | 126059 tok/s step 9538/19560 | loss 3.419952 (-0.26z)| norm 0.2535 (-0.63z)| lr 3.30e-04 | 4158.53 ms | 32.5% bf16 MFU | 126060 tok/s step 9539/19560 | loss 3.412978 (-0.40z)| norm 0.2424 (-0.78z)| lr 3.30e-04 | 4156.10 ms | 32.5% bf16 MFU | 126065 tok/s step 9540/19560 | loss 3.450881 (+0.39z)| norm 0.2527 (-0.63z)| lr 3.29e-04 | 4147.11 ms | 32.6% bf16 MFU | 126083 tok/s step 9541/19560 | loss 3.447264 (+0.30z)| norm 0.2472 (-0.70z)| lr 3.29e-04 | 4153.93 ms | 32.5% bf16 MFU | 126089 tok/s step 9542/19560 | loss 3.365417 (-1.42z)| norm 0.2726 (-0.34z)| lr 3.29e-04 | 4158.50 ms | 32.5% bf16 MFU | 126088 tok/s step 9543/19560 | loss 3.468505 (+0.76z)| norm 0.2647 (-0.47z)| lr 3.29e-04 | 4153.85 ms | 32.5% bf16 MFU | 126095 tok/s step 9544/19560 | loss 3.411113 (-0.45z)| norm 0.2697 (-0.37z)| lr 3.29e-04 | 4165.47 ms | 32.4% bf16 MFU | 126083 tok/s step 9545/19560 | loss 3.384103 (-1.01z)| norm 0.2739 (-0.29z)| lr 3.29e-04 | 4164.28 ms | 32.4% bf16 MFU | 126074 tok/s step 9546/19560 | loss 3.446104 (+0.30z)| norm 0.2754 (-0.25z)| lr 3.29e-04 | 4160.19 ms | 32.5% bf16 MFU | 126072 tok/s step 9547/19560 | loss 3.399972 (-0.70z)| norm 0.2598 (-0.52z)| lr 3.29e-04 | 4163.78 ms | 32.4% bf16 MFU | 126064 tok/s step 9548/19560 | loss 3.429706 (-0.06z)| norm 0.2820 (-0.13z)| lr 3.29e-04 | 4172.33 ms | 32.4% bf16 MFU | 126044 tok/s step 9549/19560 | loss 3.432098 (-0.00z)| norm 0.2664 (-0.39z)| lr 3.29e-04 | 4155.16 ms | 32.5% bf16 MFU | 126050 tok/s step 9550/19560 | loss 3.420854 (-0.25z)| norm 0.3034 (+0.25z)| lr 3.29e-04 | 4154.66 ms | 32.5% bf16 MFU | 126058 tok/s step 9551/19560 | loss 3.460108 (+0.61z)| norm 0.3020 (+0.22z)| lr 3.29e-04 | 4161.44 ms | 32.4% bf16 MFU | 126054 tok/s step 9552/19560 | loss 3.504220 (+1.62z)| norm 0.2657 (-0.40z)| lr 3.29e-04 | 4156.31 ms | 32.5% bf16 MFU | 126059 tok/s step 9553/19560 | loss 3.396582 (-0.79z)| norm 0.3186 (+0.63z)| lr 3.29e-04 | 4156.23 ms | 32.5% bf16 MFU | 126063 tok/s step 9554/19560 | loss 3.387222 (-0.99z)| norm 0.2684 (-0.34z)| lr 3.29e-04 | 4166.29 ms | 32.4% bf16 MFU | 126052 tok/s step 9555/19560 | loss 3.396276 (-0.77z)| norm 0.2764 (-0.18z)| lr 3.29e-04 | 4154.67 ms | 32.5% bf16 MFU | 126059 tok/s step 9556/19560 | loss 3.345945 (-1.90z)| norm 0.2964 (+0.21z)| lr 3.29e-04 | 4155.77 ms | 32.5% bf16 MFU | 126064 tok/s step 9557/19560 | loss 3.386266 (-0.97z)| norm 0.2709 (-0.28z)| lr 3.29e-04 | 4158.20 ms | 32.5% bf16 MFU | 126065 tok/s step 9558/19560 | loss 3.447824 (+0.44z)| norm 0.2762 (-0.18z)| lr 3.29e-04 | 4158.53 ms | 32.5% bf16 MFU | 126065 tok/s step 9559/19560 | loss 3.444488 (+0.36z)| norm 0.2843 (-0.02z)| lr 3.29e-04 | 4149.19 ms | 32.5% bf16 MFU | 126080 tok/s step 9560/19560 | loss 3.443105 (+0.33z)| norm 0.2699 (-0.29z)| lr 3.28e-04 | 4157.00 ms | 32.5% bf16 MFU | 126082 tok/s step 9561/19560 | loss 3.435986 (+0.17z)| norm 0.2694 (-0.30z)| lr 3.28e-04 | 4157.38 ms | 32.5% bf16 MFU | 126084 tok/s step 9562/19560 | loss 3.385171 (-0.98z)| norm 0.2912 (+0.12z)| lr 3.28e-04 | 4153.56 ms | 32.5% bf16 MFU | 126091 tok/s step 9563/19560 | loss 3.561163 (+2.96z)| norm 0.2609 (-0.46z)| lr 3.28e-04 | 4147.69 ms | 32.6% bf16 MFU | 126106 tok/s step 9564/19560 | loss 3.361143 (-1.47z)| norm 0.2714 (-0.26z)| lr 3.28e-04 | 4159.64 ms | 32.5% bf16 MFU | 126103 tok/s step 9565/19560 | loss 3.483857 (+1.24z)| norm 0.2729 (-0.23z)| lr 3.28e-04 | 4179.84 ms | 32.3% bf16 MFU | 126070 tok/s step 9566/19560 | loss 3.367952 (-1.30z)| norm 0.2689 (-0.30z)| lr 3.28e-04 | 4156.87 ms | 32.5% bf16 MFU | 126072 tok/s step 9567/19560 | loss 3.446209 (+0.41z)| norm 0.2884 (+0.07z)| lr 3.28e-04 | 4154.99 ms | 32.5% bf16 MFU | 126078 tok/s step 9568/19560 | loss 3.405146 (-0.49z)| norm 0.2671 (-0.34z)| lr 3.28e-04 | 4156.00 ms | 32.5% bf16 MFU | 126082 tok/s step 9569/19560 | loss 3.444242 (+0.37z)| norm 0.2980 (+0.25z)| lr 3.28e-04 | 4158.14 ms | 32.5% bf16 MFU | 126082 tok/s step 9570/19560 | loss 3.388202 (-0.85z)| norm 0.2593 (-0.49z)| lr 3.28e-04 | 4150.88 ms | 32.5% bf16 MFU | 126093 tok/s step 9571/19560 | loss 3.329785 (-2.11z)| norm 0.2857 (+0.02z)| lr 3.28e-04 | 4164.15 ms | 32.4% bf16 MFU | 126084 tok/s step 9572/19560 | loss 3.438396 (+0.25z)| norm 0.2663 (-0.36z)| lr 3.28e-04 | 4160.42 ms | 32.5% bf16 MFU | 126080 tok/s step 9573/19560 | loss 3.419014 (-0.18z)| norm 0.2629 (-0.42z)| lr 3.28e-04 | 4146.83 ms | 32.6% bf16 MFU | 126098 tok/s step 9574/19560 | loss 3.396976 (-0.65z)| norm 0.2628 (-0.42z)| lr 3.28e-04 | 4153.95 ms | 32.5% bf16 MFU | 126104 tok/s step 9575/19560 | loss 3.347836 (-1.70z)| norm 0.2595 (-0.48z)| lr 3.28e-04 | 4156.58 ms | 32.5% bf16 MFU | 126105 tok/s step 9576/19560 | loss 3.503449 (+1.63z)| norm 0.2837 (-0.01z)| lr 3.28e-04 | 4182.62 ms | 32.3% bf16 MFU | 126068 tok/s step 9577/19560 | loss 3.354551 (-1.53z)| norm 0.2507 (-0.65z)| lr 3.28e-04 | 4146.22 ms | 32.6% bf16 MFU | 126087 tok/s step 9578/19560 | loss 3.438528 (+0.24z)| norm 0.2803 (-0.07z)| lr 3.28e-04 | 4148.66 ms | 32.5% bf16 MFU | 126101 tok/s step 9579/19560 | loss 3.464967 (+0.79z)| norm 0.2856 (+0.02z)| lr 3.28e-04 | 4153.99 ms | 32.5% bf16 MFU | 126107 tok/s step 9580/19560 | loss 3.478120 (+1.06z)| norm 0.2666 (-0.34z)| lr 3.27e-04 | 4152.52 ms | 32.5% bf16 MFU | 126114 tok/s step 9581/19560 | loss 3.354202 (-1.55z)| norm 0.2894 (+0.10z)| lr 3.27e-04 | 4153.30 ms | 32.5% bf16 MFU | 126120 tok/s step 9582/19560 | loss 3.423197 (-0.10z)| norm 0.2608 (-0.45z)| lr 3.27e-04 | 4153.52 ms | 32.5% bf16 MFU | 126126 tok/s step 9583/19560 | loss 3.428221 (+0.01z)| norm 0.2872 (+0.06z)| lr 3.27e-04 | 4158.64 ms | 32.5% bf16 MFU | 126123 tok/s step 9584/19560 | loss 3.399106 (-0.59z)| norm 0.2700 (-0.28z)| lr 3.27e-04 | 4158.94 ms | 32.5% bf16 MFU | 126120 tok/s step 9585/19560 | loss 3.483831 (+1.17z)| norm 0.2849 (+0.01z)| lr 3.27e-04 | 4161.35 ms | 32.4% bf16 MFU | 126113 tok/s step 9586/19560 | loss 3.373323 (-1.12z)| norm 0.2817 (-0.05z)| lr 3.27e-04 | 4158.46 ms | 32.5% bf16 MFU | 126112 tok/s step 9587/19560 | loss 3.402318 (-0.51z)| norm 0.2608 (-0.45z)| lr 3.27e-04 | 4155.89 ms | 32.5% bf16 MFU | 126114 tok/s step 9588/19560 | loss 3.418326 (-0.17z)| norm 0.2887 (+0.09z)| lr 3.27e-04 | 4150.03 ms | 32.5% bf16 MFU | 126125 tok/s step 9589/19560 | loss 3.485309 (+1.23z)| norm 0.2505 (-0.65z)| lr 3.27e-04 | 4162.04 ms | 32.4% bf16 MFU | 126117 tok/s step 9590/19560 | loss 3.442583 (+0.32z)| norm 0.3073 (+0.45z)| lr 3.27e-04 | 4152.48 ms | 32.5% bf16 MFU | 126124 tok/s step 9591/19560 | loss 3.380731 (-0.98z)| norm 0.2675 (-0.32z)| lr 3.27e-04 | 4153.56 ms | 32.5% bf16 MFU | 126129 tok/s step 9592/19560 | loss 3.417672 (-0.19z)| norm 0.2996 (+0.30z)| lr 3.27e-04 | 4161.08 ms | 32.4% bf16 MFU | 126123 tok/s step 9593/19560 | loss 3.393126 (-0.71z)| norm 0.2883 (+0.08z)| lr 3.27e-04 | 4151.87 ms | 32.5% bf16 MFU | 126130 tok/s step 9594/19560 | loss 3.376428 (-1.05z)| norm 0.2914 (+0.13z)| lr 3.27e-04 | 4160.50 ms | 32.5% bf16 MFU | 126125 tok/s step 9595/19560 | loss 3.370529 (-1.19z)| norm 0.2760 (-0.17z)| lr 3.27e-04 | 4154.69 ms | 32.5% bf16 MFU | 126128 tok/s step 9596/19560 | loss 3.472974 (+0.96z)| norm 0.2932 (+0.17z)| lr 3.27e-04 | 4162.99 ms | 32.4% bf16 MFU | 126119 tok/s step 9597/19560 | loss 3.377139 (-1.05z)| norm 0.2580 (-0.51z)| lr 3.27e-04 | 4157.07 ms | 32.5% bf16 MFU | 126119 tok/s step 9598/19560 | loss 3.395652 (-0.66z)| norm 0.3126 (+0.55z)| lr 3.27e-04 | 4154.88 ms | 32.5% bf16 MFU | 126122 tok/s step 9599/19560 | loss 3.430874 (+0.07z)| norm 0.2903 (+0.12z)| lr 3.27e-04 | 4159.71 ms | 32.5% bf16 MFU | 126118 tok/s step 9600/19560 | loss 3.515753 (+1.82z)| norm 0.2777 (-0.12z)| lr 3.27e-04 | 4156.01 ms | 32.5% bf16 MFU | 126120 tok/s step 9601/19560 | loss 3.415073 (-0.27z)| norm 0.2917 (+0.15z)| lr 3.26e-04 | 4153.66 ms | 32.5% bf16 MFU | 126125 tok/s step 9602/19560 | loss 3.414679 (-0.26z)| norm 0.2774 (-0.13z)| lr 3.26e-04 | 4217.47 ms | 32.0% bf16 MFU | 126034 tok/s step 9603/19560 | loss 3.432503 (+0.12z)| norm 0.2945 (+0.20z)| lr 3.26e-04 | 4165.75 ms | 32.4% bf16 MFU | 126025 tok/s step 9604/19560 | loss 3.409105 (-0.38z)| norm 0.2913 (+0.14z)| lr 3.26e-04 | 4161.19 ms | 32.4% bf16 MFU | 126024 tok/s step 9605/19560 | loss 3.398353 (-0.60z)| norm 0.2740 (-0.20z)| lr 3.26e-04 | 4152.50 ms | 32.5% bf16 MFU | 126036 tok/s step 9606/19560 | loss 3.440291 (+0.31z)| norm 0.2886 (+0.08z)| lr 3.26e-04 | 4157.12 ms | 32.5% bf16 MFU | 126040 tok/s step 9607/19560 | loss 3.425479 (+0.00z)| norm 0.2895 (+0.10z)| lr 3.26e-04 | 4156.18 ms | 32.5% bf16 MFU | 126045 tok/s step 9608/19560 | loss 3.370776 (-1.18z)| norm 0.3277 (+0.83z)| lr 3.26e-04 | 4154.82 ms | 32.5% bf16 MFU | 126052 tok/s step 9609/19560 | loss 3.412798 (-0.28z)| norm 0.2653 (-0.37z)| lr 3.26e-04 | 4157.33 ms | 32.5% bf16 MFU | 126055 tok/s step 9610/19560 | loss 3.429979 (+0.10z)| norm 0.2696 (-0.29z)| lr 3.26e-04 | 4147.18 ms | 32.6% bf16 MFU | 126073 tok/s step 9611/19560 | loss 3.431695 (+0.13z)| norm 0.2668 (-0.33z)| lr 3.26e-04 | 4144.45 ms | 32.6% bf16 MFU | 126095 tok/s step 9612/19560 | loss 3.413655 (-0.27z)| norm 0.2921 (+0.15z)| lr 3.26e-04 | 4151.14 ms | 32.5% bf16 MFU | 126105 tok/s step 9613/19560 | loss 3.393162 (-0.72z)| norm 0.2903 (+0.12z)| lr 3.26e-04 | 4158.24 ms | 32.5% bf16 MFU | 126104 tok/s step 9614/19560 | loss 3.338565 (-1.88z)| norm 0.2545 (-0.57z)| lr 3.26e-04 | 4151.51 ms | 32.5% bf16 MFU | 126113 tok/s step 9615/19560 | loss 3.371640 (-1.15z)| norm 0.2779 (-0.11z)| lr 3.26e-04 | 4153.04 ms | 32.5% bf16 MFU | 126120 tok/s step 9616/19560 | loss 3.393659 (-0.66z)| norm 0.2479 (-0.68z)| lr 3.26e-04 | 4149.82 ms | 32.5% bf16 MFU | 126131 tok/s step 9617/19560 | loss 3.449926 (+0.57z)| norm 0.2633 (-0.38z)| lr 3.26e-04 | 4147.64 ms | 32.6% bf16 MFU | 126145 tok/s step 9618/19560 | loss 3.430654 (+0.15z)| norm 0.2854 (+0.05z)| lr 3.26e-04 | 4153.59 ms | 32.5% bf16 MFU | 126149 tok/s step 9619/19560 | loss 3.424232 (+0.01z)| norm 0.2546 (-0.55z)| lr 3.26e-04 | 4151.48 ms | 32.5% bf16 MFU | 126156 tok/s step 9620/19560 | loss 3.411453 (-0.26z)| norm 0.2786 (-0.09z)| lr 3.26e-04 | 4151.62 ms | 32.5% bf16 MFU | 126162 tok/s step 9621/19560 | loss 3.472845 (+1.09z)| norm 0.2634 (-0.38z)| lr 3.25e-04 | 4156.31 ms | 32.5% bf16 MFU | 126161 tok/s step 9622/19560 | loss 3.494529 (+1.54z)| norm 0.3049 (+0.42z)| lr 3.25e-04 | 4154.44 ms | 32.5% bf16 MFU | 126163 tok/s step 9623/19560 | loss 3.454671 (+0.66z)| norm 0.2762 (-0.14z)| lr 3.25e-04 | 4157.57 ms | 32.5% bf16 MFU | 126160 tok/s step 9624/19560 | loss 3.402975 (-0.48z)| norm 0.2839 (+0.01z)| lr 3.25e-04 | 4172.29 ms | 32.4% bf16 MFU | 126135 tok/s step 9625/19560 | loss 3.352091 (-1.57z)| norm 0.2665 (-0.33z)| lr 3.25e-04 | 4153.23 ms | 32.5% bf16 MFU | 126140 tok/s step 9626/19560 | loss 3.462422 (+0.86z)| norm 0.2638 (-0.38z)| lr 3.25e-04 | 4172.38 ms | 32.4% bf16 MFU | 126116 tok/s step 9627/19560 | loss 3.364666 (-1.30z)| norm 0.3200 (+0.70z)| lr 3.25e-04 | 4159.00 ms | 32.5% bf16 MFU | 126113 tok/s step 9628/19560 | loss 3.418994 (-0.10z)| norm 0.2888 (+0.10z)| lr 3.25e-04 | 4148.23 ms | 32.5% bf16 MFU | 126127 tok/s step 9629/19560 | loss 3.448646 (+0.55z)| norm 0.2870 (+0.06z)| lr 3.25e-04 | 4159.91 ms | 32.5% bf16 MFU | 126122 tok/s step 9630/19560 | loss 3.420270 (-0.09z)| norm 0.3011 (+0.33z)| lr 3.25e-04 | 4164.23 ms | 32.4% bf16 MFU | 126111 tok/s step 9631/19560 | loss 3.515508 (+1.98z)| norm 0.2883 (+0.08z)| lr 3.25e-04 | 4145.83 ms | 32.6% bf16 MFU | 126129 tok/s step 9632/19560 | loss 3.432428 (+0.17z)| norm 0.2929 (+0.17z)| lr 3.25e-04 | 4162.34 ms | 32.4% bf16 MFU | 126120 tok/s step 9633/19560 | loss 3.374869 (-1.10z)| norm 0.2737 (-0.20z)| lr 3.25e-04 | 4167.98 ms | 32.4% bf16 MFU | 126104 tok/s step 9634/19560 | loss 3.305673 (-2.57z)| norm 0.2918 (+0.15z)| lr 3.25e-04 | 4152.35 ms | 32.5% bf16 MFU | 126112 tok/s step 9635/19560 | loss 3.437147 (+0.32z)| norm 0.2867 (+0.05z)| lr 3.25e-04 | 4171.02 ms | 32.4% bf16 MFU | 126091 tok/s step 9636/19560 | loss 3.410888 (-0.25z)| norm 0.2652 (-0.36z)| lr 3.25e-04 | 4152.21 ms | 32.5% bf16 MFU | 126100 tok/s step 9637/19560 | loss 3.437114 (+0.32z)| norm 0.2734 (-0.20z)| lr 3.25e-04 | 4168.74 ms | 32.4% bf16 MFU | 126083 tok/s step 9638/19560 | loss 3.400266 (-0.49z)| norm 0.2806 (-0.06z)| lr 3.25e-04 | 4153.73 ms | 32.5% bf16 MFU | 126090 tok/s step 9639/19560 | loss 3.383060 (-0.88z)| norm 0.2611 (-0.43z)| lr 3.25e-04 | 4157.62 ms | 32.5% bf16 MFU | 126091 tok/s step 9640/19560 | loss 3.484290 (+1.34z)| norm 0.3136 (+0.57z)| lr 3.25e-04 | 4149.99 ms | 32.5% bf16 MFU | 126103 tok/s step 9641/19560 | loss 3.465967 (+0.92z)| norm 0.2939 (+0.19z)| lr 3.24e-04 | 4180.31 ms | 32.3% bf16 MFU | 126069 tok/s step 9642/19560 | loss 3.450996 (+0.60z)| norm 0.2864 (+0.32z)| lr 3.24e-04 | 4177.54 ms | 32.3% bf16 MFU | 126040 tok/s step 9643/19560 | loss 3.387094 (-0.80z)| norm 0.2753 (-0.21z)| lr 3.24e-04 | 4212.47 ms | 32.1% bf16 MFU | 125961 tok/s step 9644/19560 | loss 3.420539 (-0.06z)| norm 0.2994 (+1.19z)| lr 3.24e-04 | 4171.68 ms | 32.4% bf16 MFU | 125947 tok/s step 9645/19560 | loss 3.365740 (-1.26z)| norm 0.2846 (+0.36z)| lr 3.24e-04 | 4167.97 ms | 32.4% bf16 MFU | 125939 tok/s step 9646/19560 | loss 3.440666 (+0.40z)| norm 0.2763 (-0.13z)| lr 3.24e-04 | 4173.60 ms | 32.4% bf16 MFU | 125923 tok/s step 9647/19560 | loss 3.405382 (-0.37z)| norm 0.2914 (+0.81z)| lr 3.24e-04 | 4158.11 ms | 32.5% bf16 MFU | 125932 tok/s step 9648/19560 | loss 3.405483 (-0.36z)| norm 0.2619 (-0.99z)| lr 3.24e-04 | 4157.66 ms | 32.5% bf16 MFU | 125940 tok/s step 9649/19560 | loss 3.442453 (+0.47z)| norm 0.2960 (+1.11z)| lr 3.24e-04 | 4168.43 ms | 32.4% bf16 MFU | 125932 tok/s step 9650/19560 | loss 3.379061 (-0.96z)| norm 0.2649 (-0.81z)| lr 3.24e-04 | 4171.57 ms | 32.4% bf16 MFU | 125919 tok/s step 9651/19560 | loss 3.390119 (-0.70z)| norm 0.2675 (-0.64z)| lr 3.24e-04 | 4153.59 ms | 32.5% bf16 MFU | 125935 tok/s step 9652/19560 | loss 3.384490 (-0.81z)| norm 0.2705 (-0.44z)| lr 3.24e-04 | 4155.29 ms | 32.5% bf16 MFU | 125947 tok/s step 9653/19560 | loss 3.399397 (-0.47z)| norm 0.2660 (-0.72z)| lr 3.24e-04 | 4172.07 ms | 32.4% bf16 MFU | 125933 tok/s step 9654/19560 | loss 3.414777 (-0.12z)| norm 0.2608 (-1.04z)| lr 3.24e-04 | 4152.77 ms | 32.5% bf16 MFU | 125948 tok/s step 9655/19560 | loss 3.448946 (+0.65z)| norm 0.2587 (-1.16z)| lr 3.24e-04 | 4157.15 ms | 32.5% bf16 MFU | 125957 tok/s step 9656/19560 | loss 3.374132 (-1.03z)| norm 0.2781 (+0.03z)| lr 3.24e-04 | 4148.23 ms | 32.5% bf16 MFU | 125978 tok/s step 9657/19560 | loss 3.338327 (-1.81z)| norm 0.2637 (-0.86z)| lr 3.24e-04 | 4170.76 ms | 32.4% bf16 MFU | 125965 tok/s step 9658/19560 | loss 3.413176 (-0.12z)| norm 0.2731 (-0.27z)| lr 3.24e-04 | 4144.62 ms | 32.6% bf16 MFU | 125992 tok/s step 9659/19560 | loss 3.394098 (-0.56z)| norm 0.2740 (-0.22z)| lr 3.24e-04 | 4161.41 ms | 32.4% bf16 MFU | 125991 tok/s step 9660/19560 | loss 3.431465 (+0.32z)| norm 0.2747 (-0.17z)| lr 3.24e-04 | 4151.15 ms | 32.5% bf16 MFU | 126007 tok/s step 9661/19560 | loss 3.401748 (-0.38z)| norm 0.2628 (-0.89z)| lr 3.23e-04 | 4155.28 ms | 32.5% bf16 MFU | 126015 tok/s step 9662/19560 | loss 3.358033 (-1.40z)| norm 0.2737 (-0.22z)| lr 3.23e-04 | 4160.38 ms | 32.5% bf16 MFU | 126015 tok/s step 9663/19560 | loss 3.392955 (-0.56z)| norm 0.2681 (-0.56z)| lr 3.23e-04 | 4163.75 ms | 32.4% bf16 MFU | 126010 tok/s step 9664/19560 | loss 3.450723 (+0.80z)| norm 0.2772 (-0.00z)| lr 3.23e-04 | 4179.85 ms | 32.3% bf16 MFU | 125981 tok/s step 9665/19560 | loss 3.386651 (-0.71z)| norm 0.2714 (-0.38z)| lr 3.23e-04 | 4156.75 ms | 32.5% bf16 MFU | 125989 tok/s step 9666/19560 | loss 3.383982 (-0.76z)| norm 0.2932 (+0.97z)| lr 3.23e-04 | 4163.42 ms | 32.4% bf16 MFU | 125986 tok/s step 9667/19560 | loss 3.456336 (+0.94z)| norm 0.2928 (+0.94z)| lr 3.23e-04 | 4155.74 ms | 32.5% bf16 MFU | 125994 tok/s step 9668/19560 | loss 3.447991 (+0.74z)| norm 0.3319 (+3.28z)| lr 3.23e-04 | 4150.43 ms | 32.5% bf16 MFU | 126011 tok/s step 9669/19560 | loss 3.403300 (-0.31z)| norm 0.3055 (+1.63z)| lr 3.23e-04 | 4152.56 ms | 32.5% bf16 MFU | 126023 tok/s step 9670/19560 | loss 3.404469 (-0.29z)| norm 0.2823 (+0.19z)| lr 3.23e-04 | 4158.15 ms | 32.5% bf16 MFU | 126026 tok/s step 9671/19560 | loss 3.413486 (-0.06z)| norm 0.3331 (+3.18z)| lr 3.23e-04 | 4152.97 ms | 32.5% bf16 MFU | 126037 tok/s step 9672/19560 | loss 3.464861 (+1.15z)| norm 0.2553 (-1.45z)| lr 3.23e-04 | 4160.66 ms | 32.5% bf16 MFU | 126036 tok/s step 9673/19560 | loss 3.408125 (-0.20z)| norm 0.3070 (+1.58z)| lr 3.23e-04 | 4154.94 ms | 32.5% bf16 MFU | 126043 tok/s step 9674/19560 | loss 3.380193 (-0.86z)| norm 0.2644 (-0.91z)| lr 3.23e-04 | 4159.29 ms | 32.5% bf16 MFU | 126044 tok/s step 9675/19560 | loss 3.367212 (-1.16z)| norm 0.2686 (-0.67z)| lr 3.23e-04 | 4158.25 ms | 32.5% bf16 MFU | 126046 tok/s step 9676/19560 | loss 3.443408 (+0.65z)| norm 0.2947 (+0.86z)| lr 3.23e-04 | 4151.36 ms | 32.5% bf16 MFU | 126058 tok/s step 9677/19560 | loss 3.518725 (+2.36z)| norm 0.2820 (+0.10z)| lr 3.23e-04 | 4168.95 ms | 32.4% bf16 MFU | 126043 tok/s step 9678/19560 | loss 3.379544 (-0.85z)| norm 0.2623 (-1.04z)| lr 3.23e-04 | 4160.10 ms | 32.5% bf16 MFU | 126042 tok/s step 9679/19560 | loss 3.440930 (+0.57z)| norm 0.2783 (-0.09z)| lr 3.23e-04 | 4154.85 ms | 32.5% bf16 MFU | 126050 tok/s step 9680/19560 | loss 3.411837 (-0.09z)| norm 0.2557 (-1.41z)| lr 3.23e-04 | 4155.36 ms | 32.5% bf16 MFU | 126056 tok/s step 9681/19560 | loss 3.447604 (+0.75z)| norm 0.2858 (+0.39z)| lr 3.22e-04 | 4151.66 ms | 32.5% bf16 MFU | 126067 tok/s step 9682/19560 | loss 3.418951 (+0.07z)| norm 0.2692 (-0.61z)| lr 3.22e-04 | 4161.14 ms | 32.4% bf16 MFU | 126064 tok/s step 9683/19560 | loss 3.415234 (-0.03z)| norm 0.2689 (-0.62z)| lr 3.22e-04 | 4153.90 ms | 32.5% bf16 MFU | 126071 tok/s step 9684/19560 | loss 3.387402 (-0.70z)| norm 0.2875 (+0.50z)| lr 3.22e-04 | 4163.68 ms | 32.4% bf16 MFU | 126064 tok/s step 9685/19560 | loss 3.511239 (+2.19z)| norm 0.2554 (-1.42z)| lr 3.22e-04 | 4157.61 ms | 32.5% bf16 MFU | 126066 tok/s step 9686/19560 | loss 3.397904 (-0.45z)| norm 0.2957 (+0.98z)| lr 3.22e-04 | 4164.61 ms | 32.4% bf16 MFU | 126057 tok/s step 9687/19560 | loss 3.423039 (+0.14z)| norm 0.2918 (+0.75z)| lr 3.22e-04 | 4160.43 ms | 32.5% bf16 MFU | 126055 tok/s step 9688/19560 | loss 3.375906 (-0.95z)| norm 0.2867 (+0.43z)| lr 3.22e-04 | 4161.00 ms | 32.4% bf16 MFU | 126052 tok/s step 9689/19560 | loss 3.399847 (-0.38z)| norm 0.2773 (-0.13z)| lr 3.22e-04 | 4160.68 ms | 32.5% bf16 MFU | 126050 tok/s step 9690/19560 | loss 3.374254 (-0.98z)| norm 0.2862 (+0.40z)| lr 3.22e-04 | 4152.99 ms | 32.5% bf16 MFU | 126060 tok/s step 9691/19560 | loss 3.389477 (-0.62z)| norm 0.2686 (-0.65z)| lr 3.22e-04 | 4156.48 ms | 32.5% bf16 MFU | 126064 tok/s step 9692/19560 | loss 3.452762 (+0.91z)| norm 0.2904 (+0.64z)| lr 3.22e-04 | 4163.28 ms | 32.4% bf16 MFU | 126057 tok/s step 9693/19560 | loss 3.420236 (+0.13z)| norm 0.2796 (-0.01z)| lr 3.22e-04 | 4159.62 ms | 32.5% bf16 MFU | 126056 tok/s step 9694/19560 | loss 3.483505 (+1.67z)| norm 0.2958 (+0.95z)| lr 3.22e-04 | 4156.40 ms | 32.5% bf16 MFU | 126060 tok/s step 9695/19560 | loss 3.392305 (-0.57z)| norm 0.2664 (-0.80z)| lr 3.22e-04 | 4158.25 ms | 32.5% bf16 MFU | 126062 tok/s step 9696/19560 | loss 3.479281 (+1.55z)| norm 0.2756 (-0.25z)| lr 3.22e-04 | 4165.79 ms | 32.4% bf16 MFU | 126051 tok/s step 9697/19560 | loss 3.418799 (+0.07z)| norm 0.2725 (-0.43z)| lr 3.22e-04 | 4164.57 ms | 32.4% bf16 MFU | 126043 tok/s step 9698/19560 | loss 3.376710 (-0.96z)| norm 0.2777 (-0.13z)| lr 3.22e-04 | 4167.73 ms | 32.4% bf16 MFU | 126031 tok/s step 9699/19560 | loss 3.466204 (+1.22z)| norm 0.2979 (+1.08z)| lr 3.22e-04 | 4155.58 ms | 32.5% bf16 MFU | 126038 tok/s step 9700/19560 | loss 3.416468 (-0.01z)| norm 0.2827 (+0.16z)| lr 3.22e-04 | 4162.83 ms | 32.4% bf16 MFU | 126033 tok/s step 9701/19560 | loss 3.383792 (-0.81z)| norm 0.2966 (+0.99z)| lr 3.21e-04 | 4163.47 ms | 32.4% bf16 MFU | 126028 tok/s step 9702/19560 | loss 3.418230 (+0.04z)| norm 0.2953 (+0.89z)| lr 3.21e-04 | 4170.90 ms | 32.4% bf16 MFU | 126011 tok/s step 9703/19560 | loss 3.389269 (-0.69z)| norm 0.2756 (-0.30z)| lr 3.21e-04 | 4156.65 ms | 32.5% bf16 MFU | 126017 tok/s step 9704/19560 | loss 3.418777 (+0.06z)| norm 0.2886 (+0.48z)| lr 3.21e-04 | 4160.07 ms | 32.5% bf16 MFU | 126018 tok/s step 9705/19560 | loss 3.457538 (+1.04z)| norm 0.2770 (-0.24z)| lr 3.21e-04 | 4165.13 ms | 32.4% bf16 MFU | 126011 tok/s step 9706/19560 | loss 3.399358 (-0.45z)| norm 0.2904 (+0.58z)| lr 3.21e-04 | 4165.80 ms | 32.4% bf16 MFU | 126003 tok/s step 9707/19560 | loss 3.360231 (-1.43z)| norm 0.2690 (-0.72z)| lr 3.21e-04 | 4154.24 ms | 32.5% bf16 MFU | 126013 tok/s step 9708/19560 | loss 3.360366 (-1.40z)| norm 0.2736 (-0.45z)| lr 3.21e-04 | 4155.05 ms | 32.5% bf16 MFU | 126022 tok/s step 9709/19560 | loss 3.394582 (-0.54z)| norm 0.2819 (+0.07z)| lr 3.21e-04 | 4168.41 ms | 32.4% bf16 MFU | 126009 tok/s step 9710/19560 | loss 3.423648 (+0.21z)| norm 0.2888 (+0.48z)| lr 3.21e-04 | 4155.75 ms | 32.5% bf16 MFU | 126017 tok/s step 9711/19560 | loss 3.402932 (-0.32z)| norm 0.2721 (-0.55z)| lr 3.21e-04 | 4172.46 ms | 32.4% bf16 MFU | 125999 tok/s step 9712/19560 | loss 3.356620 (-1.49z)| norm 0.2869 (+0.36z)| lr 3.21e-04 | 4158.78 ms | 32.5% bf16 MFU | 126002 tok/s step 9713/19560 | loss 3.448620 (+0.88z)| norm 0.2987 (+1.08z)| lr 3.21e-04 | 4161.78 ms | 32.4% bf16 MFU | 126001 tok/s step 9714/19560 | loss 3.500855 (+2.18z)| norm 0.2923 (+0.68z)| lr 3.21e-04 | 4153.32 ms | 32.5% bf16 MFU | 126013 tok/s step 9715/19560 | loss 3.458331 (+1.08z)| norm 0.2783 (-0.19z)| lr 3.21e-04 | 4150.33 ms | 32.5% bf16 MFU | 126028 tok/s step 9716/19560 | loss 3.342899 (-1.82z)| norm 0.2847 (+0.21z)| lr 3.21e-04 | 4158.02 ms | 32.5% bf16 MFU | 126031 tok/s step 9717/19560 | loss 3.426691 (+0.30z)| norm 0.2869 (+0.33z)| lr 3.21e-04 | 4155.76 ms | 32.5% bf16 MFU | 126038 tok/s step 9718/19560 | loss 3.367583 (-1.18z)| norm 0.2721 (-0.59z)| lr 3.21e-04 | 4183.71 ms | 32.3% bf16 MFU | 126002 tok/s step 9719/19560 | loss 3.386699 (-0.70z)| norm 0.2764 (-0.32z)| lr 3.21e-04 | 4160.58 ms | 32.5% bf16 MFU | 126002 tok/s step 9720/19560 | loss 3.451443 (+0.93z)| norm 0.2774 (-0.25z)| lr 3.21e-04 | 4161.74 ms | 32.4% bf16 MFU | 126001 tok/s step 9721/19560 | loss 3.471527 (+1.41z)| norm 0.3206 (+2.44z)| lr 3.20e-04 | 4156.68 ms | 32.5% bf16 MFU | 126008 tok/s step 9722/19560 | loss 3.471374 (+1.38z)| norm 0.2728 (-0.54z)| lr 3.20e-04 | 4143.79 ms | 32.6% bf16 MFU | 126033 tok/s step 9723/19560 | loss 3.446301 (+0.74z)| norm 0.2947 (+0.82z)| lr 3.20e-04 | 4166.41 ms | 32.4% bf16 MFU | 126023 tok/s step 9724/19560 | loss 3.470628 (+1.35z)| norm 0.3154 (+2.07z)| lr 3.20e-04 | 4159.84 ms | 32.5% bf16 MFU | 126024 tok/s step 9725/19560 | loss 3.399170 (-0.44z)| norm 0.2826 (+0.04z)| lr 3.20e-04 | 4157.36 ms | 32.5% bf16 MFU | 126028 tok/s step 9726/19560 | loss 3.393002 (-0.60z)| norm 0.2983 (+1.03z)| lr 3.20e-04 | 4161.09 ms | 32.4% bf16 MFU | 126027 tok/s step 9727/19560 | loss 3.399993 (-0.41z)| norm 0.2831 (+0.08z)| lr 3.20e-04 | 4171.95 ms | 32.4% bf16 MFU | 126009 tok/s step 9728/19560 | loss 3.366830 (-1.24z)| norm 0.2890 (+0.45z)| lr 3.20e-04 | 4167.23 ms | 32.4% bf16 MFU | 125999 tok/s step 9729/19560 | loss 3.424906 (+0.24z)| norm 0.2756 (-0.38z)| lr 3.20e-04 | 4159.95 ms | 32.5% bf16 MFU | 126001 tok/s step 9730/19560 | loss 3.364507 (-1.29z)| norm 0.2737 (-0.50z)| lr 3.20e-04 | 4161.32 ms | 32.4% bf16 MFU | 126000 tok/s step 9731/19560 | loss 3.419535 (+0.12z)| norm 0.2895 (+0.50z)| lr 3.20e-04 | 4168.61 ms | 32.4% bf16 MFU | 125989 tok/s step 9732/19560 | loss 3.413707 (-0.03z)| norm 0.2679 (-0.85z)| lr 3.20e-04 | 4154.69 ms | 32.5% bf16 MFU | 125999 tok/s step 9733/19560 | loss 3.393312 (-0.55z)| norm 0.2891 (+0.47z)| lr 3.20e-04 | 4160.49 ms | 32.5% bf16 MFU | 126000 tok/s step 9734/19560 | loss 3.413978 (-0.02z)| norm 0.2758 (-0.36z)| lr 3.20e-04 | 4157.31 ms | 32.5% bf16 MFU | 126005 tok/s step 9735/19560 | loss 3.423670 (+0.23z)| norm 0.3037 (+1.38z)| lr 3.20e-04 | 4165.17 ms | 32.4% bf16 MFU | 125999 tok/s step 9736/19560 | loss 3.431186 (+0.41z)| norm 0.3000 (+1.19z)| lr 3.20e-04 | 4163.05 ms | 32.4% bf16 MFU | 125996 tok/s step 9737/19560 | loss 3.375813 (-1.00z)| norm 0.3027 (+1.34z)| lr 3.20e-04 | 4172.35 ms | 32.4% bf16 MFU | 125979 tok/s step 9738/19560 | loss 3.394138 (-0.52z)| norm 0.3067 (+1.57z)| lr 3.20e-04 | 4171.36 ms | 32.4% bf16 MFU | 125964 tok/s step 9739/19560 | loss 3.405941 (-0.22z)| norm 0.3012 (+1.20z)| lr 3.20e-04 | 4168.87 ms | 32.4% bf16 MFU | 125954 tok/s step 9740/19560 | loss 3.378797 (-0.90z)| norm 0.2931 (+0.69z)| lr 3.20e-04 | 4152.25 ms | 32.5% bf16 MFU | 125970 tok/s step 9741/19560 | loss 3.439833 (+0.64z)| norm 0.2591 (-1.43z)| lr 3.19e-04 | 4170.01 ms | 32.4% bf16 MFU | 125958 tok/s step 9742/19560 | loss 3.414271 (-0.02z)| norm 0.2854 (+0.20z)| lr 3.19e-04 | 4182.35 ms | 32.3% bf16 MFU | 125928 tok/s step 9743/19560 | loss 3.368648 (-1.20z)| norm 0.2652 (-1.07z)| lr 3.19e-04 | 4161.00 ms | 32.4% bf16 MFU | 125931 tok/s step 9744/19560 | loss 3.395967 (-0.49z)| norm 0.2650 (-1.10z)| lr 3.19e-04 | 4168.78 ms | 32.4% bf16 MFU | 125923 tok/s step 9745/19560 | loss 3.399563 (-0.39z)| norm 0.2798 (-0.16z)| lr 3.19e-04 | 4169.13 ms | 32.4% bf16 MFU | 125915 tok/s step 9746/19560 | loss 3.381444 (-0.85z)| norm 0.2700 (-0.79z)| lr 3.19e-04 | 4155.24 ms | 32.5% bf16 MFU | 125928 tok/s step 9747/19560 | loss 3.378513 (-0.91z)| norm 0.2638 (-1.20z)| lr 3.19e-04 | 4171.96 ms | 32.4% bf16 MFU | 125915 tok/s step 9748/19560 | loss 3.387723 (-0.67z)| norm 0.3774 (+5.37z)| lr 3.19e-04 | 4640.43 ms | 29.1% bf16 MFU | 125268 tok/s step 9749/19560 | loss 3.468987 (+1.42z)| norm 0.2726 (-0.60z)| lr 3.19e-04 | 4159.31 ms | 32.5% bf16 MFU | 125307 tok/s step 9750/19560 | loss 3.341346 (-1.84z)| norm 0.2581 (-1.41z)| lr 3.19e-04 | 4220.92 ms | 32.0% bf16 MFU | 125253 tok/s val loss 3.390725 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2908/10042 = 0.289584 step 9751/19560 | loss 3.385779 (-0.68z)| norm 0.2749 (-0.45z)| lr 3.19e-04 | 4162.44 ms | 32.4% bf16 MFU | 125288 tok/s step 9752/19560 | loss 3.350836 (-1.56z)| norm 0.2713 (-0.65z)| lr 3.19e-04 | 4254.27 ms | 31.7% bf16 MFU | 125185 tok/s step 9753/19560 | loss 3.409595 (-0.06z)| norm 0.2654 (-0.98z)| lr 3.19e-04 | 4244.44 ms | 31.8% bf16 MFU | 125102 tok/s step 9754/19560 | loss 3.404916 (-0.17z)| norm 0.2718 (-0.62z)| lr 3.19e-04 | 4151.02 ms | 32.5% bf16 MFU | 125162 tok/s step 9755/19560 | loss 3.434236 (+0.58z)| norm 0.2638 (-1.07z)| lr 3.19e-04 | 4293.39 ms | 31.4% bf16 MFU | 125010 tok/s step 9756/19560 | loss 3.392338 (-0.51z)| norm 0.2690 (-0.76z)| lr 3.19e-04 | 4153.93 ms | 32.5% bf16 MFU | 125070 tok/s step 9757/19560 | loss 3.372196 (-1.02z)| norm 0.2559 (-1.49z)| lr 3.19e-04 | 4153.63 ms | 32.5% bf16 MFU | 125128 tok/s step 9758/19560 | loss 3.446554 (+0.91z)| norm 0.2658 (-0.91z)| lr 3.19e-04 | 4158.24 ms | 32.5% bf16 MFU | 125176 tok/s step 9759/19560 | loss 3.428039 (+0.46z)| norm 0.2609 (-1.17z)| lr 3.19e-04 | 4165.32 ms | 32.4% bf16 MFU | 125210 tok/s step 9760/19560 | loss 3.478838 (+1.79z)| norm 0.2516 (-1.67z)| lr 3.19e-04 | 4162.10 ms | 32.4% bf16 MFU | 125248 tok/s step 9761/19560 | loss 3.331608 (-2.07z)| norm 0.2660 (-0.85z)| lr 3.18e-04 | 4164.80 ms | 32.4% bf16 MFU | 125280 tok/s step 9762/19560 | loss 3.378162 (-0.89z)| norm 0.2686 (-0.69z)| lr 3.18e-04 | 4151.63 ms | 32.5% bf16 MFU | 125330 tok/s step 9763/19560 | loss 3.367887 (-1.15z)| norm 0.2628 (-1.01z)| lr 3.18e-04 | 4163.67 ms | 32.4% bf16 MFU | 125360 tok/s step 9764/19560 | loss 3.375844 (-0.93z)| norm 0.2629 (-1.00z)| lr 3.18e-04 | 4157.99 ms | 32.5% bf16 MFU | 125396 tok/s step 9765/19560 | loss 3.379210 (-0.83z)| norm 0.2681 (-0.70z)| lr 3.18e-04 | 4162.84 ms | 32.4% bf16 MFU | 125424 tok/s step 9766/19560 | loss 3.440910 (+0.81z)| norm 0.2651 (-0.86z)| lr 3.18e-04 | 4173.57 ms | 32.4% bf16 MFU | 125434 tok/s step 9767/19560 | loss 3.443711 (+0.87z)| norm 0.2870 (+0.35z)| lr 3.18e-04 | 4161.60 ms | 32.4% bf16 MFU | 125461 tok/s step 9768/19560 | loss 3.391742 (-0.50z)| norm 0.2692 (-0.64z)| lr 3.18e-04 | 4164.39 ms | 32.4% bf16 MFU | 125483 tok/s step 9769/19560 | loss 3.414116 (+0.12z)| norm 0.2625 (-1.00z)| lr 3.18e-04 | 4157.26 ms | 32.5% bf16 MFU | 125514 tok/s step 9770/19560 | loss 3.404439 (-0.14z)| norm 0.2652 (-0.83z)| lr 3.18e-04 | 4158.91 ms | 32.5% bf16 MFU | 125542 tok/s step 9771/19560 | loss 3.405582 (-0.11z)| norm 0.2855 (+0.31z)| lr 3.18e-04 | 4164.48 ms | 32.4% bf16 MFU | 125560 tok/s step 9772/19560 | loss 3.415946 (+0.17z)| norm 0.2626 (-0.97z)| lr 3.18e-04 | 4655.05 ms | 29.0% bf16 MFU | 124913 tok/s step 9773/19560 | loss 3.389730 (-0.55z)| norm 0.2755 (-0.24z)| lr 3.18e-04 | 4148.21 ms | 32.5% bf16 MFU | 124987 tok/s step 9774/19560 | loss 3.448224 (+1.05z)| norm 0.2644 (-0.86z)| lr 3.18e-04 | 4156.01 ms | 32.5% bf16 MFU | 125045 tok/s step 9775/19560 | loss 3.390455 (-0.53z)| norm 0.2489 (-1.70z)| lr 3.18e-04 | 4148.56 ms | 32.5% bf16 MFU | 125112 tok/s step 9776/19560 | loss 3.415788 (+0.16z)| norm 0.2652 (-0.79z)| lr 3.18e-04 | 4152.98 ms | 32.5% bf16 MFU | 125168 tok/s step 9777/19560 | loss 3.375335 (-0.93z)| norm 0.2949 (+0.88z)| lr 3.18e-04 | 4168.20 ms | 32.4% bf16 MFU | 125199 tok/s step 9778/19560 | loss 3.443489 (+0.92z)| norm 0.2674 (-0.66z)| lr 3.18e-04 | 4155.15 ms | 32.5% bf16 MFU | 125248 tok/s step 9779/19560 | loss 3.396950 (-0.35z)| norm 0.2673 (-0.67z)| lr 3.18e-04 | 4169.44 ms | 32.4% bf16 MFU | 125273 tok/s step 9780/19560 | loss 3.411440 (+0.04z)| norm 0.2776 (-0.10z)| lr 3.18e-04 | 4163.29 ms | 32.4% bf16 MFU | 125306 tok/s step 9781/19560 | loss 3.398958 (-0.30z)| norm 0.2748 (-0.26z)| lr 3.17e-04 | 4165.12 ms | 32.4% bf16 MFU | 125334 tok/s step 9782/19560 | loss 3.359653 (-1.36z)| norm 0.3148 (+1.95z)| lr 3.17e-04 | 4167.09 ms | 32.4% bf16 MFU | 125358 tok/s step 9783/19560 | loss 3.434664 (+0.69z)| norm 0.2746 (-0.30z)| lr 3.17e-04 | 4153.94 ms | 32.5% bf16 MFU | 125401 tok/s step 9784/19560 | loss 3.450681 (+1.11z)| norm 0.2976 (+0.97z)| lr 3.17e-04 | 4154.87 ms | 32.5% bf16 MFU | 125440 tok/s step 9785/19560 | loss 3.387785 (-0.63z)| norm 0.2774 (-0.16z)| lr 3.17e-04 | 4157.06 ms | 32.5% bf16 MFU | 125474 tok/s step 9786/19560 | loss 3.441283 (+0.85z)| norm 0.3154 (+1.92z)| lr 3.17e-04 | 4165.71 ms | 32.4% bf16 MFU | 125494 tok/s step 9787/19560 | loss 3.382269 (-0.78z)| norm 0.2845 (+0.21z)| lr 3.17e-04 | 4154.53 ms | 32.5% bf16 MFU | 125529 tok/s step 9788/19560 | loss 3.395477 (-0.41z)| norm 0.3230 (+2.27z)| lr 3.17e-04 | 4157.02 ms | 32.5% bf16 MFU | 125558 tok/s step 9789/19560 | loss 3.355731 (-1.48z)| norm 0.2886 (+0.40z)| lr 3.17e-04 | 4158.73 ms | 32.5% bf16 MFU | 125584 tok/s step 9790/19560 | loss 3.393404 (-0.46z)| norm 0.2860 (+0.26z)| lr 3.17e-04 | 4147.29 ms | 32.6% bf16 MFU | 125626 tok/s step 9791/19560 | loss 3.340816 (-1.88z)| norm 0.2867 (+0.29z)| lr 3.17e-04 | 4152.02 ms | 32.5% bf16 MFU | 125658 tok/s step 9792/19560 | loss 3.430643 (+0.57z)| norm 0.2797 (-0.10z)| lr 3.17e-04 | 4156.30 ms | 32.5% bf16 MFU | 125682 tok/s step 9793/19560 | loss 3.377370 (-0.88z)| norm 0.3008 (+1.04z)| lr 3.17e-04 | 4160.68 ms | 32.5% bf16 MFU | 125699 tok/s step 9794/19560 | loss 3.415948 (+0.17z)| norm 0.3102 (+1.53z)| lr 3.17e-04 | 4170.94 ms | 32.4% bf16 MFU | 125699 tok/s step 9795/19560 | loss 3.407353 (-0.06z)| norm 0.2689 (-0.68z)| lr 3.17e-04 | 4163.30 ms | 32.4% bf16 MFU | 125710 tok/s step 9796/19560 | loss 3.413854 (+0.13z)| norm 0.2800 (-0.07z)| lr 3.17e-04 | 4162.56 ms | 32.4% bf16 MFU | 125722 tok/s step 9797/19560 | loss 3.381080 (-0.77z)| norm 0.2745 (-0.36z)| lr 3.17e-04 | 4171.09 ms | 32.4% bf16 MFU | 125721 tok/s step 9798/19560 | loss 3.388462 (-0.56z)| norm 0.2886 (+0.42z)| lr 3.17e-04 | 4164.82 ms | 32.4% bf16 MFU | 125729 tok/s step 9799/19560 | loss 3.455462 (+1.26z)| norm 0.2777 (-0.17z)| lr 3.17e-04 | 4167.41 ms | 32.4% bf16 MFU | 125733 tok/s step 9800/19560 | loss 3.373617 (-0.96z)| norm 0.2777 (-0.18z)| lr 3.17e-04 | 4163.14 ms | 32.4% bf16 MFU | 125743 tok/s step 9801/19560 | loss 3.384023 (-0.67z)| norm 0.2811 (+0.03z)| lr 3.16e-04 | 4157.48 ms | 32.5% bf16 MFU | 125761 tok/s step 9802/19560 | loss 3.437192 (+0.78z)| norm 0.2786 (-0.12z)| lr 3.16e-04 | 4164.14 ms | 32.4% bf16 MFU | 125769 tok/s step 9803/19560 | loss 3.447275 (+1.04z)| norm 0.2852 (+0.26z)| lr 3.16e-04 | 4154.36 ms | 32.5% bf16 MFU | 125790 tok/s step 9804/19560 | loss 3.387105 (-0.60z)| norm 0.2734 (-0.42z)| lr 3.16e-04 | 4170.81 ms | 32.4% bf16 MFU | 125786 tok/s step 9805/19560 | loss 3.370239 (-1.07z)| norm 0.2904 (+0.57z)| lr 3.16e-04 | 4158.93 ms | 32.5% bf16 MFU | 125800 tok/s step 9806/19560 | loss 3.437913 (+0.84z)| norm 0.2948 (+0.82z)| lr 3.16e-04 | 4161.08 ms | 32.4% bf16 MFU | 125810 tok/s step 9807/19560 | loss 3.431254 (+0.66z)| norm 0.2926 (+0.68z)| lr 3.16e-04 | 4168.94 ms | 32.4% bf16 MFU | 125807 tok/s step 9808/19560 | loss 3.407820 (-0.01z)| norm 0.2703 (-0.65z)| lr 3.16e-04 | 4170.16 ms | 32.4% bf16 MFU | 125803 tok/s step 9809/19560 | loss 3.371183 (-1.04z)| norm 0.2764 (-0.28z)| lr 3.16e-04 | 4165.82 ms | 32.4% bf16 MFU | 125806 tok/s step 9810/19560 | loss 3.473209 (+1.84z)| norm 0.2822 (+0.06z)| lr 3.16e-04 | 4181.16 ms | 32.3% bf16 MFU | 125785 tok/s step 9811/19560 | loss 3.400421 (-0.21z)| norm 0.2900 (+0.52z)| lr 3.16e-04 | 4178.42 ms | 32.3% bf16 MFU | 125770 tok/s step 9812/19560 | loss 3.399898 (-0.23z)| norm 0.2964 (+0.89z)| lr 3.16e-04 | 4165.83 ms | 32.4% bf16 MFU | 125774 tok/s step 9813/19560 | loss 3.436786 (+0.86z)| norm 0.2679 (-0.82z)| lr 3.16e-04 | 4170.49 ms | 32.4% bf16 MFU | 125771 tok/s step 9814/19560 | loss 3.370124 (-1.07z)| norm 0.2817 (+0.02z)| lr 3.16e-04 | 4162.05 ms | 32.4% bf16 MFU | 125781 tok/s step 9815/19560 | loss 3.412556 (+0.16z)| norm 0.2997 (+1.09z)| lr 3.16e-04 | 4163.48 ms | 32.4% bf16 MFU | 125788 tok/s step 9816/19560 | loss 3.425490 (+0.52z)| norm 0.3040 (+1.33z)| lr 3.16e-04 | 4363.48 ms | 30.9% bf16 MFU | 125506 tok/s step 9817/19560 | loss 3.355776 (-1.48z)| norm 0.2898 (+0.48z)| lr 3.16e-04 | 4166.86 ms | 32.4% bf16 MFU | 125522 tok/s step 9818/19560 | loss 3.424399 (+0.49z)| norm 0.2900 (+0.49z)| lr 3.16e-04 | 4154.17 ms | 32.5% bf16 MFU | 125556 tok/s step 9819/19560 | loss 3.391249 (-0.47z)| norm 0.2887 (+0.40z)| lr 3.16e-04 | 4159.80 ms | 32.5% bf16 MFU | 125580 tok/s step 9820/19560 | loss 3.424277 (+0.49z)| norm 0.2901 (+0.49z)| lr 3.16e-04 | 4160.71 ms | 32.5% bf16 MFU | 125602 tok/s step 9821/19560 | loss 3.397941 (-0.27z)| norm 0.2818 (-0.01z)| lr 3.15e-04 | 4152.91 ms | 32.5% bf16 MFU | 125634 tok/s step 9822/19560 | loss 3.349642 (-1.65z)| norm 0.2750 (-0.40z)| lr 3.15e-04 | 4163.87 ms | 32.4% bf16 MFU | 125648 tok/s step 9823/19560 | loss 3.359143 (-1.36z)| norm 0.2780 (-0.23z)| lr 3.15e-04 | 4157.57 ms | 32.5% bf16 MFU | 125671 tok/s step 9824/19560 | loss 3.410992 (+0.17z)| norm 0.2744 (-0.44z)| lr 3.15e-04 | 4164.69 ms | 32.4% bf16 MFU | 125682 tok/s step 9825/19560 | loss 3.537888 (+3.69z)| norm 0.2970 (+0.89z)| lr 3.15e-04 | 4166.30 ms | 32.4% bf16 MFU | 125690 tok/s step 9826/19560 | loss 3.414092 (+0.21z)| norm 0.2893 (+0.43z)| lr 3.15e-04 | 4157.03 ms | 32.5% bf16 MFU | 125711 tok/s step 9827/19560 | loss 3.404215 (-0.05z)| norm 0.2724 (-0.57z)| lr 3.15e-04 | 4161.28 ms | 32.4% bf16 MFU | 125725 tok/s step 9828/19560 | loss 3.403255 (-0.08z)| norm 0.2687 (-0.78z)| lr 3.15e-04 | 4157.36 ms | 32.5% bf16 MFU | 125744 tok/s step 9829/19560 | loss 3.374888 (-0.88z)| norm 0.2703 (-0.67z)| lr 3.15e-04 | 4153.68 ms | 32.5% bf16 MFU | 125768 tok/s step 9830/19560 | loss 3.402248 (-0.10z)| norm 0.2932 (+0.69z)| lr 3.15e-04 | 4165.82 ms | 32.4% bf16 MFU | 125773 tok/s step 9831/19560 | loss 3.398900 (-0.20z)| norm 0.2804 (-0.08z)| lr 3.15e-04 | 4577.67 ms | 29.5% bf16 MFU | 125211 tok/s step 9832/19560 | loss 3.400968 (-0.13z)| norm 0.2766 (-0.29z)| lr 3.15e-04 | 4181.13 ms | 32.3% bf16 MFU | 125220 tok/s step 9833/19560 | loss 3.387739 (-0.50z)| norm 0.2747 (-0.41z)| lr 3.15e-04 | 4185.54 ms | 32.3% bf16 MFU | 125222 tok/s step 9834/19560 | loss 3.374892 (-0.86z)| norm 0.2749 (-0.39z)| lr 3.15e-04 | 4201.81 ms | 32.1% bf16 MFU | 125200 tok/s step 9835/19560 | loss 3.448222 (+1.22z)| norm 0.2667 (-0.88z)| lr 3.15e-04 | 4161.95 ms | 32.4% bf16 MFU | 125238 tok/s step 9836/19560 | loss 3.388020 (-0.51z)| norm 0.2961 (+0.87z)| lr 3.15e-04 | 4155.06 ms | 32.5% bf16 MFU | 125285 tok/s step 9837/19560 | loss 3.433721 (+0.79z)| norm 0.3193 (+2.19z)| lr 3.15e-04 | 4164.57 ms | 32.4% bf16 MFU | 125316 tok/s step 9838/19560 | loss 3.394847 (-0.32z)| norm 0.2941 (+0.71z)| lr 3.15e-04 | 4176.98 ms | 32.3% bf16 MFU | 125326 tok/s step 9839/19560 | loss 3.392781 (-0.37z)| norm 0.2846 (+0.15z)| lr 3.15e-04 | 4160.79 ms | 32.4% bf16 MFU | 125360 tok/s step 9840/19560 | loss 3.398823 (-0.21z)| norm 0.3532 (+3.88z)| lr 3.15e-04 | 4162.43 ms | 32.4% bf16 MFU | 125390 tok/s step 9841/19560 | loss 3.361114 (-1.28z)| norm 0.3058 (+1.27z)| lr 3.14e-04 | 4168.48 ms | 32.4% bf16 MFU | 125409 tok/s step 9842/19560 | loss 3.396459 (-0.24z)| norm 0.3018 (+1.05z)| lr 3.14e-04 | 4157.05 ms | 32.5% bf16 MFU | 125445 tok/s step 9843/19560 | loss 3.423695 (+0.58z)| norm 0.3057 (+1.24z)| lr 3.14e-04 | 4163.48 ms | 32.4% bf16 MFU | 125469 tok/s step 9844/19560 | loss 3.343657 (-1.83z)| norm 0.3140 (+1.66z)| lr 3.14e-04 | 4159.65 ms | 32.5% bf16 MFU | 125497 tok/s step 9845/19560 | loss 3.408217 (+0.12z)| norm 0.2669 (-0.86z)| lr 3.14e-04 | 4171.98 ms | 32.4% bf16 MFU | 125506 tok/s step 9846/19560 | loss 3.393799 (-0.32z)| norm 0.2852 (+0.12z)| lr 3.14e-04 | 4160.82 ms | 32.4% bf16 MFU | 125531 tok/s step 9847/19560 | loss 3.368428 (-1.08z)| norm 0.2918 (+0.46z)| lr 3.14e-04 | 4174.42 ms | 32.3% bf16 MFU | 125534 tok/s step 9848/19560 | loss 3.489053 (+2.50z)| norm 0.2838 (+0.03z)| lr 3.14e-04 | 4160.60 ms | 32.5% bf16 MFU | 125558 tok/s step 9849/19560 | loss 3.323039 (-2.37z)| norm 0.2657 (-0.93z)| lr 3.14e-04 | 4173.41 ms | 32.4% bf16 MFU | 125561 tok/s step 9850/19560 | loss 3.374948 (-0.83z)| norm 0.2517 (-1.66z)| lr 3.14e-04 | 4169.17 ms | 32.4% bf16 MFU | 125571 tok/s step 9851/19560 | loss 3.369092 (-0.99z)| norm 0.2752 (-0.39z)| lr 3.14e-04 | 4162.43 ms | 32.4% bf16 MFU | 125590 tok/s step 9852/19560 | loss 3.432902 (+0.94z)| norm 0.2515 (-1.64z)| lr 3.14e-04 | 4163.46 ms | 32.4% bf16 MFU | 125607 tok/s step 9853/19560 | loss 3.483886 (+2.41z)| norm 0.2578 (-1.28z)| lr 3.14e-04 | 4160.76 ms | 32.5% bf16 MFU | 125627 tok/s step 9854/19560 | loss 3.332458 (-2.02z)| norm 0.2614 (-1.07z)| lr 3.14e-04 | 4158.41 ms | 32.5% bf16 MFU | 125650 tok/s step 9855/19560 | loss 3.372300 (-0.86z)| norm 0.2825 (+0.06z)| lr 3.14e-04 | 4159.05 ms | 32.5% bf16 MFU | 125670 tok/s step 9856/19560 | loss 3.364049 (-1.09z)| norm 0.2613 (-1.06z)| lr 3.14e-04 | 4164.94 ms | 32.4% bf16 MFU | 125681 tok/s step 9857/19560 | loss 3.383100 (-0.53z)| norm 0.2757 (-0.30z)| lr 3.14e-04 | 4158.38 ms | 32.5% bf16 MFU | 125701 tok/s step 9858/19560 | loss 3.391042 (-0.31z)| norm 0.2476 (-1.76z)| lr 3.14e-04 | 4166.06 ms | 32.4% bf16 MFU | 125708 tok/s step 9859/19560 | loss 3.449863 (+1.39z)| norm 0.2971 (+0.84z)| lr 3.14e-04 | 4160.33 ms | 32.5% bf16 MFU | 125724 tok/s step 9860/19560 | loss 3.393045 (-0.25z)| norm 0.3027 (+1.12z)| lr 3.14e-04 | 4180.74 ms | 32.3% bf16 MFU | 125708 tok/s step 9861/19560 | loss 3.397077 (-0.13z)| norm 0.3008 (+1.01z)| lr 3.13e-04 | 4153.12 ms | 32.5% bf16 MFU | 125734 tok/s step 9862/19560 | loss 3.409013 (+0.21z)| norm 0.2756 (-0.31z)| lr 3.13e-04 | 4164.45 ms | 32.4% bf16 MFU | 125742 tok/s step 9863/19560 | loss 3.375043 (-0.76z)| norm 0.3055 (+1.25z)| lr 3.13e-04 | 4156.89 ms | 32.5% bf16 MFU | 125762 tok/s step 9864/19560 | loss 3.388626 (-0.36z)| norm 0.2678 (-0.70z)| lr 3.13e-04 | 4150.46 ms | 32.5% bf16 MFU | 125790 tok/s step 9865/19560 | loss 3.381501 (-0.57z)| norm 0.2809 (-0.01z)| lr 3.13e-04 | 4154.95 ms | 32.5% bf16 MFU | 125809 tok/s step 9866/19560 | loss 3.372408 (-0.82z)| norm 0.2591 (-1.14z)| lr 3.13e-04 | 4160.11 ms | 32.5% bf16 MFU | 125820 tok/s step 9867/19560 | loss 3.479676 (+2.22z)| norm 0.2772 (-0.17z)| lr 3.13e-04 | 4157.27 ms | 32.5% bf16 MFU | 125835 tok/s step 9868/19560 | loss 3.443911 (+1.19z)| norm 0.2775 (-0.15z)| lr 3.13e-04 | 4151.89 ms | 32.5% bf16 MFU | 125857 tok/s step 9869/19560 | loss 3.382197 (-0.55z)| norm 0.2775 (-0.16z)| lr 3.13e-04 | 4152.46 ms | 32.5% bf16 MFU | 125877 tok/s step 9870/19560 | loss 3.381952 (-0.55z)| norm 0.2626 (-0.94z)| lr 3.13e-04 | 4908.97 ms | 27.5% bf16 MFU | 124923 tok/s step 9871/19560 | loss 3.448707 (+1.33z)| norm 0.3062 (+1.35z)| lr 3.13e-04 | 4165.22 ms | 32.4% bf16 MFU | 124971 tok/s step 9872/19560 | loss 3.416312 (+0.40z)| norm 0.3129 (+1.67z)| lr 3.13e-04 | 4160.18 ms | 32.5% bf16 MFU | 125023 tok/s step 9873/19560 | loss 3.350136 (-1.44z)| norm 0.3320 (+2.57z)| lr 3.13e-04 | 4167.14 ms | 32.4% bf16 MFU | 125063 tok/s step 9874/19560 | loss 3.405630 (+0.11z)| norm 0.3195 (+1.89z)| lr 3.13e-04 | 4155.94 ms | 32.5% bf16 MFU | 125118 tok/s step 9875/19560 | loss 3.416392 (+0.40z)| norm 0.2616 (-1.02z)| lr 3.13e-04 | 4153.08 ms | 32.5% bf16 MFU | 125174 tok/s step 9876/19560 | loss 3.439298 (+1.03z)| norm 0.3085 (+1.50z)| lr 3.13e-04 | 4171.20 ms | 32.4% bf16 MFU | 125200 tok/s step 9877/19560 | loss 3.402229 (+0.01z)| norm 0.2797 (-0.09z)| lr 3.13e-04 | 4162.42 ms | 32.4% bf16 MFU | 125238 tok/s step 9878/19560 | loss 3.350001 (-1.48z)| norm 0.2952 (+0.75z)| lr 3.13e-04 | 4156.68 ms | 32.5% bf16 MFU | 125282 tok/s step 9879/19560 | loss 3.374844 (-0.77z)| norm 0.2564 (-1.38z)| lr 3.13e-04 | 4153.97 ms | 32.5% bf16 MFU | 125329 tok/s step 9880/19560 | loss 3.305420 (-2.68z)| norm 0.3076 (+1.41z)| lr 3.13e-04 | 4169.43 ms | 32.4% bf16 MFU | 125350 tok/s step 9881/19560 | loss 3.402570 (+0.03z)| norm 0.2670 (-0.81z)| lr 3.12e-04 | 4150.77 ms | 32.5% bf16 MFU | 125398 tok/s step 9882/19560 | loss 3.405560 (+0.11z)| norm 0.2822 (+0.02z)| lr 3.12e-04 | 4166.58 ms | 32.4% bf16 MFU | 125419 tok/s step 9883/19560 | loss 3.387223 (-0.39z)| norm 0.2970 (+0.82z)| lr 3.12e-04 | 4155.24 ms | 32.5% bf16 MFU | 125457 tok/s step 9884/19560 | loss 3.369556 (-0.88z)| norm 0.2950 (+0.69z)| lr 3.12e-04 | 4160.97 ms | 32.4% bf16 MFU | 125484 tok/s step 9885/19560 | loss 3.445555 (+1.22z)| norm 0.3387 (+2.97z)| lr 3.12e-04 | 4151.57 ms | 32.5% bf16 MFU | 125525 tok/s step 9886/19560 | loss 3.445219 (+1.21z)| norm 0.3036 (+1.08z)| lr 3.12e-04 | 4151.13 ms | 32.5% bf16 MFU | 125563 tok/s step 9887/19560 | loss 3.398032 (-0.09z)| norm 0.3172 (+1.77z)| lr 3.12e-04 | 4154.98 ms | 32.5% bf16 MFU | 125594 tok/s step 9888/19560 | loss 3.484557 (+2.31z)| norm 0.3217 (+1.97z)| lr 3.12e-04 | 4164.11 ms | 32.4% bf16 MFU | 125610 tok/s step 9889/19560 | loss 3.416857 (+0.42z)| norm 0.2931 (+0.45z)| lr 3.12e-04 | 4152.09 ms | 32.5% bf16 MFU | 125643 tok/s step 9890/19560 | loss 3.490875 (+2.42z)| norm 0.3083 (+1.23z)| lr 3.12e-04 | 4160.13 ms | 32.5% bf16 MFU | 125662 tok/s step 9891/19560 | loss 3.407633 (+0.12z)| norm 0.2895 (+0.24z)| lr 3.12e-04 | 4146.87 ms | 32.6% bf16 MFU | 125701 tok/s step 9892/19560 | loss 3.395755 (-0.21z)| norm 0.3318 (+2.40z)| lr 3.12e-04 | 4162.68 ms | 32.4% bf16 MFU | 125713 tok/s step 9893/19560 | loss 3.374494 (-0.80z)| norm 0.2879 (+0.12z)| lr 3.12e-04 | 4154.05 ms | 32.5% bf16 MFU | 125738 tok/s step 9894/19560 | loss 3.432853 (+0.82z)| norm 0.2879 (+0.11z)| lr 3.12e-04 | 4169.58 ms | 32.4% bf16 MFU | 125738 tok/s step 9895/19560 | loss 3.379204 (-0.66z)| norm 0.2708 (-0.78z)| lr 3.12e-04 | 4165.76 ms | 32.4% bf16 MFU | 125744 tok/s step 9896/19560 | loss 3.431875 (+0.80z)| norm 0.2581 (-1.43z)| lr 3.12e-04 | 4155.96 ms | 32.5% bf16 MFU | 125764 tok/s step 9897/19560 | loss 3.403840 (+0.02z)| norm 0.2790 (-0.36z)| lr 3.12e-04 | 4152.63 ms | 32.5% bf16 MFU | 125789 tok/s step 9898/19560 | loss 3.352772 (-1.37z)| norm 0.2706 (-0.80z)| lr 3.12e-04 | 4153.70 ms | 32.5% bf16 MFU | 125811 tok/s step 9899/19560 | loss 3.486046 (+2.24z)| norm 0.2986 (+0.66z)| lr 3.12e-04 | 4153.96 ms | 32.5% bf16 MFU | 125831 tok/s step 9900/19560 | loss 3.418513 (+0.41z)| norm 0.2800 (-0.32z)| lr 3.12e-04 | 4149.48 ms | 32.5% bf16 MFU | 125857 tok/s step 9901/19560 | loss 3.356451 (-1.25z)| norm 0.2562 (-1.55z)| lr 3.11e-04 | 4160.17 ms | 32.5% bf16 MFU | 125865 tok/s step 9902/19560 | loss 3.443769 (+1.10z)| norm 0.3037 (+0.91z)| lr 3.11e-04 | 4160.17 ms | 32.5% bf16 MFU | 125873 tok/s step 9903/19560 | loss 3.310270 (-2.42z)| norm 0.2674 (-1.00z)| lr 3.11e-04 | 4153.60 ms | 32.5% bf16 MFU | 125891 tok/s step 9904/19560 | loss 3.337177 (-1.68z)| norm 0.2836 (-0.15z)| lr 3.11e-04 | 4162.65 ms | 32.4% bf16 MFU | 125894 tok/s step 9905/19560 | loss 3.369280 (-0.84z)| norm 0.2691 (-0.91z)| lr 3.11e-04 | 4168.73 ms | 32.4% bf16 MFU | 125887 tok/s step 9906/19560 | loss 3.433330 (+0.82z)| norm 0.2965 (+0.53z)| lr 3.11e-04 | 4162.82 ms | 32.4% bf16 MFU | 125890 tok/s step 9907/19560 | loss 3.428333 (+0.69z)| norm 0.3247 (+1.98z)| lr 3.11e-04 | 4172.69 ms | 32.4% bf16 MFU | 125878 tok/s step 9908/19560 | loss 3.398224 (-0.09z)| norm 0.2776 (-0.49z)| lr 3.11e-04 | 4164.21 ms | 32.4% bf16 MFU | 125879 tok/s step 9909/19560 | loss 3.339944 (-1.58z)| norm 0.3146 (+1.42z)| lr 3.11e-04 | 4159.00 ms | 32.5% bf16 MFU | 125889 tok/s step 9910/19560 | loss 3.418892 (+0.44z)| norm 0.2700 (-0.88z)| lr 3.11e-04 | 4167.49 ms | 32.4% bf16 MFU | 125884 tok/s step 9911/19560 | loss 3.380146 (-0.55z)| norm 0.3091 (+1.14z)| lr 3.11e-04 | 4168.63 ms | 32.4% bf16 MFU | 125879 tok/s step 9912/19560 | loss 3.336141 (-1.66z)| norm 0.2816 (-0.29z)| lr 3.11e-04 | 4158.26 ms | 32.5% bf16 MFU | 125889 tok/s step 9913/19560 | loss 3.486269 (+2.15z)| norm 0.2928 (+0.29z)| lr 3.11e-04 | 4159.91 ms | 32.5% bf16 MFU | 125896 tok/s step 9914/19560 | loss 3.392662 (-0.21z)| norm 0.2905 (+0.18z)| lr 3.11e-04 | 4151.12 ms | 32.5% bf16 MFU | 125916 tok/s step 9915/19560 | loss 3.395994 (-0.13z)| norm 0.2806 (-0.33z)| lr 3.11e-04 | 4159.40 ms | 32.5% bf16 MFU | 125923 tok/s step 9916/19560 | loss 3.527533 (+3.07z)| norm 0.2928 (+0.32z)| lr 3.11e-04 | 4159.38 ms | 32.5% bf16 MFU | 125929 tok/s step 9917/19560 | loss 3.434669 (+0.79z)| norm 0.3171 (+1.59z)| lr 3.11e-04 | 4159.08 ms | 32.5% bf16 MFU | 125936 tok/s step 9918/19560 | loss 3.374711 (-0.68z)| norm 0.2660 (-1.10z)| lr 3.11e-04 | 4161.02 ms | 32.4% bf16 MFU | 125939 tok/s step 9919/19560 | loss 3.382106 (-0.51z)| norm 0.2889 (+0.11z)| lr 3.11e-04 | 4157.71 ms | 32.5% bf16 MFU | 125947 tok/s step 9920/19560 | loss 3.367517 (-0.86z)| norm 0.2766 (-0.54z)| lr 3.11e-04 | 4159.48 ms | 32.5% bf16 MFU | 125952 tok/s step 9921/19560 | loss 3.366839 (-0.87z)| norm 0.2895 (+0.15z)| lr 3.10e-04 | 4164.56 ms | 32.4% bf16 MFU | 125949 tok/s step 9922/19560 | loss 3.432039 (+0.73z)| norm 0.2892 (+0.14z)| lr 3.10e-04 | 4160.02 ms | 32.5% bf16 MFU | 125953 tok/s step 9923/19560 | loss 3.395459 (-0.17z)| norm 0.2856 (-0.06z)| lr 3.10e-04 | 4164.56 ms | 32.4% bf16 MFU | 125950 tok/s step 9924/19560 | loss 3.521018 (+2.81z)| norm 0.3169 (+1.58z)| lr 3.10e-04 | 4152.94 ms | 32.5% bf16 MFU | 125965 tok/s step 9925/19560 | loss 3.336072 (-1.57z)| norm 0.2936 (+0.34z)| lr 3.10e-04 | 4155.78 ms | 32.5% bf16 MFU | 125974 tok/s step 9926/19560 | loss 3.312579 (-2.08z)| norm 0.3198 (+1.69z)| lr 3.10e-04 | 4151.56 ms | 32.5% bf16 MFU | 125990 tok/s step 9927/19560 | loss 3.347140 (-1.26z)| norm 0.2779 (-0.49z)| lr 3.10e-04 | 4165.75 ms | 32.4% bf16 MFU | 125983 tok/s step 9928/19560 | loss 3.418285 (+0.39z)| norm 0.2954 (+0.41z)| lr 3.10e-04 | 4160.31 ms | 32.5% bf16 MFU | 125985 tok/s step 9929/19560 | loss 3.394294 (-0.17z)| norm 0.3033 (+0.81z)| lr 3.10e-04 | 4160.29 ms | 32.5% bf16 MFU | 125987 tok/s step 9930/19560 | loss 3.443127 (+0.96z)| norm 0.2918 (+0.21z)| lr 3.10e-04 | 4155.91 ms | 32.5% bf16 MFU | 125996 tok/s step 9931/19560 | loss 3.319749 (-1.86z)| norm 0.3272 (+2.00z)| lr 3.10e-04 | 4155.71 ms | 32.5% bf16 MFU | 126004 tok/s step 9932/19560 | loss 3.358462 (-0.96z)| norm 0.2892 (+0.05z)| lr 3.10e-04 | 4157.64 ms | 32.5% bf16 MFU | 126009 tok/s step 9933/19560 | loss 3.378944 (-0.50z)| norm 0.2887 (+0.03z)| lr 3.10e-04 | 4158.55 ms | 32.5% bf16 MFU | 126012 tok/s step 9934/19560 | loss 3.361701 (-0.88z)| norm 0.2796 (-0.43z)| lr 3.10e-04 | 4157.93 ms | 32.5% bf16 MFU | 126016 tok/s step 9935/19560 | loss 3.403557 (+0.09z)| norm 0.2800 (-0.41z)| lr 3.10e-04 | 4157.94 ms | 32.5% bf16 MFU | 126020 tok/s step 9936/19560 | loss 3.338878 (-1.38z)| norm 0.2915 (+0.17z)| lr 3.10e-04 | 4163.25 ms | 32.4% bf16 MFU | 126016 tok/s step 9937/19560 | loss 3.572869 (+3.71z)| norm 0.3101 (+1.11z)| lr 3.10e-04 | 4154.04 ms | 32.5% bf16 MFU | 126025 tok/s step 9938/19560 | loss 3.344428 (-1.20z)| norm 0.3044 (+0.81z)| lr 3.10e-04 | 4147.30 ms | 32.6% bf16 MFU | 126045 tok/s step 9939/19560 | loss 3.351420 (-1.04z)| norm 0.2881 (-0.03z)| lr 3.10e-04 | 4159.59 ms | 32.5% bf16 MFU | 126045 tok/s step 9940/19560 | loss 3.387404 (-0.26z)| norm 0.2651 (-1.18z)| lr 3.10e-04 | 4151.66 ms | 32.5% bf16 MFU | 126057 tok/s step 9941/19560 | loss 3.387643 (-0.24z)| norm 0.2854 (-0.16z)| lr 3.09e-04 | 4154.29 ms | 32.5% bf16 MFU | 126064 tok/s step 9942/19560 | loss 3.368452 (-0.66z)| norm 0.2615 (-1.36z)| lr 3.09e-04 | 4168.26 ms | 32.4% bf16 MFU | 126050 tok/s step 9943/19560 | loss 3.420306 (+0.46z)| norm 0.2779 (-0.52z)| lr 3.09e-04 | 4160.01 ms | 32.5% bf16 MFU | 126049 tok/s step 9944/19560 | loss 3.438985 (+0.86z)| norm 0.2788 (-0.46z)| lr 3.09e-04 | 4163.41 ms | 32.4% bf16 MFU | 126043 tok/s step 9945/19560 | loss 3.379979 (-0.42z)| norm 0.2670 (-1.05z)| lr 3.09e-04 | 4180.06 ms | 32.3% bf16 MFU | 126012 tok/s step 9946/19560 | loss 3.418176 (+0.41z)| norm 0.2631 (-1.23z)| lr 3.09e-04 | 4163.40 ms | 32.4% bf16 MFU | 126008 tok/s step 9947/19560 | loss 3.410137 (+0.23z)| norm 0.2836 (-0.20z)| lr 3.09e-04 | 4163.57 ms | 32.4% bf16 MFU | 126004 tok/s step 9948/19560 | loss 3.364868 (-0.74z)| norm 0.2752 (-0.62z)| lr 3.09e-04 | 4156.14 ms | 32.5% bf16 MFU | 126011 tok/s step 9949/19560 | loss 3.375516 (-0.50z)| norm 0.2545 (-1.63z)| lr 3.09e-04 | 4159.71 ms | 32.5% bf16 MFU | 126012 tok/s step 9950/19560 | loss 3.356877 (-0.91z)| norm 0.2828 (-0.22z)| lr 3.09e-04 | 4153.39 ms | 32.5% bf16 MFU | 126023 tok/s step 9951/19560 | loss 3.447911 (+1.04z)| norm 0.2652 (-1.09z)| lr 3.09e-04 | 4148.09 ms | 32.5% bf16 MFU | 126042 tok/s step 9952/19560 | loss 3.337161 (-1.32z)| norm 0.2735 (-0.68z)| lr 3.09e-04 | 4163.23 ms | 32.4% bf16 MFU | 126036 tok/s step 9953/19560 | loss 3.461666 (+1.40z)| norm 0.2625 (-1.20z)| lr 3.09e-04 | 4151.90 ms | 32.5% bf16 MFU | 126048 tok/s step 9954/19560 | loss 3.412142 (+0.30z)| norm 0.2692 (-0.86z)| lr 3.09e-04 | 4161.45 ms | 32.4% bf16 MFU | 126045 tok/s step 9955/19560 | loss 3.498349 (+2.15z)| norm 0.2934 (+0.32z)| lr 3.09e-04 | 4153.82 ms | 32.5% bf16 MFU | 126054 tok/s step 9956/19560 | loss 3.352133 (-1.00z)| norm 0.2573 (-1.45z)| lr 3.09e-04 | 4161.66 ms | 32.4% bf16 MFU | 126050 tok/s step 9957/19560 | loss 3.358377 (-0.86z)| norm 0.2890 (+0.10z)| lr 3.09e-04 | 4164.15 ms | 32.4% bf16 MFU | 126043 tok/s step 9958/19560 | loss 3.377465 (-0.45z)| norm 0.2509 (-1.74z)| lr 3.09e-04 | 4163.72 ms | 32.4% bf16 MFU | 126037 tok/s step 9959/19560 | loss 3.418669 (+0.43z)| norm 0.2878 (+0.06z)| lr 3.09e-04 | 4166.80 ms | 32.4% bf16 MFU | 126026 tok/s step 9960/19560 | loss 3.309873 (-1.86z)| norm 0.2843 (-0.12z)| lr 3.09e-04 | 4152.45 ms | 32.5% bf16 MFU | 126038 tok/s step 9961/19560 | loss 3.577606 (+3.59z)| norm 0.2819 (-0.24z)| lr 3.08e-04 | 4161.87 ms | 32.4% bf16 MFU | 126035 tok/s step 9962/19560 | loss 3.438706 (+0.78z)| norm 0.2988 (+0.58z)| lr 3.08e-04 | 4155.61 ms | 32.5% bf16 MFU | 126041 tok/s step 9963/19560 | loss 3.361722 (-0.75z)| norm 0.2875 (+0.02z)| lr 3.08e-04 | 4158.74 ms | 32.5% bf16 MFU | 126042 tok/s step 9964/19560 | loss 3.444136 (+0.89z)| norm 0.2885 (+0.07z)| lr 3.08e-04 | 4168.07 ms | 32.4% bf16 MFU | 126030 tok/s step 9965/19560 | loss 3.369087 (-0.60z)| norm 0.2838 (-0.15z)| lr 3.08e-04 | 4159.12 ms | 32.5% bf16 MFU | 126031 tok/s step 9966/19560 | loss 3.381649 (-0.35z)| norm 0.2745 (-0.60z)| lr 3.08e-04 | 4164.47 ms | 32.4% bf16 MFU | 126024 tok/s step 9967/19560 | loss 3.412756 (+0.27z)| norm 0.2661 (-1.00z)| lr 3.08e-04 | 4167.57 ms | 32.4% bf16 MFU | 126013 tok/s step 9968/19560 | loss 3.467965 (+1.36z)| norm 0.2772 (-0.45z)| lr 3.08e-04 | 4151.33 ms | 32.5% bf16 MFU | 126027 tok/s step 9969/19560 | loss 3.379123 (-0.41z)| norm 0.2636 (-1.13z)| lr 3.08e-04 | 4154.68 ms | 32.5% bf16 MFU | 126035 tok/s step 9970/19560 | loss 3.374671 (-0.50z)| norm 0.2549 (-1.55z)| lr 3.08e-04 | 4158.13 ms | 32.5% bf16 MFU | 126038 tok/s step 9971/19560 | loss 3.397250 (-0.04z)| norm 0.2686 (-0.84z)| lr 3.08e-04 | 4158.47 ms | 32.5% bf16 MFU | 126040 tok/s step 9972/19560 | loss 3.428330 (+0.57z)| norm 0.2600 (-1.26z)| lr 3.08e-04 | 4156.23 ms | 32.5% bf16 MFU | 126045 tok/s step 9973/19560 | loss 3.375326 (-0.49z)| norm 0.2606 (-1.22z)| lr 3.08e-04 | 4153.60 ms | 32.5% bf16 MFU | 126054 tok/s step 9974/19560 | loss 3.449438 (+0.98z)| norm 0.2977 (+0.67z)| lr 3.08e-04 | 4153.77 ms | 32.5% bf16 MFU | 126063 tok/s step 9975/19560 | loss 3.385507 (-0.30z)| norm 0.3079 (+1.18z)| lr 3.08e-04 | 4155.19 ms | 32.5% bf16 MFU | 126068 tok/s step 9976/19560 | loss 3.444544 (+0.90z)| norm 0.3032 (+0.93z)| lr 3.08e-04 | 4172.54 ms | 32.4% bf16 MFU | 126047 tok/s step 9977/19560 | loss 3.419333 (+0.38z)| norm 0.2772 (-0.39z)| lr 3.08e-04 | 4159.01 ms | 32.5% bf16 MFU | 126048 tok/s step 9978/19560 | loss 3.405983 (+0.10z)| norm 0.2954 (+0.52z)| lr 3.08e-04 | 4165.74 ms | 32.4% bf16 MFU | 126039 tok/s step 9979/19560 | loss 3.338494 (-1.26z)| norm 0.2753 (-0.51z)| lr 3.08e-04 | 4161.96 ms | 32.4% bf16 MFU | 126035 tok/s step 9980/19560 | loss 3.407959 (+0.15z)| norm 0.3005 (+0.77z)| lr 3.08e-04 | 4154.08 ms | 32.5% bf16 MFU | 126044 tok/s step 9981/19560 | loss 3.402723 (+0.06z)| norm 0.2845 (-0.07z)| lr 3.07e-04 | 4153.74 ms | 32.5% bf16 MFU | 126053 tok/s step 9982/19560 | loss 3.389955 (-0.21z)| norm 0.2790 (-0.37z)| lr 3.07e-04 | 4163.77 ms | 32.4% bf16 MFU | 126046 tok/s step 9983/19560 | loss 3.488369 (+1.78z)| norm 0.2903 (+0.22z)| lr 3.07e-04 | 4156.89 ms | 32.5% bf16 MFU | 126050 tok/s step 9984/19560 | loss 3.361256 (-0.82z)| norm 0.2954 (+0.48z)| lr 3.07e-04 | 4156.76 ms | 32.5% bf16 MFU | 126054 tok/s step 9985/19560 | loss 3.363315 (-0.77z)| norm 0.2838 (-0.13z)| lr 3.07e-04 | 4154.42 ms | 32.5% bf16 MFU | 126061 tok/s step 9986/19560 | loss 3.354053 (-0.95z)| norm 0.2865 (-0.01z)| lr 3.07e-04 | 4163.84 ms | 32.4% bf16 MFU | 126054 tok/s step 9987/19560 | loss 3.358720 (-0.84z)| norm 0.2909 (+0.23z)| lr 3.07e-04 | 4165.80 ms | 32.4% bf16 MFU | 126044 tok/s step 9988/19560 | loss 3.401223 (+0.02z)| norm 0.3009 (+0.77z)| lr 3.07e-04 | 4154.74 ms | 32.5% bf16 MFU | 126051 tok/s step 9989/19560 | loss 3.375134 (-0.50z)| norm 0.2735 (-0.69z)| lr 3.07e-04 | 4171.32 ms | 32.4% bf16 MFU | 126033 tok/s step 9990/19560 | loss 3.439079 (+0.79z)| norm 0.3305 (+2.31z)| lr 3.07e-04 | 4150.59 ms | 32.5% bf16 MFU | 126047 tok/s step 9991/19560 | loss 3.423347 (+0.46z)| norm 0.2825 (-0.22z)| lr 3.07e-04 | 4161.91 ms | 32.4% bf16 MFU | 126044 tok/s step 9992/19560 | loss 3.376010 (-0.50z)| norm 0.2903 (+0.19z)| lr 3.07e-04 | 4155.85 ms | 32.5% bf16 MFU | 126049 tok/s step 9993/19560 | loss 3.393794 (-0.14z)| norm 0.3234 (+1.90z)| lr 3.07e-04 | 4163.33 ms | 32.4% bf16 MFU | 126043 tok/s step 9994/19560 | loss 3.439639 (+0.78z)| norm 0.2862 (-0.06z)| lr 3.07e-04 | 4162.69 ms | 32.4% bf16 MFU | 126039 tok/s step 9995/19560 | loss 3.352041 (-0.98z)| norm 0.3055 (+0.94z)| lr 3.07e-04 | 4153.25 ms | 32.5% bf16 MFU | 126048 tok/s step 9996/19560 | loss 3.459372 (+1.20z)| norm 0.2835 (-0.22z)| lr 3.07e-04 | 4170.54 ms | 32.4% bf16 MFU | 126032 tok/s step 9997/19560 | loss 3.389293 (-0.22z)| norm 0.2833 (-0.23z)| lr 3.07e-04 | 4168.06 ms | 32.4% bf16 MFU | 126019 tok/s step 9998/19560 | loss 3.303568 (-1.92z)| norm 0.3150 (+1.42z)| lr 3.07e-04 | 4176.86 ms | 32.3% bf16 MFU | 125995 tok/s step 9999/19560 | loss 3.398655 (-0.01z)| norm 0.2577 (-1.58z)| lr 3.07e-04 | 4149.41 ms | 32.5% bf16 MFU | 126012 tok/s step 10000/19560 | loss 3.351919 (-0.94z)| norm 0.2899 (+0.13z)| lr 3.07e-04 | 4156.37 ms | 32.5% bf16 MFU | 126019 tok/s val loss 3.387246 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2942/10042 = 0.292970 Writing checkpoint at step 10000 Writing model to log124M/model_00010000.bin Writing state to log124M/state_00010000_00000.bin step 10001/19560 | loss 3.384484 (-0.29z)| norm 0.2740 (-0.71z)| lr 3.06e-04 | 4191.21 ms | 32.2% bf16 MFU | 125972 tok/s step 10002/19560 | loss 3.368197 (-0.61z)| norm 0.2715 (-0.83z)| lr 3.06e-04 | 4157.06 ms | 32.5% bf16 MFU | 125980 tok/s step 10003/19560 | loss 3.356989 (-0.83z)| norm 0.2682 (-1.01z)| lr 3.06e-04 | 4150.69 ms | 32.5% bf16 MFU | 125997 tok/s step 10004/19560 | loss 3.372653 (-0.50z)| norm 0.2720 (-0.79z)| lr 3.06e-04 | 4206.08 ms | 32.1% bf16 MFU | 125929 tok/s step 10005/19560 | loss 3.377589 (-0.40z)| norm 0.2600 (-1.43z)| lr 3.06e-04 | 4169.88 ms | 32.4% bf16 MFU | 125919 tok/s step 10006/19560 | loss 3.406518 (+0.17z)| norm 0.2847 (-0.08z)| lr 3.06e-04 | 4251.50 ms | 31.8% bf16 MFU | 125789 tok/s step 10007/19560 | loss 3.459251 (+1.21z)| norm 0.2854 (-0.06z)| lr 3.06e-04 | 4166.38 ms | 32.4% bf16 MFU | 125792 tok/s step 10008/19560 | loss 3.508372 (+2.15z)| norm 0.2692 (-0.93z)| lr 3.06e-04 | 4236.80 ms | 31.9% bf16 MFU | 125689 tok/s step 10009/19560 | loss 3.428143 (+0.55z)| norm 0.3000 (+0.75z)| lr 3.06e-04 | 4171.85 ms | 32.4% bf16 MFU | 125689 tok/s step 10010/19560 | loss 3.306807 (-1.83z)| norm 0.2603 (-1.42z)| lr 3.06e-04 | 4158.64 ms | 32.5% bf16 MFU | 125708 tok/s step 10011/19560 | loss 3.356926 (-0.83z)| norm 0.3425 (+2.96z)| lr 3.06e-04 | 4151.31 ms | 32.5% bf16 MFU | 125737 tok/s step 10012/19560 | loss 3.337426 (-1.21z)| norm 0.2877 (+0.06z)| lr 3.06e-04 | 4190.01 ms | 32.2% bf16 MFU | 125707 tok/s step 10013/19560 | loss 3.426809 (+0.54z)| norm 0.2958 (+0.52z)| lr 3.06e-04 | 4196.65 ms | 32.2% bf16 MFU | 125668 tok/s step 10014/19560 | loss 3.433097 (+0.67z)| norm 0.3348 (+2.58z)| lr 3.06e-04 | 4169.31 ms | 32.4% bf16 MFU | 125672 tok/s step 10015/19560 | loss 3.389306 (-0.19z)| norm 0.2821 (-0.22z)| lr 3.06e-04 | 4152.22 ms | 32.5% bf16 MFU | 125702 tok/s step 10016/19560 | loss 3.364527 (-0.66z)| norm 0.3122 (+1.42z)| lr 3.06e-04 | 4285.03 ms | 31.5% bf16 MFU | 125534 tok/s step 10017/19560 | loss 3.442271 (+0.87z)| norm 0.2945 (+0.45z)| lr 3.06e-04 | 4163.25 ms | 32.4% bf16 MFU | 125554 tok/s step 10018/19560 | loss 3.335650 (-1.22z)| norm 0.2824 (-0.19z)| lr 3.06e-04 | 4155.18 ms | 32.5% bf16 MFU | 125585 tok/s step 10019/19560 | loss 3.388548 (-0.16z)| norm 0.2949 (+0.49z)| lr 3.06e-04 | 4209.57 ms | 32.1% bf16 MFU | 125533 tok/s step 10020/19560 | loss 3.383428 (-0.26z)| norm 0.2765 (-0.51z)| lr 3.06e-04 | 4187.52 ms | 32.2% bf16 MFU | 125517 tok/s step 10021/19560 | loss 3.471621 (+1.46z)| norm 0.2799 (-0.31z)| lr 3.05e-04 | 4180.87 ms | 32.3% bf16 MFU | 125511 tok/s step 10022/19560 | loss 3.413862 (+0.33z)| norm 0.3020 (+0.92z)| lr 3.05e-04 | 4354.52 ms | 31.0% bf16 MFU | 125256 tok/s step 10023/19560 | loss 3.479968 (+1.60z)| norm 0.2732 (-0.69z)| lr 3.05e-04 | 4299.98 ms | 31.4% bf16 MFU | 125089 tok/s step 10024/19560 | loss 3.378299 (-0.38z)| norm 0.2975 (+0.65z)| lr 3.05e-04 | 4204.80 ms | 32.1% bf16 MFU | 125069 tok/s step 10025/19560 | loss 3.451137 (+1.03z)| norm 0.2916 (+0.32z)| lr 3.05e-04 | 4249.47 ms | 31.8% bf16 MFU | 124985 tok/s step 10026/19560 | loss 3.425755 (+0.53z)| norm 0.3245 (+2.11z)| lr 3.05e-04 | 4261.65 ms | 31.7% bf16 MFU | 124887 tok/s step 10027/19560 | loss 3.419126 (+0.41z)| norm 0.3634 (+3.97z)| lr 3.05e-04 | 4170.89 ms | 32.4% bf16 MFU | 124927 tok/s step 10028/19560 | loss 3.467642 (+1.36z)| norm 0.3261 (+1.98z)| lr 3.05e-04 | 4243.44 ms | 31.8% bf16 MFU | 124859 tok/s step 10029/19560 | loss 3.391971 (-0.13z)| norm 0.3065 (+0.97z)| lr 3.05e-04 | 4161.01 ms | 32.4% bf16 MFU | 124916 tok/s step 10030/19560 | loss 3.502308 (+2.00z)| norm 0.2853 (-0.12z)| lr 3.05e-04 | 4162.45 ms | 32.4% bf16 MFU | 124968 tok/s step 10031/19560 | loss 3.371851 (-0.55z)| norm 0.3082 (+1.05z)| lr 3.05e-04 | 4155.17 ms | 32.5% bf16 MFU | 125028 tok/s step 10032/19560 | loss 3.488362 (+1.71z)| norm 0.2834 (-0.23z)| lr 3.05e-04 | 4214.51 ms | 32.0% bf16 MFU | 124997 tok/s step 10033/19560 | loss 3.480383 (+1.52z)| norm 0.2883 (+0.02z)| lr 3.05e-04 | 4191.88 ms | 32.2% bf16 MFU | 125001 tok/s step 10034/19560 | loss 3.420989 (+0.37z)| norm 0.2988 (+0.56z)| lr 3.05e-04 | 4171.52 ms | 32.4% bf16 MFU | 125035 tok/s step 10035/19560 | loss 3.378144 (-0.45z)| norm 0.2742 (-0.70z)| lr 3.05e-04 | 4196.35 ms | 32.2% bf16 MFU | 125030 tok/s step 10036/19560 | loss 3.474482 (+1.40z)| norm 0.2953 (+0.39z)| lr 3.05e-04 | 4163.68 ms | 32.4% bf16 MFU | 125074 tok/s step 10037/19560 | loss 3.331842 (-1.34z)| norm 0.2729 (-0.77z)| lr 3.05e-04 | 4154.75 ms | 32.5% bf16 MFU | 125130 tok/s step 10038/19560 | loss 3.365737 (-0.68z)| norm 0.2768 (-0.56z)| lr 3.05e-04 | 4155.19 ms | 32.5% bf16 MFU | 125182 tok/s step 10039/19560 | loss 3.427288 (+0.49z)| norm 0.2631 (-1.27z)| lr 3.05e-04 | 4177.18 ms | 32.3% bf16 MFU | 125199 tok/s step 10040/19560 | loss 3.404275 (+0.04z)| norm 0.2768 (-0.54z)| lr 3.05e-04 | 4152.50 ms | 32.5% bf16 MFU | 125252 tok/s step 10041/19560 | loss 3.411663 (+0.19z)| norm 0.2847 (-0.12z)| lr 3.04e-04 | 4161.08 ms | 32.4% bf16 MFU | 125289 tok/s step 10042/19560 | loss 3.426962 (+0.49z)| norm 0.2701 (-0.88z)| lr 3.04e-04 | 4156.99 ms | 32.5% bf16 MFU | 125331 tok/s step 10043/19560 | loss 3.330422 (-1.37z)| norm 0.2990 (+0.63z)| lr 3.04e-04 | 4148.30 ms | 32.5% bf16 MFU | 125384 tok/s step 10044/19560 | loss 3.427641 (+0.53z)| norm 0.2755 (-0.60z)| lr 3.04e-04 | 4165.64 ms | 32.4% bf16 MFU | 125407 tok/s step 10045/19560 | loss 3.369819 (-0.60z)| norm 0.2794 (-0.38z)| lr 3.04e-04 | 4157.52 ms | 32.5% bf16 MFU | 125442 tok/s step 10046/19560 | loss 3.418028 (+0.35z)| norm 0.2715 (-0.80z)| lr 3.04e-04 | 4165.03 ms | 32.4% bf16 MFU | 125464 tok/s step 10047/19560 | loss 3.426106 (+0.50z)| norm 0.2888 (+0.12z)| lr 3.04e-04 | 4160.22 ms | 32.5% bf16 MFU | 125492 tok/s step 10048/19560 | loss 3.396443 (-0.09z)| norm 0.2806 (-0.32z)| lr 3.04e-04 | 4163.19 ms | 32.4% bf16 MFU | 125514 tok/s step 10049/19560 | loss 3.425707 (+0.48z)| norm 0.2772 (-0.50z)| lr 3.04e-04 | 4157.02 ms | 32.5% bf16 MFU | 125545 tok/s step 10050/19560 | loss 3.414512 (+0.26z)| norm 0.2468 (-2.06z)| lr 3.04e-04 | 4168.52 ms | 32.4% bf16 MFU | 125556 tok/s step 10051/19560 | loss 3.373697 (-0.55z)| norm 0.2737 (-0.65z)| lr 3.04e-04 | 4161.60 ms | 32.4% bf16 MFU | 125577 tok/s step 10052/19560 | loss 3.377996 (-0.45z)| norm 0.2648 (-1.10z)| lr 3.04e-04 | 4165.30 ms | 32.4% bf16 MFU | 125592 tok/s step 10053/19560 | loss 3.364695 (-0.73z)| norm 0.2606 (-1.29z)| lr 3.04e-04 | 4155.65 ms | 32.5% bf16 MFU | 125621 tok/s step 10054/19560 | loss 3.398702 (-0.05z)| norm 0.2835 (-0.09z)| lr 3.04e-04 | 4164.31 ms | 32.4% bf16 MFU | 125635 tok/s step 10055/19560 | loss 3.384107 (-0.36z)| norm 0.2540 (-1.62z)| lr 3.04e-04 | 4154.61 ms | 32.5% bf16 MFU | 125663 tok/s step 10056/19560 | loss 3.499567 (+1.99z)| norm 0.2869 (+0.10z)| lr 3.04e-04 | 4155.96 ms | 32.5% bf16 MFU | 125687 tok/s step 10057/19560 | loss 3.453367 (+1.04z)| norm 0.2687 (-0.84z)| lr 3.04e-04 | 4164.30 ms | 32.4% bf16 MFU | 125698 tok/s step 10058/19560 | loss 3.442045 (+0.81z)| norm 0.2895 (+0.25z)| lr 3.04e-04 | 4161.45 ms | 32.4% bf16 MFU | 125712 tok/s step 10059/19560 | loss 3.374473 (-0.58z)| norm 0.2878 (+0.18z)| lr 3.04e-04 | 4158.52 ms | 32.5% bf16 MFU | 125730 tok/s step 10060/19560 | loss 3.452102 (+1.00z)| norm 0.2610 (-1.23z)| lr 3.04e-04 | 4161.52 ms | 32.4% bf16 MFU | 125743 tok/s step 10061/19560 | loss 3.346894 (-1.15z)| norm 0.2926 (+0.45z)| lr 3.03e-04 | 4160.69 ms | 32.5% bf16 MFU | 125756 tok/s step 10062/19560 | loss 3.401810 (-0.04z)| norm 0.2781 (-0.32z)| lr 3.03e-04 | 4164.78 ms | 32.4% bf16 MFU | 125763 tok/s step 10063/19560 | loss 3.417536 (+0.28z)| norm 0.2898 (+0.30z)| lr 3.03e-04 | 4147.85 ms | 32.6% bf16 MFU | 125795 tok/s step 10064/19560 | loss 3.376074 (-0.58z)| norm 0.2814 (-0.15z)| lr 3.03e-04 | 4176.00 ms | 32.3% bf16 MFU | 125782 tok/s step 10065/19560 | loss 3.435066 (+0.69z)| norm 0.2853 (+0.07z)| lr 3.03e-04 | 4153.24 ms | 32.5% bf16 MFU | 125805 tok/s step 10066/19560 | loss 3.379731 (-0.51z)| norm 0.2673 (-0.88z)| lr 3.03e-04 | 4153.73 ms | 32.5% bf16 MFU | 125826 tok/s step 10067/19560 | loss 3.522988 (+2.52z)| norm 0.2860 (+0.12z)| lr 3.03e-04 | 4161.26 ms | 32.4% bf16 MFU | 125834 tok/s step 10068/19560 | loss 3.408876 (+0.09z)| norm 0.2749 (-0.47z)| lr 3.03e-04 | 4155.33 ms | 32.5% bf16 MFU | 125851 tok/s step 10069/19560 | loss 3.442302 (+0.79z)| norm 0.2815 (-0.12z)| lr 3.03e-04 | 4154.19 ms | 32.5% bf16 MFU | 125869 tok/s step 10070/19560 | loss 3.422327 (+0.36z)| norm 0.2700 (-0.74z)| lr 3.03e-04 | 4165.11 ms | 32.4% bf16 MFU | 125869 tok/s step 10071/19560 | loss 3.427922 (+0.47z)| norm 0.2717 (-0.65z)| lr 3.03e-04 | 4166.19 ms | 32.4% bf16 MFU | 125868 tok/s step 10072/19560 | loss 3.418424 (+0.28z)| norm 0.2813 (-0.13z)| lr 3.03e-04 | 4163.54 ms | 32.4% bf16 MFU | 125871 tok/s step 10073/19560 | loss 3.382198 (-0.50z)| norm 0.2956 (+0.63z)| lr 3.03e-04 | 4167.88 ms | 32.4% bf16 MFU | 125867 tok/s step 10074/19560 | loss 3.349754 (-1.17z)| norm 0.2677 (-0.88z)| lr 3.03e-04 | 4157.43 ms | 32.5% bf16 MFU | 125879 tok/s step 10075/19560 | loss 3.382951 (-0.46z)| norm 0.2762 (-0.42z)| lr 3.03e-04 | 4159.14 ms | 32.5% bf16 MFU | 125888 tok/s step 10076/19560 | loss 3.370887 (-0.72z)| norm 0.2694 (-0.78z)| lr 3.03e-04 | 4158.20 ms | 32.5% bf16 MFU | 125898 tok/s step 10077/19560 | loss 3.425132 (+0.42z)| norm 0.2947 (+0.57z)| lr 3.03e-04 | 4172.89 ms | 32.4% bf16 MFU | 125885 tok/s step 10078/19560 | loss 3.436434 (+0.65z)| norm 0.3113 (+1.45z)| lr 3.03e-04 | 4156.16 ms | 32.5% bf16 MFU | 125898 tok/s step 10079/19560 | loss 3.409094 (+0.08z)| norm 0.2839 (-0.03z)| lr 3.03e-04 | 4152.71 ms | 32.5% bf16 MFU | 125916 tok/s step 10080/19560 | loss 3.337385 (-1.45z)| norm 0.3208 (+1.92z)| lr 3.03e-04 | 4163.68 ms | 32.4% bf16 MFU | 125916 tok/s step 10081/19560 | loss 3.395205 (-0.21z)| norm 0.2951 (+0.53z)| lr 3.02e-04 | 4966.56 ms | 27.2% bf16 MFU | 124898 tok/s step 10082/19560 | loss 3.416716 (+0.25z)| norm 0.2787 (-0.35z)| lr 3.02e-04 | 4162.02 ms | 32.4% bf16 MFU | 124952 tok/s step 10083/19560 | loss 3.432511 (+0.61z)| norm 0.3099 (+1.31z)| lr 3.02e-04 | 4155.64 ms | 32.5% bf16 MFU | 125012 tok/s step 10084/19560 | loss 3.373027 (-0.69z)| norm 0.2984 (+0.68z)| lr 3.02e-04 | 4166.11 ms | 32.4% bf16 MFU | 125054 tok/s step 10085/19560 | loss 3.505043 (+2.14z)| norm 0.2996 (+0.74z)| lr 3.02e-04 | 4334.27 ms | 31.2% bf16 MFU | 124850 tok/s step 10086/19560 | loss 3.393170 (-0.28z)| norm 0.2862 (+0.01z)| lr 3.02e-04 | 4159.94 ms | 32.5% bf16 MFU | 124909 tok/s step 10087/19560 | loss 3.418031 (+0.26z)| norm 0.2962 (+0.55z)| lr 3.02e-04 | 4165.35 ms | 32.4% bf16 MFU | 124957 tok/s step 10088/19560 | loss 3.407274 (+0.01z)| norm 0.2735 (-0.68z)| lr 3.02e-04 | 4158.89 ms | 32.5% bf16 MFU | 125012 tok/s step 10089/19560 | loss 3.405216 (-0.00z)| norm 0.2958 (+0.52z)| lr 3.02e-04 | 4164.51 ms | 32.4% bf16 MFU | 125056 tok/s step 10090/19560 | loss 3.419661 (+0.34z)| norm 0.2587 (-1.46z)| lr 3.02e-04 | 4163.69 ms | 32.4% bf16 MFU | 125099 tok/s step 10091/19560 | loss 3.422343 (+0.39z)| norm 0.3121 (+1.40z)| lr 3.02e-04 | 4160.96 ms | 32.4% bf16 MFU | 125144 tok/s step 10092/19560 | loss 3.401239 (-0.10z)| norm 0.2701 (-0.84z)| lr 3.02e-04 | 4161.49 ms | 32.4% bf16 MFU | 125186 tok/s step 10093/19560 | loss 3.430813 (+0.59z)| norm 0.2797 (-0.33z)| lr 3.02e-04 | 4163.36 ms | 32.4% bf16 MFU | 125224 tok/s step 10094/19560 | loss 3.428956 (+0.54z)| norm 0.3152 (+1.53z)| lr 3.02e-04 | 4170.28 ms | 32.4% bf16 MFU | 125248 tok/s step 10095/19560 | loss 3.411842 (+0.13z)| norm 0.3198 (+1.75z)| lr 3.02e-04 | 4169.15 ms | 32.4% bf16 MFU | 125274 tok/s step 10096/19560 | loss 3.397579 (-0.19z)| norm 0.2904 (+0.19z)| lr 3.02e-04 | 4163.84 ms | 32.4% bf16 MFU | 125306 tok/s step 10097/19560 | loss 3.465607 (+1.40z)| norm 0.2869 (+0.00z)| lr 3.02e-04 | 4162.12 ms | 32.4% bf16 MFU | 125339 tok/s step 10098/19560 | loss 3.449619 (+1.00z)| norm 0.2737 (-0.71z)| lr 3.02e-04 | 4159.38 ms | 32.5% bf16 MFU | 125374 tok/s step 10099/19560 | loss 3.471158 (+1.48z)| norm 0.2804 (-0.36z)| lr 3.02e-04 | 4168.18 ms | 32.4% bf16 MFU | 125395 tok/s step 10100/19560 | loss 3.393045 (-0.33z)| norm 0.3320 (+2.34z)| lr 3.02e-04 | 4158.29 ms | 32.5% bf16 MFU | 125429 tok/s step 10101/19560 | loss 3.387634 (-0.46z)| norm 0.2914 (+0.19z)| lr 3.01e-04 | 4167.11 ms | 32.4% bf16 MFU | 125449 tok/s step 10102/19560 | loss 3.444002 (+0.86z)| norm 0.2798 (-0.43z)| lr 3.01e-04 | 4161.27 ms | 32.4% bf16 MFU | 125476 tok/s step 10103/19560 | loss 3.398285 (-0.21z)| norm 0.2996 (+0.63z)| lr 3.01e-04 | 4170.21 ms | 32.4% bf16 MFU | 125488 tok/s step 10104/19560 | loss 3.401404 (-0.13z)| norm 0.2647 (-1.21z)| lr 3.01e-04 | 4161.19 ms | 32.4% bf16 MFU | 125513 tok/s step 10105/19560 | loss 3.358298 (-1.13z)| norm 0.2992 (+0.62z)| lr 3.01e-04 | 4163.33 ms | 32.4% bf16 MFU | 125534 tok/s step 10106/19560 | loss 3.393490 (-0.30z)| norm 0.2669 (-1.09z)| lr 3.01e-04 | 4164.33 ms | 32.4% bf16 MFU | 125552 tok/s step 10107/19560 | loss 3.414685 (+0.18z)| norm 0.2811 (-0.34z)| lr 3.01e-04 | 4161.26 ms | 32.4% bf16 MFU | 125574 tok/s step 10108/19560 | loss 3.441756 (+0.81z)| norm 0.2690 (-0.96z)| lr 3.01e-04 | 4173.83 ms | 32.3% bf16 MFU | 125576 tok/s step 10109/19560 | loss 3.443632 (+0.84z)| norm 0.2795 (-0.40z)| lr 3.01e-04 | 4159.78 ms | 32.5% bf16 MFU | 125599 tok/s step 10110/19560 | loss 3.385811 (-0.51z)| norm 0.2901 (+0.15z)| lr 3.01e-04 | 4155.21 ms | 32.5% bf16 MFU | 125628 tok/s step 10111/19560 | loss 3.461449 (+1.28z)| norm 0.3033 (+0.85z)| lr 3.01e-04 | 4169.00 ms | 32.4% bf16 MFU | 125635 tok/s step 10112/19560 | loss 3.364039 (-1.02z)| norm 0.2949 (+0.40z)| lr 3.01e-04 | 4169.30 ms | 32.4% bf16 MFU | 125641 tok/s step 10113/19560 | loss 3.436954 (+0.69z)| norm 0.2773 (-0.52z)| lr 3.01e-04 | 4168.12 ms | 32.4% bf16 MFU | 125648 tok/s step 10114/19560 | loss 3.397332 (-0.26z)| norm 0.2946 (+0.39z)| lr 3.01e-04 | 4164.67 ms | 32.4% bf16 MFU | 125660 tok/s step 10115/19560 | loss 3.552619 (+3.28z)| norm 0.3136 (+1.37z)| lr 3.01e-04 | 4158.12 ms | 32.5% bf16 MFU | 125681 tok/s step 10116/19560 | loss 3.419068 (+0.21z)| norm 0.2638 (-1.22z)| lr 3.01e-04 | 4167.56 ms | 32.4% bf16 MFU | 125687 tok/s step 10117/19560 | loss 3.378581 (-0.72z)| norm 0.2823 (-0.26z)| lr 3.01e-04 | 4163.95 ms | 32.4% bf16 MFU | 125698 tok/s step 10118/19560 | loss 3.401431 (-0.19z)| norm 0.2763 (-0.56z)| lr 3.01e-04 | 4172.63 ms | 32.4% bf16 MFU | 125696 tok/s step 10119/19560 | loss 3.390840 (-0.43z)| norm 0.2874 (+0.03z)| lr 3.01e-04 | 4166.40 ms | 32.4% bf16 MFU | 125703 tok/s step 10120/19560 | loss 3.445381 (+0.81z)| norm 0.2573 (-1.55z)| lr 3.01e-04 | 4159.40 ms | 32.5% bf16 MFU | 125720 tok/s step 10121/19560 | loss 3.399395 (-0.24z)| norm 0.2804 (-0.32z)| lr 3.00e-04 | 4161.32 ms | 32.4% bf16 MFU | 125734 tok/s step 10122/19560 | loss 3.416217 (+0.15z)| norm 0.2634 (-1.21z)| lr 3.00e-04 | 4158.41 ms | 32.5% bf16 MFU | 125751 tok/s step 10123/19560 | loss 3.388295 (-0.51z)| norm 0.2752 (-0.57z)| lr 3.00e-04 | 4168.50 ms | 32.4% bf16 MFU | 125752 tok/s step 10124/19560 | loss 3.493671 (+1.91z)| norm 0.2632 (-1.20z)| lr 3.00e-04 | 4168.77 ms | 32.4% bf16 MFU | 125753 tok/s step 10125/19560 | loss 3.392454 (-0.41z)| norm 0.2663 (-1.02z)| lr 3.00e-04 | 4159.43 ms | 32.5% bf16 MFU | 125768 tok/s step 10126/19560 | loss 3.353602 (-1.34z)| norm 0.2588 (-1.40z)| lr 3.00e-04 | 4162.07 ms | 32.4% bf16 MFU | 125778 tok/s step 10127/19560 | loss 3.415696 (+0.11z)| norm 0.2794 (-0.31z)| lr 3.00e-04 | 4158.55 ms | 32.5% bf16 MFU | 125793 tok/s step 10128/19560 | loss 3.436640 (+0.59z)| norm 0.2777 (-0.40z)| lr 3.00e-04 | 4166.28 ms | 32.4% bf16 MFU | 125795 tok/s step 10129/19560 | loss 3.421015 (+0.21z)| norm 0.2477 (-1.97z)| lr 3.00e-04 | 4161.37 ms | 32.4% bf16 MFU | 125805 tok/s step 10130/19560 | loss 3.408604 (-0.08z)| norm 0.2889 (+0.20z)| lr 3.00e-04 | 4174.63 ms | 32.3% bf16 MFU | 125794 tok/s step 10131/19560 | loss 3.395266 (-0.41z)| norm 0.2817 (-0.19z)| lr 3.00e-04 | 4166.00 ms | 32.4% bf16 MFU | 125797 tok/s step 10132/19560 | loss 3.420293 (+0.18z)| norm 0.2748 (-0.56z)| lr 3.00e-04 | 4157.26 ms | 32.5% bf16 MFU | 125813 tok/s step 10133/19560 | loss 3.434931 (+0.52z)| norm 0.2865 (+0.05z)| lr 3.00e-04 | 4164.04 ms | 32.4% bf16 MFU | 125817 tok/s step 10134/19560 | loss 3.428565 (+0.36z)| norm 0.2798 (-0.30z)| lr 3.00e-04 | 4160.10 ms | 32.5% bf16 MFU | 125828 tok/s step 10135/19560 | loss 3.435479 (+0.53z)| norm 0.2821 (-0.18z)| lr 3.00e-04 | 4165.66 ms | 32.4% bf16 MFU | 125829 tok/s step 10136/19560 | loss 3.381073 (-0.76z)| norm 0.3010 (+0.82z)| lr 3.00e-04 | 4160.30 ms | 32.5% bf16 MFU | 125839 tok/s step 10137/19560 | loss 3.371013 (-0.99z)| norm 0.2682 (-0.92z)| lr 3.00e-04 | 4159.70 ms | 32.5% bf16 MFU | 125849 tok/s step 10138/19560 | loss 3.444345 (+0.78z)| norm 0.2985 (+0.68z)| lr 3.00e-04 | 4165.57 ms | 32.4% bf16 MFU | 125850 tok/s step 10139/19560 | loss 3.446919 (+0.83z)| norm 0.2532 (-1.75z)| lr 3.00e-04 | 4167.39 ms | 32.4% bf16 MFU | 125848 tok/s step 10140/19560 | loss 3.421893 (+0.19z)| norm 0.2711 (-0.76z)| lr 3.00e-04 | 4170.29 ms | 32.4% bf16 MFU | 125841 tok/s step 10141/19560 | loss 3.368203 (-1.15z)| norm 0.2939 (+0.50z)| lr 3.00e-04 | 4159.20 ms | 32.5% bf16 MFU | 125852 tok/s step 10142/19560 | loss 3.399572 (-0.35z)| norm 0.2672 (-0.97z)| lr 2.99e-04 | 4164.41 ms | 32.4% bf16 MFU | 125854 tok/s step 10143/19560 | loss 3.407168 (-0.17z)| norm 0.2602 (-1.35z)| lr 2.99e-04 | 4164.52 ms | 32.4% bf16 MFU | 125856 tok/s step 10144/19560 | loss 3.416698 (+0.07z)| norm 0.2700 (-0.78z)| lr 2.99e-04 | 4165.78 ms | 32.4% bf16 MFU | 125856 tok/s step 10145/19560 | loss 3.443342 (+0.74z)| norm 0.2722 (-0.65z)| lr 2.99e-04 | 4167.30 ms | 32.4% bf16 MFU | 125854 tok/s step 10146/19560 | loss 3.375491 (-1.01z)| norm 0.2694 (-0.80z)| lr 2.99e-04 | 4161.16 ms | 32.4% bf16 MFU | 125861 tok/s step 10147/19560 | loss 3.354996 (-1.52z)| norm 0.2560 (-1.52z)| lr 2.99e-04 | 4164.35 ms | 32.4% bf16 MFU | 125863 tok/s step 10148/19560 | loss 3.394083 (-0.52z)| norm 0.2772 (-0.34z)| lr 2.99e-04 | 4167.77 ms | 32.4% bf16 MFU | 125859 tok/s step 10149/19560 | loss 3.411938 (-0.05z)| norm 0.2806 (-0.15z)| lr 2.99e-04 | 4150.39 ms | 32.5% bf16 MFU | 125883 tok/s step 10150/19560 | loss 3.391676 (-0.57z)| norm 0.2744 (-0.48z)| lr 2.99e-04 | 4161.83 ms | 32.4% bf16 MFU | 125887 tok/s step 10151/19560 | loss 3.450224 (+0.96z)| norm 0.2608 (-1.23z)| lr 2.99e-04 | 4166.45 ms | 32.4% bf16 MFU | 125885 tok/s step 10152/19560 | loss 3.409164 (-0.12z)| norm 0.2891 (+0.34z)| lr 2.99e-04 | 4155.88 ms | 32.5% bf16 MFU | 125898 tok/s step 10153/19560 | loss 3.406942 (-0.17z)| norm 0.2861 (+0.18z)| lr 2.99e-04 | 4156.94 ms | 32.5% bf16 MFU | 125910 tok/s step 10154/19560 | loss 3.443296 (+0.78z)| norm 0.3005 (+1.01z)| lr 2.99e-04 | 4154.91 ms | 32.5% bf16 MFU | 125923 tok/s step 10155/19560 | loss 3.468047 (+1.41z)| norm 0.2739 (-0.50z)| lr 2.99e-04 | 4163.01 ms | 32.4% bf16 MFU | 125924 tok/s step 10156/19560 | loss 3.396473 (-0.44z)| norm 0.2794 (-0.14z)| lr 2.99e-04 | 4160.85 ms | 32.4% bf16 MFU | 125928 tok/s step 10157/19560 | loss 3.373655 (-1.03z)| norm 0.2865 (+0.32z)| lr 2.99e-04 | 4164.12 ms | 32.4% bf16 MFU | 125927 tok/s step 10158/19560 | loss 3.430518 (+0.48z)| norm 0.2640 (-1.11z)| lr 2.99e-04 | 4153.67 ms | 32.5% bf16 MFU | 125942 tok/s step 10159/19560 | loss 3.505606 (+2.40z)| norm 0.5353 (+9.30z)| lr 2.99e-04 | 4160.46 ms | 32.5% bf16 MFU | 125946 tok/s step 10160/19560 | loss 3.407309 (-0.15z)| norm 0.3176 (+1.25z)| lr 2.99e-04 | 4159.00 ms | 32.5% bf16 MFU | 125951 tok/s step 10161/19560 | loss 3.372900 (-1.05z)| norm 0.2936 (+0.37z)| lr 2.99e-04 | 4161.45 ms | 32.4% bf16 MFU | 125953 tok/s step 10162/19560 | loss 3.483423 (+1.87z)| norm 0.3058 (+0.82z)| lr 2.98e-04 | 4164.34 ms | 32.4% bf16 MFU | 125950 tok/s step 10163/19560 | loss 3.331460 (-2.10z)| norm 0.2964 (+0.47z)| lr 2.98e-04 | 4158.50 ms | 32.5% bf16 MFU | 125957 tok/s step 10164/19560 | loss 3.417437 (+0.15z)| norm 0.2922 (+0.31z)| lr 2.98e-04 | 4159.57 ms | 32.5% bf16 MFU | 125961 tok/s step 10165/19560 | loss 3.492098 (+2.08z)| norm 0.2663 (-0.63z)| lr 2.98e-04 | 4161.20 ms | 32.4% bf16 MFU | 125963 tok/s step 10166/19560 | loss 3.428511 (+0.40z)| norm 0.2976 (+0.51z)| lr 2.98e-04 | 4158.27 ms | 32.5% bf16 MFU | 125969 tok/s step 10167/19560 | loss 3.406241 (-0.19z)| norm 0.2786 (-0.19z)| lr 2.98e-04 | 4160.45 ms | 32.5% bf16 MFU | 125971 tok/s step 10168/19560 | loss 3.459733 (+1.21z)| norm 0.3164 (+1.17z)| lr 2.98e-04 | 4158.54 ms | 32.5% bf16 MFU | 125976 tok/s step 10169/19560 | loss 3.409190 (-0.12z)| norm 0.2975 (+0.48z)| lr 2.98e-04 | 4162.87 ms | 32.4% bf16 MFU | 125975 tok/s step 10170/19560 | loss 3.466128 (+1.36z)| norm 0.2876 (+0.12z)| lr 2.98e-04 | 4160.15 ms | 32.5% bf16 MFU | 125977 tok/s step 10171/19560 | loss 3.422915 (+0.22z)| norm 0.2987 (+0.52z)| lr 2.98e-04 | 4156.77 ms | 32.5% bf16 MFU | 125985 tok/s step 10172/19560 | loss 3.363283 (-1.35z)| norm 0.3021 (+0.64z)| lr 2.98e-04 | 4155.00 ms | 32.5% bf16 MFU | 125995 tok/s step 10173/19560 | loss 3.391432 (-0.61z)| norm 0.2828 (-0.06z)| lr 2.98e-04 | 4165.74 ms | 32.4% bf16 MFU | 125988 tok/s step 10174/19560 | loss 3.439390 (+0.66z)| norm 0.3153 (+1.10z)| lr 2.98e-04 | 4157.06 ms | 32.5% bf16 MFU | 125994 tok/s step 10175/19560 | loss 3.372732 (-1.10z)| norm 0.2886 (+0.13z)| lr 2.98e-04 | 4165.53 ms | 32.4% bf16 MFU | 125988 tok/s step 10176/19560 | loss 3.424508 (+0.27z)| norm 0.2902 (+0.19z)| lr 2.98e-04 | 4163.82 ms | 32.4% bf16 MFU | 125984 tok/s step 10177/19560 | loss 3.462055 (+1.24z)| norm 0.2680 (-0.61z)| lr 2.98e-04 | 4160.58 ms | 32.5% bf16 MFU | 125986 tok/s step 10178/19560 | loss 3.460637 (+1.19z)| norm 0.2895 (+0.15z)| lr 2.98e-04 | 4158.38 ms | 32.5% bf16 MFU | 125990 tok/s step 10179/19560 | loss 3.382674 (-0.85z)| norm 0.2679 (-0.63z)| lr 2.98e-04 | 4160.57 ms | 32.5% bf16 MFU | 125992 tok/s step 10180/19560 | loss 3.404140 (-0.29z)| norm 0.2843 (-0.04z)| lr 2.98e-04 | 4161.43 ms | 32.4% bf16 MFU | 125991 tok/s step 10181/19560 | loss 3.414956 (-0.02z)| norm 0.2700 (-0.56z)| lr 2.98e-04 | 4162.88 ms | 32.4% bf16 MFU | 125989 tok/s step 10182/19560 | loss 3.445792 (+0.78z)| norm 0.2813 (-0.15z)| lr 2.97e-04 | 4159.77 ms | 32.5% bf16 MFU | 125991 tok/s step 10183/19560 | loss 3.396400 (-0.52z)| norm 0.2934 (+0.28z)| lr 2.97e-04 | 4159.04 ms | 32.5% bf16 MFU | 125995 tok/s step 10184/19560 | loss 3.410231 (-0.14z)| norm 0.2668 (-0.69z)| lr 2.97e-04 | 4159.60 ms | 32.5% bf16 MFU | 125997 tok/s step 10185/19560 | loss 3.401190 (-0.38z)| norm 0.2635 (-0.81z)| lr 2.97e-04 | 4161.79 ms | 32.4% bf16 MFU | 125996 tok/s step 10186/19560 | loss 3.424955 (+0.27z)| norm 0.2867 (+0.04z)| lr 2.97e-04 | 4163.61 ms | 32.4% bf16 MFU | 125992 tok/s step 10187/19560 | loss 3.387182 (-0.76z)| norm 0.2729 (-0.46z)| lr 2.97e-04 | 4159.30 ms | 32.5% bf16 MFU | 125995 tok/s step 10188/19560 | loss 3.454577 (+1.07z)| norm 0.2967 (+0.40z)| lr 2.97e-04 | 4169.64 ms | 32.4% bf16 MFU | 125983 tok/s step 10189/19560 | loss 3.436554 (+0.57z)| norm 0.2959 (+0.37z)| lr 2.97e-04 | 4159.25 ms | 32.5% bf16 MFU | 125986 tok/s step 10190/19560 | loss 3.423451 (+0.21z)| norm 0.2850 (-0.03z)| lr 2.97e-04 | 4167.03 ms | 32.4% bf16 MFU | 125978 tok/s step 10191/19560 | loss 3.454498 (+1.05z)| norm 0.2766 (-0.33z)| lr 2.97e-04 | 4159.77 ms | 32.5% bf16 MFU | 125981 tok/s step 10192/19560 | loss 3.341849 (-2.01z)| norm 0.2973 (+0.42z)| lr 2.97e-04 | 4155.37 ms | 32.5% bf16 MFU | 125990 tok/s step 10193/19560 | loss 3.393024 (-0.61z)| norm 0.2686 (-0.62z)| lr 2.97e-04 | 4164.33 ms | 32.4% bf16 MFU | 125986 tok/s step 10194/19560 | loss 3.438383 (+0.60z)| norm 0.2782 (-0.28z)| lr 2.97e-04 | 4150.93 ms | 32.5% bf16 MFU | 126002 tok/s step 10195/19560 | loss 3.444611 (+0.81z)| norm 0.2856 (-0.00z)| lr 2.97e-04 | 4157.86 ms | 32.5% bf16 MFU | 126006 tok/s step 10196/19560 | loss 3.432632 (+0.47z)| norm 0.2525 (-1.21z)| lr 2.97e-04 | 4160.88 ms | 32.4% bf16 MFU | 126006 tok/s step 10197/19560 | loss 3.412114 (-0.09z)| norm 0.2827 (-0.10z)| lr 2.97e-04 | 4159.02 ms | 32.5% bf16 MFU | 126009 tok/s step 10198/19560 | loss 3.436006 (+0.57z)| norm 0.2605 (-0.91z)| lr 2.97e-04 | 4160.27 ms | 32.5% bf16 MFU | 126010 tok/s step 10199/19560 | loss 3.363120 (-1.44z)| norm 0.2842 (-0.05z)| lr 2.97e-04 | 6277.93 ms | 21.5% bf16 MFU | 123885 tok/s step 10200/19560 | loss 3.410874 (-0.11z)| norm 0.2590 (-0.96z)| lr 2.97e-04 | 4159.55 ms | 32.5% bf16 MFU | 123993 tok/s step 10201/19560 | loss 3.504292 (+2.41z)| norm 0.2929 (+0.27z)| lr 2.97e-04 | 4146.50 ms | 32.6% bf16 MFU | 124115 tok/s step 10202/19560 | loss 3.460679 (+1.20z)| norm 0.2663 (-0.69z)| lr 2.96e-04 | 4250.52 ms | 31.8% bf16 MFU | 124077 tok/s step 10203/19560 | loss 3.442197 (+0.68z)| norm 0.3167 (+1.12z)| lr 2.96e-04 | 4159.41 ms | 32.5% bf16 MFU | 124175 tok/s step 10204/19560 | loss 3.429189 (+0.32z)| norm 0.2745 (-0.41z)| lr 2.96e-04 | 4160.89 ms | 32.4% bf16 MFU | 124267 tok/s step 10205/19560 | loss 3.403151 (-0.40z)| norm 0.2775 (-0.29z)| lr 2.96e-04 | 4166.91 ms | 32.4% bf16 MFU | 124345 tok/s step 10206/19560 | loss 3.515181 (+2.61z)| norm 0.2746 (-0.39z)| lr 2.96e-04 | 4149.60 ms | 32.5% bf16 MFU | 124445 tok/s step 10207/19560 | loss 3.393206 (-0.67z)| norm 0.2610 (-0.87z)| lr 2.96e-04 | 4156.14 ms | 32.5% bf16 MFU | 124530 tok/s step 10208/19560 | loss 3.445531 (+0.73z)| norm 0.3077 (+0.82z)| lr 2.96e-04 | 4157.12 ms | 32.5% bf16 MFU | 124609 tok/s step 10209/19560 | loss 3.387980 (-0.84z)| norm 0.2767 (-0.30z)| lr 2.96e-04 | 4159.68 ms | 32.5% bf16 MFU | 124681 tok/s step 10210/19560 | loss 3.369232 (-1.33z)| norm 0.3179 (+1.18z)| lr 2.96e-04 | 4156.11 ms | 32.5% bf16 MFU | 124754 tok/s step 10211/19560 | loss 3.415315 (-0.08z)| norm 0.2870 (+0.07z)| lr 2.96e-04 | 4158.39 ms | 32.5% bf16 MFU | 124820 tok/s step 10212/19560 | loss 3.434000 (+0.41z)| norm 0.2717 (-0.47z)| lr 2.96e-04 | 4159.88 ms | 32.5% bf16 MFU | 124881 tok/s step 10213/19560 | loss 3.351631 (-1.82z)| norm 0.2747 (-0.36z)| lr 2.96e-04 | 4183.82 ms | 32.3% bf16 MFU | 124903 tok/s step 10214/19560 | loss 3.400308 (-0.48z)| norm 0.2718 (-0.46z)| lr 2.96e-04 | 4209.20 ms | 32.1% bf16 MFU | 124885 tok/s step 10215/19560 | loss 3.411160 (-0.18z)| norm 0.2734 (-0.40z)| lr 2.96e-04 | 4174.02 ms | 32.3% bf16 MFU | 124922 tok/s step 10216/19560 | loss 3.449684 (+0.87z)| norm 0.2895 (+0.18z)| lr 2.96e-04 | 4168.77 ms | 32.4% bf16 MFU | 124964 tok/s step 10217/19560 | loss 3.431271 (+0.36z)| norm 0.2824 (-0.07z)| lr 2.96e-04 | 4168.93 ms | 32.4% bf16 MFU | 125004 tok/s step 10218/19560 | loss 3.434650 (+0.45z)| norm 0.2905 (+0.21z)| lr 2.96e-04 | 4185.55 ms | 32.3% bf16 MFU | 125017 tok/s step 10219/19560 | loss 3.361799 (-1.52z)| norm 0.2718 (-0.45z)| lr 2.96e-04 | 4184.98 ms | 32.3% bf16 MFU | 125030 tok/s step 10220/19560 | loss 3.383705 (-0.92z)| norm 0.3011 (+0.60z)| lr 2.96e-04 | 4160.88 ms | 32.4% bf16 MFU | 125078 tok/s step 10221/19560 | loss 3.416639 (-0.03z)| norm 0.2834 (-0.04z)| lr 2.96e-04 | 4158.77 ms | 32.5% bf16 MFU | 125128 tok/s step 10222/19560 | loss 3.481229 (+1.70z)| norm 0.3682 (+2.95z)| lr 2.95e-04 | 4165.49 ms | 32.4% bf16 MFU | 125165 tok/s step 10223/19560 | loss 3.395431 (-0.60z)| norm 0.2917 (+0.25z)| lr 2.95e-04 | 4174.99 ms | 32.3% bf16 MFU | 125185 tok/s step 10224/19560 | loss 3.471284 (+1.41z)| norm 0.3159 (+1.10z)| lr 2.95e-04 | 4162.97 ms | 32.4% bf16 MFU | 125223 tok/s step 10225/19560 | loss 3.373116 (-1.19z)| norm 0.2703 (-0.51z)| lr 2.95e-04 | 4164.99 ms | 32.4% bf16 MFU | 125256 tok/s step 10226/19560 | loss 3.364551 (-1.39z)| norm 0.2951 (+0.36z)| lr 2.95e-04 | 4161.61 ms | 32.4% bf16 MFU | 125292 tok/s step 10227/19560 | loss 3.429197 (+0.33z)| norm 0.3027 (+0.62z)| lr 2.95e-04 | 4169.92 ms | 32.4% bf16 MFU | 125314 tok/s step 10228/19560 | loss 3.365733 (-1.35z)| norm 0.2736 (-0.40z)| lr 2.95e-04 | 4174.32 ms | 32.3% bf16 MFU | 125328 tok/s step 10229/19560 | loss 3.444763 (+0.74z)| norm 0.3021 (+0.62z)| lr 2.95e-04 | 4171.17 ms | 32.4% bf16 MFU | 125347 tok/s step 10230/19560 | loss 3.436977 (+0.53z)| norm 0.2639 (-0.74z)| lr 2.95e-04 | 4153.81 ms | 32.5% bf16 MFU | 125390 tok/s step 10231/19560 | loss 3.361325 (-1.46z)| norm 0.3107 (+0.92z)| lr 2.95e-04 | 4157.27 ms | 32.5% bf16 MFU | 125426 tok/s step 10232/19560 | loss 3.403181 (-0.35z)| norm 0.2717 (-0.47z)| lr 2.95e-04 | 4159.57 ms | 32.5% bf16 MFU | 125457 tok/s step 10233/19560 | loss 3.387132 (-0.79z)| norm 0.3189 (+1.20z)| lr 2.95e-04 | 4148.95 ms | 32.5% bf16 MFU | 125503 tok/s step 10234/19560 | loss 3.503245 (+2.23z)| norm 0.2870 (+0.07z)| lr 2.95e-04 | 4162.10 ms | 32.4% bf16 MFU | 125526 tok/s step 10235/19560 | loss 3.386047 (-0.82z)| norm 0.2973 (+0.43z)| lr 2.95e-04 | 4159.16 ms | 32.5% bf16 MFU | 125552 tok/s step 10236/19560 | loss 3.406955 (-0.27z)| norm 0.3055 (+0.71z)| lr 2.95e-04 | 4169.32 ms | 32.4% bf16 MFU | 125562 tok/s step 10237/19560 | loss 3.336775 (-2.04z)| norm 0.2692 (-0.57z)| lr 2.95e-04 | 4159.99 ms | 32.5% bf16 MFU | 125586 tok/s step 10238/19560 | loss 3.431963 (+0.39z)| norm 0.3046 (+0.67z)| lr 2.95e-04 | 4164.23 ms | 32.4% bf16 MFU | 125602 tok/s step 10239/19560 | loss 3.421093 (+0.12z)| norm 0.2827 (-0.10z)| lr 2.95e-04 | 4160.05 ms | 32.5% bf16 MFU | 125623 tok/s step 10240/19560 | loss 3.441719 (+0.64z)| norm 0.2919 (+0.23z)| lr 2.95e-04 | 4169.29 ms | 32.4% bf16 MFU | 125629 tok/s step 10241/19560 | loss 3.415106 (-0.05z)| norm 0.2933 (+0.28z)| lr 2.95e-04 | 4154.50 ms | 32.5% bf16 MFU | 125658 tok/s step 10242/19560 | loss 3.389820 (-0.70z)| norm 0.2809 (-0.16z)| lr 2.94e-04 | 4158.18 ms | 32.5% bf16 MFU | 125679 tok/s step 10243/19560 | loss 3.352374 (-1.70z)| norm 0.2774 (-0.28z)| lr 2.94e-04 | 4175.48 ms | 32.3% bf16 MFU | 125673 tok/s step 10244/19560 | loss 3.383700 (-0.84z)| norm 0.2719 (-0.47z)| lr 2.94e-04 | 4163.12 ms | 32.4% bf16 MFU | 125686 tok/s step 10245/19560 | loss 3.390620 (-0.66z)| norm 0.2779 (-0.26z)| lr 2.94e-04 | 4167.30 ms | 32.4% bf16 MFU | 125693 tok/s step 10246/19560 | loss 3.392334 (-0.61z)| norm 0.2654 (-0.70z)| lr 2.94e-04 | 4169.54 ms | 32.4% bf16 MFU | 125695 tok/s step 10247/19560 | loss 3.314458 (-2.63z)| norm 0.2700 (-0.53z)| lr 2.94e-04 | 4168.59 ms | 32.4% bf16 MFU | 125699 tok/s step 10248/19560 | loss 3.343428 (-1.83z)| norm 0.2481 (-1.30z)| lr 2.94e-04 | 4159.66 ms | 32.5% bf16 MFU | 125716 tok/s step 10249/19560 | loss 3.311267 (-2.58z)| norm 0.2742 (-0.38z)| lr 2.94e-04 | 4165.06 ms | 32.4% bf16 MFU | 125724 tok/s step 10250/19560 | loss 3.423334 (+0.26z)| norm 0.2469 (-1.33z)| lr 2.94e-04 | 4172.98 ms | 32.4% bf16 MFU | 125720 tok/s val loss 3.379095 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2944/10042 = 0.293169 step 10251/19560 | loss 3.443675 (+0.77z)| norm 0.2681 (-0.58z)| lr 2.94e-04 | 4175.86 ms | 32.3% bf16 MFU | 125711 tok/s step 10252/19560 | loss 3.346264 (-1.68z)| norm 0.2671 (-0.62z)| lr 2.94e-04 | 4177.46 ms | 32.3% bf16 MFU | 125701 tok/s step 10253/19560 | loss 3.339833 (-1.81z)| norm 0.2749 (-0.35z)| lr 2.94e-04 | 4169.13 ms | 32.4% bf16 MFU | 125704 tok/s step 10254/19560 | loss 3.428731 (+0.41z)| norm 0.2940 (+0.32z)| lr 2.94e-04 | 4158.05 ms | 32.5% bf16 MFU | 125723 tok/s step 10255/19560 | loss 3.422972 (+0.27z)| norm 0.2897 (+0.16z)| lr 2.94e-04 | 4176.97 ms | 32.3% bf16 MFU | 125713 tok/s step 10256/19560 | loss 3.406864 (-0.14z)| norm 0.2744 (-0.38z)| lr 2.94e-04 | 4162.71 ms | 32.4% bf16 MFU | 125725 tok/s step 10257/19560 | loss 3.410190 (-0.05z)| norm 0.2795 (-0.21z)| lr 2.94e-04 | 4161.99 ms | 32.4% bf16 MFU | 125737 tok/s step 10258/19560 | loss 3.368221 (-1.10z)| norm 0.2732 (-0.43z)| lr 2.94e-04 | 4186.60 ms | 32.2% bf16 MFU | 125712 tok/s step 10259/19560 | loss 3.414556 (+0.07z)| norm 0.2987 (+0.48z)| lr 2.94e-04 | 4168.76 ms | 32.4% bf16 MFU | 125714 tok/s step 10260/19560 | loss 3.378042 (-0.85z)| norm 0.2884 (+0.11z)| lr 2.94e-04 | 4164.70 ms | 32.4% bf16 MFU | 125723 tok/s step 10261/19560 | loss 3.418424 (+0.17z)| norm 0.2787 (-0.24z)| lr 2.94e-04 | 4173.50 ms | 32.4% bf16 MFU | 125718 tok/s step 10262/19560 | loss 3.442809 (+0.79z)| norm 0.2812 (-0.15z)| lr 2.93e-04 | 4218.18 ms | 32.0% bf16 MFU | 125647 tok/s step 10263/19560 | loss 3.416338 (+0.12z)| norm 0.2919 (+0.23z)| lr 2.93e-04 | 4164.42 ms | 32.4% bf16 MFU | 125659 tok/s step 10264/19560 | loss 3.512716 (+2.47z)| norm 0.2797 (-0.20z)| lr 2.93e-04 | 4152.18 ms | 32.5% bf16 MFU | 125690 tok/s step 10265/19560 | loss 3.389326 (-0.58z)| norm 0.3035 (+0.64z)| lr 2.93e-04 | 4174.63 ms | 32.3% bf16 MFU | 125685 tok/s step 10266/19560 | loss 3.383743 (-0.70z)| norm 0.2951 (+0.34z)| lr 2.93e-04 | 4184.15 ms | 32.3% bf16 MFU | 125666 tok/s step 10267/19560 | loss 3.373636 (-0.94z)| norm 0.2427 (-1.51z)| lr 2.93e-04 | 4199.17 ms | 32.2% bf16 MFU | 125625 tok/s step 10268/19560 | loss 3.386148 (-0.62z)| norm 0.3003 (+0.52z)| lr 2.93e-04 | 4179.94 ms | 32.3% bf16 MFU | 125615 tok/s step 10269/19560 | loss 3.454926 (+1.06z)| norm 0.2528 (-1.15z)| lr 2.93e-04 | 4182.68 ms | 32.3% bf16 MFU | 125602 tok/s step 10270/19560 | loss 3.459977 (+1.17z)| norm 0.2816 (-0.14z)| lr 2.93e-04 | 4159.23 ms | 32.5% bf16 MFU | 125625 tok/s step 10271/19560 | loss 3.465135 (+1.28z)| norm 0.2799 (-0.20z)| lr 2.93e-04 | 4174.32 ms | 32.3% bf16 MFU | 125623 tok/s step 10272/19560 | loss 3.432018 (+0.46z)| norm 0.2645 (-0.74z)| lr 2.93e-04 | 4169.82 ms | 32.4% bf16 MFU | 125629 tok/s step 10273/19560 | loss 3.430778 (+0.44z)| norm 0.2822 (-0.12z)| lr 2.93e-04 | 4165.35 ms | 32.4% bf16 MFU | 125641 tok/s step 10274/19560 | loss 3.433143 (+0.48z)| norm 0.2638 (-0.77z)| lr 2.93e-04 | 4177.53 ms | 32.3% bf16 MFU | 125634 tok/s step 10275/19560 | loss 3.389909 (-0.59z)| norm 0.2949 (+0.32z)| lr 2.93e-04 | 4180.64 ms | 32.3% bf16 MFU | 125623 tok/s step 10276/19560 | loss 3.369376 (-1.08z)| norm 0.2664 (-0.69z)| lr 2.93e-04 | 4168.18 ms | 32.4% bf16 MFU | 125631 tok/s step 10277/19560 | loss 3.412047 (-0.04z)| norm 0.2797 (-0.22z)| lr 2.93e-04 | 4159.35 ms | 32.5% bf16 MFU | 125652 tok/s step 10278/19560 | loss 3.432050 (+0.45z)| norm 0.2561 (-1.04z)| lr 2.93e-04 | 4170.15 ms | 32.4% bf16 MFU | 125655 tok/s step 10279/19560 | loss 3.386827 (-0.65z)| norm 0.2611 (-0.87z)| lr 2.93e-04 | 4170.66 ms | 32.4% bf16 MFU | 125658 tok/s step 10280/19560 | loss 3.410261 (-0.07z)| norm 0.2681 (-0.62z)| lr 2.93e-04 | 4171.37 ms | 32.4% bf16 MFU | 125659 tok/s step 10281/19560 | loss 3.380126 (-0.81z)| norm 0.2702 (-0.54z)| lr 2.93e-04 | 4164.56 ms | 32.4% bf16 MFU | 125671 tok/s step 10282/19560 | loss 3.363216 (-1.21z)| norm 0.2813 (-0.14z)| lr 2.92e-04 | 4167.54 ms | 32.4% bf16 MFU | 125678 tok/s step 10283/19560 | loss 3.382231 (-0.73z)| norm 0.2709 (-0.50z)| lr 2.92e-04 | 4176.45 ms | 32.3% bf16 MFU | 125670 tok/s step 10284/19560 | loss 3.400623 (-0.28z)| norm 0.2690 (-0.57z)| lr 2.92e-04 | 4168.66 ms | 32.4% bf16 MFU | 125675 tok/s step 10285/19560 | loss 3.411246 (-0.02z)| norm 0.2770 (-0.28z)| lr 2.92e-04 | 4162.24 ms | 32.4% bf16 MFU | 125690 tok/s step 10286/19560 | loss 3.346288 (-1.60z)| norm 0.2510 (-1.19z)| lr 2.92e-04 | 4170.24 ms | 32.4% bf16 MFU | 125691 tok/s step 10287/19560 | loss 3.408089 (-0.07z)| norm 0.2694 (-0.75z)| lr 2.92e-04 | 4184.07 ms | 32.3% bf16 MFU | 125672 tok/s step 10288/19560 | loss 3.382418 (-0.70z)| norm 0.2652 (-0.97z)| lr 2.92e-04 | 4162.55 ms | 32.4% bf16 MFU | 125686 tok/s step 10289/19560 | loss 3.391479 (-0.48z)| norm 0.2682 (-0.80z)| lr 2.92e-04 | 4279.14 ms | 31.6% bf16 MFU | 125528 tok/s step 10290/19560 | loss 3.411519 (+0.04z)| norm 0.2630 (-1.07z)| lr 2.92e-04 | 4163.47 ms | 32.4% bf16 MFU | 125548 tok/s step 10291/19560 | loss 3.363418 (-1.20z)| norm 0.2741 (-0.44z)| lr 2.92e-04 | 4176.27 ms | 32.3% bf16 MFU | 125547 tok/s step 10292/19560 | loss 3.399367 (-0.28z)| norm 0.2725 (-0.51z)| lr 2.92e-04 | 4166.44 ms | 32.4% bf16 MFU | 125562 tok/s step 10293/19560 | loss 3.372923 (-0.94z)| norm 0.2732 (-0.48z)| lr 2.92e-04 | 4168.75 ms | 32.4% bf16 MFU | 125572 tok/s step 10294/19560 | loss 3.369866 (-1.01z)| norm 0.2703 (-0.63z)| lr 2.92e-04 | 4169.11 ms | 32.4% bf16 MFU | 125581 tok/s step 10295/19560 | loss 3.444566 (+0.92z)| norm 0.2778 (-0.21z)| lr 2.92e-04 | 4158.97 ms | 32.5% bf16 MFU | 125605 tok/s step 10296/19560 | loss 3.371834 (-0.95z)| norm 0.2703 (-0.62z)| lr 2.92e-04 | 4178.44 ms | 32.3% bf16 MFU | 125599 tok/s step 10297/19560 | loss 3.333257 (-1.90z)| norm 0.3162 (+1.99z)| lr 2.92e-04 | 4173.97 ms | 32.3% bf16 MFU | 125599 tok/s step 10298/19560 | loss 3.433705 (+0.67z)| norm 0.2700 (-0.63z)| lr 2.92e-04 | 4159.60 ms | 32.5% bf16 MFU | 125621 tok/s step 10299/19560 | loss 3.436696 (+0.74z)| norm 0.3139 (+1.84z)| lr 2.92e-04 | 4168.05 ms | 32.4% bf16 MFU | 125630 tok/s step 10300/19560 | loss 3.392384 (-0.40z)| norm 0.2700 (-0.62z)| lr 2.92e-04 | 4164.63 ms | 32.4% bf16 MFU | 125643 tok/s step 10301/19560 | loss 3.343852 (-1.63z)| norm 0.2801 (-0.04z)| lr 2.92e-04 | 4180.16 ms | 32.3% bf16 MFU | 125632 tok/s step 10302/19560 | loss 3.366745 (-1.03z)| norm 0.2897 (+0.52z)| lr 2.91e-04 | 4159.36 ms | 32.5% bf16 MFU | 125653 tok/s step 10303/19560 | loss 3.420397 (+0.33z)| norm 0.2776 (-0.17z)| lr 2.91e-04 | 4176.93 ms | 32.3% bf16 MFU | 125646 tok/s step 10304/19560 | loss 3.416386 (+0.23z)| norm 0.2775 (-0.17z)| lr 2.91e-04 | 4177.95 ms | 32.3% bf16 MFU | 125638 tok/s step 10305/19560 | loss 3.391331 (-0.40z)| norm 0.2828 (+0.12z)| lr 2.91e-04 | 4162.95 ms | 32.4% bf16 MFU | 125653 tok/s step 10306/19560 | loss 3.376250 (-0.77z)| norm 0.2837 (+0.18z)| lr 2.91e-04 | 4177.50 ms | 32.3% bf16 MFU | 125646 tok/s step 10307/19560 | loss 3.416841 (+0.27z)| norm 0.2725 (-0.47z)| lr 2.91e-04 | 4176.11 ms | 32.3% bf16 MFU | 125641 tok/s step 10308/19560 | loss 3.362702 (-1.12z)| norm 0.2795 (-0.07z)| lr 2.91e-04 | 4161.09 ms | 32.4% bf16 MFU | 125659 tok/s step 10309/19560 | loss 3.354993 (-1.30z)| norm 0.2572 (-1.34z)| lr 2.91e-04 | 4157.40 ms | 32.5% bf16 MFU | 125681 tok/s step 10310/19560 | loss 3.383575 (-0.55z)| norm 0.2656 (-0.85z)| lr 2.91e-04 | 4174.69 ms | 32.3% bf16 MFU | 125676 tok/s step 10311/19560 | loss 3.448035 (+1.09z)| norm 0.2625 (-1.01z)| lr 2.91e-04 | 4212.51 ms | 32.1% bf16 MFU | 125616 tok/s step 10312/19560 | loss 3.402301 (-0.08z)| norm 0.3006 (+1.15z)| lr 2.91e-04 | 4162.53 ms | 32.4% bf16 MFU | 125633 tok/s step 10313/19560 | loss 3.351420 (-1.36z)| norm 0.2611 (-1.10z)| lr 2.91e-04 | 4192.52 ms | 32.2% bf16 MFU | 125604 tok/s step 10314/19560 | loss 3.426207 (+0.54z)| norm 0.2796 (-0.04z)| lr 2.91e-04 | 4174.13 ms | 32.3% bf16 MFU | 125604 tok/s step 10315/19560 | loss 3.343115 (-1.55z)| norm 0.2761 (-0.25z)| lr 2.91e-04 | 4175.03 ms | 32.3% bf16 MFU | 125602 tok/s step 10316/19560 | loss 3.408829 (+0.11z)| norm 0.2768 (-0.19z)| lr 2.91e-04 | 4182.58 ms | 32.3% bf16 MFU | 125590 tok/s step 10317/19560 | loss 3.386384 (-0.45z)| norm 0.3831 (+5.20z)| lr 2.91e-04 | 4163.62 ms | 32.4% bf16 MFU | 125606 tok/s step 10318/19560 | loss 3.425622 (+0.55z)| norm 0.2577 (-1.16z)| lr 2.91e-04 | 4190.43 ms | 32.2% bf16 MFU | 125582 tok/s step 10319/19560 | loss 3.418122 (+0.37z)| norm 0.2540 (-1.33z)| lr 2.91e-04 | 4153.18 ms | 32.5% bf16 MFU | 125615 tok/s step 10320/19560 | loss 3.472512 (+1.73z)| norm 0.2677 (-0.63z)| lr 2.91e-04 | 4164.86 ms | 32.4% bf16 MFU | 125628 tok/s step 10321/19560 | loss 3.456446 (+1.30z)| norm 0.2961 (+0.78z)| lr 2.91e-04 | 4180.30 ms | 32.3% bf16 MFU | 125618 tok/s step 10322/19560 | loss 3.328454 (-1.90z)| norm 0.2624 (-0.90z)| lr 2.90e-04 | 4154.79 ms | 32.5% bf16 MFU | 125646 tok/s step 10323/19560 | loss 3.326711 (-1.90z)| norm 0.2573 (-1.14z)| lr 2.90e-04 | 4176.62 ms | 32.3% bf16 MFU | 125640 tok/s step 10324/19560 | loss 3.411437 (+0.20z)| norm 0.2789 (-0.07z)| lr 2.90e-04 | 4153.39 ms | 32.5% bf16 MFU | 125670 tok/s step 10325/19560 | loss 3.420959 (+0.44z)| norm 0.2737 (-0.33z)| lr 2.90e-04 | 4157.99 ms | 32.5% bf16 MFU | 125691 tok/s step 10326/19560 | loss 3.424920 (+0.54z)| norm 0.2751 (-0.27z)| lr 2.90e-04 | 4165.90 ms | 32.4% bf16 MFU | 125699 tok/s step 10327/19560 | loss 3.414829 (+0.28z)| norm 0.2795 (-0.04z)| lr 2.90e-04 | 4240.76 ms | 31.8% bf16 MFU | 125596 tok/s step 10328/19560 | loss 3.374286 (-0.72z)| norm 0.2703 (-0.51z)| lr 2.90e-04 | 4159.12 ms | 32.5% bf16 MFU | 125619 tok/s step 10329/19560 | loss 3.537787 (+3.28z)| norm 0.2810 (+0.03z)| lr 2.90e-04 | 4174.02 ms | 32.3% bf16 MFU | 125618 tok/s step 10330/19560 | loss 3.364686 (-0.94z)| norm 0.2723 (-0.41z)| lr 2.90e-04 | 4156.87 ms | 32.5% bf16 MFU | 125643 tok/s step 10331/19560 | loss 3.392827 (-0.24z)| norm 0.2742 (-0.30z)| lr 2.90e-04 | 4160.46 ms | 32.5% bf16 MFU | 125662 tok/s step 10332/19560 | loss 3.425931 (+0.58z)| norm 0.3020 (+1.11z)| lr 2.90e-04 | 5994.68 ms | 22.5% bf16 MFU | 123752 tok/s step 10333/19560 | loss 3.445202 (+1.04z)| norm 0.3100 (+1.50z)| lr 2.90e-04 | 4146.48 ms | 32.6% bf16 MFU | 123886 tok/s step 10334/19560 | loss 3.392970 (-0.22z)| norm 0.3146 (+1.69z)| lr 2.90e-04 | 4153.05 ms | 32.5% bf16 MFU | 124004 tok/s step 10335/19560 | loss 3.412172 (+0.26z)| norm 0.3059 (+1.24z)| lr 2.90e-04 | 4159.50 ms | 32.5% bf16 MFU | 124106 tok/s step 10336/19560 | loss 3.363175 (-0.96z)| norm 0.2932 (+0.61z)| lr 2.90e-04 | 4181.63 ms | 32.3% bf16 MFU | 124170 tok/s step 10337/19560 | loss 3.422479 (+0.53z)| norm 0.2747 (-0.32z)| lr 2.90e-04 | 4199.73 ms | 32.1% bf16 MFU | 124203 tok/s step 10338/19560 | loss 3.364386 (-0.94z)| norm 0.2756 (-0.26z)| lr 2.90e-04 | 4153.93 ms | 32.5% bf16 MFU | 124304 tok/s step 10339/19560 | loss 3.386211 (-0.38z)| norm 0.2971 (+0.83z)| lr 2.90e-04 | 4163.55 ms | 32.4% bf16 MFU | 124385 tok/s step 10340/19560 | loss 3.436619 (+0.89z)| norm 0.2900 (+0.46z)| lr 2.90e-04 | 4170.14 ms | 32.4% bf16 MFU | 124452 tok/s step 10341/19560 | loss 3.385152 (-0.42z)| norm 0.2994 (+0.93z)| lr 2.90e-04 | 4174.29 ms | 32.3% bf16 MFU | 124509 tok/s step 10342/19560 | loss 3.413940 (+0.31z)| norm 0.2763 (-0.25z)| lr 2.89e-04 | 4161.73 ms | 32.4% bf16 MFU | 124583 tok/s step 10343/19560 | loss 3.389650 (-0.30z)| norm 0.2865 (+0.26z)| lr 2.89e-04 | 4156.01 ms | 32.5% bf16 MFU | 124661 tok/s step 10344/19560 | loss 3.384305 (-0.43z)| norm 0.2839 (+0.14z)| lr 2.89e-04 | 4162.57 ms | 32.4% bf16 MFU | 124726 tok/s step 10345/19560 | loss 3.475981 (+1.88z)| norm 0.3169 (+1.78z)| lr 2.89e-04 | 4166.95 ms | 32.4% bf16 MFU | 124780 tok/s step 10346/19560 | loss 3.454818 (+1.34z)| norm 0.2990 (+0.87z)| lr 2.89e-04 | 4176.72 ms | 32.3% bf16 MFU | 124818 tok/s step 10347/19560 | loss 3.399094 (-0.07z)| norm 0.2791 (-0.13z)| lr 2.89e-04 | 4174.66 ms | 32.3% bf16 MFU | 124856 tok/s step 10348/19560 | loss 3.398117 (-0.10z)| norm 0.3034 (+1.09z)| lr 2.89e-04 | 4168.06 ms | 32.4% bf16 MFU | 124903 tok/s step 10349/19560 | loss 3.378458 (-0.58z)| norm 0.2993 (+0.87z)| lr 2.89e-04 | 4163.21 ms | 32.4% bf16 MFU | 124954 tok/s step 10350/19560 | loss 3.395194 (-0.15z)| norm 0.2995 (+0.99z)| lr 2.89e-04 | 4167.73 ms | 32.4% bf16 MFU | 124996 tok/s step 10351/19560 | loss 3.396302 (-0.12z)| norm 0.2801 (-0.05z)| lr 2.89e-04 | 4178.26 ms | 32.3% bf16 MFU | 125021 tok/s step 10352/19560 | loss 3.373234 (-0.70z)| norm 0.2746 (-0.34z)| lr 2.89e-04 | 4164.84 ms | 32.4% bf16 MFU | 125064 tok/s step 10353/19560 | loss 3.420634 (+0.52z)| norm 0.2651 (-0.85z)| lr 2.89e-04 | 4158.26 ms | 32.5% bf16 MFU | 125115 tok/s step 10354/19560 | loss 3.339522 (-1.57z)| norm 0.2824 (+0.09z)| lr 2.89e-04 | 4177.80 ms | 32.3% bf16 MFU | 125134 tok/s step 10355/19560 | loss 3.402542 (+0.06z)| norm 0.2614 (-1.04z)| lr 2.89e-04 | 4157.65 ms | 32.5% bf16 MFU | 125182 tok/s step 10356/19560 | loss 3.413070 (+0.32z)| norm 0.2866 (+0.34z)| lr 2.89e-04 | 4159.00 ms | 32.5% bf16 MFU | 125226 tok/s step 10357/19560 | loss 3.434423 (+0.88z)| norm 0.2605 (-1.08z)| lr 2.89e-04 | 4172.38 ms | 32.4% bf16 MFU | 125248 tok/s step 10358/19560 | loss 3.352648 (-1.22z)| norm 0.3123 (+1.73z)| lr 2.89e-04 | 4169.10 ms | 32.4% bf16 MFU | 125273 tok/s step 10359/19560 | loss 3.374688 (-0.65z)| norm 0.2726 (-0.42z)| lr 2.89e-04 | 4162.85 ms | 32.4% bf16 MFU | 125307 tok/s step 10360/19560 | loss 3.353747 (-1.18z)| norm 0.2886 (+0.45z)| lr 2.89e-04 | 4171.73 ms | 32.4% bf16 MFU | 125325 tok/s step 10361/19560 | loss 3.398074 (-0.04z)| norm 0.2803 (+0.01z)| lr 2.89e-04 | 4157.30 ms | 32.5% bf16 MFU | 125364 tok/s step 10362/19560 | loss 3.388470 (-0.27z)| norm 0.2573 (-1.25z)| lr 2.88e-04 | 4160.89 ms | 32.4% bf16 MFU | 125396 tok/s step 10363/19560 | loss 3.387172 (-0.31z)| norm 0.2877 (+0.44z)| lr 2.88e-04 | 4205.00 ms | 32.1% bf16 MFU | 125361 tok/s step 10364/19560 | loss 3.370694 (-0.74z)| norm 0.2603 (-1.07z)| lr 2.88e-04 | 4172.05 ms | 32.4% bf16 MFU | 125376 tok/s step 10365/19560 | loss 3.474744 (+1.98z)| norm 0.2970 (+0.97z)| lr 2.88e-04 | 4171.16 ms | 32.4% bf16 MFU | 125392 tok/s step 10366/19560 | loss 3.364646 (-0.90z)| norm 0.3065 (+1.50z)| lr 2.88e-04 | 4166.97 ms | 32.4% bf16 MFU | 125413 tok/s step 10367/19560 | loss 3.420542 (+0.57z)| norm 0.2917 (+0.67z)| lr 2.88e-04 | 4166.79 ms | 32.4% bf16 MFU | 125434 tok/s step 10368/19560 | loss 3.474263 (+1.96z)| norm 0.2961 (+0.91z)| lr 2.88e-04 | 4175.99 ms | 32.3% bf16 MFU | 125440 tok/s step 10369/19560 | loss 3.450221 (+1.31z)| norm 0.3078 (+1.55z)| lr 2.88e-04 | 4167.27 ms | 32.4% bf16 MFU | 125458 tok/s step 10370/19560 | loss 3.388002 (-0.30z)| norm 0.2933 (+0.74z)| lr 2.88e-04 | 4170.40 ms | 32.4% bf16 MFU | 125471 tok/s step 10371/19560 | loss 3.411767 (+0.31z)| norm 0.2872 (+0.40z)| lr 2.88e-04 | 4248.46 ms | 31.8% bf16 MFU | 125368 tok/s step 10372/19560 | loss 3.445246 (+1.16z)| norm 0.3026 (+1.23z)| lr 2.88e-04 | 4177.03 ms | 32.3% bf16 MFU | 125375 tok/s step 10373/19560 | loss 3.387627 (-0.33z)| norm 0.2680 (-0.67z)| lr 2.88e-04 | 4188.20 ms | 32.2% bf16 MFU | 125366 tok/s step 10374/19560 | loss 3.401932 (+0.04z)| norm 0.3080 (+1.50z)| lr 2.88e-04 | 4154.27 ms | 32.5% bf16 MFU | 125408 tok/s step 10375/19560 | loss 3.438493 (+0.98z)| norm 0.2895 (+0.48z)| lr 2.88e-04 | 4174.98 ms | 32.3% bf16 MFU | 125416 tok/s step 10376/19560 | loss 3.327888 (-1.93z)| norm 0.3025 (+1.17z)| lr 2.88e-04 | 4172.18 ms | 32.4% bf16 MFU | 125429 tok/s step 10377/19560 | loss 3.374092 (-0.74z)| norm 0.3109 (+1.60z)| lr 2.88e-04 | 4171.66 ms | 32.4% bf16 MFU | 125441 tok/s step 10378/19560 | loss 3.418525 (+0.45z)| norm 0.3041 (+1.22z)| lr 2.88e-04 | 4162.83 ms | 32.4% bf16 MFU | 125466 tok/s step 10379/19560 | loss 3.448721 (+1.25z)| norm 0.3310 (+2.60z)| lr 2.88e-04 | 4153.84 ms | 32.5% bf16 MFU | 125504 tok/s step 10380/19560 | loss 3.371687 (-0.82z)| norm 0.3067 (+1.28z)| lr 2.88e-04 | 4166.98 ms | 32.4% bf16 MFU | 125520 tok/s step 10381/19560 | loss 3.414465 (+0.32z)| norm 0.3178 (+1.83z)| lr 2.88e-04 | 4166.80 ms | 32.4% bf16 MFU | 125535 tok/s step 10382/19560 | loss 3.398900 (-0.10z)| norm 0.2796 (-0.17z)| lr 2.87e-04 | 4179.51 ms | 32.3% bf16 MFU | 125530 tok/s step 10383/19560 | loss 3.353486 (-1.31z)| norm 0.2889 (+0.32z)| lr 2.87e-04 | 4159.53 ms | 32.5% bf16 MFU | 125556 tok/s step 10384/19560 | loss 3.376996 (-0.67z)| norm 0.2927 (+0.51z)| lr 2.87e-04 | 4173.81 ms | 32.3% bf16 MFU | 125559 tok/s step 10385/19560 | loss 3.450944 (+1.32z)| norm 0.2709 (-0.63z)| lr 2.87e-04 | 4153.58 ms | 32.5% bf16 MFU | 125592 tok/s step 10386/19560 | loss 3.429110 (+0.72z)| norm 0.2918 (+0.46z)| lr 2.87e-04 | 4161.66 ms | 32.4% bf16 MFU | 125612 tok/s step 10387/19560 | loss 3.418258 (+0.43z)| norm 0.2732 (-0.51z)| lr 2.87e-04 | 4169.25 ms | 32.4% bf16 MFU | 125619 tok/s step 10388/19560 | loss 3.406081 (+0.09z)| norm 0.2830 (+0.01z)| lr 2.87e-04 | 4160.01 ms | 32.5% bf16 MFU | 125639 tok/s step 10389/19560 | loss 3.363081 (-1.05z)| norm 0.2766 (-0.33z)| lr 2.87e-04 | 4156.45 ms | 32.5% bf16 MFU | 125664 tok/s step 10390/19560 | loss 3.378147 (-0.63z)| norm 0.3179 (+1.81z)| lr 2.87e-04 | 4163.22 ms | 32.4% bf16 MFU | 125678 tok/s step 10391/19560 | loss 3.449440 (+1.27z)| norm 0.2907 (+0.39z)| lr 2.87e-04 | 4158.39 ms | 32.5% bf16 MFU | 125698 tok/s step 10392/19560 | loss 3.389643 (-0.32z)| norm 0.3258 (+2.16z)| lr 2.87e-04 | 4166.16 ms | 32.4% bf16 MFU | 125705 tok/s step 10393/19560 | loss 3.387604 (-0.37z)| norm 0.2840 (+0.04z)| lr 2.87e-04 | 4175.75 ms | 32.3% bf16 MFU | 125698 tok/s step 10394/19560 | loss 3.459162 (+1.58z)| norm 0.2875 (+0.22z)| lr 2.87e-04 | 4170.07 ms | 32.4% bf16 MFU | 125699 tok/s step 10395/19560 | loss 3.388531 (-0.36z)| norm 0.2904 (+0.35z)| lr 2.87e-04 | 4197.37 ms | 32.2% bf16 MFU | 125659 tok/s step 10396/19560 | loss 3.394459 (-0.20z)| norm 0.2715 (-0.62z)| lr 2.87e-04 | 4360.50 ms | 31.0% bf16 MFU | 125388 tok/s step 10397/19560 | loss 3.390675 (-0.29z)| norm 0.2694 (-0.74z)| lr 2.87e-04 | 4168.72 ms | 32.4% bf16 MFU | 125407 tok/s step 10398/19560 | loss 3.389576 (-0.31z)| norm 0.2738 (-0.51z)| lr 2.87e-04 | 4165.39 ms | 32.4% bf16 MFU | 125430 tok/s step 10399/19560 | loss 3.356348 (-1.23z)| norm 0.2918 (+0.43z)| lr 2.87e-04 | 4166.69 ms | 32.4% bf16 MFU | 125450 tok/s step 10400/19560 | loss 3.374913 (-0.69z)| norm 0.2854 (+0.09z)| lr 2.87e-04 | 4188.70 ms | 32.2% bf16 MFU | 125436 tok/s step 10401/19560 | loss 3.405866 (+0.19z)| norm 0.2938 (+0.53z)| lr 2.87e-04 | 4162.69 ms | 32.4% bf16 MFU | 125462 tok/s step 10402/19560 | loss 3.405735 (+0.19z)| norm 0.2717 (-0.64z)| lr 2.86e-04 | 4161.63 ms | 32.4% bf16 MFU | 125488 tok/s step 10403/19560 | loss 3.354872 (-1.24z)| norm 0.2844 (+0.04z)| lr 2.86e-04 | 4227.83 ms | 31.9% bf16 MFU | 125414 tok/s step 10404/19560 | loss 3.407435 (+0.24z)| norm 0.2788 (-0.27z)| lr 2.86e-04 | 4226.22 ms | 31.9% bf16 MFU | 125346 tok/s step 10405/19560 | loss 3.459141 (+1.68z)| norm 0.2839 (-0.00z)| lr 2.86e-04 | 4164.27 ms | 32.4% bf16 MFU | 125374 tok/s step 10406/19560 | loss 3.448554 (+1.37z)| norm 0.3067 (+1.19z)| lr 2.86e-04 | 4214.19 ms | 32.0% bf16 MFU | 125325 tok/s step 10407/19560 | loss 3.429045 (+0.81z)| norm 0.2843 (-0.01z)| lr 2.86e-04 | 4232.15 ms | 31.9% bf16 MFU | 125253 tok/s step 10408/19560 | loss 3.359635 (-1.11z)| norm 0.2839 (-0.04z)| lr 2.86e-04 | 4396.90 ms | 30.7% bf16 MFU | 124953 tok/s step 10409/19560 | loss 3.521326 (+3.22z)| norm 0.2886 (+0.21z)| lr 2.86e-04 | 4165.48 ms | 32.4% bf16 MFU | 124998 tok/s step 10410/19560 | loss 3.339372 (-1.62z)| norm 0.2745 (-0.55z)| lr 2.86e-04 | 4166.31 ms | 32.4% bf16 MFU | 125040 tok/s step 10411/19560 | loss 3.431028 (+0.80z)| norm 0.2762 (-0.46z)| lr 2.86e-04 | 4169.90 ms | 32.4% bf16 MFU | 125075 tok/s step 10412/19560 | loss 3.419946 (+0.50z)| norm 0.2800 (-0.26z)| lr 2.86e-04 | 4213.80 ms | 32.0% bf16 MFU | 125042 tok/s step 10413/19560 | loss 3.390020 (-0.28z)| norm 0.2625 (-1.19z)| lr 2.86e-04 | 4150.96 ms | 32.5% bf16 MFU | 125105 tok/s step 10414/19560 | loss 3.406019 (+0.13z)| norm 0.2794 (-0.30z)| lr 2.86e-04 | 4186.06 ms | 32.3% bf16 MFU | 125112 tok/s step 10415/19560 | loss 3.417844 (+0.44z)| norm 0.2671 (-0.97z)| lr 2.86e-04 | 4156.33 ms | 32.5% bf16 MFU | 125164 tok/s step 10416/19560 | loss 3.486279 (+2.20z)| norm 0.2778 (-0.39z)| lr 2.86e-04 | 4205.68 ms | 32.1% bf16 MFU | 125139 tok/s step 10417/19560 | loss 3.413520 (+0.29z)| norm 0.2733 (-0.64z)| lr 2.86e-04 | 4181.06 ms | 32.3% bf16 MFU | 125152 tok/s step 10418/19560 | loss 3.362876 (-1.02z)| norm 0.3311 (+2.44z)| lr 2.86e-04 | 4152.81 ms | 32.5% bf16 MFU | 125207 tok/s step 10419/19560 | loss 3.426234 (+0.62z)| norm 0.2825 (-0.17z)| lr 2.86e-04 | 4162.90 ms | 32.4% bf16 MFU | 125243 tok/s step 10420/19560 | loss 3.392916 (-0.25z)| norm 0.3099 (+1.28z)| lr 2.86e-04 | 4172.79 ms | 32.4% bf16 MFU | 125263 tok/s step 10421/19560 | loss 3.451983 (+1.27z)| norm 0.2937 (+0.41z)| lr 2.86e-04 | 4158.37 ms | 32.5% bf16 MFU | 125304 tok/s step 10422/19560 | loss 3.453981 (+1.30z)| norm 0.3006 (+0.76z)| lr 2.85e-04 | 4166.10 ms | 32.4% bf16 MFU | 125331 tok/s step 10423/19560 | loss 3.429153 (+0.67z)| norm 0.2662 (-1.07z)| lr 2.85e-04 | 4159.16 ms | 32.5% bf16 MFU | 125368 tok/s step 10424/19560 | loss 3.420678 (+0.44z)| norm 0.2865 (+0.01z)| lr 2.85e-04 | 4172.49 ms | 32.4% bf16 MFU | 125382 tok/s step 10425/19560 | loss 3.485898 (+2.09z)| norm 0.2763 (-0.53z)| lr 2.85e-04 | 4157.78 ms | 32.5% bf16 MFU | 125418 tok/s step 10426/19560 | loss 3.435391 (+0.78z)| norm 0.2778 (-0.45z)| lr 2.85e-04 | 4163.88 ms | 32.4% bf16 MFU | 125442 tok/s step 10427/19560 | loss 3.460135 (+1.41z)| norm 0.2894 (+0.19z)| lr 2.85e-04 | 4179.77 ms | 32.3% bf16 MFU | 125442 tok/s step 10428/19560 | loss 3.443747 (+0.98z)| norm 0.2909 (+0.26z)| lr 2.85e-04 | 4160.41 ms | 32.5% bf16 MFU | 125471 tok/s step 10429/19560 | loss 3.364800 (-1.06z)| norm 0.2759 (-0.55z)| lr 2.85e-04 | 4160.45 ms | 32.5% bf16 MFU | 125498 tok/s step 10430/19560 | loss 3.338613 (-1.72z)| norm 0.2844 (-0.09z)| lr 2.85e-04 | 4156.68 ms | 32.5% bf16 MFU | 125530 tok/s step 10431/19560 | loss 3.535039 (+3.16z)| norm 0.3146 (+1.53z)| lr 2.85e-04 | 4158.32 ms | 32.5% bf16 MFU | 125557 tok/s step 10432/19560 | loss 3.424398 (+0.44z)| norm 0.2892 (+0.15z)| lr 2.85e-04 | 4154.29 ms | 32.5% bf16 MFU | 125590 tok/s step 10433/19560 | loss 3.449676 (+1.04z)| norm 0.2909 (+0.24z)| lr 2.85e-04 | 4165.17 ms | 32.4% bf16 MFU | 125604 tok/s step 10434/19560 | loss 3.361583 (-1.11z)| norm 0.2855 (-0.05z)| lr 2.85e-04 | 4170.01 ms | 32.4% bf16 MFU | 125610 tok/s step 10435/19560 | loss 3.431927 (+0.61z)| norm 0.3140 (+1.47z)| lr 2.85e-04 | 4157.01 ms | 32.5% bf16 MFU | 125636 tok/s step 10436/19560 | loss 3.474116 (+1.61z)| norm 0.2884 (+0.08z)| lr 2.85e-04 | 4158.92 ms | 32.5% bf16 MFU | 125657 tok/s step 10437/19560 | loss 3.387335 (-0.51z)| norm 0.2705 (-0.89z)| lr 2.85e-04 | 4170.19 ms | 32.4% bf16 MFU | 125660 tok/s step 10438/19560 | loss 3.359118 (-1.19z)| norm 0.2920 (+0.26z)| lr 2.85e-04 | 4167.63 ms | 32.4% bf16 MFU | 125667 tok/s step 10439/19560 | loss 3.379332 (-0.68z)| norm 0.2767 (-0.58z)| lr 2.85e-04 | 4157.95 ms | 32.5% bf16 MFU | 125689 tok/s step 10440/19560 | loss 3.442617 (+0.85z)| norm 0.2991 (+0.65z)| lr 2.85e-04 | 4168.94 ms | 32.4% bf16 MFU | 125692 tok/s step 10441/19560 | loss 3.511955 (+2.46z)| norm 0.2662 (-1.16z)| lr 2.85e-04 | 4163.71 ms | 32.4% bf16 MFU | 125704 tok/s step 10442/19560 | loss 3.439171 (+0.72z)| norm 0.2913 (+0.21z)| lr 2.84e-04 | 4148.81 ms | 32.5% bf16 MFU | 125737 tok/s step 10443/19560 | loss 3.427372 (+0.42z)| norm 0.3028 (+0.84z)| lr 2.84e-04 | 4167.51 ms | 32.4% bf16 MFU | 125740 tok/s step 10444/19560 | loss 3.489408 (+1.87z)| norm 0.3065 (+1.02z)| lr 2.84e-04 | 4165.68 ms | 32.4% bf16 MFU | 125746 tok/s step 10445/19560 | loss 3.487329 (+1.79z)| norm 0.2985 (+0.70z)| lr 2.84e-04 | 4167.08 ms | 32.4% bf16 MFU | 125750 tok/s step 10446/19560 | loss 3.544880 (+3.00z)| norm 0.3230 (+2.17z)| lr 2.84e-04 | 4157.10 ms | 32.5% bf16 MFU | 125768 tok/s step 10447/19560 | loss 3.415873 (+0.09z)| norm 0.2822 (-0.36z)| lr 2.84e-04 | 4158.03 ms | 32.5% bf16 MFU | 125784 tok/s step 10448/19560 | loss 3.417130 (+0.12z)| norm 0.3113 (+1.44z)| lr 2.84e-04 | 4170.53 ms | 32.4% bf16 MFU | 125781 tok/s step 10449/19560 | loss 3.395934 (-0.35z)| norm 0.2632 (-1.53z)| lr 2.84e-04 | 4157.46 ms | 32.5% bf16 MFU | 125797 tok/s step 10450/19560 | loss 3.406176 (-0.13z)| norm 0.2793 (-0.55z)| lr 2.84e-04 | 4158.71 ms | 32.5% bf16 MFU | 125811 tok/s step 10451/19560 | loss 3.468402 (+1.30z)| norm 0.2719 (-1.03z)| lr 2.84e-04 | 4160.68 ms | 32.5% bf16 MFU | 125821 tok/s step 10452/19560 | loss 3.388549 (-0.56z)| norm 0.2801 (-0.51z)| lr 2.84e-04 | 4157.24 ms | 32.5% bf16 MFU | 125835 tok/s step 10453/19560 | loss 3.408219 (-0.10z)| norm 0.2760 (-0.77z)| lr 2.84e-04 | 4156.13 ms | 32.5% bf16 MFU | 125851 tok/s step 10454/19560 | loss 3.457568 (+1.04z)| norm 0.3210 (+2.02z)| lr 2.84e-04 | 4157.38 ms | 32.5% bf16 MFU | 125864 tok/s step 10455/19560 | loss 3.514935 (+2.31z)| norm 0.3015 (+0.79z)| lr 2.84e-04 | 4156.64 ms | 32.5% bf16 MFU | 125877 tok/s step 10456/19560 | loss 3.446257 (+0.73z)| norm 0.2687 (-1.25z)| lr 2.84e-04 | 4172.19 ms | 32.4% bf16 MFU | 125867 tok/s step 10457/19560 | loss 3.519835 (+2.43z)| norm 0.3110 (+1.36z)| lr 2.84e-04 | 4162.31 ms | 32.4% bf16 MFU | 125871 tok/s step 10458/19560 | loss 3.382110 (-0.74z)| norm 0.2857 (-0.21z)| lr 2.84e-04 | 4159.18 ms | 32.5% bf16 MFU | 125881 tok/s step 10459/19560 | loss 3.428057 (+0.31z)| norm 0.3310 (+2.51z)| lr 2.84e-04 | 4164.04 ms | 32.4% bf16 MFU | 125882 tok/s step 10460/19560 | loss 3.434823 (+0.47z)| norm 0.3097 (+1.22z)| lr 2.84e-04 | 4163.67 ms | 32.4% bf16 MFU | 125884 tok/s step 10461/19560 | loss 3.458128 (+1.00z)| norm 0.3402 (+2.96z)| lr 2.84e-04 | 4161.69 ms | 32.4% bf16 MFU | 125889 tok/s step 10462/19560 | loss 3.483352 (+1.55z)| norm 0.2910 (+0.08z)| lr 2.83e-04 | 4166.48 ms | 32.4% bf16 MFU | 125886 tok/s step 10463/19560 | loss 3.479348 (+1.44z)| norm 0.2885 (-0.06z)| lr 2.83e-04 | 4156.32 ms | 32.5% bf16 MFU | 125899 tok/s step 10464/19560 | loss 3.419832 (+0.08z)| norm 0.2731 (-0.97z)| lr 2.83e-04 | 4155.76 ms | 32.5% bf16 MFU | 125912 tok/s step 10465/19560 | loss 3.404090 (-0.28z)| norm 0.3067 (+1.01z)| lr 2.83e-04 | 4163.14 ms | 32.4% bf16 MFU | 125913 tok/s step 10466/19560 | loss 3.368723 (-1.08z)| norm 0.2731 (-0.98z)| lr 2.83e-04 | 4163.65 ms | 32.4% bf16 MFU | 125913 tok/s step 10467/19560 | loss 3.432791 (+0.37z)| norm 0.2978 (+0.49z)| lr 2.83e-04 | 4160.88 ms | 32.4% bf16 MFU | 125918 tok/s step 10468/19560 | loss 3.399017 (-0.39z)| norm 0.2666 (-1.34z)| lr 2.83e-04 | 4166.38 ms | 32.4% bf16 MFU | 125914 tok/s step 10469/19560 | loss 3.385949 (-0.69z)| norm 0.2997 (+0.60z)| lr 2.83e-04 | 4160.96 ms | 32.4% bf16 MFU | 125918 tok/s step 10470/19560 | loss 3.404786 (-0.26z)| norm 0.2780 (-0.67z)| lr 2.83e-04 | 4159.58 ms | 32.5% bf16 MFU | 125925 tok/s step 10471/19560 | loss 3.367151 (-1.11z)| norm 0.2771 (-0.72z)| lr 2.83e-04 | 4157.33 ms | 32.5% bf16 MFU | 125934 tok/s step 10472/19560 | loss 3.383111 (-0.75z)| norm 0.2874 (-0.12z)| lr 2.83e-04 | 4158.33 ms | 32.5% bf16 MFU | 125941 tok/s step 10473/19560 | loss 3.417792 (+0.05z)| norm 0.2670 (-1.30z)| lr 2.83e-04 | 4164.21 ms | 32.4% bf16 MFU | 125939 tok/s step 10474/19560 | loss 3.394689 (-0.47z)| norm 0.2973 (+0.49z)| lr 2.83e-04 | 4160.98 ms | 32.4% bf16 MFU | 125942 tok/s step 10475/19560 | loss 3.423797 (+0.19z)| norm 0.2786 (-0.61z)| lr 2.83e-04 | 4163.90 ms | 32.4% bf16 MFU | 125941 tok/s step 10476/19560 | loss 3.419029 (+0.08z)| norm 0.2748 (-0.82z)| lr 2.83e-04 | 4154.08 ms | 32.5% bf16 MFU | 125954 tok/s step 10477/19560 | loss 3.364634 (-1.16z)| norm 0.2939 (+0.30z)| lr 2.83e-04 | 4156.87 ms | 32.5% bf16 MFU | 125963 tok/s step 10478/19560 | loss 3.434846 (+0.44z)| norm 0.2604 (-1.64z)| lr 2.83e-04 | 4161.39 ms | 32.4% bf16 MFU | 125964 tok/s step 10479/19560 | loss 3.387180 (-0.65z)| norm 0.3016 (+0.76z)| lr 2.83e-04 | 4156.22 ms | 32.5% bf16 MFU | 125973 tok/s step 10480/19560 | loss 3.367342 (-1.10z)| norm 0.2693 (-1.12z)| lr 2.83e-04 | 4163.87 ms | 32.4% bf16 MFU | 125970 tok/s step 10481/19560 | loss 3.406655 (-0.20z)| norm 0.2801 (-0.50z)| lr 2.83e-04 | 4155.94 ms | 32.5% bf16 MFU | 125980 tok/s step 10482/19560 | loss 3.373161 (-0.98z)| norm 0.2657 (-1.33z)| lr 2.82e-04 | 4155.68 ms | 32.5% bf16 MFU | 125989 tok/s step 10483/19560 | loss 3.370594 (-1.03z)| norm 0.2763 (-0.73z)| lr 2.82e-04 | 4160.90 ms | 32.4% bf16 MFU | 125989 tok/s step 10484/19560 | loss 3.403466 (-0.27z)| norm 0.2609 (-1.60z)| lr 2.82e-04 | 4162.14 ms | 32.4% bf16 MFU | 125988 tok/s step 10485/19560 | loss 3.506092 (+2.04z)| norm 0.2618 (-1.55z)| lr 2.82e-04 | 4160.56 ms | 32.5% bf16 MFU | 125989 tok/s step 10486/19560 | loss 3.416567 (+0.00z)| norm 0.2945 (+0.36z)| lr 2.82e-04 | 4167.69 ms | 32.4% bf16 MFU | 125980 tok/s step 10487/19560 | loss 3.418929 (+0.05z)| norm 0.2547 (-1.94z)| lr 2.82e-04 | 4164.17 ms | 32.4% bf16 MFU | 125976 tok/s step 10488/19560 | loss 3.360290 (-1.30z)| norm 0.2880 (-0.01z)| lr 2.82e-04 | 4217.91 ms | 32.0% bf16 MFU | 125892 tok/s step 10489/19560 | loss 3.412331 (-0.11z)| norm 0.2992 (+0.63z)| lr 2.82e-04 | 4163.12 ms | 32.4% bf16 MFU | 125895 tok/s step 10490/19560 | loss 3.534785 (+2.61z)| norm 0.2824 (-0.36z)| lr 2.82e-04 | 4167.04 ms | 32.4% bf16 MFU | 125891 tok/s step 10491/19560 | loss 3.371698 (-1.03z)| norm 0.2750 (-0.79z)| lr 2.82e-04 | 4162.22 ms | 32.4% bf16 MFU | 125894 tok/s step 10492/19560 | loss 3.421576 (+0.07z)| norm 0.2685 (-1.17z)| lr 2.82e-04 | 4159.65 ms | 32.5% bf16 MFU | 125902 tok/s step 10493/19560 | loss 3.382742 (-0.78z)| norm 0.2915 (+0.18z)| lr 2.82e-04 | 4153.77 ms | 32.5% bf16 MFU | 125918 tok/s step 10494/19560 | loss 3.372183 (-1.02z)| norm 0.2893 (+0.06z)| lr 2.82e-04 | 4154.39 ms | 32.5% bf16 MFU | 125932 tok/s step 10495/19560 | loss 3.466920 (+1.09z)| norm 0.3162 (+1.62z)| lr 2.82e-04 | 4163.14 ms | 32.4% bf16 MFU | 125932 tok/s step 10496/19560 | loss 3.534747 (+2.55z)| norm 0.2895 (+0.06z)| lr 2.82e-04 | 4162.02 ms | 32.4% bf16 MFU | 125934 tok/s step 10497/19560 | loss 3.387667 (-0.67z)| norm 0.2943 (+0.35z)| lr 2.82e-04 | 4156.55 ms | 32.5% bf16 MFU | 125944 tok/s step 10498/19560 | loss 3.398665 (-0.43z)| norm 0.2913 (+0.17z)| lr 2.82e-04 | 4161.64 ms | 32.4% bf16 MFU | 125946 tok/s step 10499/19560 | loss 3.396254 (-0.48z)| norm 0.2869 (-0.08z)| lr 2.82e-04 | 4163.53 ms | 32.4% bf16 MFU | 125945 tok/s step 10500/19560 | loss 3.417576 (-0.00z)| norm 0.2704 (-1.03z)| lr 2.82e-04 | 4155.22 ms | 32.5% bf16 MFU | 125956 tok/s val loss 3.375687 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2914/10042 = 0.290181 step 10501/19560 | loss 3.312536 (-2.26z)| norm 0.3093 (+1.23z)| lr 2.82e-04 | 4165.53 ms | 32.4% bf16 MFU | 125952 tok/s step 10502/19560 | loss 3.393040 (-0.52z)| norm 0.2772 (-0.65z)| lr 2.81e-04 | 4161.25 ms | 32.4% bf16 MFU | 125954 tok/s step 10503/19560 | loss 3.345917 (-1.51z)| norm 0.2860 (-0.12z)| lr 2.81e-04 | 4155.69 ms | 32.5% bf16 MFU | 125964 tok/s step 10504/19560 | loss 3.447953 (+0.66z)| norm 0.2717 (-0.95z)| lr 2.81e-04 | 4158.54 ms | 32.5% bf16 MFU | 125970 tok/s step 10505/19560 | loss 3.439440 (+0.47z)| norm 0.2735 (-0.83z)| lr 2.81e-04 | 4194.11 ms | 32.2% bf16 MFU | 125921 tok/s step 10506/19560 | loss 3.388906 (-0.62z)| norm 0.2719 (-0.91z)| lr 2.81e-04 | 4160.03 ms | 32.5% bf16 MFU | 125927 tok/s step 10507/19560 | loss 3.318705 (-2.09z)| norm 0.2715 (-0.93z)| lr 2.81e-04 | 4165.21 ms | 32.4% bf16 MFU | 125924 tok/s step 10508/19560 | loss 3.391678 (-0.54z)| norm 0.2724 (-0.86z)| lr 2.81e-04 | 4162.89 ms | 32.4% bf16 MFU | 125925 tok/s step 10509/19560 | loss 3.443499 (+0.57z)| norm 0.2668 (-1.19z)| lr 2.81e-04 | 4230.74 ms | 31.9% bf16 MFU | 125825 tok/s step 10510/19560 | loss 3.383325 (-0.72z)| norm 0.2730 (-0.81z)| lr 2.81e-04 | 4163.69 ms | 32.4% bf16 MFU | 125830 tok/s step 10511/19560 | loss 3.388832 (-0.61z)| norm 0.2582 (-1.67z)| lr 2.81e-04 | 4156.79 ms | 32.5% bf16 MFU | 125845 tok/s step 10512/19560 | loss 3.448728 (+0.67z)| norm 0.2834 (-0.15z)| lr 2.81e-04 | 4154.90 ms | 32.5% bf16 MFU | 125862 tok/s step 10513/19560 | loss 3.458265 (+0.87z)| norm 0.2738 (-0.73z)| lr 2.81e-04 | 4164.30 ms | 32.4% bf16 MFU | 125864 tok/s step 10514/19560 | loss 3.383549 (-0.73z)| norm 0.2905 (+0.28z)| lr 2.81e-04 | 4162.47 ms | 32.4% bf16 MFU | 125868 tok/s step 10515/19560 | loss 3.399033 (-0.39z)| norm 0.2841 (-0.11z)| lr 2.81e-04 | 4151.86 ms | 32.5% bf16 MFU | 125889 tok/s step 10516/19560 | loss 3.376579 (-0.86z)| norm 0.2710 (-0.90z)| lr 2.81e-04 | 4151.91 ms | 32.5% bf16 MFU | 125908 tok/s step 10517/19560 | loss 3.398108 (-0.41z)| norm 0.2721 (-0.83z)| lr 2.81e-04 | 4175.05 ms | 32.3% bf16 MFU | 125892 tok/s step 10518/19560 | loss 3.460958 (+0.92z)| norm 0.2726 (-0.78z)| lr 2.81e-04 | 4169.45 ms | 32.4% bf16 MFU | 125884 tok/s step 10519/19560 | loss 3.391496 (-0.56z)| norm 0.2657 (-1.19z)| lr 2.81e-04 | 4163.59 ms | 32.4% bf16 MFU | 125886 tok/s step 10520/19560 | loss 3.373427 (-0.94z)| norm 0.2650 (-1.23z)| lr 2.81e-04 | 4164.20 ms | 32.4% bf16 MFU | 125887 tok/s step 10521/19560 | loss 3.480611 (+1.33z)| norm 0.2758 (-0.55z)| lr 2.81e-04 | 4163.17 ms | 32.4% bf16 MFU | 125889 tok/s step 10522/19560 | loss 3.395114 (-0.48z)| norm 0.2659 (-1.15z)| lr 2.80e-04 | 4150.02 ms | 32.5% bf16 MFU | 125912 tok/s step 10523/19560 | loss 3.418265 (+0.01z)| norm 0.2584 (-1.58z)| lr 2.80e-04 | 4174.77 ms | 32.3% bf16 MFU | 125895 tok/s step 10524/19560 | loss 3.514732 (+2.02z)| norm 0.2608 (-1.42z)| lr 2.80e-04 | 4204.24 ms | 32.1% bf16 MFU | 125836 tok/s step 10525/19560 | loss 3.551271 (+2.69z)| norm 0.3011 (+1.00z)| lr 2.80e-04 | 4156.90 ms | 32.5% bf16 MFU | 125850 tok/s step 10526/19560 | loss 3.414806 (-0.11z)| norm 0.2931 (+0.51z)| lr 2.80e-04 | 4164.90 ms | 32.4% bf16 MFU | 125852 tok/s step 10527/19560 | loss 3.553078 (+2.64z)| norm 0.2760 (-0.52z)| lr 2.80e-04 | 4159.04 ms | 32.5% bf16 MFU | 125862 tok/s step 10528/19560 | loss 3.354712 (-1.34z)| norm 0.2962 (+0.70z)| lr 2.80e-04 | 4164.27 ms | 32.4% bf16 MFU | 125864 tok/s step 10529/19560 | loss 3.382289 (-0.78z)| norm 0.2682 (-0.98z)| lr 2.80e-04 | 4161.57 ms | 32.4% bf16 MFU | 125870 tok/s step 10530/19560 | loss 3.425393 (+0.08z)| norm 0.2747 (-0.59z)| lr 2.80e-04 | 4167.70 ms | 32.4% bf16 MFU | 125866 tok/s step 10531/19560 | loss 3.431318 (+0.18z)| norm 0.2876 (+0.19z)| lr 2.80e-04 | 4157.62 ms | 32.5% bf16 MFU | 125878 tok/s step 10532/19560 | loss 3.407984 (-0.29z)| norm 0.2849 (+0.02z)| lr 2.80e-04 | 4161.79 ms | 32.4% bf16 MFU | 125883 tok/s step 10533/19560 | loss 3.459543 (+0.75z)| norm 0.2706 (-0.83z)| lr 2.80e-04 | 4165.46 ms | 32.4% bf16 MFU | 125882 tok/s step 10534/19560 | loss 3.372439 (-0.99z)| norm 0.3136 (+1.75z)| lr 2.80e-04 | 4164.91 ms | 32.4% bf16 MFU | 125882 tok/s step 10535/19560 | loss 3.369925 (-1.02z)| norm 0.2792 (-0.31z)| lr 2.80e-04 | 4163.28 ms | 32.4% bf16 MFU | 125885 tok/s step 10536/19560 | loss 3.437021 (+0.31z)| norm 0.2974 (+0.77z)| lr 2.80e-04 | 4161.12 ms | 32.4% bf16 MFU | 125890 tok/s step 10537/19560 | loss 3.425044 (+0.08z)| norm 0.2884 (+0.23z)| lr 2.80e-04 | 4169.91 ms | 32.4% bf16 MFU | 125882 tok/s step 10538/19560 | loss 3.447646 (+0.53z)| norm 0.2903 (+0.34z)| lr 2.80e-04 | 4148.28 ms | 32.5% bf16 MFU | 125908 tok/s step 10539/19560 | loss 3.387915 (-0.69z)| norm 0.2759 (-0.52z)| lr 2.80e-04 | 4160.01 ms | 32.5% bf16 MFU | 125914 tok/s step 10540/19560 | loss 3.412768 (-0.18z)| norm 0.2893 (+0.28z)| lr 2.80e-04 | 4156.87 ms | 32.5% bf16 MFU | 125924 tok/s step 10541/19560 | loss 3.411518 (-0.21z)| norm 0.2690 (-0.95z)| lr 2.80e-04 | 4149.52 ms | 32.5% bf16 MFU | 125946 tok/s step 10542/19560 | loss 3.445863 (+0.49z)| norm 0.3043 (+1.15z)| lr 2.79e-04 | 4844.29 ms | 27.9% bf16 MFU | 125060 tok/s step 10543/19560 | loss 3.353919 (-1.38z)| norm 0.3108 (+1.52z)| lr 2.79e-04 | 4162.21 ms | 32.4% bf16 MFU | 125105 tok/s step 10544/19560 | loss 3.376503 (-0.90z)| norm 0.2775 (-0.47z)| lr 2.79e-04 | 4161.61 ms | 32.4% bf16 MFU | 125149 tok/s step 10545/19560 | loss 3.493008 (+1.46z)| norm 0.2765 (-0.52z)| lr 2.79e-04 | 4154.68 ms | 32.5% bf16 MFU | 125201 tok/s step 10546/19560 | loss 3.413906 (-0.16z)| norm 0.2908 (+0.35z)| lr 2.79e-04 | 4154.38 ms | 32.5% bf16 MFU | 125251 tok/s step 10547/19560 | loss 3.366912 (-1.10z)| norm 0.3035 (+1.12z)| lr 2.79e-04 | 4172.01 ms | 32.4% bf16 MFU | 125272 tok/s step 10548/19560 | loss 3.355719 (-1.32z)| norm 0.2838 (-0.07z)| lr 2.79e-04 | 4159.20 ms | 32.5% bf16 MFU | 125311 tok/s step 10549/19560 | loss 3.412127 (-0.17z)| norm 0.2673 (-1.07z)| lr 2.79e-04 | 4167.27 ms | 32.4% bf16 MFU | 125336 tok/s step 10550/19560 | loss 3.511752 (+1.82z)| norm 0.2818 (-0.17z)| lr 2.79e-04 | 4154.33 ms | 32.5% bf16 MFU | 125379 tok/s step 10551/19560 | loss 3.454302 (+0.66z)| norm 0.2744 (-0.64z)| lr 2.79e-04 | 4153.18 ms | 32.5% bf16 MFU | 125422 tok/s step 10552/19560 | loss 3.362181 (-1.16z)| norm 0.2684 (-0.99z)| lr 2.79e-04 | 4176.81 ms | 32.3% bf16 MFU | 125427 tok/s step 10553/19560 | loss 3.415530 (-0.09z)| norm 0.2832 (-0.09z)| lr 2.79e-04 | 4155.35 ms | 32.5% bf16 MFU | 125465 tok/s step 10554/19560 | loss 3.483948 (+1.26z)| norm 0.2674 (-1.05z)| lr 2.79e-04 | 4156.13 ms | 32.5% bf16 MFU | 125499 tok/s step 10555/19560 | loss 3.465757 (+0.90z)| norm 0.3032 (+1.14z)| lr 2.79e-04 | 4158.69 ms | 32.5% bf16 MFU | 125527 tok/s step 10556/19560 | loss 3.466446 (+0.91z)| norm 0.2907 (+0.37z)| lr 2.79e-04 | 4162.96 ms | 32.4% bf16 MFU | 125548 tok/s step 10557/19560 | loss 3.320210 (-1.97z)| norm 0.2940 (+0.56z)| lr 2.79e-04 | 5324.99 ms | 25.4% bf16 MFU | 124194 tok/s step 10558/19560 | loss 3.376236 (-0.88z)| norm 0.2922 (+0.45z)| lr 2.79e-04 | 4147.14 ms | 32.6% bf16 MFU | 124305 tok/s step 10559/19560 | loss 3.462021 (+0.84z)| norm 0.2462 (-2.30z)| lr 2.79e-04 | 4152.44 ms | 32.5% bf16 MFU | 124403 tok/s step 10560/19560 | loss 3.440849 (+0.41z)| norm 0.2876 (+0.20z)| lr 2.79e-04 | 4155.73 ms | 32.5% bf16 MFU | 124491 tok/s step 10561/19560 | loss 3.401051 (-0.38z)| norm 0.2570 (-1.62z)| lr 2.79e-04 | 4161.72 ms | 32.4% bf16 MFU | 124565 tok/s step 10562/19560 | loss 3.359529 (-1.21z)| norm 0.2817 (-0.13z)| lr 2.78e-04 | 4160.04 ms | 32.5% bf16 MFU | 124638 tok/s step 10563/19560 | loss 3.376648 (-0.86z)| norm 0.2585 (-1.50z)| lr 2.78e-04 | 4172.23 ms | 32.4% bf16 MFU | 124689 tok/s step 10564/19560 | loss 3.407816 (-0.22z)| norm 0.2782 (-0.31z)| lr 2.78e-04 | 4156.85 ms | 32.5% bf16 MFU | 124761 tok/s step 10565/19560 | loss 3.490036 (+1.41z)| norm 0.2723 (-0.67z)| lr 2.78e-04 | 4157.83 ms | 32.5% bf16 MFU | 124828 tok/s step 10566/19560 | loss 3.363559 (-1.13z)| norm 0.3138 (+1.79z)| lr 2.78e-04 | 4160.86 ms | 32.4% bf16 MFU | 124887 tok/s step 10567/19560 | loss 3.428839 (+0.18z)| norm 0.2847 (+0.06z)| lr 2.78e-04 | 4157.35 ms | 32.5% bf16 MFU | 124948 tok/s step 10568/19560 | loss 3.406242 (-0.27z)| norm 0.2722 (-0.67z)| lr 2.78e-04 | 4159.37 ms | 32.5% bf16 MFU | 125003 tok/s step 10569/19560 | loss 3.376357 (-0.86z)| norm 0.2869 (+0.19z)| lr 2.78e-04 | 4148.91 ms | 32.5% bf16 MFU | 125071 tok/s step 10570/19560 | loss 3.461531 (+0.86z)| norm 0.2909 (+0.44z)| lr 2.78e-04 | 4154.90 ms | 32.5% bf16 MFU | 125127 tok/s step 10571/19560 | loss 3.383981 (-0.70z)| norm 0.2737 (-0.59z)| lr 2.78e-04 | 4160.95 ms | 32.4% bf16 MFU | 125171 tok/s step 10572/19560 | loss 3.397662 (-0.41z)| norm 0.2976 (+0.86z)| lr 2.78e-04 | 4160.18 ms | 32.5% bf16 MFU | 125213 tok/s step 10573/19560 | loss 3.419494 (+0.04z)| norm 0.2786 (-0.28z)| lr 2.78e-04 | 4154.66 ms | 32.5% bf16 MFU | 125262 tok/s step 10574/19560 | loss 3.434102 (+0.37z)| norm 0.2956 (+0.78z)| lr 2.78e-04 | 4160.40 ms | 32.5% bf16 MFU | 125300 tok/s step 10575/19560 | loss 3.446970 (+0.64z)| norm 0.2768 (-0.38z)| lr 2.78e-04 | 4162.70 ms | 32.4% bf16 MFU | 125333 tok/s step 10576/19560 | loss 3.331102 (-1.77z)| norm 0.2911 (+0.52z)| lr 2.78e-04 | 4157.72 ms | 32.5% bf16 MFU | 125371 tok/s step 10577/19560 | loss 3.435432 (+0.40z)| norm 0.3113 (+1.75z)| lr 2.78e-04 | 4159.81 ms | 32.5% bf16 MFU | 125404 tok/s step 10578/19560 | loss 3.432094 (+0.32z)| norm 0.2852 (+0.12z)| lr 2.78e-04 | 4148.09 ms | 32.5% bf16 MFU | 125454 tok/s step 10579/19560 | loss 3.478479 (+1.29z)| norm 0.2945 (+0.69z)| lr 2.78e-04 | 4155.24 ms | 32.5% bf16 MFU | 125490 tok/s step 10580/19560 | loss 3.404347 (-0.26z)| norm 0.3073 (+1.46z)| lr 2.78e-04 | 4166.18 ms | 32.4% bf16 MFU | 125508 tok/s step 10581/19560 | loss 3.405116 (-0.24z)| norm 0.2699 (-0.84z)| lr 2.78e-04 | 4155.29 ms | 32.5% bf16 MFU | 125541 tok/s step 10582/19560 | loss 3.447900 (+0.65z)| norm 0.2751 (-0.51z)| lr 2.77e-04 | 4151.98 ms | 32.5% bf16 MFU | 125578 tok/s step 10583/19560 | loss 3.443188 (+0.57z)| norm 0.2719 (-0.70z)| lr 2.77e-04 | 4154.47 ms | 32.5% bf16 MFU | 125609 tok/s step 10584/19560 | loss 3.380716 (-0.74z)| norm 0.2669 (-1.01z)| lr 2.77e-04 | 4161.63 ms | 32.4% bf16 MFU | 125627 tok/s step 10585/19560 | loss 3.429069 (+0.31z)| norm 0.2797 (-0.19z)| lr 2.77e-04 | 4159.85 ms | 32.5% bf16 MFU | 125648 tok/s step 10586/19560 | loss 3.402431 (-0.27z)| norm 0.2783 (-0.27z)| lr 2.77e-04 | 4163.04 ms | 32.4% bf16 MFU | 125662 tok/s step 10587/19560 | loss 3.337551 (-1.64z)| norm 0.2610 (-1.39z)| lr 2.77e-04 | 4161.56 ms | 32.4% bf16 MFU | 125678 tok/s step 10588/19560 | loss 3.406934 (-0.15z)| norm 0.2920 (+0.67z)| lr 2.77e-04 | 4162.80 ms | 32.4% bf16 MFU | 125692 tok/s step 10589/19560 | loss 3.417625 (+0.08z)| norm 0.2693 (-0.85z)| lr 2.77e-04 | 4164.75 ms | 32.4% bf16 MFU | 125701 tok/s step 10590/19560 | loss 3.401824 (-0.25z)| norm 0.2784 (-0.21z)| lr 2.77e-04 | 4159.01 ms | 32.5% bf16 MFU | 125719 tok/s step 10591/19560 | loss 3.360970 (-1.11z)| norm 0.2555 (-1.78z)| lr 2.77e-04 | 4149.96 ms | 32.5% bf16 MFU | 125750 tok/s step 10592/19560 | loss 3.388680 (-0.51z)| norm 0.2684 (-0.88z)| lr 2.77e-04 | 4165.99 ms | 32.4% bf16 MFU | 125755 tok/s step 10593/19560 | loss 3.399818 (-0.26z)| norm 0.2640 (-1.17z)| lr 2.77e-04 | 4176.08 ms | 32.3% bf16 MFU | 125745 tok/s step 10594/19560 | loss 3.362682 (-1.07z)| norm 0.3004 (+1.36z)| lr 2.77e-04 | 4299.47 ms | 31.4% bf16 MFU | 125555 tok/s step 10595/19560 | loss 3.402511 (-0.20z)| norm 0.2600 (-1.43z)| lr 2.77e-04 | 4363.64 ms | 30.9% bf16 MFU | 125284 tok/s step 10596/19560 | loss 3.351549 (-1.29z)| norm 0.2815 (+0.05z)| lr 2.77e-04 | 4167.50 ms | 32.4% bf16 MFU | 125310 tok/s step 10597/19560 | loss 3.333106 (-1.66z)| norm 0.2751 (-0.38z)| lr 2.77e-04 | 4202.13 ms | 32.1% bf16 MFU | 125283 tok/s step 10598/19560 | loss 3.358506 (-1.11z)| norm 0.3123 (+2.18z)| lr 2.77e-04 | 4166.52 ms | 32.4% bf16 MFU | 125311 tok/s step 10599/19560 | loss 3.381797 (-0.61z)| norm 0.2703 (-0.72z)| lr 2.77e-04 | 4173.55 ms | 32.4% bf16 MFU | 125326 tok/s step 10600/19560 | loss 3.376921 (-0.72z)| norm 0.2743 (-0.44z)| lr 2.77e-04 | 4158.01 ms | 32.5% bf16 MFU | 125364 tok/s step 10601/19560 | loss 3.410663 (+0.00z)| norm 0.2666 (-0.97z)| lr 2.77e-04 | 4162.44 ms | 32.4% bf16 MFU | 125394 tok/s step 10602/19560 | loss 3.327700 (-1.73z)| norm 0.2911 (+0.72z)| lr 2.76e-04 | 4164.19 ms | 32.4% bf16 MFU | 125420 tok/s step 10603/19560 | loss 3.332643 (-1.60z)| norm 0.2778 (-0.19z)| lr 2.76e-04 | 4157.94 ms | 32.5% bf16 MFU | 125453 tok/s step 10604/19560 | loss 3.407477 (-0.04z)| norm 0.2974 (+1.14z)| lr 2.76e-04 | 4166.11 ms | 32.4% bf16 MFU | 125473 tok/s step 10605/19560 | loss 3.379643 (-0.62z)| norm 0.2787 (-0.13z)| lr 2.76e-04 | 4169.20 ms | 32.4% bf16 MFU | 125487 tok/s step 10606/19560 | loss 3.408981 (-0.00z)| norm 0.2821 (+0.09z)| lr 2.76e-04 | 4175.58 ms | 32.3% bf16 MFU | 125491 tok/s step 10607/19560 | loss 3.410700 (+0.03z)| norm 0.3051 (+1.68z)| lr 2.76e-04 | 4162.71 ms | 32.4% bf16 MFU | 125513 tok/s step 10608/19560 | loss 3.404284 (-0.11z)| norm 0.3056 (+1.68z)| lr 2.76e-04 | 4160.85 ms | 32.4% bf16 MFU | 125538 tok/s step 10609/19560 | loss 3.382498 (-0.56z)| norm 0.2851 (+0.27z)| lr 2.76e-04 | 4153.70 ms | 32.5% bf16 MFU | 125572 tok/s step 10610/19560 | loss 3.351493 (-1.21z)| norm 0.2927 (+0.78z)| lr 2.76e-04 | 4167.26 ms | 32.4% bf16 MFU | 125584 tok/s step 10611/19560 | loss 3.349215 (-1.25z)| norm 0.3068 (+1.72z)| lr 2.76e-04 | 4165.33 ms | 32.4% bf16 MFU | 125598 tok/s step 10612/19560 | loss 3.395494 (-0.28z)| norm 0.2898 (+0.55z)| lr 2.76e-04 | 4157.40 ms | 32.5% bf16 MFU | 125624 tok/s step 10613/19560 | loss 3.318048 (-1.87z)| norm 0.3056 (+1.60z)| lr 2.76e-04 | 4171.18 ms | 32.4% bf16 MFU | 125627 tok/s step 10614/19560 | loss 3.370065 (-0.77z)| norm 0.3054 (+1.57z)| lr 2.76e-04 | 4165.05 ms | 32.4% bf16 MFU | 125640 tok/s step 10615/19560 | loss 3.309746 (-1.98z)| norm 0.2856 (+0.22z)| lr 2.76e-04 | 4153.73 ms | 32.5% bf16 MFU | 125669 tok/s step 10616/19560 | loss 3.371928 (-0.71z)| norm 0.3110 (+1.92z)| lr 2.76e-04 | 4169.29 ms | 32.4% bf16 MFU | 125673 tok/s step 10617/19560 | loss 3.403881 (-0.05z)| norm 0.2987 (+1.09z)| lr 2.76e-04 | 4160.53 ms | 32.5% bf16 MFU | 125690 tok/s step 10618/19560 | loss 3.388832 (-0.35z)| norm 0.2910 (+0.56z)| lr 2.76e-04 | 4167.02 ms | 32.4% bf16 MFU | 125697 tok/s step 10619/19560 | loss 3.457093 (+1.08z)| norm 0.3160 (+2.19z)| lr 2.76e-04 | 4161.97 ms | 32.4% bf16 MFU | 125710 tok/s step 10620/19560 | loss 3.435025 (+0.61z)| norm 0.3196 (+2.36z)| lr 2.76e-04 | 4160.71 ms | 32.5% bf16 MFU | 125725 tok/s step 10621/19560 | loss 3.473695 (+1.40z)| norm 0.2986 (+0.98z)| lr 2.76e-04 | 4160.37 ms | 32.5% bf16 MFU | 125740 tok/s step 10622/19560 | loss 3.418118 (+0.23z)| norm 0.2717 (-0.76z)| lr 2.75e-04 | 4160.12 ms | 32.5% bf16 MFU | 125754 tok/s step 10623/19560 | loss 3.474248 (+1.40z)| norm 1.4460 (+11.15z)| lr 2.75e-04 | 4157.22 ms | 32.5% bf16 MFU | 125772 tok/s step 10624/19560 | loss 3.395401 (-0.23z)| norm 0.4087 (+1.11z)| lr 2.75e-04 | 4160.93 ms | 32.4% bf16 MFU | 125784 tok/s step 10625/19560 | loss 3.428123 (+0.47z)| norm 0.3297 (+0.35z)| lr 2.75e-04 | 4171.60 ms | 32.4% bf16 MFU | 125779 tok/s step 10626/19560 | loss 3.408659 (+0.05z)| norm 0.3441 (+0.48z)| lr 2.75e-04 | 4162.36 ms | 32.4% bf16 MFU | 125788 tok/s step 10627/19560 | loss 3.378790 (-0.59z)| norm 0.2868 (-0.07z)| lr 2.75e-04 | 4164.32 ms | 32.4% bf16 MFU | 125793 tok/s step 10628/19560 | loss 3.354319 (-1.10z)| norm 0.3109 (+0.16z)| lr 2.75e-04 | 4166.73 ms | 32.4% bf16 MFU | 125795 tok/s step 10629/19560 | loss 3.423883 (+0.37z)| norm 0.2848 (-0.09z)| lr 2.75e-04 | 4158.03 ms | 32.5% bf16 MFU | 125810 tok/s step 10630/19560 | loss 3.403641 (-0.07z)| norm 0.2916 (-0.02z)| lr 2.75e-04 | 4157.46 ms | 32.5% bf16 MFU | 125825 tok/s step 10631/19560 | loss 3.392163 (-0.33z)| norm 0.2718 (-0.21z)| lr 2.75e-04 | 4163.84 ms | 32.4% bf16 MFU | 125829 tok/s step 10632/19560 | loss 3.372560 (-0.74z)| norm 0.2742 (-0.19z)| lr 2.75e-04 | 4162.75 ms | 32.4% bf16 MFU | 125835 tok/s step 10633/19560 | loss 3.383033 (-0.50z)| norm 0.3019 (+0.07z)| lr 2.75e-04 | 4157.69 ms | 32.5% bf16 MFU | 125848 tok/s step 10634/19560 | loss 3.355035 (-1.11z)| norm 0.2632 (-0.30z)| lr 2.75e-04 | 4166.67 ms | 32.4% bf16 MFU | 125847 tok/s step 10635/19560 | loss 3.414801 (+0.18z)| norm 0.2807 (-0.13z)| lr 2.75e-04 | 4162.32 ms | 32.4% bf16 MFU | 125853 tok/s step 10636/19560 | loss 3.361502 (-0.99z)| norm 0.2519 (-0.40z)| lr 2.75e-04 | 4164.91 ms | 32.4% bf16 MFU | 125854 tok/s step 10637/19560 | loss 3.453563 (+1.04z)| norm 0.2762 (-0.17z)| lr 2.75e-04 | 4161.34 ms | 32.4% bf16 MFU | 125861 tok/s step 10638/19560 | loss 3.389435 (-0.37z)| norm 0.2494 (-0.43z)| lr 2.75e-04 | 4163.93 ms | 32.4% bf16 MFU | 125864 tok/s step 10639/19560 | loss 3.458492 (+1.13z)| norm 0.2709 (-0.22z)| lr 2.75e-04 | 4156.89 ms | 32.5% bf16 MFU | 125877 tok/s step 10640/19560 | loss 3.394770 (-0.26z)| norm 0.2668 (-0.26z)| lr 2.75e-04 | 4164.60 ms | 32.4% bf16 MFU | 125878 tok/s step 10641/19560 | loss 3.357198 (-1.07z)| norm 0.2582 (-0.34z)| lr 2.75e-04 | 4167.94 ms | 32.4% bf16 MFU | 125873 tok/s step 10642/19560 | loss 3.461699 (+1.21z)| norm 0.2793 (-0.14z)| lr 2.74e-04 | 4167.05 ms | 32.4% bf16 MFU | 125870 tok/s step 10643/19560 | loss 3.398092 (-0.18z)| norm 0.2772 (-0.16z)| lr 2.74e-04 | 4164.37 ms | 32.4% bf16 MFU | 125872 tok/s step 10644/19560 | loss 3.353892 (-1.14z)| norm 0.2584 (-0.34z)| lr 2.74e-04 | 4164.27 ms | 32.4% bf16 MFU | 125873 tok/s step 10645/19560 | loss 3.332557 (-1.58z)| norm 0.2627 (-0.30z)| lr 2.74e-04 | 4163.19 ms | 32.4% bf16 MFU | 125876 tok/s step 10646/19560 | loss 3.400978 (-0.09z)| norm 0.2619 (-0.30z)| lr 2.74e-04 | 4163.42 ms | 32.4% bf16 MFU | 125879 tok/s step 10647/19560 | loss 3.425053 (+0.42z)| norm 0.2620 (-0.30z)| lr 2.74e-04 | 4176.02 ms | 32.3% bf16 MFU | 125862 tok/s step 10648/19560 | loss 3.420840 (+0.32z)| norm 0.3147 (+0.20z)| lr 2.74e-04 | 4159.85 ms | 32.5% bf16 MFU | 125871 tok/s step 10649/19560 | loss 3.454999 (+1.08z)| norm 0.3253 (+0.30z)| lr 2.74e-04 | 4179.11 ms | 32.3% bf16 MFU | 125850 tok/s step 10650/19560 | loss 3.353958 (-1.12z)| norm 0.2769 (-0.17z)| lr 2.74e-04 | 4163.61 ms | 32.4% bf16 MFU | 125854 tok/s step 10651/19560 | loss 3.351211 (-1.16z)| norm 0.2597 (-0.33z)| lr 2.74e-04 | 4159.68 ms | 32.5% bf16 MFU | 125863 tok/s step 10652/19560 | loss 3.369881 (-0.75z)| norm 0.2749 (-0.19z)| lr 2.74e-04 | 4165.27 ms | 32.4% bf16 MFU | 125863 tok/s step 10653/19560 | loss 3.336168 (-1.51z)| norm 0.2568 (-0.36z)| lr 2.74e-04 | 4166.39 ms | 32.4% bf16 MFU | 125862 tok/s step 10654/19560 | loss 3.364793 (-0.84z)| norm 0.2682 (-0.25z)| lr 2.74e-04 | 4164.60 ms | 32.4% bf16 MFU | 125864 tok/s step 10655/19560 | loss 3.430609 (+0.72z)| norm 0.2662 (-0.26z)| lr 2.74e-04 | 4165.06 ms | 32.4% bf16 MFU | 125864 tok/s step 10656/19560 | loss 3.420525 (+0.46z)| norm 0.2684 (-0.24z)| lr 2.74e-04 | 4158.81 ms | 32.5% bf16 MFU | 125874 tok/s step 10657/19560 | loss 3.424413 (+0.55z)| norm 0.2801 (-0.13z)| lr 2.74e-04 | 4160.22 ms | 32.5% bf16 MFU | 125882 tok/s step 10658/19560 | loss 3.356454 (-1.07z)| norm 0.2677 (-0.25z)| lr 2.74e-04 | 4159.94 ms | 32.5% bf16 MFU | 125889 tok/s step 10659/19560 | loss 3.379617 (-0.50z)| norm 0.2895 (-0.04z)| lr 2.74e-04 | 4161.83 ms | 32.4% bf16 MFU | 125894 tok/s step 10660/19560 | loss 3.371548 (-0.69z)| norm 0.2849 (-0.08z)| lr 2.74e-04 | 4159.27 ms | 32.5% bf16 MFU | 125902 tok/s step 10661/19560 | loss 3.395633 (-0.10z)| norm 0.2883 (-0.05z)| lr 2.74e-04 | 4168.64 ms | 32.4% bf16 MFU | 125895 tok/s step 10662/19560 | loss 3.316058 (-1.98z)| norm 0.2857 (-0.07z)| lr 2.73e-04 | 4160.35 ms | 32.5% bf16 MFU | 125901 tok/s step 10663/19560 | loss 3.345801 (-1.26z)| norm 0.2940 (+0.00z)| lr 2.73e-04 | 4161.10 ms | 32.4% bf16 MFU | 125906 tok/s step 10664/19560 | loss 3.383224 (-0.37z)| norm 0.2701 (-0.22z)| lr 2.73e-04 | 4159.70 ms | 32.5% bf16 MFU | 125913 tok/s step 10665/19560 | loss 3.351859 (-1.10z)| norm 0.2697 (-0.23z)| lr 2.73e-04 | 4160.54 ms | 32.5% bf16 MFU | 125918 tok/s step 10666/19560 | loss 3.322843 (-1.75z)| norm 0.2735 (-0.19z)| lr 2.73e-04 | 4158.75 ms | 32.5% bf16 MFU | 125925 tok/s step 10667/19560 | loss 3.382066 (-0.35z)| norm 0.2876 (-0.05z)| lr 2.73e-04 | 4157.04 ms | 32.5% bf16 MFU | 125935 tok/s step 10668/19560 | loss 3.349571 (-1.10z)| norm 0.2821 (-0.11z)| lr 2.73e-04 | 4166.88 ms | 32.4% bf16 MFU | 125930 tok/s step 10669/19560 | loss 3.413381 (+0.39z)| norm 0.2643 (-0.28z)| lr 2.73e-04 | 4167.68 ms | 32.4% bf16 MFU | 125923 tok/s step 10670/19560 | loss 3.438746 (+0.99z)| norm 0.2988 (+0.06z)| lr 2.73e-04 | 4164.05 ms | 32.4% bf16 MFU | 125922 tok/s step 10671/19560 | loss 3.463790 (+1.54z)| norm 0.2732 (-0.19z)| lr 2.73e-04 | 4165.41 ms | 32.4% bf16 MFU | 125920 tok/s step 10672/19560 | loss 3.334920 (-1.44z)| norm 0.2992 (+0.06z)| lr 2.73e-04 | 4164.46 ms | 32.4% bf16 MFU | 125918 tok/s step 10673/19560 | loss 3.396486 (+0.00z)| norm 0.3036 (+0.10z)| lr 2.73e-04 | 4164.86 ms | 32.4% bf16 MFU | 125917 tok/s step 10674/19560 | loss 3.409925 (+0.32z)| norm 0.2785 (-0.14z)| lr 2.73e-04 | 4160.51 ms | 32.5% bf16 MFU | 125922 tok/s step 10675/19560 | loss 3.378703 (-0.42z)| norm 0.2661 (-0.25z)| lr 2.73e-04 | 4161.40 ms | 32.4% bf16 MFU | 125925 tok/s step 10676/19560 | loss 3.415840 (+0.45z)| norm 0.2866 (-0.06z)| lr 2.73e-04 | 4180.89 ms | 32.3% bf16 MFU | 125899 tok/s step 10677/19560 | loss 3.354196 (-1.00z)| norm 0.2704 (-0.21z)| lr 2.73e-04 | 4169.37 ms | 32.4% bf16 MFU | 125891 tok/s step 10678/19560 | loss 3.387760 (-0.19z)| norm 0.2882 (-0.04z)| lr 2.73e-04 | 4167.90 ms | 32.4% bf16 MFU | 125886 tok/s step 10679/19560 | loss 3.523933 (+3.02z)| norm 0.2760 (-0.16z)| lr 2.73e-04 | 4158.57 ms | 32.5% bf16 MFU | 125896 tok/s step 10680/19560 | loss 3.429173 (+0.77z)| norm 0.3326 (+0.38z)| lr 2.73e-04 | 4152.22 ms | 32.5% bf16 MFU | 125914 tok/s step 10681/19560 | loss 3.353815 (-1.00z)| norm 0.2697 (-0.23z)| lr 2.73e-04 | 4156.42 ms | 32.5% bf16 MFU | 125925 tok/s step 10682/19560 | loss 3.426414 (+0.73z)| norm 0.3013 (+0.07z)| lr 2.73e-04 | 4161.32 ms | 32.4% bf16 MFU | 125929 tok/s step 10683/19560 | loss 3.357682 (-0.89z)| norm 0.2929 (-0.00z)| lr 2.72e-04 | 4157.57 ms | 32.5% bf16 MFU | 125937 tok/s step 10684/19560 | loss 3.342484 (-1.24z)| norm 0.2621 (-0.30z)| lr 2.72e-04 | 4164.85 ms | 32.4% bf16 MFU | 125935 tok/s step 10685/19560 | loss 3.324657 (-1.68z)| norm 0.3097 (+0.16z)| lr 2.72e-04 | 4159.94 ms | 32.5% bf16 MFU | 125940 tok/s step 10686/19560 | loss 3.425099 (+0.75z)| norm 0.2814 (-0.11z)| lr 2.72e-04 | 4169.56 ms | 32.4% bf16 MFU | 125930 tok/s step 10687/19560 | loss 3.357658 (-0.87z)| norm 0.3064 (+0.12z)| lr 2.72e-04 | 4159.95 ms | 32.5% bf16 MFU | 125935 tok/s step 10688/19560 | loss 3.381172 (-0.29z)| norm 0.2863 (-0.07z)| lr 2.72e-04 | 4165.19 ms | 32.4% bf16 MFU | 125932 tok/s step 10689/19560 | loss 3.400422 (+0.18z)| norm 0.2586 (-0.34z)| lr 2.72e-04 | 4160.29 ms | 32.5% bf16 MFU | 125936 tok/s step 10690/19560 | loss 3.400720 (+0.18z)| norm 0.3291 (+0.34z)| lr 2.72e-04 | 4158.82 ms | 32.5% bf16 MFU | 125943 tok/s step 10691/19560 | loss 3.361293 (-0.78z)| norm 0.2938 (-0.01z)| lr 2.72e-04 | 4167.86 ms | 32.4% bf16 MFU | 125935 tok/s step 10692/19560 | loss 3.437518 (+1.08z)| norm 0.2969 (+0.02z)| lr 2.72e-04 | 4153.70 ms | 32.5% bf16 MFU | 125950 tok/s step 10693/19560 | loss 3.373654 (-0.47z)| norm 0.3099 (+0.14z)| lr 2.72e-04 | 4165.71 ms | 32.4% bf16 MFU | 125945 tok/s step 10694/19560 | loss 3.388132 (-0.11z)| norm 0.2711 (-0.22z)| lr 2.72e-04 | 4167.41 ms | 32.4% bf16 MFU | 125938 tok/s step 10695/19560 | loss 3.386649 (-0.14z)| norm 0.2816 (-0.12z)| lr 2.72e-04 | 4155.52 ms | 32.5% bf16 MFU | 125950 tok/s step 10696/19560 | loss 3.345494 (-1.16z)| norm 0.2630 (-0.30z)| lr 2.72e-04 | 4160.02 ms | 32.5% bf16 MFU | 125954 tok/s step 10697/19560 | loss 3.380663 (-0.28z)| norm 0.2807 (-0.13z)| lr 2.72e-04 | 4153.38 ms | 32.5% bf16 MFU | 125968 tok/s step 10698/19560 | loss 3.336687 (-1.36z)| norm 0.2711 (-0.22z)| lr 2.72e-04 | 4168.00 ms | 32.4% bf16 MFU | 125959 tok/s step 10699/19560 | loss 3.407815 (+0.42z)| norm 0.2716 (-0.22z)| lr 2.72e-04 | 4154.78 ms | 32.5% bf16 MFU | 125970 tok/s step 10700/19560 | loss 3.347629 (-1.08z)| norm 0.2659 (-0.27z)| lr 2.72e-04 | 4161.08 ms | 32.4% bf16 MFU | 125972 tok/s step 10701/19560 | loss 3.404145 (+0.34z)| norm 0.2874 (-0.06z)| lr 2.72e-04 | 4162.39 ms | 32.4% bf16 MFU | 125971 tok/s step 10702/19560 | loss 3.408899 (+0.46z)| norm 0.2625 (-0.30z)| lr 2.72e-04 | 4163.80 ms | 32.4% bf16 MFU | 125968 tok/s step 10703/19560 | loss 3.407038 (+0.43z)| norm 0.3287 (+0.33z)| lr 2.71e-04 | 4162.18 ms | 32.4% bf16 MFU | 125968 tok/s step 10704/19560 | loss 3.383237 (-0.19z)| norm 0.2926 (-0.01z)| lr 2.71e-04 | 4159.08 ms | 32.5% bf16 MFU | 125972 tok/s step 10705/19560 | loss 3.422869 (+0.83z)| norm 0.2653 (-0.27z)| lr 2.71e-04 | 4161.57 ms | 32.4% bf16 MFU | 125973 tok/s step 10706/19560 | loss 3.415844 (+0.66z)| norm 0.2953 (+0.01z)| lr 2.71e-04 | 4164.46 ms | 32.4% bf16 MFU | 125969 tok/s step 10707/19560 | loss 3.315032 (-1.91z)| norm 0.3095 (+0.15z)| lr 2.71e-04 | 4166.36 ms | 32.4% bf16 MFU | 125963 tok/s step 10708/19560 | loss 3.427309 (+0.98z)| norm 0.2954 (+0.01z)| lr 2.71e-04 | 4166.02 ms | 32.4% bf16 MFU | 125957 tok/s step 10709/19560 | loss 3.429094 (+1.02z)| norm 0.2890 (-0.05z)| lr 2.71e-04 | 4164.12 ms | 32.4% bf16 MFU | 125954 tok/s step 10710/19560 | loss 3.375906 (-0.33z)| norm 0.2918 (-0.02z)| lr 2.71e-04 | 4160.17 ms | 32.5% bf16 MFU | 125958 tok/s step 10711/19560 | loss 3.420582 (+0.83z)| norm 0.3053 (+0.10z)| lr 2.71e-04 | 4163.18 ms | 32.4% bf16 MFU | 125957 tok/s step 10712/19560 | loss 3.374141 (-0.38z)| norm 0.2739 (-0.20z)| lr 2.71e-04 | 4158.57 ms | 32.5% bf16 MFU | 125963 tok/s step 10713/19560 | loss 3.363044 (-0.65z)| norm 0.2916 (-0.03z)| lr 2.71e-04 | 4165.67 ms | 32.4% bf16 MFU | 125957 tok/s step 10714/19560 | loss 3.329075 (-1.51z)| norm 0.2833 (-0.11z)| lr 2.71e-04 | 4237.47 ms | 31.9% bf16 MFU | 125846 tok/s step 10715/19560 | loss 3.398203 (+0.27z)| norm 0.2784 (-0.16z)| lr 2.71e-04 | 4162.39 ms | 32.4% bf16 MFU | 125852 tok/s step 10716/19560 | loss 3.428435 (+1.04z)| norm 0.2811 (-0.13z)| lr 2.71e-04 | 4160.60 ms | 32.5% bf16 MFU | 125860 tok/s step 10717/19560 | loss 3.353600 (-0.88z)| norm 0.2950 (+0.00z)| lr 2.71e-04 | 4156.84 ms | 32.5% bf16 MFU | 125873 tok/s step 10718/19560 | loss 3.425932 (+0.98z)| norm 0.2716 (-0.22z)| lr 2.71e-04 | 4155.77 ms | 32.5% bf16 MFU | 125887 tok/s step 10719/19560 | loss 3.374055 (-0.36z)| norm 0.2717 (-0.22z)| lr 2.71e-04 | 4153.59 ms | 32.5% bf16 MFU | 125904 tok/s step 10720/19560 | loss 3.414958 (+0.69z)| norm 0.2621 (-0.31z)| lr 2.71e-04 | 4164.31 ms | 32.4% bf16 MFU | 125904 tok/s step 10721/19560 | loss 3.483365 (+2.39z)| norm 0.2838 (-0.11z)| lr 2.71e-04 | 4152.17 ms | 32.5% bf16 MFU | 125922 tok/s step 10722/19560 | loss 3.364999 (-0.60z)| norm 0.2759 (-0.18z)| lr 2.71e-04 | 4158.65 ms | 32.5% bf16 MFU | 125930 tok/s step 10723/19560 | loss 3.400671 (+0.30z)| norm 0.2714 (-0.23z)| lr 2.70e-04 | 4160.86 ms | 32.4% bf16 MFU | 125933 tok/s step 10724/19560 | loss 3.366744 (-0.56z)| norm 0.2571 (-0.36z)| lr 2.70e-04 | 4153.00 ms | 32.5% bf16 MFU | 125949 tok/s step 10725/19560 | loss 3.364327 (-0.63z)| norm 0.2705 (-0.23z)| lr 2.70e-04 | 4154.30 ms | 32.5% bf16 MFU | 125962 tok/s step 10726/19560 | loss 3.386412 (-0.08z)| norm 0.2738 (-0.20z)| lr 2.70e-04 | 4307.52 ms | 31.3% bf16 MFU | 125749 tok/s step 10727/19560 | loss 3.359019 (-0.77z)| norm 0.2939 (-0.01z)| lr 2.70e-04 | 4159.88 ms | 32.5% bf16 MFU | 125764 tok/s step 10728/19560 | loss 3.414927 (+0.65z)| norm 0.2693 (-0.24z)| lr 2.70e-04 | 4162.17 ms | 32.4% bf16 MFU | 125774 tok/s step 10729/19560 | loss 3.394481 (+0.13z)| norm 0.2727 (-0.21z)| lr 2.70e-04 | 4167.20 ms | 32.4% bf16 MFU | 125776 tok/s step 10730/19560 | loss 3.339193 (-1.28z)| norm 0.2841 (-0.10z)| lr 2.70e-04 | 4161.18 ms | 32.4% bf16 MFU | 125787 tok/s step 10731/19560 | loss 3.363372 (-0.68z)| norm 0.2739 (-0.20z)| lr 2.70e-04 | 4155.96 ms | 32.5% bf16 MFU | 125805 tok/s step 10732/19560 | loss 3.402445 (+0.33z)| norm 0.2915 (-0.03z)| lr 2.70e-04 | 4152.48 ms | 32.5% bf16 MFU | 125828 tok/s step 10733/19560 | loss 3.386373 (-0.09z)| norm 0.2802 (-0.14z)| lr 2.70e-04 | 4159.74 ms | 32.5% bf16 MFU | 125838 tok/s step 10734/19560 | loss 3.349436 (-1.02z)| norm 0.2810 (-0.13z)| lr 2.70e-04 | 4155.27 ms | 32.5% bf16 MFU | 125855 tok/s step 10735/19560 | loss 3.399373 (+0.26z)| norm 0.2761 (-0.17z)| lr 2.70e-04 | 4153.42 ms | 32.5% bf16 MFU | 125874 tok/s step 10736/19560 | loss 3.398523 (+0.24z)| norm 0.2889 (-0.05z)| lr 2.70e-04 | 4161.40 ms | 32.4% bf16 MFU | 125879 tok/s step 10737/19560 | loss 3.347375 (-1.06z)| norm 0.2680 (-0.25z)| lr 2.70e-04 | 4148.97 ms | 32.5% bf16 MFU | 125904 tok/s step 10738/19560 | loss 3.410506 (+0.54z)| norm 0.2566 (-0.36z)| lr 2.70e-04 | 4154.93 ms | 32.5% bf16 MFU | 125918 tok/s step 10739/19560 | loss 3.310046 (-2.00z)| norm 0.2941 (+0.00z)| lr 2.70e-04 | 4155.19 ms | 32.5% bf16 MFU | 125931 tok/s step 10740/19560 | loss 3.380475 (-0.21z)| norm 0.2527 (-0.39z)| lr 2.70e-04 | 4154.90 ms | 32.5% bf16 MFU | 125943 tok/s step 10741/19560 | loss 3.405889 (+0.42z)| norm 0.2922 (-0.01z)| lr 2.70e-04 | 4156.54 ms | 32.5% bf16 MFU | 125953 tok/s step 10742/19560 | loss 3.411286 (+0.55z)| norm 0.2772 (-0.15z)| lr 2.70e-04 | 4158.11 ms | 32.5% bf16 MFU | 125960 tok/s step 10743/19560 | loss 3.426516 (+0.93z)| norm 0.2631 (-0.28z)| lr 2.69e-04 | 4176.03 ms | 32.3% bf16 MFU | 125939 tok/s step 10744/19560 | loss 3.419522 (+0.74z)| norm 0.2876 (-0.05z)| lr 2.69e-04 | 4156.06 ms | 32.5% bf16 MFU | 125950 tok/s step 10745/19560 | loss 3.368855 (-0.57z)| norm 0.2727 (-0.19z)| lr 2.69e-04 | 4159.16 ms | 32.5% bf16 MFU | 125955 tok/s step 10746/19560 | loss 3.372845 (-0.46z)| norm 0.2797 (-0.12z)| lr 2.69e-04 | 4147.32 ms | 32.6% bf16 MFU | 125978 tok/s step 10747/19560 | loss 3.399230 (+0.23z)| norm 0.2670 (-0.24z)| lr 2.69e-04 | 4167.22 ms | 32.4% bf16 MFU | 125970 tok/s step 10748/19560 | loss 3.438737 (+1.27z)| norm 0.2573 (-0.33z)| lr 2.69e-04 | 4167.42 ms | 32.4% bf16 MFU | 125962 tok/s step 10749/19560 | loss 3.380901 (-0.23z)| norm 0.2676 (-0.23z)| lr 2.69e-04 | 4152.43 ms | 32.5% bf16 MFU | 125977 tok/s step 10750/19560 | loss 3.418538 (+0.77z)| norm 0.2689 (-0.21z)| lr 2.69e-04 | 4161.30 ms | 32.4% bf16 MFU | 125977 tok/s val loss 3.370723 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2906/10042 = 0.289385 step 10751/19560 | loss 3.422766 (+0.91z)| norm 0.2780 (-0.19z)| lr 2.69e-04 | 4158.74 ms | 32.5% bf16 MFU | 125982 tok/s step 10752/19560 | loss 3.367863 (-0.57z)| norm 0.2619 (-1.06z)| lr 2.69e-04 | 4165.32 ms | 32.4% bf16 MFU | 125976 tok/s step 10753/19560 | loss 3.312798 (-2.02z)| norm 0.2800 (-0.03z)| lr 2.69e-04 | 4167.44 ms | 32.4% bf16 MFU | 125968 tok/s step 10754/19560 | loss 3.391973 (+0.11z)| norm 0.2780 (-0.13z)| lr 2.69e-04 | 4158.54 ms | 32.5% bf16 MFU | 125973 tok/s step 10755/19560 | loss 3.340957 (-1.25z)| norm 0.2809 (+0.06z)| lr 2.69e-04 | 4155.87 ms | 32.5% bf16 MFU | 125982 tok/s step 10756/19560 | loss 3.431000 (+1.14z)| norm 0.2986 (+1.15z)| lr 2.69e-04 | 4165.21 ms | 32.4% bf16 MFU | 125977 tok/s step 10757/19560 | loss 3.519216 (+3.32z)| norm 0.2738 (-0.37z)| lr 2.69e-04 | 4157.11 ms | 32.5% bf16 MFU | 125984 tok/s step 10758/19560 | loss 3.314683 (-1.85z)| norm 0.2706 (-0.55z)| lr 2.69e-04 | 4165.66 ms | 32.4% bf16 MFU | 125978 tok/s step 10759/19560 | loss 3.433181 (+1.12z)| norm 0.2991 (+1.17z)| lr 2.69e-04 | 4153.32 ms | 32.5% bf16 MFU | 125990 tok/s step 10760/19560 | loss 3.375533 (-0.33z)| norm 0.2832 (+0.20z)| lr 2.69e-04 | 4161.20 ms | 32.4% bf16 MFU | 125991 tok/s step 10761/19560 | loss 3.341922 (-1.16z)| norm 0.2891 (+0.57z)| lr 2.69e-04 | 4158.22 ms | 32.5% bf16 MFU | 125995 tok/s step 10762/19560 | loss 3.368284 (-0.50z)| norm 0.2824 (+0.15z)| lr 2.69e-04 | 4163.04 ms | 32.4% bf16 MFU | 125993 tok/s step 10763/19560 | loss 3.386876 (-0.03z)| norm 0.2650 (-0.91z)| lr 2.68e-04 | 4156.59 ms | 32.5% bf16 MFU | 126000 tok/s step 10764/19560 | loss 3.348967 (-0.98z)| norm 0.2704 (-0.59z)| lr 2.68e-04 | 4169.70 ms | 32.4% bf16 MFU | 125986 tok/s step 10765/19560 | loss 3.339057 (-1.21z)| norm 0.2574 (-1.38z)| lr 2.68e-04 | 4163.62 ms | 32.4% bf16 MFU | 125983 tok/s step 10766/19560 | loss 3.377308 (-0.25z)| norm 0.2877 (+0.47z)| lr 2.68e-04 | 4157.16 ms | 32.5% bf16 MFU | 125990 tok/s step 10767/19560 | loss 3.462802 (+1.89z)| norm 0.2893 (+0.56z)| lr 2.68e-04 | 4176.83 ms | 32.3% bf16 MFU | 125967 tok/s step 10768/19560 | loss 3.414165 (+0.67z)| norm 0.3032 (+1.40z)| lr 2.68e-04 | 4161.18 ms | 32.4% bf16 MFU | 125968 tok/s step 10769/19560 | loss 3.418497 (+0.77z)| norm 0.2665 (-0.88z)| lr 2.68e-04 | 4156.95 ms | 32.5% bf16 MFU | 125976 tok/s step 10770/19560 | loss 3.318266 (-1.71z)| norm 0.2696 (-0.68z)| lr 2.68e-04 | 4165.52 ms | 32.4% bf16 MFU | 125970 tok/s step 10771/19560 | loss 3.387526 (+0.02z)| norm 0.2866 (+0.37z)| lr 2.68e-04 | 4160.05 ms | 32.5% bf16 MFU | 125973 tok/s step 10772/19560 | loss 3.407495 (+0.51z)| norm 0.3116 (+1.89z)| lr 2.68e-04 | 4155.58 ms | 32.5% bf16 MFU | 125983 tok/s step 10773/19560 | loss 3.390719 (+0.08z)| norm 0.3001 (+1.16z)| lr 2.68e-04 | 4159.38 ms | 32.5% bf16 MFU | 125986 tok/s step 10774/19560 | loss 3.380105 (-0.18z)| norm 0.2738 (-0.48z)| lr 2.68e-04 | 4154.07 ms | 32.5% bf16 MFU | 125997 tok/s step 10775/19560 | loss 3.422935 (+0.90z)| norm 0.2955 (+0.86z)| lr 2.68e-04 | 4152.73 ms | 32.5% bf16 MFU | 126010 tok/s step 10776/19560 | loss 3.368902 (-0.46z)| norm 0.2550 (-1.64z)| lr 2.68e-04 | 4155.04 ms | 32.5% bf16 MFU | 126019 tok/s step 10777/19560 | loss 3.381569 (-0.12z)| norm 0.2725 (-0.54z)| lr 2.68e-04 | 4164.97 ms | 32.4% bf16 MFU | 126012 tok/s step 10778/19560 | loss 3.401912 (+0.39z)| norm 0.2826 (+0.11z)| lr 2.68e-04 | 4155.95 ms | 32.5% bf16 MFU | 126019 tok/s step 10779/19560 | loss 3.416836 (+0.76z)| norm 0.2605 (-1.31z)| lr 2.68e-04 | 4159.13 ms | 32.5% bf16 MFU | 126021 tok/s step 10780/19560 | loss 3.345455 (-1.07z)| norm 0.2927 (+0.75z)| lr 2.68e-04 | 4161.93 ms | 32.4% bf16 MFU | 126018 tok/s step 10781/19560 | loss 3.428570 (+1.05z)| norm 0.2505 (-1.95z)| lr 2.68e-04 | 4164.75 ms | 32.4% bf16 MFU | 126012 tok/s step 10782/19560 | loss 3.370843 (-0.44z)| norm 0.3086 (+1.73z)| lr 2.68e-04 | 4158.47 ms | 32.5% bf16 MFU | 126015 tok/s step 10783/19560 | loss 3.409965 (+0.58z)| norm 0.2824 (+0.06z)| lr 2.67e-04 | 4155.21 ms | 32.5% bf16 MFU | 126023 tok/s step 10784/19560 | loss 3.400280 (+0.33z)| norm 0.2848 (+0.21z)| lr 2.67e-04 | 4153.41 ms | 32.5% bf16 MFU | 126033 tok/s step 10785/19560 | loss 3.465361 (+1.98z)| norm 0.2980 (+1.03z)| lr 2.67e-04 | 4355.22 ms | 31.0% bf16 MFU | 125751 tok/s step 10786/19560 | loss 3.341268 (-1.18z)| norm 0.2972 (+0.97z)| lr 2.67e-04 | 4241.36 ms | 31.8% bf16 MFU | 125644 tok/s step 10787/19560 | loss 3.375549 (-0.31z)| norm 0.2906 (+0.55z)| lr 2.67e-04 | 4185.97 ms | 32.3% bf16 MFU | 125624 tok/s step 10788/19560 | loss 3.355394 (-0.82z)| norm 0.2902 (+0.53z)| lr 2.67e-04 | 4157.14 ms | 32.5% bf16 MFU | 125649 tok/s step 10789/19560 | loss 3.307328 (-1.99z)| norm 0.2704 (-0.72z)| lr 2.67e-04 | 4156.76 ms | 32.5% bf16 MFU | 125673 tok/s step 10790/19560 | loss 3.356539 (-0.77z)| norm 0.2822 (+0.03z)| lr 2.67e-04 | 4158.15 ms | 32.5% bf16 MFU | 125694 tok/s step 10791/19560 | loss 3.346426 (-1.03z)| norm 0.2624 (-1.21z)| lr 2.67e-04 | 4200.87 ms | 32.1% bf16 MFU | 125649 tok/s step 10792/19560 | loss 3.364275 (-0.57z)| norm 0.2869 (+0.33z)| lr 2.67e-04 | 4181.65 ms | 32.3% bf16 MFU | 125636 tok/s step 10793/19560 | loss 3.361148 (-0.66z)| norm 0.2509 (-1.91z)| lr 2.67e-04 | 4155.23 ms | 32.5% bf16 MFU | 125663 tok/s step 10794/19560 | loss 3.362846 (-0.63z)| norm 0.2818 (+0.01z)| lr 2.67e-04 | 4156.82 ms | 32.5% bf16 MFU | 125686 tok/s step 10795/19560 | loss 3.383676 (-0.09z)| norm 0.2933 (+0.73z)| lr 2.67e-04 | 4157.86 ms | 32.5% bf16 MFU | 125706 tok/s step 10796/19560 | loss 3.413754 (+0.66z)| norm 0.2887 (+0.44z)| lr 2.67e-04 | 4157.08 ms | 32.5% bf16 MFU | 125727 tok/s step 10797/19560 | loss 3.341033 (-1.18z)| norm 0.3798 (+5.36z)| lr 2.67e-04 | 4152.22 ms | 32.5% bf16 MFU | 125754 tok/s step 10798/19560 | loss 3.344430 (-1.08z)| norm 0.2910 (+0.47z)| lr 2.67e-04 | 4154.90 ms | 32.5% bf16 MFU | 125776 tok/s step 10799/19560 | loss 3.344933 (-1.05z)| norm 0.2863 (+0.20z)| lr 2.67e-04 | 4153.39 ms | 32.5% bf16 MFU | 125798 tok/s step 10800/19560 | loss 3.354260 (-0.82z)| norm 0.2807 (-0.10z)| lr 2.67e-04 | 4152.42 ms | 32.5% bf16 MFU | 125821 tok/s step 10801/19560 | loss 3.330616 (-1.41z)| norm 0.2894 (+0.39z)| lr 2.67e-04 | 4151.70 ms | 32.5% bf16 MFU | 125844 tok/s step 10802/19560 | loss 3.344401 (-1.04z)| norm 0.2736 (-0.48z)| lr 2.67e-04 | 4155.49 ms | 32.5% bf16 MFU | 125861 tok/s step 10803/19560 | loss 3.329094 (-1.41z)| norm 0.2857 (+0.18z)| lr 2.66e-04 | 4152.18 ms | 32.5% bf16 MFU | 125881 tok/s step 10804/19560 | loss 3.317226 (-1.68z)| norm 0.2592 (-1.28z)| lr 2.66e-04 | 4154.82 ms | 32.5% bf16 MFU | 125896 tok/s step 10805/19560 | loss 3.445926 (+1.54z)| norm 0.2887 (+0.35z)| lr 2.66e-04 | 4154.15 ms | 32.5% bf16 MFU | 125912 tok/s step 10806/19560 | loss 3.366344 (-0.45z)| norm 0.2869 (+0.25z)| lr 2.66e-04 | 4158.85 ms | 32.5% bf16 MFU | 125920 tok/s step 10807/19560 | loss 3.351091 (-0.83z)| norm 0.2634 (-1.05z)| lr 2.66e-04 | 4154.70 ms | 32.5% bf16 MFU | 125933 tok/s step 10808/19560 | loss 3.350939 (-0.82z)| norm 0.2825 (+0.03z)| lr 2.66e-04 | 4152.51 ms | 32.5% bf16 MFU | 125949 tok/s step 10809/19560 | loss 3.369552 (-0.34z)| norm 0.2634 (-1.05z)| lr 2.66e-04 | 4153.11 ms | 32.5% bf16 MFU | 125964 tok/s step 10810/19560 | loss 3.369947 (-0.32z)| norm 0.2806 (-0.06z)| lr 2.66e-04 | 4155.12 ms | 32.5% bf16 MFU | 125975 tok/s step 10811/19560 | loss 3.376662 (-0.14z)| norm 0.2774 (-0.24z)| lr 2.66e-04 | 4151.24 ms | 32.5% bf16 MFU | 125991 tok/s step 10812/19560 | loss 3.348190 (-0.90z)| norm 0.2746 (-0.41z)| lr 2.66e-04 | 4154.02 ms | 32.5% bf16 MFU | 126002 tok/s step 10813/19560 | loss 3.389271 (+0.18z)| norm 0.2628 (-1.07z)| lr 2.66e-04 | 4153.64 ms | 32.5% bf16 MFU | 126013 tok/s step 10814/19560 | loss 3.361194 (-0.56z)| norm 0.2646 (-0.96z)| lr 2.66e-04 | 4153.67 ms | 32.5% bf16 MFU | 126023 tok/s step 10815/19560 | loss 3.381507 (-0.02z)| norm 0.2683 (-0.73z)| lr 2.66e-04 | 4159.85 ms | 32.5% bf16 MFU | 126024 tok/s step 10816/19560 | loss 3.307790 (-1.97z)| norm 0.2807 (-0.01z)| lr 2.66e-04 | 4153.88 ms | 32.5% bf16 MFU | 126034 tok/s step 10817/19560 | loss 3.383500 (+0.05z)| norm 0.2759 (-0.30z)| lr 2.66e-04 | 4149.80 ms | 32.5% bf16 MFU | 126049 tok/s step 10818/19560 | loss 3.332910 (-1.27z)| norm 0.2557 (-1.48z)| lr 2.66e-04 | 4151.92 ms | 32.5% bf16 MFU | 126060 tok/s step 10819/19560 | loss 3.532731 (+3.76z)| norm 0.2799 (-0.02z)| lr 2.66e-04 | 4151.35 ms | 32.5% bf16 MFU | 126072 tok/s step 10820/19560 | loss 3.390474 (+0.21z)| norm 1.6042 (+11.16z)| lr 2.66e-04 | 4153.15 ms | 32.5% bf16 MFU | 126080 tok/s step 10821/19560 | loss 3.422600 (+1.01z)| norm 0.3176 (+0.23z)| lr 2.66e-04 | 4151.99 ms | 32.5% bf16 MFU | 126090 tok/s step 10822/19560 | loss 3.514091 (+3.15z)| norm 3.4579 (+10.38z)| lr 2.66e-04 | 4149.28 ms | 32.5% bf16 MFU | 126103 tok/s step 10823/19560 | loss 3.360961 (-0.53z)| norm 0.8902 (+1.86z)| lr 2.65e-04 | 4150.94 ms | 32.5% bf16 MFU | 126113 tok/s step 10824/19560 | loss 3.390423 (+0.17z)| norm 0.3485 (+0.09z)| lr 2.65e-04 | 4151.32 ms | 32.5% bf16 MFU | 126123 tok/s step 10825/19560 | loss 3.428163 (+1.06z)| norm 0.3173 (-0.01z)| lr 2.65e-04 | 4152.84 ms | 32.5% bf16 MFU | 126129 tok/s step 10826/19560 | loss 3.366515 (-0.43z)| norm 0.3295 (+0.03z)| lr 2.65e-04 | 4150.37 ms | 32.5% bf16 MFU | 126139 tok/s step 10827/19560 | loss 3.354719 (-0.70z)| norm 0.3178 (-0.01z)| lr 2.65e-04 | 4154.88 ms | 32.5% bf16 MFU | 126141 tok/s step 10828/19560 | loss 3.353600 (-0.73z)| norm 0.3399 (+0.06z)| lr 2.65e-04 | 4151.91 ms | 32.5% bf16 MFU | 126148 tok/s step 10829/19560 | loss 3.353364 (-0.72z)| norm 0.2721 (-0.16z)| lr 2.65e-04 | 4152.25 ms | 32.5% bf16 MFU | 126154 tok/s step 10830/19560 | loss 3.477588 (+2.22z)| norm 0.3113 (-0.04z)| lr 2.65e-04 | 4151.73 ms | 32.5% bf16 MFU | 126160 tok/s step 10831/19560 | loss 3.395391 (+0.27z)| norm 0.2855 (-0.12z)| lr 2.65e-04 | 4151.85 ms | 32.5% bf16 MFU | 126166 tok/s step 10832/19560 | loss 3.440377 (+1.32z)| norm 0.2759 (-0.15z)| lr 2.65e-04 | 4150.63 ms | 32.5% bf16 MFU | 126173 tok/s step 10833/19560 | loss 3.402479 (+0.43z)| norm 0.2957 (-0.09z)| lr 2.65e-04 | 4155.57 ms | 32.5% bf16 MFU | 126173 tok/s step 10834/19560 | loss 3.431128 (+1.10z)| norm 0.2690 (-0.17z)| lr 2.65e-04 | 4151.35 ms | 32.5% bf16 MFU | 126179 tok/s step 10835/19560 | loss 3.401160 (+0.39z)| norm 0.2818 (-0.13z)| lr 2.65e-04 | 4155.81 ms | 32.5% bf16 MFU | 126178 tok/s step 10836/19560 | loss 3.337853 (-1.10z)| norm 0.2763 (-0.15z)| lr 2.65e-04 | 4155.92 ms | 32.5% bf16 MFU | 126177 tok/s step 10837/19560 | loss 3.352045 (-0.75z)| norm 0.2683 (-0.17z)| lr 2.65e-04 | 4151.48 ms | 32.5% bf16 MFU | 126182 tok/s step 10838/19560 | loss 3.391491 (+0.18z)| norm 0.2735 (-0.16z)| lr 2.65e-04 | 4148.65 ms | 32.5% bf16 MFU | 126192 tok/s step 10839/19560 | loss 3.388000 (+0.11z)| norm 0.2820 (-0.13z)| lr 2.65e-04 | 4152.07 ms | 32.5% bf16 MFU | 126196 tok/s step 10840/19560 | loss 3.422070 (+0.91z)| norm 0.2718 (-0.16z)| lr 2.65e-04 | 4149.80 ms | 32.5% bf16 MFU | 126203 tok/s step 10841/19560 | loss 3.437990 (+1.27z)| norm 0.2766 (-0.15z)| lr 2.65e-04 | 4149.32 ms | 32.5% bf16 MFU | 126211 tok/s step 10842/19560 | loss 3.358205 (-0.63z)| norm 0.2716 (-0.16z)| lr 2.65e-04 | 4151.61 ms | 32.5% bf16 MFU | 126215 tok/s step 10843/19560 | loss 3.377974 (-0.16z)| norm 0.2587 (-0.20z)| lr 2.65e-04 | 4150.80 ms | 32.5% bf16 MFU | 126219 tok/s step 10844/19560 | loss 3.390200 (+0.14z)| norm 0.2770 (-0.14z)| lr 2.64e-04 | 4152.52 ms | 32.5% bf16 MFU | 126221 tok/s step 10845/19560 | loss 3.347373 (-0.88z)| norm 0.2628 (-0.19z)| lr 2.64e-04 | 4150.91 ms | 32.5% bf16 MFU | 126226 tok/s step 10846/19560 | loss 3.347363 (-0.86z)| norm 0.2657 (-0.18z)| lr 2.64e-04 | 4151.83 ms | 32.5% bf16 MFU | 126228 tok/s step 10847/19560 | loss 3.323224 (-1.42z)| norm 0.2808 (-0.13z)| lr 2.64e-04 | 4147.62 ms | 32.6% bf16 MFU | 126237 tok/s step 10848/19560 | loss 3.347343 (-0.84z)| norm 0.2504 (-0.23z)| lr 2.64e-04 | 4152.03 ms | 32.5% bf16 MFU | 126239 tok/s step 10849/19560 | loss 3.403596 (+0.52z)| norm 0.2800 (-0.13z)| lr 2.64e-04 | 4151.31 ms | 32.5% bf16 MFU | 126242 tok/s step 10850/19560 | loss 3.505086 (+2.86z)| norm 0.2988 (-0.07z)| lr 2.64e-04 | 4199.37 ms | 32.2% bf16 MFU | 126172 tok/s step 10851/19560 | loss 3.434656 (+1.20z)| norm 0.4285 (+0.35z)| lr 2.64e-04 | 4149.09 ms | 32.5% bf16 MFU | 126182 tok/s step 10852/19560 | loss 3.417988 (+0.80z)| norm 0.2898 (-0.11z)| lr 2.64e-04 | 4191.39 ms | 32.2% bf16 MFU | 126127 tok/s step 10853/19560 | loss 3.327630 (-1.29z)| norm 0.2725 (-0.16z)| lr 2.64e-04 | 4191.89 ms | 32.2% bf16 MFU | 126074 tok/s step 10854/19560 | loss 3.330586 (-1.21z)| norm 0.2893 (-0.11z)| lr 2.64e-04 | 4155.24 ms | 32.5% bf16 MFU | 126079 tok/s step 10855/19560 | loss 3.449507 (+1.50z)| norm 0.2804 (-0.14z)| lr 2.64e-04 | 4165.50 ms | 32.4% bf16 MFU | 126068 tok/s step 10856/19560 | loss 3.379102 (-0.10z)| norm 0.2942 (-0.09z)| lr 2.64e-04 | 4143.47 ms | 32.6% bf16 MFU | 126092 tok/s step 10857/19560 | loss 3.391839 (+0.19z)| norm 0.2747 (-0.16z)| lr 2.64e-04 | 4146.02 ms | 32.6% bf16 MFU | 126110 tok/s step 10858/19560 | loss 3.387485 (+0.08z)| norm 0.2671 (-0.18z)| lr 2.64e-04 | 4146.71 ms | 32.6% bf16 MFU | 126126 tok/s step 10859/19560 | loss 3.355323 (-0.65z)| norm 0.2675 (-0.18z)| lr 2.64e-04 | 4152.97 ms | 32.5% bf16 MFU | 126132 tok/s step 10860/19560 | loss 3.343883 (-0.90z)| norm 0.2662 (-0.18z)| lr 2.64e-04 | 4149.56 ms | 32.5% bf16 MFU | 126143 tok/s step 10861/19560 | loss 3.361983 (-0.48z)| norm 0.2826 (-0.13z)| lr 2.64e-04 | 4147.83 ms | 32.6% bf16 MFU | 126156 tok/s step 10862/19560 | loss 3.347141 (-0.82z)| norm 0.2680 (-0.18z)| lr 2.64e-04 | 4148.29 ms | 32.5% bf16 MFU | 126167 tok/s step 10863/19560 | loss 3.348407 (-0.78z)| norm 0.2885 (-0.11z)| lr 2.64e-04 | 4143.79 ms | 32.6% bf16 MFU | 126185 tok/s step 10864/19560 | loss 3.359758 (-0.52z)| norm 0.2833 (-0.13z)| lr 2.63e-04 | 4154.36 ms | 32.5% bf16 MFU | 126186 tok/s step 10865/19560 | loss 3.358628 (-0.54z)| norm 0.2976 (-0.08z)| lr 2.63e-04 | 4148.72 ms | 32.5% bf16 MFU | 126195 tok/s step 10866/19560 | loss 3.344757 (-0.85z)| norm 0.2924 (-0.10z)| lr 2.63e-04 | 4149.19 ms | 32.5% bf16 MFU | 126203 tok/s step 10867/19560 | loss 3.333144 (-1.12z)| norm 0.2712 (-0.17z)| lr 2.63e-04 | 4151.45 ms | 32.5% bf16 MFU | 126208 tok/s step 10868/19560 | loss 3.509108 (+2.80z)| norm 6.0692 (+9.64z)| lr 2.63e-04 | 4147.58 ms | 32.6% bf16 MFU | 126218 tok/s step 10869/19560 | loss 3.389549 (+0.15z)| norm 0.3403 (-0.05z)| lr 2.63e-04 | 4141.32 ms | 32.6% bf16 MFU | 126237 tok/s step 10870/19560 | loss 3.403665 (+0.46z)| norm 0.2846 (-0.14z)| lr 2.63e-04 | 4147.58 ms | 32.6% bf16 MFU | 126245 tok/s step 10871/19560 | loss 3.374907 (-0.17z)| norm 0.3586 (-0.02z)| lr 2.63e-04 | 4149.66 ms | 32.5% bf16 MFU | 126250 tok/s step 10872/19560 | loss 3.395554 (+0.30z)| norm 0.2834 (-0.14z)| lr 2.63e-04 | 4146.83 ms | 32.6% bf16 MFU | 126259 tok/s step 10873/19560 | loss 3.320145 (-1.38z)| norm 0.3101 (-0.10z)| lr 2.63e-04 | 4149.12 ms | 32.5% bf16 MFU | 126265 tok/s step 10874/19560 | loss 3.361557 (-0.45z)| norm 0.2971 (-0.12z)| lr 2.63e-04 | 4144.51 ms | 32.6% bf16 MFU | 126276 tok/s step 10875/19560 | loss 3.372050 (-0.21z)| norm 0.3036 (-0.11z)| lr 2.63e-04 | 4150.19 ms | 32.5% bf16 MFU | 126279 tok/s step 10876/19560 | loss 3.311810 (-1.53z)| norm 0.2780 (-0.16z)| lr 2.63e-04 | 4147.30 ms | 32.6% bf16 MFU | 126286 tok/s step 10877/19560 | loss 3.324527 (-1.23z)| norm 0.5711 (+0.34z)| lr 2.63e-04 | 4150.19 ms | 32.5% bf16 MFU | 126288 tok/s step 10878/19560 | loss 3.362525 (-0.38z)| norm 0.3043 (-0.12z)| lr 2.63e-04 | 4147.82 ms | 32.6% bf16 MFU | 126294 tok/s step 10879/19560 | loss 3.411085 (+0.69z)| norm 0.2747 (-0.17z)| lr 2.63e-04 | 4145.64 ms | 32.6% bf16 MFU | 126302 tok/s step 10880/19560 | loss 3.351729 (-0.62z)| norm 0.3149 (-0.10z)| lr 2.63e-04 | 4145.09 ms | 32.6% bf16 MFU | 126311 tok/s step 10881/19560 | loss 3.375959 (-0.09z)| norm 0.2739 (-0.17z)| lr 2.63e-04 | 4145.38 ms | 32.6% bf16 MFU | 126320 tok/s step 10882/19560 | loss 3.419877 (+0.88z)| norm 0.2753 (-0.17z)| lr 2.63e-04 | 4150.44 ms | 32.5% bf16 MFU | 126320 tok/s step 10883/19560 | loss 3.416088 (+0.78z)| norm 0.2869 (-0.15z)| lr 2.63e-04 | 4147.61 ms | 32.6% bf16 MFU | 126324 tok/s step 10884/19560 | loss 3.356895 (-0.52z)| norm 0.2919 (-0.14z)| lr 2.62e-04 | 4151.26 ms | 32.5% bf16 MFU | 126323 tok/s step 10885/19560 | loss 3.348911 (-0.70z)| norm 0.2643 (-0.18z)| lr 2.62e-04 | 4150.63 ms | 32.5% bf16 MFU | 126322 tok/s step 10886/19560 | loss 3.379195 (-0.01z)| norm 0.2653 (-0.18z)| lr 2.62e-04 | 4149.49 ms | 32.5% bf16 MFU | 126324 tok/s step 10887/19560 | loss 3.369333 (-0.23z)| norm 0.2673 (-0.18z)| lr 2.62e-04 | 4147.63 ms | 32.6% bf16 MFU | 126328 tok/s step 10888/19560 | loss 3.323509 (-1.29z)| norm 0.2765 (-0.16z)| lr 2.62e-04 | 4149.82 ms | 32.5% bf16 MFU | 126328 tok/s step 10889/19560 | loss 3.379998 (+0.03z)| norm 1.0491 (+1.13z)| lr 2.62e-04 | 4145.47 ms | 32.6% bf16 MFU | 126336 tok/s step 10890/19560 | loss 3.390969 (+0.28z)| norm 0.2957 (-0.14z)| lr 2.62e-04 | 4149.01 ms | 32.5% bf16 MFU | 126337 tok/s step 10891/19560 | loss 3.390939 (+0.28z)| norm 0.2709 (-0.18z)| lr 2.62e-04 | 4149.02 ms | 32.5% bf16 MFU | 126338 tok/s step 10892/19560 | loss 3.378161 (-0.03z)| norm 0.2767 (-0.17z)| lr 2.62e-04 | 4148.60 ms | 32.5% bf16 MFU | 126340 tok/s step 10893/19560 | loss 3.421509 (+0.98z)| norm 0.2778 (-0.17z)| lr 2.62e-04 | 4146.32 ms | 32.6% bf16 MFU | 126346 tok/s step 10894/19560 | loss 3.401785 (+0.51z)| norm 0.3220 (-0.10z)| lr 2.62e-04 | 4153.21 ms | 32.5% bf16 MFU | 126340 tok/s step 10895/19560 | loss 3.378992 (-0.01z)| norm 0.2889 (-0.15z)| lr 2.62e-04 | 4149.49 ms | 32.5% bf16 MFU | 126341 tok/s step 10896/19560 | loss 3.323829 (-1.30z)| norm 0.2783 (-0.17z)| lr 2.62e-04 | 4146.69 ms | 32.6% bf16 MFU | 126345 tok/s step 10897/19560 | loss 3.324456 (-1.27z)| norm 0.2596 (-0.20z)| lr 2.62e-04 | 4150.47 ms | 32.5% bf16 MFU | 126344 tok/s step 10898/19560 | loss 3.470276 (+2.13z)| norm 0.2938 (-0.14z)| lr 2.62e-04 | 4146.44 ms | 32.6% bf16 MFU | 126349 tok/s step 10899/19560 | loss 3.349465 (-0.69z)| norm 0.2860 (-0.16z)| lr 2.62e-04 | 4154.02 ms | 32.5% bf16 MFU | 126342 tok/s step 10900/19560 | loss 3.393551 (+0.34z)| norm 0.2962 (-0.14z)| lr 2.62e-04 | 4147.42 ms | 32.6% bf16 MFU | 126346 tok/s step 10901/19560 | loss 3.400565 (+0.51z)| norm 0.2711 (-0.18z)| lr 2.62e-04 | 4150.93 ms | 32.5% bf16 MFU | 126344 tok/s step 10902/19560 | loss 3.414669 (+0.83z)| norm 0.2819 (-0.16z)| lr 2.62e-04 | 4147.80 ms | 32.6% bf16 MFU | 126347 tok/s step 10903/19560 | loss 3.388110 (+0.22z)| norm 0.2901 (-0.15z)| lr 2.62e-04 | 4152.75 ms | 32.5% bf16 MFU | 126342 tok/s step 10904/19560 | loss 3.364934 (-0.33z)| norm 0.2752 (-0.17z)| lr 2.61e-04 | 4149.35 ms | 32.5% bf16 MFU | 126343 tok/s step 10905/19560 | loss 3.392464 (+0.32z)| norm 0.2667 (-0.19z)| lr 2.61e-04 | 4149.22 ms | 32.5% bf16 MFU | 126343 tok/s step 10906/19560 | loss 3.350114 (-0.67z)| norm 0.2640 (-0.19z)| lr 2.61e-04 | 4152.41 ms | 32.5% bf16 MFU | 126339 tok/s step 10907/19560 | loss 3.449107 (+1.63z)| norm 0.2580 (-0.20z)| lr 2.61e-04 | 4149.12 ms | 32.5% bf16 MFU | 126340 tok/s step 10908/19560 | loss 3.415227 (+0.83z)| norm 0.2643 (-0.19z)| lr 2.61e-04 | 4150.03 ms | 32.5% bf16 MFU | 126340 tok/s step 10909/19560 | loss 3.388721 (+0.23z)| norm 0.2633 (-0.19z)| lr 2.61e-04 | 4150.44 ms | 32.5% bf16 MFU | 126339 tok/s step 10910/19560 | loss 3.398039 (+0.44z)| norm 0.2694 (-0.18z)| lr 2.61e-04 | 4149.75 ms | 32.5% bf16 MFU | 126339 tok/s step 10911/19560 | loss 3.390435 (+0.26z)| norm 0.2596 (-0.20z)| lr 2.61e-04 | 4152.67 ms | 32.5% bf16 MFU | 126335 tok/s step 10912/19560 | loss 3.420891 (+0.97z)| norm 0.2777 (-0.17z)| lr 2.61e-04 | 4149.49 ms | 32.5% bf16 MFU | 126336 tok/s step 10913/19560 | loss 3.407652 (+0.68z)| norm 0.2794 (-0.17z)| lr 2.61e-04 | 4152.84 ms | 32.5% bf16 MFU | 126331 tok/s step 10914/19560 | loss 3.334498 (-1.04z)| norm 0.2543 (-0.21z)| lr 2.61e-04 | 4152.15 ms | 32.5% bf16 MFU | 126328 tok/s step 10915/19560 | loss 3.372056 (-0.16z)| norm 0.2796 (-0.16z)| lr 2.61e-04 | 4149.00 ms | 32.5% bf16 MFU | 126330 tok/s step 10916/19560 | loss 3.345001 (-0.79z)| norm 0.2681 (-0.18z)| lr 2.61e-04 | 4149.11 ms | 32.5% bf16 MFU | 126332 tok/s step 10917/19560 | loss 3.387903 (+0.21z)| norm 0.2759 (-0.17z)| lr 2.61e-04 | 4149.75 ms | 32.5% bf16 MFU | 126332 tok/s step 10918/19560 | loss 3.386758 (+0.17z)| norm 0.2735 (-0.17z)| lr 2.61e-04 | 4147.90 ms | 32.6% bf16 MFU | 126335 tok/s step 10919/19560 | loss 3.346440 (-0.79z)| norm 0.2659 (-0.19z)| lr 2.61e-04 | 4148.84 ms | 32.5% bf16 MFU | 126337 tok/s step 10920/19560 | loss 3.363270 (-0.39z)| norm 0.2750 (-0.17z)| lr 2.61e-04 | 4151.23 ms | 32.5% bf16 MFU | 126335 tok/s step 10921/19560 | loss 3.383605 (+0.09z)| norm 0.2674 (-0.18z)| lr 2.61e-04 | 4148.58 ms | 32.5% bf16 MFU | 126337 tok/s step 10922/19560 | loss 3.336383 (-1.02z)| norm 0.2725 (-0.18z)| lr 2.61e-04 | 4150.96 ms | 32.5% bf16 MFU | 126336 tok/s step 10923/19560 | loss 3.373276 (-0.15z)| norm 0.2750 (-0.17z)| lr 2.61e-04 | 4150.77 ms | 32.5% bf16 MFU | 126334 tok/s step 10924/19560 | loss 3.335274 (-1.03z)| norm 0.2945 (-0.14z)| lr 2.60e-04 | 4150.31 ms | 32.5% bf16 MFU | 126334 tok/s step 10925/19560 | loss 3.343874 (-0.83z)| norm 0.2554 (-0.20z)| lr 2.60e-04 | 4150.93 ms | 32.5% bf16 MFU | 126333 tok/s step 10926/19560 | loss 3.422102 (+1.01z)| norm 0.2986 (-0.13z)| lr 2.60e-04 | 4752.47 ms | 28.4% bf16 MFU | 125532 tok/s step 10927/19560 | loss 3.373622 (-0.14z)| norm 0.2712 (-0.18z)| lr 2.60e-04 | 4201.96 ms | 32.1% bf16 MFU | 125494 tok/s step 10928/19560 | loss 3.352186 (-0.65z)| norm 0.2872 (-0.15z)| lr 2.60e-04 | 4357.15 ms | 31.0% bf16 MFU | 125236 tok/s step 10929/19560 | loss 3.350960 (-0.69z)| norm 0.2744 (-0.17z)| lr 2.60e-04 | 4238.35 ms | 31.9% bf16 MFU | 125159 tok/s step 10930/19560 | loss 3.429908 (+1.18z)| norm 0.2766 (-0.17z)| lr 2.60e-04 | 4312.68 ms | 31.3% bf16 MFU | 124979 tok/s step 10931/19560 | loss 3.295398 (-2.00z)| norm 0.2984 (-0.13z)| lr 2.60e-04 | 4149.44 ms | 32.5% bf16 MFU | 125048 tok/s step 10932/19560 | loss 3.324705 (-1.31z)| norm 0.2694 (-0.18z)| lr 2.60e-04 | 4199.59 ms | 32.2% bf16 MFU | 125038 tok/s step 10933/19560 | loss 3.371690 (-0.19z)| norm 0.2794 (-0.16z)| lr 2.60e-04 | 4253.16 ms | 31.7% bf16 MFU | 124949 tok/s step 10934/19560 | loss 3.331827 (-1.13z)| norm 0.2951 (-0.14z)| lr 2.60e-04 | 4226.36 ms | 31.9% bf16 MFU | 124904 tok/s step 10935/19560 | loss 3.358653 (-0.49z)| norm 0.2765 (-0.17z)| lr 2.60e-04 | 4147.96 ms | 32.6% bf16 MFU | 124979 tok/s step 10936/19560 | loss 3.363022 (-0.39z)| norm 0.7386 (+0.60z)| lr 2.60e-04 | 4201.06 ms | 32.1% bf16 MFU | 124970 tok/s step 10937/19560 | loss 3.325240 (-1.27z)| norm 0.2870 (-0.16z)| lr 2.60e-04 | 4181.59 ms | 32.3% bf16 MFU | 124991 tok/s step 10938/19560 | loss 3.349362 (-0.70z)| norm 0.2981 (-0.14z)| lr 2.60e-04 | 4151.15 ms | 32.5% bf16 MFU | 125056 tok/s step 10939/19560 | loss 3.368892 (-0.24z)| norm 0.2906 (-0.15z)| lr 2.60e-04 | 4145.62 ms | 32.6% bf16 MFU | 125127 tok/s step 10940/19560 | loss 3.457404 (+1.81z)| norm 0.2791 (-0.17z)| lr 2.60e-04 | 4148.71 ms | 32.5% bf16 MFU | 125189 tok/s step 10941/19560 | loss 3.367115 (-0.29z)| norm 0.2911 (-0.15z)| lr 2.60e-04 | 4148.80 ms | 32.5% bf16 MFU | 125248 tok/s step 10942/19560 | loss 3.333905 (-1.06z)| norm 0.2699 (-0.19z)| lr 2.60e-04 | 4150.80 ms | 32.5% bf16 MFU | 125301 tok/s step 10943/19560 | loss 3.416241 (+0.84z)| norm 0.3027 (-0.13z)| lr 2.60e-04 | 4147.76 ms | 32.6% bf16 MFU | 125356 tok/s step 10944/19560 | loss 3.446033 (+1.51z)| norm 0.2920 (-0.15z)| lr 2.59e-04 | 4146.22 ms | 32.6% bf16 MFU | 125411 tok/s step 10945/19560 | loss 3.316556 (-1.47z)| norm 0.2935 (-0.15z)| lr 2.59e-04 | 4144.61 ms | 32.6% bf16 MFU | 125465 tok/s step 10946/19560 | loss 3.351480 (-0.67z)| norm 0.2752 (-0.18z)| lr 2.59e-04 | 4148.47 ms | 32.5% bf16 MFU | 125511 tok/s step 10947/19560 | loss 3.361335 (-0.43z)| norm 0.2676 (-0.19z)| lr 2.59e-04 | 4146.26 ms | 32.6% bf16 MFU | 125558 tok/s step 10948/19560 | loss 3.338934 (-0.96z)| norm 0.2737 (-0.17z)| lr 2.59e-04 | 4145.73 ms | 32.6% bf16 MFU | 125603 tok/s step 10949/19560 | loss 3.394393 (+0.39z)| norm 0.2736 (-0.17z)| lr 2.59e-04 | 4153.31 ms | 32.5% bf16 MFU | 125635 tok/s step 10950/19560 | loss 3.357499 (-0.50z)| norm 0.9831 (+1.21z)| lr 2.59e-04 | 4148.76 ms | 32.5% bf16 MFU | 125672 tok/s step 10951/19560 | loss 3.396910 (+0.49z)| norm 0.3144 (-0.06z)| lr 2.59e-04 | 4151.28 ms | 32.5% bf16 MFU | 125703 tok/s step 10952/19560 | loss 3.365792 (-0.29z)| norm 0.2741 (-0.14z)| lr 2.59e-04 | 4151.99 ms | 32.5% bf16 MFU | 125731 tok/s step 10953/19560 | loss 3.387973 (+0.28z)| norm 0.2999 (-0.09z)| lr 2.59e-04 | 4153.27 ms | 32.5% bf16 MFU | 125757 tok/s step 10954/19560 | loss 3.348309 (-0.73z)| norm 0.2783 (-0.13z)| lr 2.59e-04 | 4146.14 ms | 32.6% bf16 MFU | 125791 tok/s step 10955/19560 | loss 3.329832 (-1.19z)| norm 0.2785 (-0.13z)| lr 2.59e-04 | 4150.29 ms | 32.5% bf16 MFU | 125818 tok/s step 10956/19560 | loss 3.363317 (-0.34z)| norm 0.2680 (-0.15z)| lr 2.59e-04 | 4148.62 ms | 32.5% bf16 MFU | 125846 tok/s step 10957/19560 | loss 3.411061 (+0.86z)| norm 0.2638 (-0.16z)| lr 2.59e-04 | 4150.84 ms | 32.5% bf16 MFU | 125869 tok/s step 10958/19560 | loss 3.422241 (+1.18z)| norm 0.2934 (-0.10z)| lr 2.59e-04 | 4148.13 ms | 32.5% bf16 MFU | 125895 tok/s step 10959/19560 | loss 3.367387 (-0.24z)| norm 1.6440 (+2.43z)| lr 2.59e-04 | 4149.58 ms | 32.5% bf16 MFU | 125918 tok/s step 10960/19560 | loss 3.369994 (-0.16z)| norm 0.3225 (-0.06z)| lr 2.59e-04 | 4147.13 ms | 32.6% bf16 MFU | 125943 tok/s step 10961/19560 | loss 3.389074 (+0.34z)| norm 0.2976 (-0.11z)| lr 2.59e-04 | 4148.60 ms | 32.5% bf16 MFU | 125965 tok/s step 10962/19560 | loss 3.375605 (+0.00z)| norm 0.3160 (-0.08z)| lr 2.59e-04 | 4150.68 ms | 32.5% bf16 MFU | 125982 tok/s step 10963/19560 | loss 3.393522 (+0.48z)| norm 0.2669 (-0.17z)| lr 2.59e-04 | 4149.73 ms | 32.5% bf16 MFU | 126000 tok/s step 10964/19560 | loss 3.385723 (+0.26z)| norm 0.2925 (-0.12z)| lr 2.59e-04 | 4153.59 ms | 32.5% bf16 MFU | 126011 tok/s step 10965/19560 | loss 3.345366 (-0.80z)| norm 0.3081 (-0.09z)| lr 2.58e-04 | 4147.71 ms | 32.6% bf16 MFU | 126031 tok/s step 10966/19560 | loss 3.390693 (+0.40z)| norm 0.2765 (-0.15z)| lr 2.58e-04 | 4151.79 ms | 32.5% bf16 MFU | 126044 tok/s step 10967/19560 | loss 3.375660 (+0.00z)| norm 0.2637 (-0.17z)| lr 2.58e-04 | 4146.63 ms | 32.6% bf16 MFU | 126063 tok/s step 10968/19560 | loss 3.331690 (-1.15z)| norm 0.2515 (-0.20z)| lr 2.58e-04 | 4149.48 ms | 32.5% bf16 MFU | 126078 tok/s step 10969/19560 | loss 3.461135 (+2.26z)| norm 0.2734 (-0.16z)| lr 2.58e-04 | 4148.68 ms | 32.5% bf16 MFU | 126092 tok/s step 10970/19560 | loss 3.425668 (+1.31z)| norm 0.2723 (-0.16z)| lr 2.58e-04 | 4150.06 ms | 32.5% bf16 MFU | 126104 tok/s step 10971/19560 | loss 3.377761 (+0.06z)| norm 0.2630 (-0.18z)| lr 2.58e-04 | 4153.38 ms | 32.5% bf16 MFU | 126111 tok/s step 10972/19560 | loss 3.424525 (+1.26z)| norm 0.2715 (-0.16z)| lr 2.58e-04 | 4153.24 ms | 32.5% bf16 MFU | 126117 tok/s step 10973/19560 | loss 3.325942 (-1.29z)| norm 0.2808 (-0.14z)| lr 2.58e-04 | 4148.22 ms | 32.5% bf16 MFU | 126131 tok/s step 10974/19560 | loss 3.435434 (+1.52z)| norm 0.2674 (-0.17z)| lr 2.58e-04 | 4150.17 ms | 32.5% bf16 MFU | 126141 tok/s step 10975/19560 | loss 3.392368 (+0.40z)| norm 0.2694 (-0.16z)| lr 2.58e-04 | 4198.59 ms | 32.2% bf16 MFU | 126077 tok/s step 10976/19560 | loss 3.365437 (-0.30z)| norm 0.2854 (-0.13z)| lr 2.58e-04 | 4354.35 ms | 31.0% bf16 MFU | 125794 tok/s step 10977/19560 | loss 3.392275 (+0.40z)| norm 0.2901 (-0.12z)| lr 2.58e-04 | 4180.37 ms | 32.3% bf16 MFU | 125775 tok/s step 10978/19560 | loss 3.417521 (+1.11z)| norm 0.2682 (-0.17z)| lr 2.58e-04 | 4195.80 ms | 32.2% bf16 MFU | 125734 tok/s step 10979/19560 | loss 3.394385 (+0.50z)| norm 0.2983 (-0.11z)| lr 2.58e-04 | 4217.72 ms | 32.0% bf16 MFU | 125662 tok/s step 10980/19560 | loss 3.338421 (-1.01z)| norm 0.2750 (-0.15z)| lr 2.58e-04 | 4177.72 ms | 32.3% bf16 MFU | 125654 tok/s step 10981/19560 | loss 3.335646 (-1.09z)| norm 0.2999 (-0.10z)| lr 2.58e-04 | 4199.84 ms | 32.1% bf16 MFU | 125613 tok/s step 10982/19560 | loss 3.412735 (+1.00z)| norm 0.2685 (-0.16z)| lr 2.58e-04 | 4166.85 ms | 32.4% bf16 MFU | 125624 tok/s step 10983/19560 | loss 3.425552 (+1.37z)| norm 0.2800 (-0.14z)| lr 2.58e-04 | 4429.77 ms | 30.5% bf16 MFU | 125260 tok/s step 10984/19560 | loss 3.391107 (+0.42z)| norm 0.2801 (-0.14z)| lr 2.58e-04 | 4165.27 ms | 32.4% bf16 MFU | 125291 tok/s step 10985/19560 | loss 3.416108 (+1.10z)| norm 0.3031 (-0.10z)| lr 2.57e-04 | 4172.30 ms | 32.4% bf16 MFU | 125309 tok/s step 10986/19560 | loss 3.385437 (+0.25z)| norm 0.2756 (-0.15z)| lr 2.57e-04 | 4153.75 ms | 32.5% bf16 MFU | 125355 tok/s step 10987/19560 | loss 3.383442 (+0.19z)| norm 0.2935 (-0.12z)| lr 2.57e-04 | 4160.86 ms | 32.4% bf16 MFU | 125387 tok/s step 10988/19560 | loss 3.425790 (+1.34z)| norm 0.2874 (-0.13z)| lr 2.57e-04 | 4151.36 ms | 32.5% bf16 MFU | 125433 tok/s step 10989/19560 | loss 3.441027 (+1.72z)| norm 0.2738 (-0.15z)| lr 2.57e-04 | 4166.25 ms | 32.4% bf16 MFU | 125453 tok/s step 10990/19560 | loss 3.413982 (+0.97z)| norm 0.3033 (-0.10z)| lr 2.57e-04 | 4156.25 ms | 32.5% bf16 MFU | 125488 tok/s step 10991/19560 | loss 3.387038 (+0.23z)| norm 0.2740 (-0.15z)| lr 2.57e-04 | 4161.23 ms | 32.4% bf16 MFU | 125513 tok/s step 10992/19560 | loss 3.491857 (+2.95z)| norm 0.2913 (-0.12z)| lr 2.57e-04 | 4163.31 ms | 32.4% bf16 MFU | 125534 tok/s step 10993/19560 | loss 3.413514 (+0.88z)| norm 0.2649 (-0.17z)| lr 2.57e-04 | 4163.75 ms | 32.4% bf16 MFU | 125553 tok/s step 10994/19560 | loss 3.536200 (+3.83z)| norm 0.2802 (-0.14z)| lr 2.57e-04 | 4159.51 ms | 32.5% bf16 MFU | 125578 tok/s step 10995/19560 | loss 3.475000 (+2.25z)| norm 0.2928 (-0.12z)| lr 2.57e-04 | 4154.72 ms | 32.5% bf16 MFU | 125608 tok/s step 10996/19560 | loss 3.399867 (+0.46z)| norm 0.2870 (-0.15z)| lr 2.57e-04 | 4154.18 ms | 32.5% bf16 MFU | 125638 tok/s step 10997/19560 | loss 3.383235 (+0.04z)| norm 0.2838 (-0.17z)| lr 2.57e-04 | 4158.74 ms | 32.5% bf16 MFU | 125660 tok/s step 10998/19560 | loss 3.455477 (+1.84z)| norm 0.2927 (-0.11z)| lr 2.57e-04 | 4163.23 ms | 32.4% bf16 MFU | 125673 tok/s step 10999/19560 | loss 3.406047 (+0.59z)| norm 0.3225 (+0.08z)| lr 2.57e-04 | 4158.09 ms | 32.5% bf16 MFU | 125694 tok/s step 11000/19560 | loss 3.367874 (-0.35z)| norm 0.2690 (-0.26z)| lr 2.57e-04 | 4162.98 ms | 32.4% bf16 MFU | 125706 tok/s val loss 3.369678 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2915/10042 = 0.290281 step 11001/19560 | loss 3.421032 (+0.96z)| norm 0.2999 (-0.06z)| lr 2.57e-04 | 4165.06 ms | 32.4% bf16 MFU | 125715 tok/s step 11002/19560 | loss 3.360772 (-0.55z)| norm 0.2713 (-0.24z)| lr 2.57e-04 | 4160.83 ms | 32.4% bf16 MFU | 125730 tok/s step 11003/19560 | loss 3.374583 (-0.21z)| norm 0.2783 (-0.20z)| lr 2.57e-04 | 4161.46 ms | 32.4% bf16 MFU | 125742 tok/s step 11004/19560 | loss 3.403238 (+0.50z)| norm 0.2739 (-0.22z)| lr 2.57e-04 | 4167.48 ms | 32.4% bf16 MFU | 125746 tok/s step 11005/19560 | loss 3.344236 (-1.01z)| norm 0.2632 (-0.28z)| lr 2.56e-04 | 4162.46 ms | 32.4% bf16 MFU | 125756 tok/s step 11006/19560 | loss 3.421210 (+0.94z)| norm 0.2843 (-0.14z)| lr 2.56e-04 | 4163.86 ms | 32.4% bf16 MFU | 125764 tok/s step 11007/19560 | loss 3.409510 (+0.65z)| norm 0.2609 (-0.29z)| lr 2.56e-04 | 4173.94 ms | 32.3% bf16 MFU | 125756 tok/s step 11008/19560 | loss 3.430413 (+1.16z)| norm 0.2842 (-0.14z)| lr 2.56e-04 | 4151.82 ms | 32.5% bf16 MFU | 125782 tok/s step 11009/19560 | loss 3.385244 (+0.01z)| norm 0.2653 (-0.26z)| lr 2.56e-04 | 4163.54 ms | 32.4% bf16 MFU | 125789 tok/s step 11010/19560 | loss 3.381966 (-0.07z)| norm 0.2687 (-0.24z)| lr 2.56e-04 | 4164.88 ms | 32.4% bf16 MFU | 125794 tok/s step 11011/19560 | loss 3.537241 (+3.67z)| norm 0.2777 (-0.18z)| lr 2.56e-04 | 4155.32 ms | 32.5% bf16 MFU | 125813 tok/s step 11012/19560 | loss 3.376647 (-0.22z)| norm 0.2686 (-0.24z)| lr 2.56e-04 | 4161.85 ms | 32.4% bf16 MFU | 125821 tok/s step 11013/19560 | loss 3.347661 (-0.92z)| norm 0.2720 (-0.22z)| lr 2.56e-04 | 4155.52 ms | 32.5% bf16 MFU | 125838 tok/s step 11014/19560 | loss 3.414511 (+0.69z)| norm 0.2718 (-0.22z)| lr 2.56e-04 | 4168.63 ms | 32.4% bf16 MFU | 125835 tok/s step 11015/19560 | loss 3.435674 (+1.18z)| norm 0.2692 (-0.24z)| lr 2.56e-04 | 4158.85 ms | 32.5% bf16 MFU | 125847 tok/s step 11016/19560 | loss 3.377606 (-0.23z)| norm 0.2620 (-0.28z)| lr 2.56e-04 | 4164.18 ms | 32.4% bf16 MFU | 125849 tok/s step 11017/19560 | loss 3.407464 (+0.49z)| norm 0.2683 (-0.22z)| lr 2.56e-04 | 4165.03 ms | 32.4% bf16 MFU | 125851 tok/s step 11018/19560 | loss 3.379062 (-0.19z)| norm 0.2828 (-0.12z)| lr 2.56e-04 | 4314.54 ms | 31.3% bf16 MFU | 125634 tok/s step 11019/19560 | loss 3.351212 (-0.86z)| norm 0.2707 (-0.20z)| lr 2.56e-04 | 4172.84 ms | 32.4% bf16 MFU | 125635 tok/s step 11020/19560 | loss 3.390053 (+0.08z)| norm 0.2710 (-0.20z)| lr 2.56e-04 | 4148.23 ms | 32.5% bf16 MFU | 125672 tok/s step 11021/19560 | loss 3.397840 (+0.27z)| norm 0.2707 (-0.20z)| lr 2.56e-04 | 4159.43 ms | 32.5% bf16 MFU | 125691 tok/s step 11022/19560 | loss 3.438580 (+1.25z)| norm 0.2605 (-0.27z)| lr 2.56e-04 | 4176.60 ms | 32.3% bf16 MFU | 125683 tok/s step 11023/19560 | loss 3.477502 (+2.13z)| norm 0.2815 (-0.12z)| lr 2.56e-04 | 4893.27 ms | 27.6% bf16 MFU | 124756 tok/s step 11024/19560 | loss 3.381459 (-0.16z)| norm 0.2550 (-0.31z)| lr 2.56e-04 | 4280.54 ms | 31.5% bf16 MFU | 124642 tok/s step 11025/19560 | loss 3.436500 (+1.14z)| norm 0.2946 (-0.03z)| lr 2.55e-04 | 4157.04 ms | 32.5% bf16 MFU | 124716 tok/s step 11026/19560 | loss 3.407818 (+0.47z)| norm 0.2921 (-0.05z)| lr 2.55e-04 | 4154.91 ms | 32.5% bf16 MFU | 124790 tok/s step 11027/19560 | loss 3.357430 (-0.76z)| norm 0.3188 (+0.14z)| lr 2.55e-04 | 4158.79 ms | 32.5% bf16 MFU | 124854 tok/s step 11028/19560 | loss 3.393135 (+0.11z)| norm 0.2629 (-0.26z)| lr 2.55e-04 | 4159.33 ms | 32.5% bf16 MFU | 124913 tok/s step 11029/19560 | loss 3.464391 (+1.81z)| norm 0.2873 (-0.08z)| lr 2.55e-04 | 4155.90 ms | 32.5% bf16 MFU | 124976 tok/s step 11030/19560 | loss 3.450271 (+1.46z)| norm 0.2625 (-0.26z)| lr 2.55e-04 | 4164.91 ms | 32.4% bf16 MFU | 125021 tok/s step 11031/19560 | loss 3.447131 (+1.36z)| norm 0.3090 (+0.07z)| lr 2.55e-04 | 4166.43 ms | 32.4% bf16 MFU | 125062 tok/s step 11032/19560 | loss 3.402004 (+0.28z)| norm 0.2672 (-0.23z)| lr 2.55e-04 | 4158.38 ms | 32.5% bf16 MFU | 125113 tok/s step 11033/19560 | loss 3.376832 (-0.31z)| norm 0.3077 (+0.06z)| lr 2.55e-04 | 4160.98 ms | 32.4% bf16 MFU | 125157 tok/s step 11034/19560 | loss 3.472618 (+1.92z)| norm 0.2893 (-0.07z)| lr 2.55e-04 | 4160.69 ms | 32.5% bf16 MFU | 125200 tok/s step 11035/19560 | loss 3.462401 (+1.67z)| norm 0.2932 (-0.05z)| lr 2.55e-04 | 4151.97 ms | 32.5% bf16 MFU | 125253 tok/s step 11036/19560 | loss 3.421900 (+0.72z)| norm 0.2962 (-0.03z)| lr 2.55e-04 | 4163.27 ms | 32.4% bf16 MFU | 125287 tok/s step 11037/19560 | loss 3.422194 (+0.72z)| norm 0.2684 (-0.22z)| lr 2.55e-04 | 4154.05 ms | 32.5% bf16 MFU | 125334 tok/s step 11038/19560 | loss 3.460391 (+1.58z)| norm 0.2847 (-0.11z)| lr 2.55e-04 | 4155.23 ms | 32.5% bf16 MFU | 125376 tok/s step 11039/19560 | loss 3.417490 (+0.59z)| norm 0.3312 (+0.22z)| lr 2.55e-04 | 4166.32 ms | 32.4% bf16 MFU | 125399 tok/s step 11040/19560 | loss 3.445016 (+1.21z)| norm 0.2613 (-0.28z)| lr 2.55e-04 | 4160.89 ms | 32.4% bf16 MFU | 125429 tok/s step 11041/19560 | loss 3.418549 (+0.60z)| norm 0.3173 (+0.12z)| lr 2.55e-04 | 4167.18 ms | 32.4% bf16 MFU | 125448 tok/s step 11042/19560 | loss 3.416369 (+0.54z)| norm 0.2788 (-0.16z)| lr 2.55e-04 | 4164.21 ms | 32.4% bf16 MFU | 125471 tok/s step 11043/19560 | loss 3.358191 (-0.80z)| norm 0.2744 (-0.19z)| lr 2.55e-04 | 4151.85 ms | 32.5% bf16 MFU | 125511 tok/s step 11044/19560 | loss 3.367434 (-0.59z)| norm 0.2872 (-0.10z)| lr 2.55e-04 | 4153.58 ms | 32.5% bf16 MFU | 125547 tok/s step 11045/19560 | loss 3.451594 (+1.33z)| norm 0.2685 (-0.23z)| lr 2.55e-04 | 4160.04 ms | 32.5% bf16 MFU | 125571 tok/s step 11046/19560 | loss 3.377350 (-0.37z)| norm 0.2898 (-0.08z)| lr 2.54e-04 | 4152.46 ms | 32.5% bf16 MFU | 125606 tok/s step 11047/19560 | loss 3.349464 (-1.01z)| norm 0.2640 (-0.27z)| lr 2.54e-04 | 4156.83 ms | 32.5% bf16 MFU | 125632 tok/s step 11048/19560 | loss 3.366819 (-0.61z)| norm 0.2693 (-0.23z)| lr 2.54e-04 | 4155.94 ms | 32.5% bf16 MFU | 125658 tok/s step 11049/19560 | loss 3.408146 (+0.33z)| norm 0.2779 (-0.17z)| lr 2.54e-04 | 4156.19 ms | 32.5% bf16 MFU | 125682 tok/s step 11050/19560 | loss 3.367804 (-0.60z)| norm 0.3044 (+0.02z)| lr 2.54e-04 | 4158.60 ms | 32.5% bf16 MFU | 125702 tok/s step 11051/19560 | loss 3.423438 (+0.67z)| norm 0.2669 (-0.25z)| lr 2.54e-04 | 4155.68 ms | 32.5% bf16 MFU | 125725 tok/s step 11052/19560 | loss 3.483970 (+2.02z)| norm 0.3154 (+0.10z)| lr 2.54e-04 | 4163.48 ms | 32.4% bf16 MFU | 125735 tok/s step 11053/19560 | loss 3.434704 (+0.88z)| norm 0.2947 (-0.05z)| lr 2.54e-04 | 4166.96 ms | 32.4% bf16 MFU | 125739 tok/s step 11054/19560 | loss 3.470231 (+1.67z)| norm 0.3026 (+0.00z)| lr 2.54e-04 | 4165.06 ms | 32.4% bf16 MFU | 125746 tok/s step 11055/19560 | loss 3.406179 (+0.21z)| norm 0.3039 (+0.01z)| lr 2.54e-04 | 4162.26 ms | 32.4% bf16 MFU | 125757 tok/s step 11056/19560 | loss 3.394421 (-0.06z)| norm 0.2745 (-0.20z)| lr 2.54e-04 | 4153.39 ms | 32.5% bf16 MFU | 125781 tok/s step 11057/19560 | loss 3.419074 (+0.49z)| norm 0.3033 (+0.01z)| lr 2.54e-04 | 4148.58 ms | 32.5% bf16 MFU | 125810 tok/s step 11058/19560 | loss 3.466845 (+1.56z)| norm 0.2605 (-0.30z)| lr 2.54e-04 | 4162.43 ms | 32.4% bf16 MFU | 125818 tok/s step 11059/19560 | loss 3.399185 (+0.01z)| norm 0.2910 (-0.08z)| lr 2.54e-04 | 4157.21 ms | 32.5% bf16 MFU | 125833 tok/s step 11060/19560 | loss 3.437963 (+0.90z)| norm 0.2944 (-0.06z)| lr 2.54e-04 | 4159.21 ms | 32.5% bf16 MFU | 125844 tok/s step 11061/19560 | loss 3.379896 (-0.46z)| norm 0.2750 (-0.19z)| lr 2.54e-04 | 4159.16 ms | 32.5% bf16 MFU | 125854 tok/s step 11062/19560 | loss 3.386374 (-0.33z)| norm 0.2918 (-0.08z)| lr 2.54e-04 | 4167.33 ms | 32.4% bf16 MFU | 125852 tok/s step 11063/19560 | loss 3.436860 (+0.85z)| norm 0.2740 (-0.20z)| lr 2.54e-04 | 4152.41 ms | 32.5% bf16 MFU | 125873 tok/s step 11064/19560 | loss 3.387440 (-0.32z)| norm 0.2813 (-0.13z)| lr 2.54e-04 | 4240.06 ms | 31.8% bf16 MFU | 125761 tok/s step 11065/19560 | loss 3.389649 (-0.28z)| norm 0.2800 (-0.14z)| lr 2.54e-04 | 4152.11 ms | 32.5% bf16 MFU | 125787 tok/s step 11066/19560 | loss 3.422538 (+0.50z)| norm 0.2850 (-0.10z)| lr 2.53e-04 | 4153.42 ms | 32.5% bf16 MFU | 125809 tok/s step 11067/19560 | loss 3.449029 (+1.12z)| norm 0.2926 (-0.04z)| lr 2.53e-04 | 4153.29 ms | 32.5% bf16 MFU | 125830 tok/s step 11068/19560 | loss 3.343097 (-1.41z)| norm 0.2970 (-0.01z)| lr 2.53e-04 | 4165.32 ms | 32.4% bf16 MFU | 125832 tok/s step 11069/19560 | loss 3.406361 (+0.10z)| norm 0.2745 (-0.18z)| lr 2.53e-04 | 4155.38 ms | 32.5% bf16 MFU | 125849 tok/s step 11070/19560 | loss 3.433808 (+0.75z)| norm 0.3018 (+0.02z)| lr 2.53e-04 | 4156.67 ms | 32.5% bf16 MFU | 125863 tok/s step 11071/19560 | loss 3.395993 (-0.16z)| norm 0.2764 (-0.16z)| lr 2.53e-04 | 4165.18 ms | 32.4% bf16 MFU | 125864 tok/s step 11072/19560 | loss 3.394044 (-0.20z)| norm 0.2902 (-0.06z)| lr 2.53e-04 | 4160.08 ms | 32.5% bf16 MFU | 125872 tok/s step 11073/19560 | loss 3.403024 (+0.00z)| norm 0.3014 (+0.02z)| lr 2.53e-04 | 4156.49 ms | 32.5% bf16 MFU | 125885 tok/s step 11074/19560 | loss 3.409298 (+0.15z)| norm 0.2829 (-0.12z)| lr 2.53e-04 | 4165.84 ms | 32.4% bf16 MFU | 125884 tok/s step 11075/19560 | loss 3.411874 (+0.20z)| norm 0.3023 (+0.02z)| lr 2.53e-04 | 4159.76 ms | 32.5% bf16 MFU | 125892 tok/s step 11076/19560 | loss 3.393625 (-0.27z)| norm 0.2753 (-0.18z)| lr 2.53e-04 | 4151.54 ms | 32.5% bf16 MFU | 125911 tok/s step 11077/19560 | loss 3.457316 (+1.33z)| norm 0.3072 (+0.06z)| lr 2.53e-04 | 4155.47 ms | 32.5% bf16 MFU | 125924 tok/s step 11078/19560 | loss 3.396366 (-0.22z)| norm 0.2869 (-0.06z)| lr 2.53e-04 | 4161.09 ms | 32.4% bf16 MFU | 125928 tok/s step 11079/19560 | loss 3.435248 (+0.76z)| norm 0.2805 (-0.11z)| lr 2.53e-04 | 4158.13 ms | 32.5% bf16 MFU | 125936 tok/s step 11080/19560 | loss 3.415519 (+0.25z)| norm 0.2909 (-0.02z)| lr 2.53e-04 | 4168.83 ms | 32.4% bf16 MFU | 125927 tok/s step 11081/19560 | loss 3.360895 (-1.13z)| norm 0.2801 (-0.11z)| lr 2.53e-04 | 4152.52 ms | 32.5% bf16 MFU | 125944 tok/s step 11082/19560 | loss 3.451096 (+1.14z)| norm 0.2933 (-0.00z)| lr 2.53e-04 | 4152.60 ms | 32.5% bf16 MFU | 125959 tok/s step 11083/19560 | loss 3.397974 (-0.23z)| norm 0.2768 (-0.14z)| lr 2.53e-04 | 4158.47 ms | 32.5% bf16 MFU | 125965 tok/s step 11084/19560 | loss 3.493131 (+2.17z)| norm 0.2830 (-0.09z)| lr 2.53e-04 | 4155.46 ms | 32.5% bf16 MFU | 125975 tok/s step 11085/19560 | loss 3.378201 (-0.75z)| norm 0.3006 (+0.05z)| lr 2.53e-04 | 4169.46 ms | 32.4% bf16 MFU | 125964 tok/s step 11086/19560 | loss 3.410381 (+0.07z)| norm 0.2835 (-0.09z)| lr 2.52e-04 | 4156.12 ms | 32.5% bf16 MFU | 125973 tok/s step 11087/19560 | loss 3.385819 (-0.56z)| norm 0.2757 (-0.48z)| lr 2.52e-04 | 4158.66 ms | 32.5% bf16 MFU | 125978 tok/s step 11088/19560 | loss 3.428210 (+0.51z)| norm 0.2681 (-0.95z)| lr 2.52e-04 | 4158.46 ms | 32.5% bf16 MFU | 125983 tok/s step 11089/19560 | loss 3.421758 (+0.34z)| norm 0.3008 (+1.14z)| lr 2.52e-04 | 4165.66 ms | 32.4% bf16 MFU | 125977 tok/s step 11090/19560 | loss 3.396911 (-0.30z)| norm 0.2631 (-1.26z)| lr 2.52e-04 | 5085.89 ms | 26.5% bf16 MFU | 124832 tok/s step 11091/19560 | loss 3.404189 (-0.11z)| norm 0.3222 (+2.50z)| lr 2.52e-04 | 4161.83 ms | 32.4% bf16 MFU | 124889 tok/s step 11092/19560 | loss 3.432680 (+0.61z)| norm 0.2908 (+0.50z)| lr 2.52e-04 | 4160.57 ms | 32.5% bf16 MFU | 124946 tok/s step 11093/19560 | loss 3.423794 (+0.37z)| norm 0.3034 (+1.31z)| lr 2.52e-04 | 4165.21 ms | 32.4% bf16 MFU | 124992 tok/s step 11094/19560 | loss 3.415364 (+0.14z)| norm 0.2659 (-1.08z)| lr 2.52e-04 | 4161.25 ms | 32.4% bf16 MFU | 125042 tok/s step 11095/19560 | loss 3.390489 (-0.50z)| norm 0.2867 (+0.23z)| lr 2.52e-04 | 4151.19 ms | 32.5% bf16 MFU | 125105 tok/s step 11096/19560 | loss 3.364513 (-1.20z)| norm 0.2672 (-1.03z)| lr 2.52e-04 | 4160.95 ms | 32.4% bf16 MFU | 125150 tok/s step 11097/19560 | loss 3.465570 (+1.45z)| norm 0.2766 (-0.43z)| lr 2.52e-04 | 4161.14 ms | 32.4% bf16 MFU | 125192 tok/s step 11098/19560 | loss 3.441202 (+0.81z)| norm 0.2865 (+0.21z)| lr 2.52e-04 | 4150.87 ms | 32.5% bf16 MFU | 125248 tok/s step 11099/19560 | loss 3.519347 (+2.75z)| norm 0.2842 (+0.05z)| lr 2.52e-04 | 4168.94 ms | 32.4% bf16 MFU | 125273 tok/s step 11100/19560 | loss 3.367091 (-1.11z)| norm 0.3065 (+1.48z)| lr 2.52e-04 | 4166.50 ms | 32.4% bf16 MFU | 125302 tok/s step 11101/19560 | loss 3.416729 (+0.13z)| norm 0.2972 (+0.86z)| lr 2.52e-04 | 4159.53 ms | 32.5% bf16 MFU | 125339 tok/s step 11102/19560 | loss 3.389633 (-0.56z)| norm 0.2819 (-0.14z)| lr 2.52e-04 | 4158.03 ms | 32.5% bf16 MFU | 125376 tok/s step 11103/19560 | loss 3.409980 (-0.04z)| norm 0.2765 (-0.49z)| lr 2.52e-04 | 4169.97 ms | 32.4% bf16 MFU | 125394 tok/s step 11104/19560 | loss 3.361762 (-1.28z)| norm 0.2679 (-1.04z)| lr 2.52e-04 | 4168.96 ms | 32.4% bf16 MFU | 125412 tok/s step 11105/19560 | loss 3.378147 (-0.85z)| norm 0.2689 (-0.96z)| lr 2.52e-04 | 4679.35 ms | 28.9% bf16 MFU | 124744 tok/s step 11106/19560 | loss 3.461483 (+1.28z)| norm 0.2810 (-0.18z)| lr 2.51e-04 | 4163.60 ms | 32.4% bf16 MFU | 124803 tok/s step 11107/19560 | loss 3.387388 (-0.62z)| norm 0.2859 (+0.14z)| lr 2.51e-04 | 4158.60 ms | 32.5% bf16 MFU | 124866 tok/s step 11108/19560 | loss 3.410277 (-0.05z)| norm 0.2681 (-1.02z)| lr 2.51e-04 | 4158.18 ms | 32.5% bf16 MFU | 124927 tok/s step 11109/19560 | loss 3.378676 (-0.89z)| norm 0.2817 (-0.12z)| lr 2.51e-04 | 4173.32 ms | 32.4% bf16 MFU | 124962 tok/s step 11110/19560 | loss 3.440725 (+0.73z)| norm 0.2994 (+1.02z)| lr 2.51e-04 | 4169.13 ms | 32.4% bf16 MFU | 125002 tok/s step 11111/19560 | loss 3.454105 (+1.07z)| norm 0.2821 (-0.11z)| lr 2.51e-04 | 4162.83 ms | 32.4% bf16 MFU | 125049 tok/s step 11112/19560 | loss 3.411676 (-0.04z)| norm 0.2644 (-1.25z)| lr 2.51e-04 | 4166.97 ms | 32.4% bf16 MFU | 125088 tok/s step 11113/19560 | loss 3.428862 (+0.41z)| norm 0.2794 (-0.27z)| lr 2.51e-04 | 4160.91 ms | 32.4% bf16 MFU | 125133 tok/s step 11114/19560 | loss 3.409931 (-0.09z)| norm 0.2809 (-0.17z)| lr 2.51e-04 | 4157.64 ms | 32.5% bf16 MFU | 125182 tok/s step 11115/19560 | loss 3.407318 (-0.16z)| norm 0.2863 (+0.18z)| lr 2.51e-04 | 4162.17 ms | 32.4% bf16 MFU | 125221 tok/s step 11116/19560 | loss 3.430896 (+0.45z)| norm 0.2722 (-0.73z)| lr 2.51e-04 | 4156.29 ms | 32.5% bf16 MFU | 125267 tok/s step 11117/19560 | loss 3.390064 (-0.61z)| norm 0.2835 (+0.01z)| lr 2.51e-04 | 4159.12 ms | 32.5% bf16 MFU | 125307 tok/s step 11118/19560 | loss 3.461605 (+1.25z)| norm 0.2868 (+0.23z)| lr 2.51e-04 | 4164.06 ms | 32.4% bf16 MFU | 125337 tok/s step 11119/19560 | loss 3.429387 (+0.40z)| norm 0.2676 (-1.03z)| lr 2.51e-04 | 4170.16 ms | 32.4% bf16 MFU | 125356 tok/s step 11120/19560 | loss 3.415018 (+0.04z)| norm 0.2777 (-0.36z)| lr 2.51e-04 | 4160.04 ms | 32.5% bf16 MFU | 125390 tok/s step 11121/19560 | loss 3.434860 (+0.57z)| norm 0.2784 (-0.32z)| lr 2.51e-04 | 4159.08 ms | 32.5% bf16 MFU | 125423 tok/s step 11122/19560 | loss 3.378458 (-0.93z)| norm 0.2647 (-1.21z)| lr 2.51e-04 | 4160.89 ms | 32.4% bf16 MFU | 125452 tok/s step 11123/19560 | loss 3.399984 (-0.33z)| norm 0.3013 (+1.18z)| lr 2.51e-04 | 4150.15 ms | 32.5% bf16 MFU | 125496 tok/s step 11124/19560 | loss 3.540128 (+3.39z)| norm 0.2714 (-0.76z)| lr 2.51e-04 | 4160.32 ms | 32.5% bf16 MFU | 125522 tok/s step 11125/19560 | loss 3.421270 (+0.22z)| norm 0.2716 (-0.74z)| lr 2.51e-04 | 4170.69 ms | 32.4% bf16 MFU | 125532 tok/s step 11126/19560 | loss 3.388688 (-0.64z)| norm 0.2917 (+0.57z)| lr 2.51e-04 | 4162.43 ms | 32.4% bf16 MFU | 125553 tok/s step 11127/19560 | loss 3.396636 (-0.42z)| norm 0.2672 (-1.02z)| lr 2.50e-04 | 4151.69 ms | 32.5% bf16 MFU | 125589 tok/s step 11128/19560 | loss 3.426027 (+0.35z)| norm 0.3009 (+1.20z)| lr 2.50e-04 | 4156.23 ms | 32.5% bf16 MFU | 125617 tok/s step 11129/19560 | loss 3.394597 (-0.49z)| norm 0.2662 (-1.08z)| lr 2.50e-04 | 4164.32 ms | 32.4% bf16 MFU | 125631 tok/s step 11130/19560 | loss 3.395316 (-0.48z)| norm 0.2781 (-0.30z)| lr 2.50e-04 | 4159.61 ms | 32.5% bf16 MFU | 125652 tok/s step 11131/19560 | loss 3.413845 (+0.01z)| norm 0.2893 (+0.44z)| lr 2.50e-04 | 4172.48 ms | 32.4% bf16 MFU | 125652 tok/s step 11132/19560 | loss 3.396370 (-0.46z)| norm 0.2800 (-0.18z)| lr 2.50e-04 | 4160.15 ms | 32.5% bf16 MFU | 125671 tok/s step 11133/19560 | loss 3.442552 (+0.78z)| norm 0.2900 (+0.48z)| lr 2.50e-04 | 4166.01 ms | 32.4% bf16 MFU | 125680 tok/s step 11134/19560 | loss 3.404236 (-0.27z)| norm 0.2828 (-0.01z)| lr 2.50e-04 | 4165.88 ms | 32.4% bf16 MFU | 125688 tok/s step 11135/19560 | loss 3.450191 (+0.99z)| norm 0.2660 (-1.14z)| lr 2.50e-04 | 4161.34 ms | 32.4% bf16 MFU | 125703 tok/s step 11136/19560 | loss 3.416310 (+0.06z)| norm 0.2575 (-1.68z)| lr 2.50e-04 | 4159.87 ms | 32.5% bf16 MFU | 125720 tok/s step 11137/19560 | loss 3.431154 (+0.46z)| norm 0.2692 (-0.91z)| lr 2.50e-04 | 4164.25 ms | 32.4% bf16 MFU | 125729 tok/s step 11138/19560 | loss 3.323321 (-2.44z)| norm 0.2643 (-1.23z)| lr 2.50e-04 | 4160.11 ms | 32.5% bf16 MFU | 125744 tok/s step 11139/19560 | loss 3.401852 (-0.31z)| norm 0.2614 (-1.40z)| lr 2.50e-04 | 4158.79 ms | 32.5% bf16 MFU | 125760 tok/s step 11140/19560 | loss 3.414476 (+0.04z)| norm 0.2819 (-0.05z)| lr 2.50e-04 | 4163.43 ms | 32.4% bf16 MFU | 125768 tok/s step 11141/19560 | loss 3.392952 (-0.59z)| norm 0.3020 (+1.26z)| lr 2.50e-04 | 4170.97 ms | 32.4% bf16 MFU | 125765 tok/s step 11142/19560 | loss 3.406025 (-0.21z)| norm 0.2926 (+0.63z)| lr 2.50e-04 | 4166.44 ms | 32.4% bf16 MFU | 125769 tok/s step 11143/19560 | loss 3.400052 (-0.38z)| norm 0.2572 (-1.69z)| lr 2.50e-04 | 4178.32 ms | 32.3% bf16 MFU | 125754 tok/s step 11144/19560 | loss 3.396982 (-0.47z)| norm 0.2769 (-0.41z)| lr 2.50e-04 | 4153.71 ms | 32.5% bf16 MFU | 125777 tok/s step 11145/19560 | loss 3.340098 (-2.05z)| norm 0.2601 (-1.51z)| lr 2.50e-04 | 4153.15 ms | 32.5% bf16 MFU | 125800 tok/s step 11146/19560 | loss 3.408270 (-0.14z)| norm 0.2735 (-0.62z)| lr 2.50e-04 | 4164.03 ms | 32.4% bf16 MFU | 125806 tok/s step 11147/19560 | loss 3.425455 (+0.34z)| norm 0.2712 (-0.77z)| lr 2.49e-04 | 4160.51 ms | 32.5% bf16 MFU | 125816 tok/s step 11148/19560 | loss 3.424951 (+0.32z)| norm 0.2678 (-0.99z)| lr 2.49e-04 | 4158.69 ms | 32.5% bf16 MFU | 125829 tok/s step 11149/19560 | loss 3.378051 (-1.02z)| norm 0.2937 (+0.69z)| lr 2.49e-04 | 4161.67 ms | 32.4% bf16 MFU | 125837 tok/s step 11150/19560 | loss 3.397581 (-0.46z)| norm 0.2814 (-0.13z)| lr 2.49e-04 | 4153.56 ms | 32.5% bf16 MFU | 125856 tok/s step 11151/19560 | loss 3.462304 (+1.41z)| norm 0.2789 (-0.29z)| lr 2.49e-04 | 4156.77 ms | 32.5% bf16 MFU | 125870 tok/s step 11152/19560 | loss 3.387458 (-0.75z)| norm 0.2731 (-0.69z)| lr 2.49e-04 | 4159.56 ms | 32.5% bf16 MFU | 125878 tok/s step 11153/19560 | loss 3.391497 (-0.62z)| norm 0.2845 (+0.08z)| lr 2.49e-04 | 4165.58 ms | 32.4% bf16 MFU | 125878 tok/s step 11154/19560 | loss 3.341370 (-2.02z)| norm 0.2603 (-1.52z)| lr 2.49e-04 | 4150.70 ms | 32.5% bf16 MFU | 125899 tok/s step 11155/19560 | loss 3.409075 (-0.11z)| norm 0.2888 (+0.40z)| lr 2.49e-04 | 4161.57 ms | 32.4% bf16 MFU | 125904 tok/s step 11156/19560 | loss 3.450946 (+1.07z)| norm 0.2542 (-1.93z)| lr 2.49e-04 | 4164.52 ms | 32.4% bf16 MFU | 125903 tok/s step 11157/19560 | loss 3.398136 (-0.42z)| norm 0.2731 (-0.65z)| lr 2.49e-04 | 4151.22 ms | 32.5% bf16 MFU | 125923 tok/s step 11158/19560 | loss 3.383226 (-0.84z)| norm 0.2799 (-0.20z)| lr 2.49e-04 | 4168.23 ms | 32.4% bf16 MFU | 125916 tok/s step 11159/19560 | loss 3.419148 (+0.20z)| norm 0.2751 (-0.51z)| lr 2.49e-04 | 4168.05 ms | 32.4% bf16 MFU | 125909 tok/s step 11160/19560 | loss 3.393413 (-0.54z)| norm 0.2688 (-0.95z)| lr 2.49e-04 | 4149.80 ms | 32.5% bf16 MFU | 125931 tok/s step 11161/19560 | loss 3.418188 (+0.17z)| norm 0.2775 (-0.33z)| lr 2.49e-04 | 4163.09 ms | 32.4% bf16 MFU | 125931 tok/s step 11162/19560 | loss 3.374564 (-1.08z)| norm 0.2703 (-0.83z)| lr 2.49e-04 | 4155.95 ms | 32.5% bf16 MFU | 125942 tok/s step 11163/19560 | loss 3.399372 (-0.35z)| norm 0.2825 (+0.03z)| lr 2.49e-04 | 4157.74 ms | 32.5% bf16 MFU | 125950 tok/s step 11164/19560 | loss 3.441279 (+0.88z)| norm 0.2867 (+0.33z)| lr 2.49e-04 | 4164.74 ms | 32.4% bf16 MFU | 125947 tok/s step 11165/19560 | loss 3.386899 (-0.71z)| norm 0.2707 (-0.80z)| lr 2.49e-04 | 4157.11 ms | 32.5% bf16 MFU | 125956 tok/s step 11166/19560 | loss 3.456575 (+1.34z)| norm 0.2924 (+0.72z)| lr 2.49e-04 | 4426.51 ms | 30.5% bf16 MFU | 125580 tok/s step 11167/19560 | loss 3.413890 (+0.09z)| norm 0.2611 (-1.48z)| lr 2.48e-04 | 4219.38 ms | 32.0% bf16 MFU | 125514 tok/s step 11168/19560 | loss 3.367765 (-1.25z)| norm 0.2756 (-0.44z)| lr 2.48e-04 | 4161.86 ms | 32.4% bf16 MFU | 125537 tok/s step 11169/19560 | loss 3.440682 (+0.88z)| norm 0.2821 (+0.05z)| lr 2.48e-04 | 4393.56 ms | 30.7% bf16 MFU | 125227 tok/s step 11170/19560 | loss 3.372860 (-1.09z)| norm 0.3105 (+2.13z)| lr 2.48e-04 | 4221.63 ms | 32.0% bf16 MFU | 125175 tok/s step 11171/19560 | loss 3.450377 (+1.15z)| norm 0.2918 (+0.74z)| lr 2.48e-04 | 4157.70 ms | 32.5% bf16 MFU | 125221 tok/s step 11172/19560 | loss 3.376727 (-1.00z)| norm 0.3003 (+1.35z)| lr 2.48e-04 | 4155.62 ms | 32.5% bf16 MFU | 125268 tok/s step 11173/19560 | loss 3.403119 (-0.22z)| norm 0.2819 (-0.01z)| lr 2.48e-04 | 4152.92 ms | 32.5% bf16 MFU | 125317 tok/s step 11174/19560 | loss 3.372145 (-1.13z)| norm 0.2805 (-0.11z)| lr 2.48e-04 | 4152.76 ms | 32.5% bf16 MFU | 125364 tok/s step 11175/19560 | loss 3.400223 (-0.32z)| norm 0.2652 (-1.23z)| lr 2.48e-04 | 4156.97 ms | 32.5% bf16 MFU | 125402 tok/s step 11176/19560 | loss 3.444027 (+0.97z)| norm 0.2902 (+0.60z)| lr 2.48e-04 | 4155.74 ms | 32.5% bf16 MFU | 125440 tok/s step 11177/19560 | loss 3.384115 (-0.81z)| norm 0.2675 (-1.06z)| lr 2.48e-04 | 4193.97 ms | 32.2% bf16 MFU | 125418 tok/s step 11178/19560 | loss 3.482315 (+2.07z)| norm 0.2781 (-0.28z)| lr 2.48e-04 | 4153.16 ms | 32.5% bf16 MFU | 125459 tok/s step 11179/19560 | loss 3.375554 (-1.07z)| norm 0.2806 (-0.10z)| lr 2.48e-04 | 4152.78 ms | 32.5% bf16 MFU | 125499 tok/s step 11180/19560 | loss 3.354210 (-1.68z)| norm 0.2718 (-0.74z)| lr 2.48e-04 | 4156.94 ms | 32.5% bf16 MFU | 125530 tok/s step 11181/19560 | loss 3.368382 (-1.24z)| norm 0.2978 (+1.24z)| lr 2.48e-04 | 4151.71 ms | 32.5% bf16 MFU | 125568 tok/s step 11182/19560 | loss 3.418437 (+0.25z)| norm 0.2775 (-0.30z)| lr 2.48e-04 | 4153.54 ms | 32.5% bf16 MFU | 125600 tok/s step 11183/19560 | loss 3.455965 (+1.35z)| norm 0.2971 (+1.22z)| lr 2.48e-04 | 4147.89 ms | 32.6% bf16 MFU | 125640 tok/s step 11184/19560 | loss 3.436091 (+0.75z)| norm 0.2864 (+0.38z)| lr 2.48e-04 | 4155.58 ms | 32.5% bf16 MFU | 125667 tok/s step 11185/19560 | loss 3.345193 (-1.89z)| norm 0.2784 (-0.22z)| lr 2.48e-04 | 4152.40 ms | 32.5% bf16 MFU | 125696 tok/s step 11186/19560 | loss 3.314777 (-2.69z)| norm 0.2776 (-0.30z)| lr 2.48e-04 | 4153.54 ms | 32.5% bf16 MFU | 125723 tok/s step 11187/19560 | loss 3.392996 (-0.45z)| norm 0.2892 (+0.62z)| lr 2.48e-04 | 4153.63 ms | 32.5% bf16 MFU | 125748 tok/s step 11188/19560 | loss 3.460666 (+1.47z)| norm 0.2807 (-0.04z)| lr 2.47e-04 | 4152.37 ms | 32.5% bf16 MFU | 125774 tok/s step 11189/19560 | loss 3.345946 (-1.77z)| norm 0.2981 (+1.32z)| lr 2.47e-04 | 4155.87 ms | 32.5% bf16 MFU | 125793 tok/s step 11190/19560 | loss 3.360792 (-1.34z)| norm 0.2783 (-0.24z)| lr 2.47e-04 | 4155.44 ms | 32.5% bf16 MFU | 125812 tok/s step 11191/19560 | loss 3.363563 (-1.24z)| norm 0.2983 (+1.32z)| lr 2.47e-04 | 4150.86 ms | 32.5% bf16 MFU | 125836 tok/s step 11192/19560 | loss 3.362390 (-1.26z)| norm 0.2570 (-1.89z)| lr 2.47e-04 | 4154.20 ms | 32.5% bf16 MFU | 125855 tok/s step 11193/19560 | loss 3.317908 (-2.42z)| norm 0.2950 (+1.05z)| lr 2.47e-04 | 4153.77 ms | 32.5% bf16 MFU | 125873 tok/s step 11194/19560 | loss 3.403010 (-0.11z)| norm 0.2598 (-1.64z)| lr 2.47e-04 | 4154.30 ms | 32.5% bf16 MFU | 125890 tok/s step 11195/19560 | loss 3.504773 (+2.59z)| norm 0.2950 (+1.05z)| lr 2.47e-04 | 4154.09 ms | 32.5% bf16 MFU | 125906 tok/s step 11196/19560 | loss 3.379467 (-0.76z)| norm 0.3071 (+1.96z)| lr 2.47e-04 | 4153.93 ms | 32.5% bf16 MFU | 125921 tok/s step 11197/19560 | loss 3.340940 (-1.76z)| norm 0.2802 (-0.09z)| lr 2.47e-04 | 4154.77 ms | 32.5% bf16 MFU | 125935 tok/s step 11198/19560 | loss 3.422246 (+0.40z)| norm 0.3060 (+1.86z)| lr 2.47e-04 | 4152.91 ms | 32.5% bf16 MFU | 125950 tok/s step 11199/19560 | loss 3.452696 (+1.19z)| norm 0.3047 (+1.72z)| lr 2.47e-04 | 4150.87 ms | 32.5% bf16 MFU | 125968 tok/s step 11200/19560 | loss 3.390368 (-0.45z)| norm 0.2985 (+1.25z)| lr 2.47e-04 | 4148.54 ms | 32.5% bf16 MFU | 125989 tok/s step 11201/19560 | loss 3.436630 (+0.76z)| norm 0.3084 (+1.96z)| lr 2.47e-04 | 4150.05 ms | 32.5% bf16 MFU | 126006 tok/s step 11202/19560 | loss 3.416831 (+0.23z)| norm 0.2948 (+0.95z)| lr 2.47e-04 | 4152.68 ms | 32.5% bf16 MFU | 126018 tok/s step 11203/19560 | loss 3.358994 (-1.27z)| norm 0.2731 (-0.64z)| lr 2.47e-04 | 4154.14 ms | 32.5% bf16 MFU | 126028 tok/s step 11204/19560 | loss 3.352025 (-1.43z)| norm 0.2847 (+0.22z)| lr 2.47e-04 | 4150.31 ms | 32.5% bf16 MFU | 126043 tok/s step 11205/19560 | loss 3.397385 (-0.24z)| norm 0.2707 (-0.80z)| lr 2.47e-04 | 4150.67 ms | 32.5% bf16 MFU | 126056 tok/s step 11206/19560 | loss 3.423501 (+0.43z)| norm 0.2924 (+0.82z)| lr 2.47e-04 | 4152.07 ms | 32.5% bf16 MFU | 126067 tok/s step 11207/19560 | loss 3.364849 (-1.08z)| norm 0.2742 (-0.54z)| lr 2.47e-04 | 4152.38 ms | 32.5% bf16 MFU | 126077 tok/s step 11208/19560 | loss 3.422438 (+0.42z)| norm 0.2786 (-0.20z)| lr 2.46e-04 | 4151.13 ms | 32.5% bf16 MFU | 126088 tok/s step 11209/19560 | loss 3.377439 (-0.76z)| norm 0.2760 (-0.40z)| lr 2.46e-04 | 4150.89 ms | 32.5% bf16 MFU | 126099 tok/s step 11210/19560 | loss 3.415016 (+0.23z)| norm 0.2721 (-0.67z)| lr 2.46e-04 | 4152.03 ms | 32.5% bf16 MFU | 126107 tok/s step 11211/19560 | loss 3.401316 (-0.13z)| norm 0.2808 (-0.03z)| lr 2.46e-04 | 4154.15 ms | 32.5% bf16 MFU | 126112 tok/s step 11212/19560 | loss 3.370773 (-0.92z)| norm 0.2781 (-0.23z)| lr 2.46e-04 | 4151.57 ms | 32.5% bf16 MFU | 126121 tok/s step 11213/19560 | loss 3.377777 (-0.74z)| norm 0.2749 (-0.46z)| lr 2.46e-04 | 4151.76 ms | 32.5% bf16 MFU | 126129 tok/s step 11214/19560 | loss 3.422613 (+0.46z)| norm 0.2694 (-0.86z)| lr 2.46e-04 | 4151.99 ms | 32.5% bf16 MFU | 126136 tok/s step 11215/19560 | loss 3.338471 (-1.76z)| norm 0.3108 (+2.20z)| lr 2.46e-04 | 4154.16 ms | 32.5% bf16 MFU | 126140 tok/s step 11216/19560 | loss 3.368397 (-0.95z)| norm 0.2940 (+0.94z)| lr 2.46e-04 | 4150.39 ms | 32.5% bf16 MFU | 126149 tok/s step 11217/19560 | loss 3.393192 (-0.30z)| norm 0.2962 (+1.11z)| lr 2.46e-04 | 4155.67 ms | 32.5% bf16 MFU | 126150 tok/s step 11218/19560 | loss 3.397040 (-0.19z)| norm 0.3056 (+1.78z)| lr 2.46e-04 | 4151.99 ms | 32.5% bf16 MFU | 126156 tok/s step 11219/19560 | loss 3.391121 (-0.35z)| norm 0.2751 (-0.47z)| lr 2.46e-04 | 4149.23 ms | 32.5% bf16 MFU | 126166 tok/s step 11220/19560 | loss 3.396231 (-0.21z)| norm 0.2874 (+0.48z)| lr 2.46e-04 | 4152.20 ms | 32.5% bf16 MFU | 126171 tok/s step 11221/19560 | loss 3.341020 (-1.63z)| norm 0.2857 (+0.36z)| lr 2.46e-04 | 4155.59 ms | 32.5% bf16 MFU | 126171 tok/s step 11222/19560 | loss 3.427689 (+0.63z)| norm 0.2899 (+0.67z)| lr 2.46e-04 | 4151.29 ms | 32.5% bf16 MFU | 126177 tok/s step 11223/19560 | loss 3.374343 (-0.76z)| norm 0.2803 (-0.07z)| lr 2.46e-04 | 4153.31 ms | 32.5% bf16 MFU | 126180 tok/s step 11224/19560 | loss 3.415114 (+0.30z)| norm 0.2893 (+0.62z)| lr 2.46e-04 | 4147.20 ms | 32.6% bf16 MFU | 126192 tok/s step 11225/19560 | loss 3.427753 (+0.64z)| norm 0.3058 (+1.86z)| lr 2.46e-04 | 4159.15 ms | 32.5% bf16 MFU | 126185 tok/s step 11226/19560 | loss 3.357021 (-1.21z)| norm 0.2561 (-1.92z)| lr 2.46e-04 | 4150.51 ms | 32.5% bf16 MFU | 126192 tok/s step 11227/19560 | loss 3.365270 (-0.99z)| norm 0.3086 (+2.02z)| lr 2.46e-04 | 7141.94 ms | 18.9% bf16 MFU | 123553 tok/s step 11228/19560 | loss 3.426540 (+0.67z)| norm 0.2781 (-0.24z)| lr 2.45e-04 | 4139.66 ms | 32.6% bf16 MFU | 123708 tok/s step 11229/19560 | loss 3.607539 (+5.00z)| norm 0.3063 (+1.87z)| lr 2.45e-04 | 4144.02 ms | 32.6% bf16 MFU | 123848 tok/s step 11230/19560 | loss 3.372817 (-0.75z)| norm 0.2888 (+0.56z)| lr 2.45e-04 | 4140.52 ms | 32.6% bf16 MFU | 123987 tok/s step 11231/19560 | loss 3.525047 (+2.86z)| norm 0.3464 (+4.45z)| lr 2.45e-04 | 4144.89 ms | 32.6% bf16 MFU | 124112 tok/s step 11232/19560 | loss 3.409715 (+0.12z)| norm 0.3279 (+3.04z)| lr 2.45e-04 | 4146.35 ms | 32.6% bf16 MFU | 124229 tok/s step 11233/19560 | loss 3.376967 (-0.66z)| norm 0.2969 (+0.95z)| lr 2.45e-04 | 4143.85 ms | 32.6% bf16 MFU | 124343 tok/s step 11234/19560 | loss 3.352126 (-1.23z)| norm 0.3007 (+1.19z)| lr 2.45e-04 | 4152.44 ms | 32.5% bf16 MFU | 124439 tok/s step 11235/19560 | loss 3.321698 (-1.91z)| norm 0.2763 (-0.43z)| lr 2.45e-04 | 4147.70 ms | 32.6% bf16 MFU | 124537 tok/s step 11236/19560 | loss 3.401895 (-0.03z)| norm 0.3082 (+1.66z)| lr 2.45e-04 | 4148.14 ms | 32.5% bf16 MFU | 124630 tok/s step 11237/19560 | loss 3.319968 (-1.92z)| norm 0.2781 (-0.33z)| lr 2.45e-04 | 4146.23 ms | 32.6% bf16 MFU | 124721 tok/s step 11238/19560 | loss 3.358368 (-1.01z)| norm 0.2957 (+0.84z)| lr 2.45e-04 | 4143.70 ms | 32.6% bf16 MFU | 124811 tok/s step 11239/19560 | loss 3.318035 (-1.90z)| norm 0.2745 (-0.55z)| lr 2.45e-04 | 4149.59 ms | 32.5% bf16 MFU | 124888 tok/s step 11240/19560 | loss 3.324187 (-1.73z)| norm 0.2778 (-0.35z)| lr 2.45e-04 | 4148.87 ms | 32.5% bf16 MFU | 124962 tok/s step 11241/19560 | loss 3.382379 (-0.40z)| norm 0.2730 (-0.66z)| lr 2.45e-04 | 4147.22 ms | 32.6% bf16 MFU | 125035 tok/s step 11242/19560 | loss 3.372714 (-0.61z)| norm 0.2596 (-1.52z)| lr 2.45e-04 | 4150.73 ms | 32.5% bf16 MFU | 125099 tok/s step 11243/19560 | loss 3.407031 (+0.17z)| norm 0.2645 (-1.18z)| lr 2.45e-04 | 4147.50 ms | 32.6% bf16 MFU | 125164 tok/s step 11244/19560 | loss 3.399972 (+0.01z)| norm 0.2738 (-0.58z)| lr 2.45e-04 | 4152.43 ms | 32.5% bf16 MFU | 125219 tok/s step 11245/19560 | loss 3.375785 (-0.53z)| norm 0.2643 (-1.18z)| lr 2.45e-04 | 4147.02 ms | 32.6% bf16 MFU | 125280 tok/s step 11246/19560 | loss 3.365128 (-0.76z)| norm 0.2559 (-1.69z)| lr 2.45e-04 | 4161.43 ms | 32.4% bf16 MFU | 125315 tok/s step 11247/19560 | loss 3.377926 (-0.46z)| norm 0.3075 (+1.58z)| lr 2.45e-04 | 4146.90 ms | 32.6% bf16 MFU | 125371 tok/s step 11248/19560 | loss 3.455606 (+1.30z)| norm 0.2791 (-0.22z)| lr 2.45e-04 | 4149.07 ms | 32.5% bf16 MFU | 125420 tok/s step 11249/19560 | loss 3.419101 (+0.47z)| norm 0.2782 (-0.28z)| lr 2.44e-04 | 4149.08 ms | 32.5% bf16 MFU | 125467 tok/s step 11250/19560 | loss 3.409542 (+0.25z)| norm 0.2798 (-0.19z)| lr 2.44e-04 | 4150.20 ms | 32.5% bf16 MFU | 125510 tok/s val loss 3.362043 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2911/10042 = 0.289882 step 11251/19560 | loss 3.385735 (-0.29z)| norm 0.2726 (-0.64z)| lr 2.44e-04 | 4145.23 ms | 32.6% bf16 MFU | 125559 tok/s step 11252/19560 | loss 3.382756 (-0.34z)| norm 0.2956 (+0.83z)| lr 2.44e-04 | 4145.67 ms | 32.6% bf16 MFU | 125604 tok/s step 11253/19560 | loss 3.388485 (-0.20z)| norm 0.2610 (-1.38z)| lr 2.44e-04 | 4145.87 ms | 32.6% bf16 MFU | 125647 tok/s step 11254/19560 | loss 3.376880 (-0.48z)| norm 0.2699 (-0.80z)| lr 2.44e-04 | 4151.77 ms | 32.5% bf16 MFU | 125679 tok/s step 11255/19560 | loss 3.409136 (+0.29z)| norm 0.2811 (-0.09z)| lr 2.44e-04 | 4149.63 ms | 32.5% bf16 MFU | 125712 tok/s step 11256/19560 | loss 3.358604 (-0.90z)| norm 0.2829 (+0.03z)| lr 2.44e-04 | 4148.08 ms | 32.5% bf16 MFU | 125746 tok/s step 11257/19560 | loss 3.410012 (+0.32z)| norm 0.2897 (+0.46z)| lr 2.44e-04 | 4151.47 ms | 32.5% bf16 MFU | 125773 tok/s step 11258/19560 | loss 3.514688 (+2.69z)| norm 0.2527 (-1.89z)| lr 2.44e-04 | 4149.69 ms | 32.5% bf16 MFU | 125802 tok/s step 11259/19560 | loss 3.339221 (-1.32z)| norm 0.2683 (-0.89z)| lr 2.44e-04 | 4147.76 ms | 32.6% bf16 MFU | 125832 tok/s step 11260/19560 | loss 3.337669 (-1.34z)| norm 0.2676 (-0.92z)| lr 2.44e-04 | 4151.58 ms | 32.5% bf16 MFU | 125855 tok/s step 11261/19560 | loss 3.355022 (-0.93z)| norm 0.2644 (-1.11z)| lr 2.44e-04 | 4149.11 ms | 32.5% bf16 MFU | 125880 tok/s step 11262/19560 | loss 3.373300 (-0.51z)| norm 0.2718 (-0.63z)| lr 2.44e-04 | 4148.37 ms | 32.5% bf16 MFU | 125905 tok/s step 11263/19560 | loss 3.500385 (+2.33z)| norm 0.2733 (-0.54z)| lr 2.44e-04 | 4147.87 ms | 32.6% bf16 MFU | 125930 tok/s step 11264/19560 | loss 3.390410 (-0.12z)| norm 0.2844 (+0.15z)| lr 2.44e-04 | 4148.87 ms | 32.5% bf16 MFU | 125952 tok/s step 11265/19560 | loss 3.418285 (+0.51z)| norm 0.2911 (+0.56z)| lr 2.44e-04 | 4151.72 ms | 32.5% bf16 MFU | 125968 tok/s step 11266/19560 | loss 3.353718 (-0.95z)| norm 0.2744 (-0.51z)| lr 2.44e-04 | 4147.20 ms | 32.6% bf16 MFU | 125991 tok/s step 11267/19560 | loss 3.384715 (-0.25z)| norm 0.2888 (+0.40z)| lr 2.44e-04 | 4147.33 ms | 32.6% bf16 MFU | 126012 tok/s step 11268/19560 | loss 3.475079 (+1.76z)| norm 0.2971 (+0.92z)| lr 2.44e-04 | 4150.29 ms | 32.5% bf16 MFU | 126028 tok/s step 11269/19560 | loss 3.398407 (+0.05z)| norm 0.2736 (-0.57z)| lr 2.43e-04 | 4151.20 ms | 32.5% bf16 MFU | 126041 tok/s step 11270/19560 | loss 3.346292 (-1.10z)| norm 0.2913 (+0.57z)| lr 2.43e-04 | 4150.31 ms | 32.5% bf16 MFU | 126056 tok/s step 11271/19560 | loss 3.338785 (-1.25z)| norm 0.2705 (-0.79z)| lr 2.43e-04 | 4146.04 ms | 32.6% bf16 MFU | 126075 tok/s step 11272/19560 | loss 3.383019 (-0.27z)| norm 0.2620 (-1.33z)| lr 2.43e-04 | 4148.77 ms | 32.5% bf16 MFU | 126090 tok/s step 11273/19560 | loss 3.348041 (-1.04z)| norm 0.2636 (-1.23z)| lr 2.43e-04 | 4148.56 ms | 32.5% bf16 MFU | 126105 tok/s step 11274/19560 | loss 3.387048 (-0.18z)| norm 0.2938 (+0.73z)| lr 2.43e-04 | 4150.98 ms | 32.5% bf16 MFU | 126115 tok/s step 11275/19560 | loss 3.383046 (-0.26z)| norm 0.2751 (-0.49z)| lr 2.43e-04 | 4148.73 ms | 32.5% bf16 MFU | 126128 tok/s step 11276/19560 | loss 3.377463 (-0.38z)| norm 0.2881 (+0.35z)| lr 2.43e-04 | 4150.09 ms | 32.5% bf16 MFU | 126138 tok/s step 11277/19560 | loss 3.370581 (-0.53z)| norm 0.2636 (-1.23z)| lr 2.43e-04 | 4148.76 ms | 32.5% bf16 MFU | 126150 tok/s step 11278/19560 | loss 3.354048 (-0.88z)| norm 0.2834 (+0.05z)| lr 2.43e-04 | 4150.87 ms | 32.5% bf16 MFU | 126157 tok/s step 11279/19560 | loss 3.379936 (-0.30z)| norm 0.2894 (+0.44z)| lr 2.43e-04 | 4146.92 ms | 32.6% bf16 MFU | 126171 tok/s step 11280/19560 | loss 3.356022 (-0.82z)| norm 0.2617 (-1.35z)| lr 2.43e-04 | 4146.19 ms | 32.6% bf16 MFU | 126185 tok/s step 11281/19560 | loss 3.362020 (-0.68z)| norm 0.2616 (-1.34z)| lr 2.43e-04 | 4147.36 ms | 32.6% bf16 MFU | 126196 tok/s step 11282/19560 | loss 3.372870 (-0.45z)| norm 0.2835 (+0.06z)| lr 2.43e-04 | 4148.03 ms | 32.5% bf16 MFU | 126206 tok/s step 11283/19560 | loss 3.385581 (-0.17z)| norm 0.2946 (+0.77z)| lr 2.43e-04 | 4149.08 ms | 32.5% bf16 MFU | 126214 tok/s step 11284/19560 | loss 3.371121 (-0.48z)| norm 0.2635 (-1.25z)| lr 2.43e-04 | 4148.91 ms | 32.5% bf16 MFU | 126222 tok/s step 11285/19560 | loss 3.363079 (-0.65z)| norm 0.2486 (-2.17z)| lr 2.43e-04 | 4151.15 ms | 32.5% bf16 MFU | 126226 tok/s step 11286/19560 | loss 3.397832 (+0.13z)| norm 0.2677 (-0.94z)| lr 2.43e-04 | 4145.20 ms | 32.6% bf16 MFU | 126238 tok/s step 11287/19560 | loss 3.486283 (+2.06z)| norm 0.2513 (-1.94z)| lr 2.43e-04 | 4149.10 ms | 32.5% bf16 MFU | 126245 tok/s step 11288/19560 | loss 3.393533 (+0.02z)| norm 0.2501 (-1.98z)| lr 2.43e-04 | 4148.61 ms | 32.5% bf16 MFU | 126251 tok/s step 11289/19560 | loss 3.455548 (+1.37z)| norm 0.2665 (-0.96z)| lr 2.42e-04 | 4149.32 ms | 32.5% bf16 MFU | 126256 tok/s step 11290/19560 | loss 3.355071 (-0.82z)| norm 0.2607 (-1.30z)| lr 2.42e-04 | 4151.17 ms | 32.5% bf16 MFU | 126259 tok/s step 11291/19560 | loss 3.345003 (-1.03z)| norm 0.2532 (-1.73z)| lr 2.42e-04 | 4147.81 ms | 32.6% bf16 MFU | 126266 tok/s step 11292/19560 | loss 3.388417 (-0.08z)| norm 0.2603 (-1.28z)| lr 2.42e-04 | 4151.97 ms | 32.5% bf16 MFU | 126266 tok/s step 11293/19560 | loss 3.314557 (-1.66z)| norm 0.2699 (-0.70z)| lr 2.42e-04 | 4147.26 ms | 32.6% bf16 MFU | 126274 tok/s step 11294/19560 | loss 3.445398 (+1.17z)| norm 0.2493 (-1.90z)| lr 2.42e-04 | 4150.84 ms | 32.5% bf16 MFU | 126275 tok/s step 11295/19560 | loss 3.428139 (+0.79z)| norm 0.2680 (-0.79z)| lr 2.42e-04 | 4149.35 ms | 32.5% bf16 MFU | 126279 tok/s step 11296/19560 | loss 3.376928 (-0.32z)| norm 0.2653 (-0.95z)| lr 2.42e-04 | 4150.16 ms | 32.5% bf16 MFU | 126282 tok/s step 11297/19560 | loss 3.387690 (-0.08z)| norm 0.2622 (-1.11z)| lr 2.42e-04 | 4157.67 ms | 32.5% bf16 MFU | 126273 tok/s step 11298/19560 | loss 3.396741 (+0.12z)| norm 0.2859 (+0.31z)| lr 2.42e-04 | 4243.05 ms | 31.8% bf16 MFU | 126137 tok/s step 11299/19560 | loss 3.311080 (-1.71z)| norm 0.2698 (-0.65z)| lr 2.42e-04 | 4218.86 ms | 32.0% bf16 MFU | 126044 tok/s step 11300/19560 | loss 3.400318 (+0.21z)| norm 0.2558 (-1.46z)| lr 2.42e-04 | 4151.39 ms | 32.5% bf16 MFU | 126057 tok/s step 11301/19560 | loss 3.336929 (-1.14z)| norm 0.2840 (+0.22z)| lr 2.42e-04 | 4186.83 ms | 32.2% bf16 MFU | 126015 tok/s step 11302/19560 | loss 3.389274 (-0.02z)| norm 0.2577 (-1.33z)| lr 2.42e-04 | 4153.07 ms | 32.5% bf16 MFU | 126026 tok/s step 11303/19560 | loss 3.353777 (-0.77z)| norm 0.2816 (+0.08z)| lr 2.42e-04 | 4174.31 ms | 32.3% bf16 MFU | 126005 tok/s step 11304/19560 | loss 3.321330 (-1.44z)| norm 0.2574 (-1.33z)| lr 2.42e-04 | 4155.02 ms | 32.5% bf16 MFU | 126014 tok/s step 11305/19560 | loss 3.370566 (-0.39z)| norm 0.2801 (+0.00z)| lr 2.42e-04 | 4147.31 ms | 32.6% bf16 MFU | 126034 tok/s step 11306/19560 | loss 3.385280 (-0.06z)| norm 0.2809 (+0.05z)| lr 2.42e-04 | 4151.53 ms | 32.5% bf16 MFU | 126047 tok/s step 11307/19560 | loss 3.365493 (-0.48z)| norm 0.2846 (+0.26z)| lr 2.42e-04 | 4153.17 ms | 32.5% bf16 MFU | 126056 tok/s step 11308/19560 | loss 3.564625 (+3.61z)| norm 0.3281 (+2.74z)| lr 2.42e-04 | 4148.77 ms | 32.5% bf16 MFU | 126072 tok/s step 11309/19560 | loss 3.391944 (+0.05z)| norm 0.2817 (+0.07z)| lr 2.42e-04 | 4149.65 ms | 32.5% bf16 MFU | 126086 tok/s step 11310/19560 | loss 3.401432 (+0.25z)| norm 0.3053 (+1.41z)| lr 2.41e-04 | 4150.78 ms | 32.5% bf16 MFU | 126097 tok/s step 11311/19560 | loss 3.335602 (-1.10z)| norm 0.2880 (+0.43z)| lr 2.41e-04 | 4147.52 ms | 32.6% bf16 MFU | 126112 tok/s step 11312/19560 | loss 3.360615 (-0.57z)| norm 0.2666 (-0.79z)| lr 2.41e-04 | 4145.68 ms | 32.6% bf16 MFU | 126130 tok/s step 11313/19560 | loss 3.398611 (+0.21z)| norm 0.2796 (-0.05z)| lr 2.41e-04 | 4149.07 ms | 32.5% bf16 MFU | 126142 tok/s step 11314/19560 | loss 3.332862 (-1.16z)| norm 0.2746 (-0.33z)| lr 2.41e-04 | 4152.30 ms | 32.5% bf16 MFU | 126148 tok/s step 11315/19560 | loss 3.407251 (+0.39z)| norm 0.3022 (+1.24z)| lr 2.41e-04 | 4148.00 ms | 32.6% bf16 MFU | 126160 tok/s step 11316/19560 | loss 3.386686 (-0.03z)| norm 0.2569 (-1.33z)| lr 2.41e-04 | 4149.53 ms | 32.5% bf16 MFU | 126170 tok/s step 11317/19560 | loss 3.447207 (+1.23z)| norm 0.3039 (+1.33z)| lr 2.41e-04 | 4150.92 ms | 32.5% bf16 MFU | 126177 tok/s step 11318/19560 | loss 3.410699 (+0.45z)| norm 0.2840 (+0.20z)| lr 2.41e-04 | 4149.92 ms | 32.5% bf16 MFU | 126185 tok/s step 11319/19560 | loss 3.498822 (+2.24z)| norm 0.2936 (+0.75z)| lr 2.41e-04 | 4149.22 ms | 32.5% bf16 MFU | 126193 tok/s step 11320/19560 | loss 3.368523 (-0.45z)| norm 0.2709 (-0.54z)| lr 2.41e-04 | 4152.38 ms | 32.5% bf16 MFU | 126197 tok/s step 11321/19560 | loss 3.382636 (-0.17z)| norm 0.3098 (+1.65z)| lr 2.41e-04 | 4153.19 ms | 32.5% bf16 MFU | 126199 tok/s step 11322/19560 | loss 3.416102 (+0.52z)| norm 0.2853 (+0.25z)| lr 2.41e-04 | 4144.51 ms | 32.6% bf16 MFU | 126214 tok/s step 11323/19560 | loss 3.409061 (+0.40z)| norm 0.2868 (+0.35z)| lr 2.41e-04 | 4144.38 ms | 32.6% bf16 MFU | 126228 tok/s step 11324/19560 | loss 3.363834 (-0.56z)| norm 0.2696 (-0.63z)| lr 2.41e-04 | 4146.17 ms | 32.6% bf16 MFU | 126240 tok/s step 11325/19560 | loss 3.451041 (+1.28z)| norm 0.2975 (+0.97z)| lr 2.41e-04 | 4150.86 ms | 32.5% bf16 MFU | 126243 tok/s step 11326/19560 | loss 3.384280 (-0.14z)| norm 0.2865 (+0.35z)| lr 2.41e-04 | 4146.68 ms | 32.6% bf16 MFU | 126253 tok/s step 11327/19560 | loss 3.408593 (+0.39z)| norm 0.2742 (-0.35z)| lr 2.41e-04 | 4149.10 ms | 32.5% bf16 MFU | 126258 tok/s step 11328/19560 | loss 3.333488 (-1.20z)| norm 0.2857 (+0.32z)| lr 2.41e-04 | 4147.87 ms | 32.6% bf16 MFU | 126265 tok/s step 11329/19560 | loss 3.415872 (+0.56z)| norm 0.2734 (-0.38z)| lr 2.41e-04 | 4145.22 ms | 32.6% bf16 MFU | 126276 tok/s step 11330/19560 | loss 3.404666 (+0.32z)| norm 0.2708 (-0.52z)| lr 2.40e-04 | 4148.12 ms | 32.5% bf16 MFU | 126282 tok/s step 11331/19560 | loss 3.375867 (-0.30z)| norm 0.2764 (-0.19z)| lr 2.40e-04 | 4151.54 ms | 32.5% bf16 MFU | 126282 tok/s step 11332/19560 | loss 3.373870 (-0.34z)| norm 0.2732 (-0.37z)| lr 2.40e-04 | 4150.26 ms | 32.5% bf16 MFU | 126284 tok/s step 11333/19560 | loss 3.363600 (-0.56z)| norm 0.2627 (-0.99z)| lr 2.40e-04 | 4149.47 ms | 32.5% bf16 MFU | 126288 tok/s step 11334/19560 | loss 3.385856 (-0.07z)| norm 0.2884 (+0.53z)| lr 2.40e-04 | 4146.57 ms | 32.6% bf16 MFU | 126295 tok/s step 11335/19560 | loss 3.417648 (+0.60z)| norm 0.2642 (-0.89z)| lr 2.40e-04 | 4145.79 ms | 32.6% bf16 MFU | 126303 tok/s step 11336/19560 | loss 3.352281 (-0.79z)| norm 0.2647 (-0.85z)| lr 2.40e-04 | 4152.98 ms | 32.5% bf16 MFU | 126300 tok/s step 11337/19560 | loss 3.340354 (-1.04z)| norm 0.2688 (-0.61z)| lr 2.40e-04 | 4151.28 ms | 32.5% bf16 MFU | 126300 tok/s step 11338/19560 | loss 3.343724 (-0.95z)| norm 0.2563 (-1.33z)| lr 2.40e-04 | 4149.33 ms | 32.5% bf16 MFU | 126303 tok/s step 11339/19560 | loss 3.594586 (+4.07z)| norm 0.2776 (-0.08z)| lr 2.40e-04 | 4147.26 ms | 32.6% bf16 MFU | 126309 tok/s step 11340/19560 | loss 3.393958 (+0.08z)| norm 0.2822 (+0.18z)| lr 2.40e-04 | 4148.30 ms | 32.5% bf16 MFU | 126313 tok/s step 11341/19560 | loss 3.399064 (+0.18z)| norm 0.2679 (-0.65z)| lr 2.40e-04 | 4147.16 ms | 32.6% bf16 MFU | 126318 tok/s step 11342/19560 | loss 3.409978 (+0.40z)| norm 0.2845 (+0.31z)| lr 2.40e-04 | 4148.44 ms | 32.5% bf16 MFU | 126321 tok/s step 11343/19560 | loss 3.362655 (-0.55z)| norm 0.2782 (-0.04z)| lr 2.40e-04 | 4147.96 ms | 32.6% bf16 MFU | 126325 tok/s step 11344/19560 | loss 3.323456 (-1.32z)| norm 0.2573 (-1.25z)| lr 2.40e-04 | 4148.50 ms | 32.5% bf16 MFU | 126328 tok/s step 11345/19560 | loss 3.362796 (-0.53z)| norm 0.2894 (+0.64z)| lr 2.40e-04 | 4148.25 ms | 32.5% bf16 MFU | 126331 tok/s step 11346/19560 | loss 3.424866 (+0.69z)| norm 0.2652 (-0.77z)| lr 2.40e-04 | 4146.58 ms | 32.6% bf16 MFU | 126336 tok/s step 11347/19560 | loss 3.381340 (-0.17z)| norm 0.2819 (+0.21z)| lr 2.40e-04 | 4147.59 ms | 32.6% bf16 MFU | 126340 tok/s step 11348/19560 | loss 3.441040 (+1.00z)| norm 0.2642 (-0.82z)| lr 2.40e-04 | 4152.73 ms | 32.5% bf16 MFU | 126335 tok/s step 11349/19560 | loss 3.345750 (-0.88z)| norm 0.2738 (-0.25z)| lr 2.40e-04 | 4147.86 ms | 32.6% bf16 MFU | 126339 tok/s step 11350/19560 | loss 3.394693 (+0.09z)| norm 0.2968 (+1.11z)| lr 2.40e-04 | 4151.72 ms | 32.5% bf16 MFU | 126336 tok/s step 11351/19560 | loss 3.360128 (-0.59z)| norm 0.2781 (-0.00z)| lr 2.39e-04 | 4148.24 ms | 32.5% bf16 MFU | 126338 tok/s step 11352/19560 | loss 3.341517 (-0.94z)| norm 0.2916 (+0.80z)| lr 2.39e-04 | 4148.52 ms | 32.5% bf16 MFU | 126340 tok/s step 11353/19560 | loss 3.452397 (+1.23z)| norm 0.3221 (+2.55z)| lr 2.39e-04 | 4145.51 ms | 32.6% bf16 MFU | 126347 tok/s step 11354/19560 | loss 3.431988 (+0.82z)| norm 0.2867 (+0.48z)| lr 2.39e-04 | 4147.29 ms | 32.6% bf16 MFU | 126350 tok/s step 11355/19560 | loss 3.354934 (-0.69z)| norm 0.2977 (+1.14z)| lr 2.39e-04 | 4150.19 ms | 32.5% bf16 MFU | 126349 tok/s step 11356/19560 | loss 3.290131 (-1.91z)| norm 0.2701 (-0.48z)| lr 2.39e-04 | 4146.98 ms | 32.6% bf16 MFU | 126353 tok/s step 11357/19560 | loss 3.371382 (-0.33z)| norm 0.2857 (+0.45z)| lr 2.39e-04 | 4188.50 ms | 32.2% bf16 MFU | 126294 tok/s step 11358/19560 | loss 3.485261 (+2.00z)| norm 0.2891 (+0.65z)| lr 2.39e-04 | 4184.66 ms | 32.3% bf16 MFU | 126244 tok/s step 11359/19560 | loss 3.427987 (+0.86z)| norm 0.2905 (+0.81z)| lr 2.39e-04 | 4163.72 ms | 32.4% bf16 MFU | 126228 tok/s step 11360/19560 | loss 3.368560 (-0.39z)| norm 0.2642 (-0.86z)| lr 2.39e-04 | 4169.09 ms | 32.4% bf16 MFU | 126204 tok/s step 11361/19560 | loss 3.437721 (+1.06z)| norm 0.2950 (+1.18z)| lr 2.39e-04 | 4162.94 ms | 32.4% bf16 MFU | 126191 tok/s step 11362/19560 | loss 3.374535 (-0.27z)| norm 0.3064 (+1.92z)| lr 2.39e-04 | 4157.87 ms | 32.5% bf16 MFU | 126186 tok/s step 11363/19560 | loss 3.351847 (-0.76z)| norm 0.3168 (+2.51z)| lr 2.39e-04 | 4168.33 ms | 32.4% bf16 MFU | 126166 tok/s step 11364/19560 | loss 3.378056 (-0.20z)| norm 0.2976 (+1.30z)| lr 2.39e-04 | 4162.10 ms | 32.4% bf16 MFU | 126156 tok/s step 11365/19560 | loss 3.410325 (+0.47z)| norm 0.2833 (+0.37z)| lr 2.39e-04 | 4157.87 ms | 32.5% bf16 MFU | 126153 tok/s step 11366/19560 | loss 3.391975 (+0.07z)| norm 0.2924 (+0.97z)| lr 2.39e-04 | 4158.27 ms | 32.5% bf16 MFU | 126149 tok/s step 11367/19560 | loss 3.378495 (-0.23z)| norm 0.3031 (+1.63z)| lr 2.39e-04 | 4170.42 ms | 32.4% bf16 MFU | 126128 tok/s step 11368/19560 | loss 3.358721 (-0.66z)| norm 0.2933 (+0.98z)| lr 2.39e-04 | 4162.07 ms | 32.4% bf16 MFU | 126120 tok/s step 11369/19560 | loss 3.370517 (-0.41z)| norm 0.2848 (+0.44z)| lr 2.39e-04 | 4159.03 ms | 32.5% bf16 MFU | 126117 tok/s step 11370/19560 | loss 3.389585 (+0.01z)| norm 0.3031 (+1.57z)| lr 2.39e-04 | 4159.14 ms | 32.5% bf16 MFU | 126114 tok/s step 11371/19560 | loss 3.376513 (-0.27z)| norm 0.2956 (+1.08z)| lr 2.38e-04 | 4162.20 ms | 32.4% bf16 MFU | 126106 tok/s step 11372/19560 | loss 3.464622 (+1.61z)| norm 0.2952 (+1.04z)| lr 2.38e-04 | 4159.13 ms | 32.5% bf16 MFU | 126104 tok/s step 11373/19560 | loss 3.371595 (-0.39z)| norm 0.3125 (+2.08z)| lr 2.38e-04 | 4171.07 ms | 32.4% bf16 MFU | 126083 tok/s step 11374/19560 | loss 3.393692 (+0.08z)| norm 0.2828 (+0.22z)| lr 2.38e-04 | 4169.34 ms | 32.4% bf16 MFU | 126067 tok/s step 11375/19560 | loss 3.351868 (-0.81z)| norm 0.3023 (+1.45z)| lr 2.38e-04 | 4153.99 ms | 32.5% bf16 MFU | 126074 tok/s step 11376/19560 | loss 3.365457 (-0.51z)| norm 0.2796 (+0.02z)| lr 2.38e-04 | 4155.65 ms | 32.5% bf16 MFU | 126078 tok/s step 11377/19560 | loss 3.537708 (+3.08z)| norm 0.3021 (+1.42z)| lr 2.38e-04 | 4162.88 ms | 32.4% bf16 MFU | 126072 tok/s step 11378/19560 | loss 3.443817 (+1.11z)| norm 0.2820 (+0.16z)| lr 2.38e-04 | 4153.21 ms | 32.5% bf16 MFU | 126080 tok/s step 11379/19560 | loss 3.361139 (-0.60z)| norm 0.2997 (+1.25z)| lr 2.38e-04 | 4162.39 ms | 32.4% bf16 MFU | 126074 tok/s step 11380/19560 | loss 3.382411 (-0.15z)| norm 0.2692 (-0.64z)| lr 2.38e-04 | 4152.10 ms | 32.5% bf16 MFU | 126084 tok/s step 11381/19560 | loss 3.342246 (-0.98z)| norm 0.3109 (+1.92z)| lr 2.38e-04 | 4154.19 ms | 32.5% bf16 MFU | 126090 tok/s step 11382/19560 | loss 3.498634 (+2.19z)| norm 0.2860 (+0.37z)| lr 2.38e-04 | 4154.58 ms | 32.5% bf16 MFU | 126095 tok/s step 11383/19560 | loss 3.390214 (-0.00z)| norm 0.3109 (+1.87z)| lr 2.38e-04 | 4174.63 ms | 32.3% bf16 MFU | 126070 tok/s step 11384/19560 | loss 3.366335 (-0.49z)| norm 0.2957 (+0.93z)| lr 2.38e-04 | 4152.41 ms | 32.5% bf16 MFU | 126079 tok/s step 11385/19560 | loss 3.500034 (+2.17z)| norm 0.2779 (-0.14z)| lr 2.38e-04 | 4154.21 ms | 32.5% bf16 MFU | 126086 tok/s step 11386/19560 | loss 3.397932 (+0.16z)| norm 0.2959 (+0.94z)| lr 2.38e-04 | 4164.96 ms | 32.4% bf16 MFU | 126075 tok/s step 11387/19560 | loss 3.435813 (+0.92z)| norm 0.2862 (+0.34z)| lr 2.38e-04 | 4157.54 ms | 32.5% bf16 MFU | 126077 tok/s step 11388/19560 | loss 3.423795 (+0.66z)| norm 0.2945 (+0.84z)| lr 2.38e-04 | 4151.83 ms | 32.5% bf16 MFU | 126087 tok/s step 11389/19560 | loss 3.411999 (+0.41z)| norm 0.3080 (+1.63z)| lr 2.38e-04 | 4153.11 ms | 32.5% bf16 MFU | 126095 tok/s step 11390/19560 | loss 3.362195 (-0.61z)| norm 0.2722 (-0.55z)| lr 2.38e-04 | 4165.97 ms | 32.4% bf16 MFU | 126082 tok/s step 11391/19560 | loss 3.355690 (-0.73z)| norm 0.2932 (+0.72z)| lr 2.37e-04 | 4173.02 ms | 32.4% bf16 MFU | 126060 tok/s step 11392/19560 | loss 3.359657 (-0.64z)| norm 0.2760 (-0.32z)| lr 2.37e-04 | 4170.62 ms | 32.4% bf16 MFU | 126043 tok/s step 11393/19560 | loss 3.409656 (+0.40z)| norm 0.2855 (+0.25z)| lr 2.37e-04 | 4161.85 ms | 32.4% bf16 MFU | 126039 tok/s step 11394/19560 | loss 3.370614 (-0.42z)| norm 0.2775 (-0.23z)| lr 2.37e-04 | 4158.05 ms | 32.5% bf16 MFU | 126042 tok/s step 11395/19560 | loss 3.364940 (-0.53z)| norm 0.2859 (+0.28z)| lr 2.37e-04 | 4184.99 ms | 32.3% bf16 MFU | 126004 tok/s step 11396/19560 | loss 3.338161 (-1.08z)| norm 0.2668 (-0.87z)| lr 2.37e-04 | 4155.79 ms | 32.5% bf16 MFU | 126011 tok/s step 11397/19560 | loss 3.408229 (+0.39z)| norm 0.2612 (-1.20z)| lr 2.37e-04 | 4158.86 ms | 32.5% bf16 MFU | 126014 tok/s step 11398/19560 | loss 3.388179 (-0.03z)| norm 0.2948 (+0.84z)| lr 2.37e-04 | 4164.34 ms | 32.4% bf16 MFU | 126008 tok/s step 11399/19560 | loss 3.404919 (+0.31z)| norm 0.2735 (-0.46z)| lr 2.37e-04 | 4164.68 ms | 32.4% bf16 MFU | 126002 tok/s step 11400/19560 | loss 3.379132 (-0.24z)| norm 0.2801 (-0.07z)| lr 2.37e-04 | 4165.20 ms | 32.4% bf16 MFU | 125996 tok/s step 11401/19560 | loss 3.342410 (-1.01z)| norm 0.3017 (+1.23z)| lr 2.37e-04 | 4154.41 ms | 32.5% bf16 MFU | 126006 tok/s step 11402/19560 | loss 3.412791 (+0.47z)| norm 0.2656 (-0.95z)| lr 2.37e-04 | 4168.63 ms | 32.4% bf16 MFU | 125994 tok/s step 11403/19560 | loss 3.341650 (-1.02z)| norm 0.2917 (+0.63z)| lr 2.37e-04 | 4158.78 ms | 32.5% bf16 MFU | 125998 tok/s step 11404/19560 | loss 3.312391 (-1.61z)| norm 0.2705 (-0.65z)| lr 2.37e-04 | 4174.11 ms | 32.3% bf16 MFU | 125978 tok/s step 11405/19560 | loss 3.370702 (-0.39z)| norm 0.2612 (-1.22z)| lr 2.37e-04 | 4156.49 ms | 32.5% bf16 MFU | 125986 tok/s step 11406/19560 | loss 3.386851 (-0.06z)| norm 0.2792 (-0.12z)| lr 2.37e-04 | 4185.56 ms | 32.3% bf16 MFU | 125950 tok/s step 11407/19560 | loss 3.403232 (+0.27z)| norm 0.2688 (-0.74z)| lr 2.37e-04 | 4166.90 ms | 32.4% bf16 MFU | 125944 tok/s step 11408/19560 | loss 3.294172 (-1.96z)| norm 0.3038 (+1.36z)| lr 2.37e-04 | 4161.61 ms | 32.4% bf16 MFU | 125946 tok/s step 11409/19560 | loss 3.335643 (-1.10z)| norm 0.2731 (-0.51z)| lr 2.37e-04 | 4170.11 ms | 32.4% bf16 MFU | 125935 tok/s step 11410/19560 | loss 3.404211 (+0.30z)| norm 0.3063 (+1.49z)| lr 2.37e-04 | 4153.44 ms | 32.5% bf16 MFU | 125949 tok/s step 11411/19560 | loss 3.384924 (-0.10z)| norm 0.2585 (-1.37z)| lr 2.37e-04 | 4165.54 ms | 32.4% bf16 MFU | 125945 tok/s step 11412/19560 | loss 3.374898 (-0.30z)| norm 0.2782 (-0.20z)| lr 2.36e-04 | 4176.01 ms | 32.3% bf16 MFU | 125925 tok/s step 11413/19560 | loss 3.413248 (+0.48z)| norm 0.2692 (-0.76z)| lr 2.36e-04 | 4159.02 ms | 32.5% bf16 MFU | 125932 tok/s step 11414/19560 | loss 3.411756 (+0.44z)| norm 0.2699 (-0.72z)| lr 2.36e-04 | 4154.89 ms | 32.5% bf16 MFU | 125945 tok/s step 11415/19560 | loss 3.410430 (+0.43z)| norm 0.2700 (-0.73z)| lr 2.36e-04 | 4158.46 ms | 32.5% bf16 MFU | 125951 tok/s step 11416/19560 | loss 3.468572 (+1.62z)| norm 0.2694 (-0.78z)| lr 2.36e-04 | 4159.17 ms | 32.5% bf16 MFU | 125956 tok/s step 11417/19560 | loss 3.334265 (-1.13z)| norm 0.2811 (-0.06z)| lr 2.36e-04 | 4166.50 ms | 32.4% bf16 MFU | 125950 tok/s step 11418/19560 | loss 3.365632 (-0.49z)| norm 0.2929 (+0.67z)| lr 2.36e-04 | 4167.64 ms | 32.4% bf16 MFU | 125943 tok/s step 11419/19560 | loss 3.388223 (-0.03z)| norm 0.2911 (+0.54z)| lr 2.36e-04 | 4200.48 ms | 32.1% bf16 MFU | 125886 tok/s step 11420/19560 | loss 3.474455 (+1.72z)| norm 0.2916 (+0.56z)| lr 2.36e-04 | 4164.77 ms | 32.4% bf16 MFU | 125886 tok/s step 11421/19560 | loss 3.417341 (+0.54z)| norm 0.3018 (+1.20z)| lr 2.36e-04 | 4168.44 ms | 32.4% bf16 MFU | 125881 tok/s step 11422/19560 | loss 3.489144 (+2.00z)| norm 0.3259 (+2.68z)| lr 2.36e-04 | 4158.00 ms | 32.5% bf16 MFU | 125891 tok/s step 11423/19560 | loss 3.377702 (-0.27z)| norm 0.3074 (+1.48z)| lr 2.36e-04 | 4164.55 ms | 32.4% bf16 MFU | 125892 tok/s step 11424/19560 | loss 3.328447 (-1.27z)| norm 0.3000 (+1.00z)| lr 2.36e-04 | 4156.05 ms | 32.5% bf16 MFU | 125904 tok/s step 11425/19560 | loss 3.488742 (+1.95z)| norm 0.2932 (+0.56z)| lr 2.36e-04 | 4162.13 ms | 32.4% bf16 MFU | 125908 tok/s step 11426/19560 | loss 3.438047 (+0.92z)| norm 0.2956 (+0.70z)| lr 2.36e-04 | 4154.39 ms | 32.5% bf16 MFU | 125922 tok/s step 11427/19560 | loss 3.311557 (-1.60z)| norm 0.2686 (-1.01z)| lr 2.36e-04 | 4164.74 ms | 32.4% bf16 MFU | 125920 tok/s step 11428/19560 | loss 3.385875 (-0.12z)| norm 0.3162 (+1.98z)| lr 2.36e-04 | 4158.38 ms | 32.5% bf16 MFU | 125928 tok/s step 11429/19560 | loss 3.471592 (+1.57z)| norm 0.2912 (+0.39z)| lr 2.36e-04 | 4151.92 ms | 32.5% bf16 MFU | 125946 tok/s step 11430/19560 | loss 3.393464 (+0.01z)| norm 0.3103 (+1.58z)| lr 2.36e-04 | 4160.00 ms | 32.5% bf16 MFU | 125950 tok/s step 11431/19560 | loss 3.390923 (-0.04z)| norm 0.3031 (+1.10z)| lr 2.36e-04 | 4161.85 ms | 32.4% bf16 MFU | 125951 tok/s step 11432/19560 | loss 3.423669 (+0.60z)| norm 0.2946 (+0.55z)| lr 2.35e-04 | 4170.32 ms | 32.4% bf16 MFU | 125940 tok/s step 11433/19560 | loss 3.345012 (-0.97z)| norm 0.2860 (+0.00z)| lr 2.35e-04 | 4160.30 ms | 32.5% bf16 MFU | 125944 tok/s step 11434/19560 | loss 3.333055 (-1.20z)| norm 0.2664 (-1.24z)| lr 2.35e-04 | 4161.77 ms | 32.4% bf16 MFU | 125945 tok/s step 11435/19560 | loss 3.320477 (-1.43z)| norm 0.2902 (+0.27z)| lr 2.35e-04 | 4165.13 ms | 32.4% bf16 MFU | 125942 tok/s step 11436/19560 | loss 3.343431 (-0.99z)| norm 0.2798 (-0.38z)| lr 2.35e-04 | 4163.28 ms | 32.4% bf16 MFU | 125941 tok/s step 11437/19560 | loss 3.504827 (+2.28z)| norm 0.2819 (-0.24z)| lr 2.35e-04 | 4152.43 ms | 32.5% bf16 MFU | 125957 tok/s step 11438/19560 | loss 3.330286 (-1.23z)| norm 0.2738 (-0.76z)| lr 2.35e-04 | 4167.35 ms | 32.4% bf16 MFU | 125950 tok/s step 11439/19560 | loss 3.367948 (-0.48z)| norm 0.2940 (+0.57z)| lr 2.35e-04 | 4164.75 ms | 32.4% bf16 MFU | 125947 tok/s step 11440/19560 | loss 3.400960 (+0.18z)| norm 0.2818 (-0.24z)| lr 2.35e-04 | 4162.71 ms | 32.4% bf16 MFU | 125947 tok/s step 11441/19560 | loss 3.395161 (+0.06z)| norm 0.3031 (+1.15z)| lr 2.35e-04 | 4168.82 ms | 32.4% bf16 MFU | 125938 tok/s step 11442/19560 | loss 3.362981 (-0.59z)| norm 0.2858 (+0.00z)| lr 2.35e-04 | 4157.22 ms | 32.5% bf16 MFU | 125947 tok/s step 11443/19560 | loss 3.389959 (-0.04z)| norm 0.3065 (+1.36z)| lr 2.35e-04 | 4168.86 ms | 32.4% bf16 MFU | 125937 tok/s step 11444/19560 | loss 3.415019 (+0.46z)| norm 0.3213 (+2.29z)| lr 2.35e-04 | 4157.02 ms | 32.5% bf16 MFU | 125947 tok/s step 11445/19560 | loss 3.346637 (-0.91z)| norm 0.2845 (-0.10z)| lr 2.35e-04 | 4161.94 ms | 32.4% bf16 MFU | 125948 tok/s step 11446/19560 | loss 3.400467 (+0.18z)| norm 0.2993 (+0.85z)| lr 2.35e-04 | 4157.34 ms | 32.5% bf16 MFU | 125956 tok/s step 11447/19560 | loss 3.369309 (-0.44z)| norm 0.2730 (-0.85z)| lr 2.35e-04 | 4158.43 ms | 32.5% bf16 MFU | 125962 tok/s step 11448/19560 | loss 3.401670 (+0.23z)| norm 0.3017 (+1.00z)| lr 2.35e-04 | 4157.75 ms | 32.5% bf16 MFU | 125969 tok/s step 11449/19560 | loss 3.438472 (+0.98z)| norm 0.2815 (-0.30z)| lr 2.35e-04 | 4163.36 ms | 32.4% bf16 MFU | 125967 tok/s step 11450/19560 | loss 3.338508 (-1.07z)| norm 0.2821 (-0.26z)| lr 2.35e-04 | 4150.80 ms | 32.5% bf16 MFU | 125984 tok/s step 11451/19560 | loss 3.374566 (-0.32z)| norm 0.2733 (-0.83z)| lr 2.35e-04 | 4159.15 ms | 32.5% bf16 MFU | 125988 tok/s step 11452/19560 | loss 3.357261 (-0.68z)| norm 0.2796 (-0.42z)| lr 2.35e-04 | 4161.80 ms | 32.4% bf16 MFU | 125987 tok/s step 11453/19560 | loss 3.381198 (-0.18z)| norm 0.2935 (+0.49z)| lr 2.34e-04 | 4154.06 ms | 32.5% bf16 MFU | 125998 tok/s step 11454/19560 | loss 3.432940 (+0.88z)| norm 0.2637 (-1.45z)| lr 2.34e-04 | 4149.06 ms | 32.5% bf16 MFU | 126017 tok/s step 11455/19560 | loss 3.348216 (-0.85z)| norm 0.2783 (-0.49z)| lr 2.34e-04 | 4150.35 ms | 32.5% bf16 MFU | 126032 tok/s step 11456/19560 | loss 3.413824 (+0.49z)| norm 0.2638 (-1.42z)| lr 2.34e-04 | 4162.50 ms | 32.4% bf16 MFU | 126028 tok/s step 11457/19560 | loss 3.349391 (-0.83z)| norm 0.2718 (-0.91z)| lr 2.34e-04 | 4153.01 ms | 32.5% bf16 MFU | 126039 tok/s step 11458/19560 | loss 3.442389 (+1.08z)| norm 0.3026 (+1.08z)| lr 2.34e-04 | 4162.44 ms | 32.4% bf16 MFU | 126035 tok/s step 11459/19560 | loss 3.358399 (-0.65z)| norm 0.2589 (-1.73z)| lr 2.34e-04 | 4156.57 ms | 32.5% bf16 MFU | 126040 tok/s step 11460/19560 | loss 3.388990 (-0.02z)| norm 0.2688 (-1.09z)| lr 2.34e-04 | 4156.41 ms | 32.5% bf16 MFU | 126045 tok/s step 11461/19560 | loss 3.344737 (-0.92z)| norm 0.2605 (-1.62z)| lr 2.34e-04 | 4164.21 ms | 32.4% bf16 MFU | 126038 tok/s step 11462/19560 | loss 3.324750 (-1.31z)| norm 0.2809 (-0.31z)| lr 2.34e-04 | 4152.56 ms | 32.5% bf16 MFU | 126049 tok/s step 11463/19560 | loss 3.394806 (+0.11z)| norm 0.2623 (-1.49z)| lr 2.34e-04 | 4162.91 ms | 32.4% bf16 MFU | 126043 tok/s step 11464/19560 | loss 3.347671 (-0.84z)| norm 0.2793 (-0.42z)| lr 2.34e-04 | 4159.26 ms | 32.5% bf16 MFU | 126044 tok/s step 11465/19560 | loss 3.303091 (-1.73z)| norm 0.2544 (-1.99z)| lr 2.34e-04 | 4163.65 ms | 32.4% bf16 MFU | 126038 tok/s step 11466/19560 | loss 3.364858 (-0.49z)| norm 0.2887 (+0.18z)| lr 2.34e-04 | 4153.14 ms | 32.5% bf16 MFU | 126048 tok/s step 11467/19560 | loss 3.355048 (-0.70z)| norm 0.2705 (-0.99z)| lr 2.34e-04 | 4158.25 ms | 32.5% bf16 MFU | 126050 tok/s step 11468/19560 | loss 3.531654 (+3.00z)| norm 0.3009 (+0.96z)| lr 2.34e-04 | 4156.70 ms | 32.5% bf16 MFU | 126054 tok/s step 11469/19560 | loss 3.333426 (-1.13z)| norm 0.2658 (-1.29z)| lr 2.34e-04 | 4164.79 ms | 32.4% bf16 MFU | 126045 tok/s step 11470/19560 | loss 3.404366 (+0.35z)| norm 0.2744 (-0.74z)| lr 2.34e-04 | 4170.64 ms | 32.4% bf16 MFU | 126028 tok/s step 11471/19560 | loss 3.403218 (+0.32z)| norm 0.2631 (-1.44z)| lr 2.34e-04 | 4153.92 ms | 32.5% bf16 MFU | 126038 tok/s step 11472/19560 | loss 3.358126 (-0.63z)| norm 0.2718 (-0.90z)| lr 2.34e-04 | 4159.30 ms | 32.5% bf16 MFU | 126038 tok/s step 11473/19560 | loss 3.362048 (-0.55z)| norm 0.2691 (-1.06z)| lr 2.33e-04 | 4156.25 ms | 32.5% bf16 MFU | 126044 tok/s step 11474/19560 | loss 3.392819 (+0.10z)| norm 0.2687 (-1.10z)| lr 2.33e-04 | 4160.97 ms | 32.4% bf16 MFU | 126042 tok/s step 11475/19560 | loss 3.391245 (+0.07z)| norm 0.2517 (-2.12z)| lr 2.33e-04 | 4144.38 ms | 32.6% bf16 MFU | 126065 tok/s step 11476/19560 | loss 3.372474 (-0.32z)| norm 0.2789 (-0.43z)| lr 2.33e-04 | 4153.18 ms | 32.5% bf16 MFU | 126073 tok/s step 11477/19560 | loss 3.340509 (-0.99z)| norm 0.2721 (-0.86z)| lr 2.33e-04 | 4163.39 ms | 32.4% bf16 MFU | 126066 tok/s step 11478/19560 | loss 3.388758 (+0.03z)| norm 0.2692 (-1.02z)| lr 2.33e-04 | 4163.46 ms | 32.4% bf16 MFU | 126059 tok/s step 11479/19560 | loss 3.410364 (+0.48z)| norm 0.2876 (+0.13z)| lr 2.33e-04 | 4163.18 ms | 32.4% bf16 MFU | 126053 tok/s step 11480/19560 | loss 3.333293 (-1.15z)| norm 0.2941 (+0.54z)| lr 2.33e-04 | 4160.53 ms | 32.5% bf16 MFU | 126051 tok/s step 11481/19560 | loss 3.353801 (-0.70z)| norm 0.2953 (+0.64z)| lr 2.33e-04 | 4168.87 ms | 32.4% bf16 MFU | 126037 tok/s step 11482/19560 | loss 3.408542 (+0.46z)| norm 0.2916 (+0.40z)| lr 2.33e-04 | 4154.58 ms | 32.5% bf16 MFU | 126045 tok/s step 11483/19560 | loss 3.429203 (+0.89z)| norm 0.2874 (+0.13z)| lr 2.33e-04 | 4150.30 ms | 32.5% bf16 MFU | 126059 tok/s step 11484/19560 | loss 3.409708 (+0.46z)| norm 0.2938 (+0.54z)| lr 2.33e-04 | 4161.20 ms | 32.4% bf16 MFU | 126055 tok/s step 11485/19560 | loss 3.392702 (+0.09z)| norm 0.2742 (-0.72z)| lr 2.33e-04 | 4154.15 ms | 32.5% bf16 MFU | 126063 tok/s step 11486/19560 | loss 3.368427 (-0.42z)| norm 0.3149 (+1.87z)| lr 2.33e-04 | 4150.04 ms | 32.5% bf16 MFU | 126077 tok/s step 11487/19560 | loss 3.308976 (-1.69z)| norm 0.2696 (-1.00z)| lr 2.33e-04 | 4157.07 ms | 32.5% bf16 MFU | 126079 tok/s step 11488/19560 | loss 3.374778 (-0.26z)| norm 0.2914 (+0.37z)| lr 2.33e-04 | 4157.59 ms | 32.5% bf16 MFU | 126080 tok/s step 11489/19560 | loss 3.376966 (-0.20z)| norm 0.2689 (-1.05z)| lr 2.33e-04 | 4158.14 ms | 32.5% bf16 MFU | 126080 tok/s step 11490/19560 | loss 3.381275 (-0.11z)| norm 0.2846 (-0.04z)| lr 2.33e-04 | 4155.07 ms | 32.5% bf16 MFU | 126085 tok/s step 11491/19560 | loss 3.312628 (-1.59z)| norm 0.2986 (+0.87z)| lr 2.33e-04 | 4151.08 ms | 32.5% bf16 MFU | 126096 tok/s step 11492/19560 | loss 3.374385 (-0.25z)| norm 0.2822 (-0.18z)| lr 2.33e-04 | 4152.94 ms | 32.5% bf16 MFU | 126104 tok/s step 11493/19560 | loss 3.383246 (-0.05z)| norm 0.2675 (-1.12z)| lr 2.33e-04 | 4192.54 ms | 32.2% bf16 MFU | 126051 tok/s step 11494/19560 | loss 3.392361 (+0.14z)| norm 0.2912 (+0.41z)| lr 2.32e-04 | 4148.56 ms | 32.5% bf16 MFU | 126067 tok/s step 11495/19560 | loss 3.391046 (+0.11z)| norm 0.2810 (-0.24z)| lr 2.32e-04 | 4157.92 ms | 32.5% bf16 MFU | 126069 tok/s step 11496/19560 | loss 3.393220 (+0.15z)| norm 0.2542 (-1.94z)| lr 2.32e-04 | 4151.59 ms | 32.5% bf16 MFU | 126080 tok/s step 11497/19560 | loss 3.371105 (-0.33z)| norm 0.2885 (+0.26z)| lr 2.32e-04 | 4158.91 ms | 32.5% bf16 MFU | 126079 tok/s step 11498/19560 | loss 3.316892 (-1.48z)| norm 0.2581 (-1.65z)| lr 2.32e-04 | 4158.80 ms | 32.5% bf16 MFU | 126078 tok/s step 11499/19560 | loss 3.391583 (+0.13z)| norm 0.2786 (-0.34z)| lr 2.32e-04 | 4149.22 ms | 32.5% bf16 MFU | 126092 tok/s step 11500/19560 | loss 3.437909 (+1.14z)| norm 0.2593 (-1.54z)| lr 2.32e-04 | 4148.68 ms | 32.5% bf16 MFU | 126106 tok/s val loss 3.356038 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2912/10042 = 0.289982 step 11501/19560 | loss 3.385377 (-0.00z)| norm 0.2722 (-0.71z)| lr 2.32e-04 | 4157.84 ms | 32.5% bf16 MFU | 126106 tok/s step 11502/19560 | loss 3.314154 (-1.53z)| norm 0.2760 (-0.47z)| lr 2.32e-04 | 4161.03 ms | 32.4% bf16 MFU | 126100 tok/s step 11503/19560 | loss 3.346195 (-0.83z)| norm 0.2666 (-1.05z)| lr 2.32e-04 | 4158.13 ms | 32.5% bf16 MFU | 126100 tok/s step 11504/19560 | loss 3.378174 (-0.15z)| norm 0.2613 (-1.37z)| lr 2.32e-04 | 4163.75 ms | 32.4% bf16 MFU | 126091 tok/s step 11505/19560 | loss 3.394816 (+0.25z)| norm 0.2500 (-2.05z)| lr 2.32e-04 | 4161.24 ms | 32.4% bf16 MFU | 126086 tok/s step 11506/19560 | loss 3.343375 (-0.90z)| norm 0.2747 (-0.48z)| lr 2.32e-04 | 4160.10 ms | 32.5% bf16 MFU | 126083 tok/s step 11507/19560 | loss 3.377599 (-0.13z)| norm 0.2527 (-1.83z)| lr 2.32e-04 | 4159.47 ms | 32.5% bf16 MFU | 126081 tok/s step 11508/19560 | loss 3.324211 (-1.32z)| norm 0.2624 (-1.22z)| lr 2.32e-04 | 4149.95 ms | 32.5% bf16 MFU | 126094 tok/s step 11509/19560 | loss 3.390578 (+0.17z)| norm 0.2612 (-1.28z)| lr 2.32e-04 | 4152.27 ms | 32.5% bf16 MFU | 126102 tok/s step 11510/19560 | loss 3.345925 (-0.83z)| norm 0.2972 (+0.97z)| lr 2.32e-04 | 4160.10 ms | 32.5% bf16 MFU | 126099 tok/s step 11511/19560 | loss 3.412683 (+0.70z)| norm 0.2478 (-2.08z)| lr 2.32e-04 | 4156.66 ms | 32.5% bf16 MFU | 126100 tok/s step 11512/19560 | loss 3.339479 (-0.98z)| norm 0.2597 (-1.31z)| lr 2.32e-04 | 4156.36 ms | 32.5% bf16 MFU | 126102 tok/s step 11513/19560 | loss 3.374183 (-0.16z)| norm 0.2715 (-0.58z)| lr 2.32e-04 | 4152.91 ms | 32.5% bf16 MFU | 126110 tok/s step 11514/19560 | loss 3.399868 (+0.45z)| norm 0.2607 (-1.23z)| lr 2.31e-04 | 4155.01 ms | 32.5% bf16 MFU | 126113 tok/s step 11515/19560 | loss 3.402196 (+0.51z)| norm 0.2665 (-0.86z)| lr 2.31e-04 | 4158.11 ms | 32.5% bf16 MFU | 126112 tok/s step 11516/19560 | loss 3.387575 (+0.17z)| norm 0.2668 (-0.82z)| lr 2.31e-04 | 4159.62 ms | 32.5% bf16 MFU | 126108 tok/s step 11517/19560 | loss 3.395481 (+0.36z)| norm 0.2922 (+0.76z)| lr 2.31e-04 | 4161.54 ms | 32.4% bf16 MFU | 126102 tok/s step 11518/19560 | loss 3.405898 (+0.60z)| norm 0.2620 (-1.11z)| lr 2.31e-04 | 4164.20 ms | 32.4% bf16 MFU | 126092 tok/s step 11519/19560 | loss 3.357767 (-0.55z)| norm 0.2749 (-0.30z)| lr 2.31e-04 | 4153.68 ms | 32.5% bf16 MFU | 126099 tok/s step 11520/19560 | loss 3.346875 (-0.80z)| norm 0.2654 (-0.89z)| lr 2.31e-04 | 4147.12 ms | 32.6% bf16 MFU | 126115 tok/s step 11521/19560 | loss 3.403127 (+0.54z)| norm 0.2734 (-0.39z)| lr 2.31e-04 | 4150.98 ms | 32.5% bf16 MFU | 126124 tok/s step 11522/19560 | loss 3.338921 (-0.99z)| norm 0.2759 (-0.23z)| lr 2.31e-04 | 4158.69 ms | 32.5% bf16 MFU | 126122 tok/s step 11523/19560 | loss 3.388236 (+0.19z)| norm 0.2628 (-1.03z)| lr 2.31e-04 | 4160.55 ms | 32.5% bf16 MFU | 126116 tok/s step 11524/19560 | loss 3.347070 (-0.80z)| norm 0.2841 (+0.28z)| lr 2.31e-04 | 4153.73 ms | 32.5% bf16 MFU | 126122 tok/s step 11525/19560 | loss 3.310604 (-1.64z)| norm 0.2712 (-0.53z)| lr 2.31e-04 | 4164.29 ms | 32.4% bf16 MFU | 126111 tok/s step 11526/19560 | loss 3.376841 (-0.07z)| norm 0.2932 (+0.84z)| lr 2.31e-04 | 4149.40 ms | 32.5% bf16 MFU | 126123 tok/s step 11527/19560 | loss 3.398758 (+0.45z)| norm 0.2809 (+0.07z)| lr 2.31e-04 | 4164.04 ms | 32.4% bf16 MFU | 126112 tok/s step 11528/19560 | loss 3.394402 (+0.35z)| norm 0.2926 (+0.79z)| lr 2.31e-04 | 4155.13 ms | 32.5% bf16 MFU | 126115 tok/s step 11529/19560 | loss 3.392984 (+0.31z)| norm 0.2695 (-0.63z)| lr 2.31e-04 | 4152.91 ms | 32.5% bf16 MFU | 126122 tok/s step 11530/19560 | loss 3.363874 (-0.38z)| norm 0.2812 (+0.09z)| lr 2.31e-04 | 4157.55 ms | 32.5% bf16 MFU | 126121 tok/s step 11531/19560 | loss 3.386501 (+0.15z)| norm 0.2825 (+0.18z)| lr 2.31e-04 | 4160.85 ms | 32.4% bf16 MFU | 126115 tok/s step 11532/19560 | loss 3.368554 (-0.29z)| norm 0.2794 (-0.02z)| lr 2.31e-04 | 4147.68 ms | 32.6% bf16 MFU | 126130 tok/s step 11533/19560 | loss 3.353304 (-0.65z)| norm 0.2808 (+0.06z)| lr 2.31e-04 | 4164.10 ms | 32.4% bf16 MFU | 126119 tok/s step 11534/19560 | loss 3.406559 (+0.62z)| norm 0.2725 (-0.46z)| lr 2.31e-04 | 4156.30 ms | 32.5% bf16 MFU | 126120 tok/s step 11535/19560 | loss 3.345538 (-0.83z)| norm 0.2595 (-1.27z)| lr 2.30e-04 | 4159.73 ms | 32.5% bf16 MFU | 126116 tok/s step 11536/19560 | loss 3.393867 (+0.32z)| norm 0.2685 (-0.69z)| lr 2.30e-04 | 4156.47 ms | 32.5% bf16 MFU | 126117 tok/s step 11537/19560 | loss 3.349696 (-0.76z)| norm 0.2844 (+0.31z)| lr 2.30e-04 | 4153.87 ms | 32.5% bf16 MFU | 126122 tok/s step 11538/19560 | loss 3.378413 (-0.06z)| norm 0.2629 (-1.04z)| lr 2.30e-04 | 4162.45 ms | 32.4% bf16 MFU | 126114 tok/s step 11539/19560 | loss 3.332524 (-1.17z)| norm 0.2813 (+0.12z)| lr 2.30e-04 | 4161.95 ms | 32.4% bf16 MFU | 126106 tok/s step 11540/19560 | loss 3.434882 (+1.31z)| norm 0.2789 (-0.03z)| lr 2.30e-04 | 4164.24 ms | 32.4% bf16 MFU | 126096 tok/s step 11541/19560 | loss 3.370042 (-0.25z)| norm 0.2758 (-0.23z)| lr 2.30e-04 | 4150.87 ms | 32.5% bf16 MFU | 126107 tok/s step 11542/19560 | loss 3.344331 (-0.86z)| norm 0.2921 (+0.80z)| lr 2.30e-04 | 4155.43 ms | 32.5% bf16 MFU | 126110 tok/s step 11543/19560 | loss 3.407717 (+0.67z)| norm 0.2809 (+0.08z)| lr 2.30e-04 | 4168.94 ms | 32.4% bf16 MFU | 126092 tok/s step 11544/19560 | loss 3.361970 (-0.42z)| norm 0.2834 (+0.23z)| lr 2.30e-04 | 4148.96 ms | 32.5% bf16 MFU | 126106 tok/s step 11545/19560 | loss 3.434280 (+1.34z)| norm 0.2835 (+0.24z)| lr 2.30e-04 | 4149.50 ms | 32.5% bf16 MFU | 126118 tok/s step 11546/19560 | loss 3.344085 (-0.88z)| norm 0.2922 (+0.79z)| lr 2.30e-04 | 4151.19 ms | 32.5% bf16 MFU | 126127 tok/s step 11547/19560 | loss 3.378814 (-0.02z)| norm 0.2783 (-0.10z)| lr 2.30e-04 | 4165.67 ms | 32.4% bf16 MFU | 126114 tok/s step 11548/19560 | loss 3.357434 (-0.53z)| norm 0.3040 (+1.55z)| lr 2.30e-04 | 4196.73 ms | 32.2% bf16 MFU | 126055 tok/s step 11549/19560 | loss 3.370536 (-0.20z)| norm 0.2769 (-0.18z)| lr 2.30e-04 | 4179.40 ms | 32.3% bf16 MFU | 126024 tok/s step 11550/19560 | loss 3.396891 (+0.50z)| norm 0.3041 (+1.63z)| lr 2.30e-04 | 4180.42 ms | 32.3% bf16 MFU | 125994 tok/s step 11551/19560 | loss 3.322620 (-1.41z)| norm 0.2706 (-0.58z)| lr 2.30e-04 | 4166.49 ms | 32.4% bf16 MFU | 125986 tok/s step 11552/19560 | loss 3.323267 (-1.39z)| norm 0.2953 (+1.09z)| lr 2.30e-04 | 4174.32 ms | 32.3% bf16 MFU | 125966 tok/s step 11553/19560 | loss 3.319266 (-1.49z)| norm 0.2811 (+0.14z)| lr 2.30e-04 | 4160.05 ms | 32.5% bf16 MFU | 125970 tok/s step 11554/19560 | loss 3.366019 (-0.25z)| norm 0.2764 (-0.17z)| lr 2.30e-04 | 4161.81 ms | 32.4% bf16 MFU | 125970 tok/s step 11555/19560 | loss 3.384021 (+0.22z)| norm 0.2890 (+0.67z)| lr 2.30e-04 | 4163.23 ms | 32.4% bf16 MFU | 125968 tok/s step 11556/19560 | loss 3.382355 (+0.18z)| norm 0.2721 (-0.46z)| lr 2.29e-04 | 4174.18 ms | 32.3% bf16 MFU | 125950 tok/s step 11557/19560 | loss 3.369612 (-0.15z)| norm 0.3062 (+1.88z)| lr 2.29e-04 | 4169.74 ms | 32.4% bf16 MFU | 125939 tok/s step 11558/19560 | loss 3.396260 (+0.59z)| norm 0.2678 (-0.75z)| lr 2.29e-04 | 4149.69 ms | 32.5% bf16 MFU | 125959 tok/s step 11559/19560 | loss 3.358024 (-0.46z)| norm 0.2749 (-0.24z)| lr 2.29e-04 | 4166.26 ms | 32.4% bf16 MFU | 125953 tok/s step 11560/19560 | loss 3.296602 (-2.11z)| norm 0.2768 (-0.09z)| lr 2.29e-04 | 4171.11 ms | 32.4% bf16 MFU | 125940 tok/s step 11561/19560 | loss 3.357138 (-0.46z)| norm 0.2786 (+0.04z)| lr 2.29e-04 | 4172.49 ms | 32.4% bf16 MFU | 125926 tok/s step 11562/19560 | loss 3.364567 (-0.26z)| norm 0.2866 (+0.59z)| lr 2.29e-04 | 4162.27 ms | 32.4% bf16 MFU | 125928 tok/s step 11563/19560 | loss 3.364329 (-0.28z)| norm 0.2626 (-1.10z)| lr 2.29e-04 | 4163.53 ms | 32.4% bf16 MFU | 125928 tok/s step 11564/19560 | loss 3.393453 (+0.52z)| norm 0.2726 (-0.38z)| lr 2.29e-04 | 4152.27 ms | 32.5% bf16 MFU | 125945 tok/s step 11565/19560 | loss 3.318457 (-1.60z)| norm 0.2605 (-1.22z)| lr 2.29e-04 | 4196.02 ms | 32.2% bf16 MFU | 125895 tok/s step 11566/19560 | loss 3.374004 (+0.01z)| norm 0.2628 (-1.05z)| lr 2.29e-04 | 4181.86 ms | 32.3% bf16 MFU | 125869 tok/s step 11567/19560 | loss 3.346943 (-0.78z)| norm 0.2643 (-0.93z)| lr 2.29e-04 | 4168.61 ms | 32.4% bf16 MFU | 125864 tok/s step 11568/19560 | loss 3.360064 (-0.39z)| norm 0.2602 (-1.20z)| lr 2.29e-04 | 4196.65 ms | 32.2% bf16 MFU | 125817 tok/s step 11569/19560 | loss 3.383379 (+0.30z)| norm 0.2827 (+0.39z)| lr 2.29e-04 | 4166.44 ms | 32.4% bf16 MFU | 125818 tok/s step 11570/19560 | loss 3.344994 (-0.82z)| norm 0.2636 (-0.95z)| lr 2.29e-04 | 4159.04 ms | 32.5% bf16 MFU | 125830 tok/s step 11571/19560 | loss 3.387509 (+0.42z)| norm 0.2796 (+0.21z)| lr 2.29e-04 | 4166.25 ms | 32.4% bf16 MFU | 125831 tok/s step 11572/19560 | loss 3.427761 (+1.59z)| norm 0.2949 (+1.37z)| lr 2.29e-04 | 4176.33 ms | 32.3% bf16 MFU | 125816 tok/s step 11573/19560 | loss 3.328318 (-1.29z)| norm 0.2890 (+0.93z)| lr 2.29e-04 | 4156.16 ms | 32.5% bf16 MFU | 125833 tok/s step 11574/19560 | loss 3.352390 (-0.59z)| norm 0.2810 (+0.34z)| lr 2.29e-04 | 4182.70 ms | 32.3% bf16 MFU | 125808 tok/s step 11575/19560 | loss 3.417916 (+1.29z)| norm 0.3447 (+4.65z)| lr 2.29e-04 | 4184.33 ms | 32.3% bf16 MFU | 125783 tok/s step 11576/19560 | loss 3.383531 (+0.31z)| norm 0.3210 (+2.94z)| lr 2.28e-04 | 4159.15 ms | 32.5% bf16 MFU | 125796 tok/s step 11577/19560 | loss 3.407285 (+1.01z)| norm 0.2860 (+0.59z)| lr 2.28e-04 | 4172.82 ms | 32.4% bf16 MFU | 125789 tok/s step 11578/19560 | loss 3.390443 (+0.51z)| norm 0.3077 (+2.00z)| lr 2.28e-04 | 4163.41 ms | 32.4% bf16 MFU | 125796 tok/s step 11579/19560 | loss 3.331697 (-1.19z)| norm 0.2776 (+0.01z)| lr 2.28e-04 | 4157.73 ms | 32.5% bf16 MFU | 125811 tok/s step 11580/19560 | loss 3.360174 (-0.36z)| norm 0.2760 (-0.10z)| lr 2.28e-04 | 4149.79 ms | 32.5% bf16 MFU | 125837 tok/s step 11581/19560 | loss 3.350128 (-0.65z)| norm 0.2981 (+1.36z)| lr 2.28e-04 | 4175.22 ms | 32.3% bf16 MFU | 125824 tok/s step 11582/19560 | loss 3.344136 (-0.81z)| norm 0.2841 (+0.43z)| lr 2.28e-04 | 4164.29 ms | 32.4% bf16 MFU | 125828 tok/s step 11583/19560 | loss 3.350808 (-0.62z)| norm 0.2996 (+1.43z)| lr 2.28e-04 | 4161.37 ms | 32.4% bf16 MFU | 125836 tok/s step 11584/19560 | loss 3.381256 (+0.29z)| norm 0.2905 (+0.82z)| lr 2.28e-04 | 4173.55 ms | 32.4% bf16 MFU | 125825 tok/s step 11585/19560 | loss 3.328228 (-1.27z)| norm 0.2808 (+0.18z)| lr 2.28e-04 | 4149.24 ms | 32.5% bf16 MFU | 125852 tok/s step 11586/19560 | loss 3.347054 (-0.70z)| norm 0.2908 (+0.85z)| lr 2.28e-04 | 4167.60 ms | 32.4% bf16 MFU | 125849 tok/s step 11587/19560 | loss 3.417067 (+1.36z)| norm 0.2877 (+0.63z)| lr 2.28e-04 | 4170.54 ms | 32.4% bf16 MFU | 125843 tok/s step 11588/19560 | loss 3.348017 (-0.67z)| norm 0.2847 (+0.43z)| lr 2.28e-04 | 4181.35 ms | 32.3% bf16 MFU | 125820 tok/s step 11589/19560 | loss 3.357281 (-0.40z)| norm 0.3011 (+1.49z)| lr 2.28e-04 | 4171.92 ms | 32.4% bf16 MFU | 125812 tok/s step 11590/19560 | loss 3.498803 (+3.59z)| norm 0.3267 (+3.04z)| lr 2.28e-04 | 4170.10 ms | 32.4% bf16 MFU | 125808 tok/s step 11591/19560 | loss 3.310541 (-1.71z)| norm 0.3125 (+2.08z)| lr 2.28e-04 | 4186.50 ms | 32.3% bf16 MFU | 125779 tok/s step 11592/19560 | loss 3.365394 (-0.18z)| norm 0.2931 (+0.85z)| lr 2.28e-04 | 4164.08 ms | 32.4% bf16 MFU | 125786 tok/s step 11593/19560 | loss 3.322132 (-1.41z)| norm 0.2883 (+0.54z)| lr 2.28e-04 | 4163.23 ms | 32.4% bf16 MFU | 125793 tok/s step 11594/19560 | loss 3.365019 (-0.19z)| norm 0.2983 (+1.16z)| lr 2.28e-04 | 4163.20 ms | 32.4% bf16 MFU | 125800 tok/s step 11595/19560 | loss 3.349103 (-0.64z)| norm 0.2768 (-0.19z)| lr 2.28e-04 | 4192.93 ms | 32.2% bf16 MFU | 125762 tok/s step 11596/19560 | loss 3.360735 (-0.30z)| norm 0.2960 (+1.02z)| lr 2.28e-04 | 4171.94 ms | 32.4% bf16 MFU | 125757 tok/s step 11597/19560 | loss 3.401831 (+0.95z)| norm 0.2668 (-0.82z)| lr 2.27e-04 | 4171.74 ms | 32.4% bf16 MFU | 125753 tok/s step 11598/19560 | loss 3.357567 (-0.41z)| norm 0.2922 (+0.77z)| lr 2.27e-04 | 4165.89 ms | 32.4% bf16 MFU | 125758 tok/s step 11599/19560 | loss 3.309583 (-1.85z)| norm 0.2929 (+0.80z)| lr 2.27e-04 | 4176.81 ms | 32.3% bf16 MFU | 125747 tok/s step 11600/19560 | loss 3.411720 (+1.26z)| norm 0.2971 (+1.05z)| lr 2.27e-04 | 4168.60 ms | 32.4% bf16 MFU | 125748 tok/s step 11601/19560 | loss 3.308734 (-1.84z)| norm 0.2844 (+0.25z)| lr 2.27e-04 | 4163.48 ms | 32.4% bf16 MFU | 125757 tok/s step 11602/19560 | loss 3.380579 (+0.32z)| norm 0.3063 (+1.60z)| lr 2.27e-04 | 4165.96 ms | 32.4% bf16 MFU | 125761 tok/s step 11603/19560 | loss 3.319467 (-1.49z)| norm 0.2635 (-1.10z)| lr 2.27e-04 | 4164.52 ms | 32.4% bf16 MFU | 125768 tok/s step 11604/19560 | loss 3.361640 (-0.23z)| norm 0.3002 (+1.20z)| lr 2.27e-04 | 4199.44 ms | 32.2% bf16 MFU | 125722 tok/s step 11605/19560 | loss 3.370260 (+0.03z)| norm 0.2674 (-0.86z)| lr 2.27e-04 | 4183.71 ms | 32.3% bf16 MFU | 125702 tok/s step 11606/19560 | loss 3.353007 (-0.48z)| norm 0.3123 (+1.92z)| lr 2.27e-04 | 4185.29 ms | 32.3% bf16 MFU | 125680 tok/s step 11607/19560 | loss 3.387029 (+0.55z)| norm 0.2723 (-0.55z)| lr 2.27e-04 | 4158.89 ms | 32.5% bf16 MFU | 125699 tok/s step 11608/19560 | loss 3.358879 (-0.31z)| norm 0.2766 (-0.28z)| lr 2.27e-04 | 4186.21 ms | 32.3% bf16 MFU | 125676 tok/s step 11609/19560 | loss 3.383152 (+0.42z)| norm 0.2686 (-0.76z)| lr 2.27e-04 | 4178.63 ms | 32.3% bf16 MFU | 125666 tok/s step 11610/19560 | loss 3.380377 (+0.34z)| norm 0.2859 (+0.31z)| lr 2.27e-04 | 4184.70 ms | 32.3% bf16 MFU | 125647 tok/s step 11611/19560 | loss 3.389981 (+0.65z)| norm 0.2785 (-0.15z)| lr 2.27e-04 | 4176.60 ms | 32.3% bf16 MFU | 125641 tok/s step 11612/19560 | loss 3.371429 (+0.09z)| norm 0.3116 (+1.89z)| lr 2.27e-04 | 4183.19 ms | 32.3% bf16 MFU | 125626 tok/s step 11613/19560 | loss 3.376651 (+0.26z)| norm 0.2672 (-0.84z)| lr 2.27e-04 | 4159.16 ms | 32.5% bf16 MFU | 125647 tok/s step 11614/19560 | loss 3.313013 (-1.69z)| norm 0.2651 (-0.96z)| lr 2.27e-04 | 4159.21 ms | 32.5% bf16 MFU | 125668 tok/s step 11615/19560 | loss 3.369442 (+0.03z)| norm 0.2771 (-0.21z)| lr 2.27e-04 | 4190.38 ms | 32.2% bf16 MFU | 125640 tok/s step 11616/19560 | loss 3.335146 (-1.02z)| norm 0.2715 (-0.55z)| lr 2.27e-04 | 4174.19 ms | 32.3% bf16 MFU | 125638 tok/s step 11617/19560 | loss 3.385139 (+0.52z)| norm 0.2724 (-0.50z)| lr 2.26e-04 | 4151.54 ms | 32.5% bf16 MFU | 125671 tok/s step 11618/19560 | loss 3.438370 (+2.12z)| norm 0.2628 (-1.08z)| lr 2.26e-04 | 4174.28 ms | 32.3% bf16 MFU | 125667 tok/s step 11619/19560 | loss 3.373699 (+0.14z)| norm 0.2625 (-1.09z)| lr 2.26e-04 | 4167.86 ms | 32.4% bf16 MFU | 125673 tok/s step 11620/19560 | loss 3.348741 (-0.62z)| norm 0.2628 (-1.06z)| lr 2.26e-04 | 4165.25 ms | 32.4% bf16 MFU | 125683 tok/s step 11621/19560 | loss 3.368562 (-0.01z)| norm 0.2527 (-1.66z)| lr 2.26e-04 | 4165.35 ms | 32.4% bf16 MFU | 125693 tok/s step 11622/19560 | loss 3.316722 (-1.57z)| norm 0.2560 (-1.43z)| lr 2.26e-04 | 4151.25 ms | 32.5% bf16 MFU | 125723 tok/s step 11623/19560 | loss 3.414225 (+1.39z)| norm 0.2839 (+0.27z)| lr 2.26e-04 | 4210.66 ms | 32.1% bf16 MFU | 125662 tok/s step 11624/19560 | loss 3.323210 (-1.35z)| norm 0.2698 (-0.60z)| lr 2.26e-04 | 4159.07 ms | 32.5% bf16 MFU | 125682 tok/s step 11625/19560 | loss 3.344018 (-0.71z)| norm 0.2682 (-0.69z)| lr 2.26e-04 | 4163.86 ms | 32.4% bf16 MFU | 125694 tok/s step 11626/19560 | loss 3.414953 (+1.41z)| norm 0.2907 (+0.68z)| lr 2.26e-04 | 4163.98 ms | 32.4% bf16 MFU | 125705 tok/s step 11627/19560 | loss 3.329993 (-1.14z)| norm 0.2826 (+0.18z)| lr 2.26e-04 | 4172.79 ms | 32.4% bf16 MFU | 125702 tok/s step 11628/19560 | loss 3.394947 (+0.84z)| norm 0.2736 (-0.39z)| lr 2.26e-04 | 4170.82 ms | 32.4% bf16 MFU | 125702 tok/s step 11629/19560 | loss 3.430317 (+1.88z)| norm 0.2804 (+0.03z)| lr 2.26e-04 | 4169.44 ms | 32.4% bf16 MFU | 125704 tok/s step 11630/19560 | loss 3.375264 (+0.21z)| norm 0.2920 (+0.75z)| lr 2.26e-04 | 4162.68 ms | 32.4% bf16 MFU | 125716 tok/s step 11631/19560 | loss 3.418240 (+1.49z)| norm 0.2819 (+0.11z)| lr 2.26e-04 | 4164.39 ms | 32.4% bf16 MFU | 125725 tok/s step 11632/19560 | loss 3.429123 (+1.79z)| norm 0.3112 (+1.89z)| lr 2.26e-04 | 4169.47 ms | 32.4% bf16 MFU | 125726 tok/s step 11633/19560 | loss 3.360234 (-0.27z)| norm 0.2765 (-0.27z)| lr 2.26e-04 | 4156.21 ms | 32.5% bf16 MFU | 125747 tok/s step 11634/19560 | loss 3.350087 (-0.57z)| norm 0.2812 (+0.03z)| lr 2.26e-04 | 4166.89 ms | 32.4% bf16 MFU | 125751 tok/s step 11635/19560 | loss 3.350929 (-0.54z)| norm 0.2940 (+0.82z)| lr 2.26e-04 | 4163.93 ms | 32.4% bf16 MFU | 125759 tok/s step 11636/19560 | loss 3.352105 (-0.52z)| norm 0.2874 (+0.38z)| lr 2.26e-04 | 4152.71 ms | 32.5% bf16 MFU | 125784 tok/s step 11637/19560 | loss 3.337398 (-0.94z)| norm 0.3043 (+1.44z)| lr 2.26e-04 | 4171.47 ms | 32.4% bf16 MFU | 125779 tok/s step 11638/19560 | loss 3.360651 (-0.25z)| norm 0.2594 (-1.39z)| lr 2.25e-04 | 4160.68 ms | 32.5% bf16 MFU | 125790 tok/s step 11639/19560 | loss 3.316326 (-1.56z)| norm 0.3090 (+1.72z)| lr 2.25e-04 | 4174.89 ms | 32.3% bf16 MFU | 125780 tok/s step 11640/19560 | loss 3.313791 (-1.62z)| norm 0.2611 (-1.32z)| lr 2.25e-04 | 4170.92 ms | 32.4% bf16 MFU | 125776 tok/s step 11641/19560 | loss 3.399206 (+0.92z)| norm 0.2772 (-0.30z)| lr 2.25e-04 | 4182.23 ms | 32.3% bf16 MFU | 125755 tok/s step 11642/19560 | loss 3.337182 (-0.91z)| norm 0.2800 (-0.13z)| lr 2.25e-04 | 4173.27 ms | 32.4% bf16 MFU | 125749 tok/s step 11643/19560 | loss 3.330608 (-1.08z)| norm 0.2722 (-0.63z)| lr 2.25e-04 | 4159.24 ms | 32.5% bf16 MFU | 125764 tok/s step 11644/19560 | loss 3.402136 (+1.04z)| norm 0.2623 (-1.26z)| lr 2.25e-04 | 4330.50 ms | 31.2% bf16 MFU | 125529 tok/s step 11645/19560 | loss 3.299808 (-1.95z)| norm 0.2605 (-1.36z)| lr 2.25e-04 | 4181.24 ms | 32.3% bf16 MFU | 125522 tok/s step 11646/19560 | loss 3.321486 (-1.30z)| norm 0.2654 (-1.05z)| lr 2.25e-04 | 4175.71 ms | 32.3% bf16 MFU | 125524 tok/s step 11647/19560 | loss 3.378123 (+0.36z)| norm 0.2822 (+0.03z)| lr 2.25e-04 | 4166.73 ms | 32.4% bf16 MFU | 125539 tok/s step 11648/19560 | loss 3.311607 (-1.57z)| norm 0.2747 (-0.46z)| lr 2.25e-04 | 4172.42 ms | 32.4% bf16 MFU | 125545 tok/s step 11649/19560 | loss 3.356155 (-0.27z)| norm 0.2585 (-1.49z)| lr 2.25e-04 | 4176.76 ms | 32.3% bf16 MFU | 125544 tok/s step 11650/19560 | loss 3.350106 (-0.45z)| norm 0.2549 (-1.69z)| lr 2.25e-04 | 4164.83 ms | 32.4% bf16 MFU | 125561 tok/s step 11651/19560 | loss 3.320876 (-1.28z)| norm 0.2732 (-0.54z)| lr 2.25e-04 | 4177.91 ms | 32.3% bf16 MFU | 125558 tok/s step 11652/19560 | loss 3.411458 (+1.33z)| norm 0.2596 (-1.37z)| lr 2.25e-04 | 4160.27 ms | 32.5% bf16 MFU | 125581 tok/s step 11653/19560 | loss 3.378975 (+0.38z)| norm 0.2871 (+0.34z)| lr 2.25e-04 | 4148.97 ms | 32.5% bf16 MFU | 125620 tok/s step 11654/19560 | loss 3.343348 (-0.65z)| norm 0.2590 (-1.40z)| lr 2.25e-04 | 4160.46 ms | 32.5% bf16 MFU | 125640 tok/s step 11655/19560 | loss 3.341575 (-0.69z)| norm 0.2924 (+0.68z)| lr 2.25e-04 | 4160.15 ms | 32.5% bf16 MFU | 125659 tok/s step 11656/19560 | loss 3.394714 (+0.86z)| norm 0.2842 (+0.17z)| lr 2.25e-04 | 4169.12 ms | 32.4% bf16 MFU | 125664 tok/s step 11657/19560 | loss 3.405710 (+1.17z)| norm 0.2663 (-0.94z)| lr 2.25e-04 | 4166.42 ms | 32.4% bf16 MFU | 125673 tok/s step 11658/19560 | loss 3.302266 (-1.79z)| norm 0.2757 (-0.36z)| lr 2.25e-04 | 4160.85 ms | 32.4% bf16 MFU | 125689 tok/s step 11659/19560 | loss 3.391395 (+0.76z)| norm 0.2567 (-1.51z)| lr 2.24e-04 | 4157.67 ms | 32.5% bf16 MFU | 125710 tok/s step 11660/19560 | loss 3.365238 (+0.01z)| norm 0.2817 (+0.03z)| lr 2.24e-04 | 4160.72 ms | 32.5% bf16 MFU | 125725 tok/s step 11661/19560 | loss 3.358426 (-0.18z)| norm 0.3004 (+1.17z)| lr 2.24e-04 | 4176.44 ms | 32.3% bf16 MFU | 125715 tok/s step 11662/19560 | loss 3.327353 (-1.06z)| norm 0.3096 (+1.70z)| lr 2.24e-04 | 4169.93 ms | 32.4% bf16 MFU | 125716 tok/s step 11663/19560 | loss 3.345382 (-0.54z)| norm 0.2883 (+0.39z)| lr 2.24e-04 | 4181.58 ms | 32.3% bf16 MFU | 125699 tok/s step 11664/19560 | loss 3.350382 (-0.39z)| norm 0.2914 (+0.58z)| lr 2.24e-04 | 4154.14 ms | 32.5% bf16 MFU | 125725 tok/s step 11665/19560 | loss 3.377312 (+0.38z)| norm 0.2871 (+0.31z)| lr 2.24e-04 | 4160.13 ms | 32.5% bf16 MFU | 125740 tok/s step 11666/19560 | loss 3.340986 (-0.66z)| norm 0.2712 (-0.67z)| lr 2.24e-04 | 4175.53 ms | 32.3% bf16 MFU | 125731 tok/s step 11667/19560 | loss 3.333286 (-0.88z)| norm 0.2959 (+0.84z)| lr 2.24e-04 | 4154.97 ms | 32.5% bf16 MFU | 125754 tok/s step 11668/19560 | loss 3.420720 (+1.65z)| norm 0.2744 (-0.48z)| lr 2.24e-04 | 4167.42 ms | 32.4% bf16 MFU | 125756 tok/s step 11669/19560 | loss 3.379313 (+0.45z)| norm 0.2689 (-0.81z)| lr 2.24e-04 | 4162.09 ms | 32.4% bf16 MFU | 125767 tok/s step 11670/19560 | loss 3.337736 (-0.75z)| norm 0.3187 (+2.19z)| lr 2.24e-04 | 4180.73 ms | 32.3% bf16 MFU | 125749 tok/s step 11671/19560 | loss 3.391273 (+0.80z)| norm 0.2861 (+0.22z)| lr 2.24e-04 | 4167.28 ms | 32.4% bf16 MFU | 125752 tok/s step 11672/19560 | loss 3.339028 (-0.71z)| norm 0.2859 (+0.21z)| lr 2.24e-04 | 4164.79 ms | 32.4% bf16 MFU | 125759 tok/s step 11673/19560 | loss 3.353928 (-0.26z)| norm 0.2821 (-0.02z)| lr 2.24e-04 | 4179.78 ms | 32.3% bf16 MFU | 125742 tok/s step 11674/19560 | loss 3.366821 (+0.11z)| norm 0.2834 (+0.07z)| lr 2.24e-04 | 4174.75 ms | 32.3% bf16 MFU | 125734 tok/s step 11675/19560 | loss 3.381622 (+0.55z)| norm 0.2998 (+1.04z)| lr 2.24e-04 | 4164.19 ms | 32.4% bf16 MFU | 125743 tok/s step 11676/19560 | loss 3.351077 (-0.35z)| norm 0.2785 (-0.23z)| lr 2.24e-04 | 4158.91 ms | 32.5% bf16 MFU | 125759 tok/s step 11677/19560 | loss 3.431666 (+1.98z)| norm 0.3045 (+1.32z)| lr 2.24e-04 | 4153.33 ms | 32.5% bf16 MFU | 125783 tok/s step 11678/19560 | loss 3.331558 (-0.91z)| norm 0.2666 (-0.94z)| lr 2.24e-04 | 4186.46 ms | 32.3% bf16 MFU | 125755 tok/s step 11679/19560 | loss 3.343108 (-0.58z)| norm 0.2870 (+0.28z)| lr 2.23e-04 | 4162.00 ms | 32.4% bf16 MFU | 125766 tok/s step 11680/19560 | loss 3.377713 (+0.42z)| norm 0.2847 (+0.15z)| lr 2.23e-04 | 4163.71 ms | 32.4% bf16 MFU | 125774 tok/s step 11681/19560 | loss 3.360201 (-0.11z)| norm 0.2697 (-0.75z)| lr 2.23e-04 | 4166.23 ms | 32.4% bf16 MFU | 125777 tok/s step 11682/19560 | loss 3.333395 (-0.89z)| norm 0.3010 (+1.12z)| lr 2.23e-04 | 4163.92 ms | 32.4% bf16 MFU | 125784 tok/s step 11683/19560 | loss 3.347867 (-0.45z)| norm 0.2697 (-0.75z)| lr 2.23e-04 | 4258.40 ms | 31.7% bf16 MFU | 125651 tok/s step 11684/19560 | loss 3.335966 (-0.79z)| norm 0.2779 (-0.26z)| lr 2.23e-04 | 4163.25 ms | 32.4% bf16 MFU | 125665 tok/s step 11685/19560 | loss 3.336703 (-0.76z)| norm 0.2861 (+0.24z)| lr 2.23e-04 | 4197.79 ms | 32.2% bf16 MFU | 125626 tok/s step 11686/19560 | loss 3.321264 (-1.19z)| norm 0.2689 (-0.80z)| lr 2.23e-04 | 4200.73 ms | 32.1% bf16 MFU | 125585 tok/s step 11687/19560 | loss 3.440159 (+2.22z)| norm 0.2815 (-0.04z)| lr 2.23e-04 | 4162.05 ms | 32.4% bf16 MFU | 125605 tok/s step 11688/19560 | loss 3.340172 (-0.66z)| norm 0.2762 (-0.36z)| lr 2.23e-04 | 4168.50 ms | 32.4% bf16 MFU | 125613 tok/s step 11689/19560 | loss 3.375120 (+0.35z)| norm 0.2854 (+0.19z)| lr 2.23e-04 | 4150.23 ms | 32.5% bf16 MFU | 125649 tok/s step 11690/19560 | loss 3.371765 (+0.25z)| norm 0.2835 (+0.08z)| lr 2.23e-04 | 4159.81 ms | 32.5% bf16 MFU | 125668 tok/s step 11691/19560 | loss 3.421620 (+1.66z)| norm 0.2780 (-0.26z)| lr 2.23e-04 | 4169.46 ms | 32.4% bf16 MFU | 125672 tok/s step 11692/19560 | loss 3.369638 (+0.18z)| norm 0.2891 (+0.41z)| lr 2.23e-04 | 4162.55 ms | 32.4% bf16 MFU | 125686 tok/s step 11693/19560 | loss 3.402047 (+1.09z)| norm 0.2659 (-1.01z)| lr 2.23e-04 | 4170.39 ms | 32.4% bf16 MFU | 125688 tok/s step 11694/19560 | loss 3.351070 (-0.37z)| norm 0.2764 (-0.38z)| lr 2.23e-04 | 4163.58 ms | 32.4% bf16 MFU | 125699 tok/s step 11695/19560 | loss 3.353119 (-0.32z)| norm 0.2744 (-0.51z)| lr 2.23e-04 | 4175.52 ms | 32.3% bf16 MFU | 125692 tok/s step 11696/19560 | loss 3.365595 (+0.04z)| norm 0.2810 (-0.11z)| lr 2.23e-04 | 4148.89 ms | 32.5% bf16 MFU | 125726 tok/s step 11697/19560 | loss 3.457136 (+2.60z)| norm 0.2620 (-1.28z)| lr 2.23e-04 | 4170.17 ms | 32.4% bf16 MFU | 125726 tok/s step 11698/19560 | loss 3.374376 (+0.27z)| norm 0.3224 (+2.38z)| lr 2.23e-04 | 4178.86 ms | 32.3% bf16 MFU | 125713 tok/s step 11699/19560 | loss 3.382859 (+0.51z)| norm 0.2681 (-0.90z)| lr 2.23e-04 | 4151.65 ms | 32.5% bf16 MFU | 125741 tok/s step 11700/19560 | loss 3.372491 (+0.23z)| norm 0.3114 (+1.69z)| lr 2.22e-04 | 4161.67 ms | 32.4% bf16 MFU | 125753 tok/s step 11701/19560 | loss 3.463444 (+2.72z)| norm 0.3552 (+4.00z)| lr 2.22e-04 | 4166.74 ms | 32.4% bf16 MFU | 125757 tok/s step 11702/19560 | loss 3.398889 (+0.91z)| norm 0.2816 (-0.12z)| lr 2.22e-04 | 4204.81 ms | 32.1% bf16 MFU | 125704 tok/s step 11703/19560 | loss 3.341964 (-0.65z)| norm 0.3037 (+1.19z)| lr 2.22e-04 | 4164.33 ms | 32.4% bf16 MFU | 125713 tok/s step 11704/19560 | loss 3.351100 (-0.39z)| norm 0.2863 (+0.19z)| lr 2.22e-04 | 4151.11 ms | 32.5% bf16 MFU | 125743 tok/s step 11705/19560 | loss 3.303954 (-1.67z)| norm 0.2721 (-0.65z)| lr 2.22e-04 | 4167.07 ms | 32.4% bf16 MFU | 125746 tok/s step 11706/19560 | loss 3.396519 (+0.89z)| norm 0.2912 (+0.50z)| lr 2.22e-04 | 4167.08 ms | 32.4% bf16 MFU | 125750 tok/s step 11707/19560 | loss 3.378902 (+0.40z)| norm 0.2810 (-0.11z)| lr 2.22e-04 | 4152.73 ms | 32.5% bf16 MFU | 125775 tok/s step 11708/19560 | loss 3.326617 (-1.04z)| norm 0.2854 (+0.15z)| lr 2.22e-04 | 4202.12 ms | 32.1% bf16 MFU | 125725 tok/s step 11709/19560 | loss 3.344242 (-0.55z)| norm 0.2738 (-0.54z)| lr 2.22e-04 | 4161.03 ms | 32.4% bf16 MFU | 125738 tok/s step 11710/19560 | loss 3.425178 (+1.65z)| norm 0.2661 (-0.99z)| lr 2.22e-04 | 4149.19 ms | 32.5% bf16 MFU | 125769 tok/s step 11711/19560 | loss 3.382358 (+0.47z)| norm 0.2840 (+0.09z)| lr 2.22e-04 | 4158.76 ms | 32.5% bf16 MFU | 125784 tok/s step 11712/19560 | loss 3.356047 (-0.24z)| norm 0.2814 (-0.06z)| lr 2.22e-04 | 4174.47 ms | 32.3% bf16 MFU | 125775 tok/s step 11713/19560 | loss 3.336298 (-0.79z)| norm 0.2995 (+1.01z)| lr 2.22e-04 | 4168.20 ms | 32.4% bf16 MFU | 125775 tok/s step 11714/19560 | loss 3.356361 (-0.24z)| norm 0.2975 (+0.89z)| lr 2.22e-04 | 4165.47 ms | 32.4% bf16 MFU | 125780 tok/s step 11715/19560 | loss 3.311223 (-1.45z)| norm 0.3125 (+1.75z)| lr 2.22e-04 | 4234.67 ms | 31.9% bf16 MFU | 125681 tok/s step 11716/19560 | loss 3.316684 (-1.29z)| norm 0.2863 (+0.20z)| lr 2.22e-04 | 4161.09 ms | 32.4% bf16 MFU | 125697 tok/s step 11717/19560 | loss 3.359693 (-0.12z)| norm 0.2860 (+0.20z)| lr 2.22e-04 | 4166.00 ms | 32.4% bf16 MFU | 125705 tok/s step 11718/19560 | loss 3.385185 (+0.63z)| norm 0.2984 (+0.97z)| lr 2.22e-04 | 4147.04 ms | 32.6% bf16 MFU | 125741 tok/s step 11719/19560 | loss 3.365599 (+0.06z)| norm 0.2801 (-0.13z)| lr 2.22e-04 | 4199.67 ms | 32.1% bf16 MFU | 125696 tok/s step 11720/19560 | loss 3.417591 (+1.54z)| norm 0.2863 (+0.25z)| lr 2.22e-04 | 4165.47 ms | 32.4% bf16 MFU | 125704 tok/s step 11721/19560 | loss 3.462017 (+2.72z)| norm 0.2754 (-0.41z)| lr 2.21e-04 | 4158.73 ms | 32.5% bf16 MFU | 125722 tok/s step 11722/19560 | loss 3.358428 (-0.19z)| norm 0.2769 (-0.31z)| lr 2.21e-04 | 4178.88 ms | 32.3% bf16 MFU | 125709 tok/s step 11723/19560 | loss 3.382623 (+0.49z)| norm 0.2797 (-0.14z)| lr 2.21e-04 | 4169.00 ms | 32.4% bf16 MFU | 125712 tok/s step 11724/19560 | loss 3.346949 (-0.51z)| norm 0.2682 (-0.84z)| lr 2.21e-04 | 4162.11 ms | 32.4% bf16 MFU | 125725 tok/s step 11725/19560 | loss 3.354992 (-0.28z)| norm 0.2758 (-0.37z)| lr 2.21e-04 | 4163.08 ms | 32.4% bf16 MFU | 125735 tok/s step 11726/19560 | loss 3.352064 (-0.36z)| norm 0.4573 (+7.81z)| lr 2.21e-04 | 4164.59 ms | 32.4% bf16 MFU | 125743 tok/s step 11727/19560 | loss 3.373881 (+0.24z)| norm 0.3043 (+0.94z)| lr 2.21e-04 | 4181.49 ms | 32.3% bf16 MFU | 125725 tok/s step 11728/19560 | loss 3.350459 (-0.41z)| norm 0.3400 (+2.47z)| lr 2.21e-04 | 4190.07 ms | 32.2% bf16 MFU | 125695 tok/s step 11729/19560 | loss 3.336929 (-0.81z)| norm 0.3034 (+0.86z)| lr 2.21e-04 | 4149.93 ms | 32.5% bf16 MFU | 125727 tok/s step 11730/19560 | loss 3.376689 (+0.34z)| norm 0.3109 (+1.19z)| lr 2.21e-04 | 4155.40 ms | 32.5% bf16 MFU | 125749 tok/s step 11731/19560 | loss 3.376568 (+0.32z)| norm 0.2958 (+0.52z)| lr 2.21e-04 | 4176.03 ms | 32.3% bf16 MFU | 125739 tok/s step 11732/19560 | loss 3.348474 (-0.49z)| norm 0.2843 (+0.02z)| lr 2.21e-04 | 4192.53 ms | 32.2% bf16 MFU | 125705 tok/s step 11733/19560 | loss 3.331751 (-0.96z)| norm 0.2707 (-0.57z)| lr 2.21e-04 | 4164.70 ms | 32.4% bf16 MFU | 125714 tok/s step 11734/19560 | loss 3.325290 (-1.14z)| norm 0.2814 (-0.09z)| lr 2.21e-04 | 4175.90 ms | 32.3% bf16 MFU | 125706 tok/s step 11735/19560 | loss 3.366103 (+0.04z)| norm 0.2874 (+0.16z)| lr 2.21e-04 | 4159.16 ms | 32.5% bf16 MFU | 125723 tok/s step 11736/19560 | loss 3.412671 (+1.36z)| norm 0.2784 (-0.23z)| lr 2.21e-04 | 4159.07 ms | 32.5% bf16 MFU | 125740 tok/s step 11737/19560 | loss 3.346100 (-0.53z)| norm 0.3113 (+1.19z)| lr 2.21e-04 | 4179.84 ms | 32.3% bf16 MFU | 125725 tok/s step 11738/19560 | loss 3.399683 (+0.99z)| norm 0.2849 (+0.04z)| lr 2.21e-04 | 4218.13 ms | 32.0% bf16 MFU | 125653 tok/s step 11739/19560 | loss 3.410377 (+1.28z)| norm 0.2713 (-0.56z)| lr 2.21e-04 | 4396.03 ms | 30.7% bf16 MFU | 125334 tok/s step 11740/19560 | loss 3.341596 (-0.66z)| norm 0.2946 (+0.47z)| lr 2.21e-04 | 4192.45 ms | 32.2% bf16 MFU | 125320 tok/s step 11741/19560 | loss 3.321931 (-1.20z)| norm 0.2590 (-1.09z)| lr 2.21e-04 | 4251.94 ms | 31.8% bf16 MFU | 125219 tok/s step 11742/19560 | loss 3.319138 (-1.28z)| norm 0.2744 (-0.42z)| lr 2.20e-04 | 4241.28 ms | 31.8% bf16 MFU | 125139 tok/s step 11743/19560 | loss 3.357346 (-0.20z)| norm 0.2577 (-1.14z)| lr 2.20e-04 | 4171.93 ms | 32.4% bf16 MFU | 125166 tok/s step 11744/19560 | loss 3.375509 (+0.30z)| norm 0.2835 (-0.02z)| lr 2.20e-04 | 4165.78 ms | 32.4% bf16 MFU | 125200 tok/s step 11745/19560 | loss 3.384396 (+0.56z)| norm 0.2567 (-1.18z)| lr 2.20e-04 | 4151.08 ms | 32.5% bf16 MFU | 125255 tok/s step 11746/19560 | loss 3.431159 (+1.89z)| norm 0.2903 (+0.28z)| lr 2.20e-04 | 4160.08 ms | 32.5% bf16 MFU | 125294 tok/s step 11747/19560 | loss 3.393343 (+0.81z)| norm 0.2884 (+0.19z)| lr 2.20e-04 | 4166.30 ms | 32.4% bf16 MFU | 125321 tok/s step 11748/19560 | loss 3.426354 (+1.71z)| norm 0.2939 (+0.42z)| lr 2.20e-04 | 4162.63 ms | 32.4% bf16 MFU | 125353 tok/s step 11749/19560 | loss 3.431941 (+1.82z)| norm 0.3003 (+0.69z)| lr 2.20e-04 | 4258.36 ms | 31.7% bf16 MFU | 125241 tok/s step 11750/19560 | loss 3.290863 (-2.05z)| norm 0.2873 (+0.11z)| lr 2.20e-04 | 4240.26 ms | 31.8% bf16 MFU | 125161 tok/s val loss 3.351433 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2923/10042 = 0.291077 step 11751/19560 | loss 3.353042 (-0.34z)| norm 0.3118 (+1.18z)| lr 2.20e-04 | 4731.26 ms | 28.5% bf16 MFU | 124444 tok/s step 11752/19560 | loss 3.368096 (+0.07z)| norm 0.2873 (+0.09z)| lr 2.20e-04 | 4308.40 ms | 31.3% bf16 MFU | 124306 tok/s step 11753/19560 | loss 3.371902 (+0.17z)| norm 0.3093 (+1.05z)| lr 2.20e-04 | 4153.74 ms | 32.5% bf16 MFU | 124402 tok/s step 11754/19560 | loss 3.415850 (+1.39z)| norm 0.2927 (+0.31z)| lr 2.20e-04 | 4178.77 ms | 32.3% bf16 MFU | 124455 tok/s step 11755/19560 | loss 3.365503 (-0.02z)| norm 0.3335 (+2.07z)| lr 2.20e-04 | 4167.70 ms | 32.4% bf16 MFU | 124522 tok/s step 11756/19560 | loss 3.381044 (+0.42z)| norm 0.2844 (-0.07z)| lr 2.20e-04 | 4160.79 ms | 32.4% bf16 MFU | 124596 tok/s step 11757/19560 | loss 3.380567 (+0.42z)| norm 0.2799 (-0.27z)| lr 2.20e-04 | 4151.46 ms | 32.5% bf16 MFU | 124681 tok/s step 11758/19560 | loss 3.354551 (-0.31z)| norm 0.2773 (-0.38z)| lr 2.20e-04 | 4178.28 ms | 32.3% bf16 MFU | 124721 tok/s step 11759/19560 | loss 3.333192 (-0.90z)| norm 0.2816 (-0.19z)| lr 2.20e-04 | 4155.74 ms | 32.5% bf16 MFU | 124793 tok/s step 11760/19560 | loss 3.416264 (+1.47z)| norm 0.2923 (+0.28z)| lr 2.20e-04 | 4161.75 ms | 32.4% bf16 MFU | 124852 tok/s step 11761/19560 | loss 3.443995 (+2.20z)| norm 0.2740 (-0.52z)| lr 2.20e-04 | 4161.75 ms | 32.4% bf16 MFU | 124908 tok/s step 11762/19560 | loss 3.380131 (+0.41z)| norm 0.2967 (+0.47z)| lr 2.19e-04 | 4162.66 ms | 32.4% bf16 MFU | 124961 tok/s step 11763/19560 | loss 3.360306 (-0.15z)| norm 0.2934 (+0.33z)| lr 2.19e-04 | 4162.79 ms | 32.4% bf16 MFU | 125010 tok/s step 11764/19560 | loss 3.333673 (-0.89z)| norm 0.2780 (-0.35z)| lr 2.19e-04 | 4157.17 ms | 32.5% bf16 MFU | 125065 tok/s step 11765/19560 | loss 3.383709 (+0.50z)| norm 0.2576 (-1.22z)| lr 2.19e-04 | 4165.68 ms | 32.4% bf16 MFU | 125105 tok/s step 11766/19560 | loss 3.407949 (+1.16z)| norm 0.2994 (+0.59z)| lr 2.19e-04 | 4176.40 ms | 32.3% bf16 MFU | 125126 tok/s step 11767/19560 | loss 3.376747 (+0.28z)| norm 0.2851 (-0.02z)| lr 2.19e-04 | 4163.84 ms | 32.4% bf16 MFU | 125166 tok/s step 11768/19560 | loss 3.333022 (-0.95z)| norm 0.2658 (-0.87z)| lr 2.19e-04 | 4164.41 ms | 32.4% bf16 MFU | 125202 tok/s step 11769/19560 | loss 3.386885 (+0.57z)| norm 0.2959 (+0.44z)| lr 2.19e-04 | 4175.56 ms | 32.3% bf16 MFU | 125220 tok/s step 11770/19560 | loss 3.376880 (+0.28z)| norm 0.2905 (+0.20z)| lr 2.19e-04 | 4153.03 ms | 32.5% bf16 MFU | 125271 tok/s step 11771/19560 | loss 3.331006 (-1.02z)| norm 0.2416 (-1.91z)| lr 2.19e-04 | 4173.60 ms | 32.4% bf16 MFU | 125289 tok/s step 11772/19560 | loss 3.333059 (-0.95z)| norm 0.2684 (-0.76z)| lr 2.19e-04 | 4168.91 ms | 32.4% bf16 MFU | 125312 tok/s step 11773/19560 | loss 3.344087 (-0.65z)| norm 0.2567 (-1.26z)| lr 2.19e-04 | 4157.29 ms | 32.5% bf16 MFU | 125352 tok/s step 11774/19560 | loss 3.369415 (+0.06z)| norm 0.2650 (-0.90z)| lr 2.19e-04 | 4158.46 ms | 32.5% bf16 MFU | 125389 tok/s step 11775/19560 | loss 3.378447 (+0.32z)| norm 0.2746 (-0.48z)| lr 2.19e-04 | 4168.01 ms | 32.4% bf16 MFU | 125409 tok/s step 11776/19560 | loss 3.421867 (+1.55z)| norm 0.2958 (+0.43z)| lr 2.19e-04 | 4163.27 ms | 32.4% bf16 MFU | 125435 tok/s step 11777/19560 | loss 3.329530 (-1.10z)| norm 0.2707 (-0.66z)| lr 2.19e-04 | 4182.74 ms | 32.3% bf16 MFU | 125430 tok/s step 11778/19560 | loss 3.371025 (+0.09z)| norm 0.2772 (-0.39z)| lr 2.19e-04 | 4161.37 ms | 32.4% bf16 MFU | 125458 tok/s step 11779/19560 | loss 3.382474 (+0.40z)| norm 0.2774 (-0.38z)| lr 2.19e-04 | 4161.07 ms | 32.4% bf16 MFU | 125485 tok/s step 11780/19560 | loss 3.370673 (+0.07z)| norm 0.2538 (-1.41z)| lr 2.19e-04 | 4175.96 ms | 32.3% bf16 MFU | 125489 tok/s step 11781/19560 | loss 3.435021 (+1.91z)| norm 0.2708 (-0.66z)| lr 2.19e-04 | 4162.70 ms | 32.4% bf16 MFU | 125512 tok/s step 11782/19560 | loss 3.374596 (+0.16z)| norm 0.2823 (-0.17z)| lr 2.19e-04 | 4164.42 ms | 32.4% bf16 MFU | 125531 tok/s step 11783/19560 | loss 3.371810 (+0.08z)| norm 0.2894 (+0.14z)| lr 2.18e-04 | 4157.74 ms | 32.5% bf16 MFU | 125559 tok/s step 11784/19560 | loss 3.352685 (-0.47z)| norm 0.2928 (+0.29z)| lr 2.18e-04 | 4203.11 ms | 32.1% bf16 MFU | 125518 tok/s step 11785/19560 | loss 3.358981 (-0.27z)| norm 0.2465 (-1.72z)| lr 2.18e-04 | 4169.41 ms | 32.4% bf16 MFU | 125530 tok/s step 11786/19560 | loss 3.466851 (+2.77z)| norm 0.3045 (+0.79z)| lr 2.18e-04 | 4157.11 ms | 32.5% bf16 MFU | 125559 tok/s step 11787/19560 | loss 3.442296 (+2.03z)| norm 0.2648 (-0.94z)| lr 2.18e-04 | 4167.45 ms | 32.4% bf16 MFU | 125571 tok/s step 11788/19560 | loss 3.388675 (+0.52z)| norm 0.2937 (+0.32z)| lr 2.18e-04 | 4171.57 ms | 32.4% bf16 MFU | 125577 tok/s step 11789/19560 | loss 3.331039 (-1.09z)| norm 0.2850 (-0.06z)| lr 2.18e-04 | 4160.67 ms | 32.5% bf16 MFU | 125599 tok/s step 11790/19560 | loss 3.310237 (-1.66z)| norm 0.2852 (-0.04z)| lr 2.18e-04 | 4165.17 ms | 32.4% bf16 MFU | 125612 tok/s step 11791/19560 | loss 3.367170 (-0.08z)| norm 0.2981 (+0.52z)| lr 2.18e-04 | 4161.30 ms | 32.4% bf16 MFU | 125631 tok/s step 11792/19560 | loss 3.438256 (+1.86z)| norm 0.2658 (-0.88z)| lr 2.18e-04 | 4165.72 ms | 32.4% bf16 MFU | 125643 tok/s step 11793/19560 | loss 3.350347 (-0.56z)| norm 0.2953 (+0.40z)| lr 2.18e-04 | 4169.72 ms | 32.4% bf16 MFU | 125647 tok/s step 11794/19560 | loss 3.384936 (+0.39z)| norm 0.3087 (+0.97z)| lr 2.18e-04 | 4171.82 ms | 32.4% bf16 MFU | 125649 tok/s step 11795/19560 | loss 3.346587 (-0.68z)| norm 0.2768 (-0.41z)| lr 2.18e-04 | 4155.38 ms | 32.5% bf16 MFU | 125675 tok/s step 11796/19560 | loss 3.406523 (+0.99z)| norm 0.2840 (-0.10z)| lr 2.18e-04 | 4158.45 ms | 32.5% bf16 MFU | 125695 tok/s step 11797/19560 | loss 3.373238 (+0.06z)| norm 0.3164 (+1.29z)| lr 2.18e-04 | 4162.82 ms | 32.4% bf16 MFU | 125707 tok/s step 11798/19560 | loss 3.397203 (+0.72z)| norm 0.2754 (-0.48z)| lr 2.18e-04 | 4159.11 ms | 32.5% bf16 MFU | 125725 tok/s step 11799/19560 | loss 3.429586 (+1.59z)| norm 0.2753 (-0.47z)| lr 2.18e-04 | 4158.27 ms | 32.5% bf16 MFU | 125743 tok/s step 11800/19560 | loss 3.352743 (-0.52z)| norm 0.2752 (-0.47z)| lr 2.18e-04 | 4165.29 ms | 32.4% bf16 MFU | 125749 tok/s step 11801/19560 | loss 3.375954 (+0.11z)| norm 0.2845 (-0.07z)| lr 2.18e-04 | 4166.51 ms | 32.4% bf16 MFU | 125753 tok/s step 11802/19560 | loss 3.349732 (-0.61z)| norm 0.2762 (-0.43z)| lr 2.18e-04 | 4152.23 ms | 32.5% bf16 MFU | 125779 tok/s step 11803/19560 | loss 3.372550 (+0.02z)| norm 0.2656 (-0.88z)| lr 2.18e-04 | 4183.73 ms | 32.3% bf16 MFU | 125756 tok/s step 11804/19560 | loss 3.365691 (-0.17z)| norm 0.3162 (+1.30z)| lr 2.17e-04 | 4179.97 ms | 32.3% bf16 MFU | 125740 tok/s step 11805/19560 | loss 3.407508 (+1.00z)| norm 0.3063 (+0.87z)| lr 2.17e-04 | 4159.53 ms | 32.5% bf16 MFU | 125755 tok/s step 11806/19560 | loss 3.317110 (-1.51z)| norm 0.2887 (+0.10z)| lr 2.17e-04 | 4170.19 ms | 32.4% bf16 MFU | 125753 tok/s step 11807/19560 | loss 3.374958 (+0.09z)| norm 0.3018 (+0.67z)| lr 2.17e-04 | 4160.74 ms | 32.5% bf16 MFU | 125766 tok/s step 11808/19560 | loss 3.373317 (+0.04z)| norm 0.2804 (-0.26z)| lr 2.17e-04 | 4177.58 ms | 32.3% bf16 MFU | 125753 tok/s step 11809/19560 | loss 3.307960 (-1.74z)| norm 0.2936 (+0.31z)| lr 2.17e-04 | 4164.73 ms | 32.4% bf16 MFU | 125759 tok/s step 11810/19560 | loss 3.437146 (+1.77z)| norm 0.2803 (-0.26z)| lr 2.17e-04 | 4163.24 ms | 32.4% bf16 MFU | 125768 tok/s step 11811/19560 | loss 3.392465 (+0.54z)| norm 0.3055 (+0.82z)| lr 2.17e-04 | 4171.30 ms | 32.4% bf16 MFU | 125764 tok/s step 11812/19560 | loss 3.377983 (+0.14z)| norm 0.4014 (+4.52z)| lr 2.17e-04 | 4165.73 ms | 32.4% bf16 MFU | 125769 tok/s step 11813/19560 | loss 3.331675 (-1.12z)| norm 0.2889 (+0.05z)| lr 2.17e-04 | 4183.21 ms | 32.3% bf16 MFU | 125747 tok/s step 11814/19560 | loss 3.425432 (+1.41z)| norm 0.2654 (-0.88z)| lr 2.17e-04 | 4159.66 ms | 32.5% bf16 MFU | 125762 tok/s step 11815/19560 | loss 3.318887 (-1.47z)| norm 0.2783 (-0.37z)| lr 2.17e-04 | 4165.69 ms | 32.4% bf16 MFU | 125766 tok/s step 11816/19560 | loss 3.409941 (+1.01z)| norm 0.2720 (-0.62z)| lr 2.17e-04 | 4163.48 ms | 32.4% bf16 MFU | 125774 tok/s step 11817/19560 | loss 3.371559 (-0.04z)| norm 0.3192 (+1.24z)| lr 2.17e-04 | 4160.68 ms | 32.5% bf16 MFU | 125786 tok/s step 11818/19560 | loss 3.353211 (-0.54z)| norm 0.2823 (-0.22z)| lr 2.17e-04 | 4167.44 ms | 32.4% bf16 MFU | 125787 tok/s step 11819/19560 | loss 3.385438 (+0.35z)| norm 0.2644 (-0.91z)| lr 2.17e-04 | 4190.22 ms | 32.2% bf16 MFU | 125754 tok/s step 11820/19560 | loss 3.360388 (-0.34z)| norm 0.2777 (-0.39z)| lr 2.17e-04 | 4165.86 ms | 32.4% bf16 MFU | 125759 tok/s step 11821/19560 | loss 3.332371 (-1.09z)| norm 0.2737 (-0.55z)| lr 2.17e-04 | 4160.82 ms | 32.4% bf16 MFU | 125771 tok/s step 11822/19560 | loss 3.307139 (-1.76z)| norm 0.2570 (-1.20z)| lr 2.17e-04 | 4171.57 ms | 32.4% bf16 MFU | 125767 tok/s step 11823/19560 | loss 3.383178 (+0.30z)| norm 0.2659 (-0.84z)| lr 2.17e-04 | 4149.68 ms | 32.5% bf16 MFU | 125796 tok/s step 11824/19560 | loss 3.355494 (-0.45z)| norm 0.2791 (-0.33z)| lr 2.17e-04 | 4172.41 ms | 32.4% bf16 MFU | 125789 tok/s step 11825/19560 | loss 3.374534 (+0.09z)| norm 0.2525 (-1.36z)| lr 2.16e-04 | 4151.34 ms | 32.5% bf16 MFU | 125814 tok/s step 11826/19560 | loss 3.421829 (+1.38z)| norm 0.2553 (-1.23z)| lr 2.16e-04 | 4173.34 ms | 32.4% bf16 MFU | 125805 tok/s step 11827/19560 | loss 3.434960 (+1.71z)| norm 0.2803 (-0.26z)| lr 2.16e-04 | 4164.36 ms | 32.4% bf16 MFU | 125809 tok/s step 11828/19560 | loss 3.324748 (-1.27z)| norm 0.2668 (-0.77z)| lr 2.16e-04 | 4168.57 ms | 32.4% bf16 MFU | 125807 tok/s step 11829/19560 | loss 3.477087 (+2.83z)| norm 0.2671 (-0.75z)| lr 2.16e-04 | 4177.33 ms | 32.3% bf16 MFU | 125792 tok/s step 11830/19560 | loss 3.412974 (+1.10z)| norm 0.2676 (-0.73z)| lr 2.16e-04 | 4154.82 ms | 32.5% bf16 MFU | 125812 tok/s step 11831/19560 | loss 3.308617 (-1.68z)| norm 0.2562 (-1.17z)| lr 2.16e-04 | 4168.68 ms | 32.4% bf16 MFU | 125810 tok/s step 11832/19560 | loss 3.312439 (-1.55z)| norm 0.2668 (-0.74z)| lr 2.16e-04 | 4158.62 ms | 32.5% bf16 MFU | 125823 tok/s step 11833/19560 | loss 3.335438 (-0.96z)| norm 0.2787 (-0.27z)| lr 2.16e-04 | 4171.66 ms | 32.4% bf16 MFU | 125816 tok/s step 11834/19560 | loss 3.328088 (-1.14z)| norm 0.2639 (-0.85z)| lr 2.16e-04 | 4166.97 ms | 32.4% bf16 MFU | 125816 tok/s step 11835/19560 | loss 3.429671 (+1.53z)| norm 0.3006 (+0.61z)| lr 2.16e-04 | 4155.62 ms | 32.5% bf16 MFU | 125833 tok/s step 11836/19560 | loss 3.292833 (-2.03z)| norm 0.2727 (-0.49z)| lr 2.16e-04 | 4154.63 ms | 32.5% bf16 MFU | 125852 tok/s step 11837/19560 | loss 3.347456 (-0.62z)| norm 0.2754 (-0.39z)| lr 2.16e-04 | 4162.34 ms | 32.4% bf16 MFU | 125857 tok/s step 11838/19560 | loss 3.349214 (-0.56z)| norm 0.2894 (+0.16z)| lr 2.16e-04 | 4152.58 ms | 32.5% bf16 MFU | 125877 tok/s step 11839/19560 | loss 3.353659 (-0.44z)| norm 0.2949 (+0.38z)| lr 2.16e-04 | 4157.31 ms | 32.5% bf16 MFU | 125889 tok/s step 11840/19560 | loss 3.353952 (-0.43z)| norm 0.2845 (-0.04z)| lr 2.16e-04 | 4176.51 ms | 32.3% bf16 MFU | 125871 tok/s step 11841/19560 | loss 3.465798 (+2.42z)| norm 0.2846 (-0.03z)| lr 2.16e-04 | 4161.96 ms | 32.4% bf16 MFU | 125876 tok/s step 11842/19560 | loss 3.331202 (-1.02z)| norm 0.2841 (-0.05z)| lr 2.16e-04 | 4154.96 ms | 32.5% bf16 MFU | 125891 tok/s step 11843/19560 | loss 3.399251 (+0.70z)| norm 0.2577 (-1.08z)| lr 2.16e-04 | 4159.31 ms | 32.5% bf16 MFU | 125899 tok/s step 11844/19560 | loss 3.399840 (+0.71z)| norm 0.2696 (-0.60z)| lr 2.16e-04 | 4161.18 ms | 32.4% bf16 MFU | 125904 tok/s step 11845/19560 | loss 3.356311 (-0.42z)| norm 0.2621 (-0.89z)| lr 2.16e-04 | 4194.30 ms | 32.2% bf16 MFU | 125859 tok/s step 11846/19560 | loss 3.378180 (+0.15z)| norm 0.2795 (-0.19z)| lr 2.15e-04 | 4155.29 ms | 32.5% bf16 MFU | 125875 tok/s step 11847/19560 | loss 3.407965 (+0.91z)| norm 0.2763 (-0.32z)| lr 2.15e-04 | 4167.66 ms | 32.4% bf16 MFU | 125871 tok/s step 11848/19560 | loss 3.353008 (-0.50z)| norm 0.3103 (+1.02z)| lr 2.15e-04 | 4167.83 ms | 32.4% bf16 MFU | 125867 tok/s step 11849/19560 | loss 3.314313 (-1.49z)| norm 0.2592 (-0.99z)| lr 2.15e-04 | 4167.38 ms | 32.4% bf16 MFU | 125864 tok/s step 11850/19560 | loss 3.336446 (-0.90z)| norm 0.2969 (+0.49z)| lr 2.15e-04 | 4176.27 ms | 32.3% bf16 MFU | 125848 tok/s step 11851/19560 | loss 3.376248 (+0.14z)| norm 0.2654 (-0.75z)| lr 2.15e-04 | 4164.54 ms | 32.4% bf16 MFU | 125850 tok/s step 11852/19560 | loss 3.378665 (+0.20z)| norm 0.2898 (+0.20z)| lr 2.15e-04 | 4169.43 ms | 32.4% bf16 MFU | 125845 tok/s step 11853/19560 | loss 3.504102 (+3.30z)| norm 0.3071 (+0.88z)| lr 2.15e-04 | 4156.78 ms | 32.5% bf16 MFU | 125859 tok/s step 11854/19560 | loss 3.329990 (-1.05z)| norm 0.3001 (+0.81z)| lr 2.15e-04 | 4151.37 ms | 32.5% bf16 MFU | 125881 tok/s step 11855/19560 | loss 3.427547 (+1.36z)| norm 0.2721 (-0.55z)| lr 2.15e-04 | 4161.93 ms | 32.4% bf16 MFU | 125885 tok/s step 11856/19560 | loss 3.358907 (-0.34z)| norm 0.2995 (+0.83z)| lr 2.15e-04 | 4181.37 ms | 32.3% bf16 MFU | 125860 tok/s step 11857/19560 | loss 3.340089 (-0.81z)| norm 0.2830 (+0.00z)| lr 2.15e-04 | 4159.23 ms | 32.5% bf16 MFU | 125870 tok/s step 11858/19560 | loss 3.363130 (-0.23z)| norm 0.2770 (-0.29z)| lr 2.15e-04 | 4172.48 ms | 32.4% bf16 MFU | 125859 tok/s step 11859/19560 | loss 3.351089 (-0.53z)| norm 0.2730 (-0.49z)| lr 2.15e-04 | 4177.75 ms | 32.3% bf16 MFU | 125841 tok/s step 11860/19560 | loss 3.361292 (-0.28z)| norm 0.2661 (-0.83z)| lr 2.15e-04 | 4154.62 ms | 32.5% bf16 MFU | 125859 tok/s step 11861/19560 | loss 3.352569 (-0.50z)| norm 0.2942 (+0.60z)| lr 2.15e-04 | 4155.14 ms | 32.5% bf16 MFU | 125875 tok/s step 11862/19560 | loss 3.450510 (+1.90z)| norm 0.2771 (-0.27z)| lr 2.15e-04 | 4161.19 ms | 32.4% bf16 MFU | 125881 tok/s step 11863/19560 | loss 3.410932 (+0.91z)| norm 0.2800 (-0.12z)| lr 2.15e-04 | 4161.38 ms | 32.4% bf16 MFU | 125886 tok/s step 11864/19560 | loss 3.326932 (-1.14z)| norm 0.2907 (+0.42z)| lr 2.15e-04 | 4159.10 ms | 32.5% bf16 MFU | 125895 tok/s step 11865/19560 | loss 3.331359 (-1.02z)| norm 0.2838 (+0.08z)| lr 2.15e-04 | 4173.37 ms | 32.4% bf16 MFU | 125881 tok/s step 11866/19560 | loss 3.368294 (-0.11z)| norm 0.2767 (-0.28z)| lr 2.14e-04 | 4169.69 ms | 32.4% bf16 MFU | 125874 tok/s step 11867/19560 | loss 3.399687 (+0.66z)| norm 0.3041 (+1.11z)| lr 2.14e-04 | 4182.22 ms | 32.3% bf16 MFU | 125848 tok/s step 11868/19560 | loss 3.388830 (+0.38z)| norm 0.2605 (-1.11z)| lr 2.14e-04 | 4155.88 ms | 32.5% bf16 MFU | 125864 tok/s step 11869/19560 | loss 3.311838 (-1.51z)| norm 0.2772 (-0.26z)| lr 2.14e-04 | 4155.01 ms | 32.5% bf16 MFU | 125880 tok/s step 11870/19560 | loss 3.359365 (-0.35z)| norm 0.2841 (+0.09z)| lr 2.14e-04 | 4161.74 ms | 32.4% bf16 MFU | 125885 tok/s step 11871/19560 | loss 3.402854 (+0.72z)| norm 0.2920 (+0.48z)| lr 2.14e-04 | 4154.30 ms | 32.5% bf16 MFU | 125901 tok/s step 11872/19560 | loss 3.355420 (-0.45z)| norm 0.2750 (-0.40z)| lr 2.14e-04 | 4155.77 ms | 32.5% bf16 MFU | 125914 tok/s step 11873/19560 | loss 3.334510 (-0.95z)| norm 0.2736 (-0.48z)| lr 2.14e-04 | 4166.68 ms | 32.4% bf16 MFU | 125909 tok/s step 11874/19560 | loss 3.336699 (-0.89z)| norm 0.2685 (-0.73z)| lr 2.14e-04 | 4152.01 ms | 32.5% bf16 MFU | 125927 tok/s step 11875/19560 | loss 3.327702 (-1.09z)| norm 0.2735 (-0.46z)| lr 2.14e-04 | 4169.73 ms | 32.4% bf16 MFU | 125918 tok/s step 11876/19560 | loss 3.383775 (+0.30z)| norm 0.2630 (-1.00z)| lr 2.14e-04 | 4159.60 ms | 32.5% bf16 MFU | 125924 tok/s step 11877/19560 | loss 3.299516 (-1.76z)| norm 0.2693 (-0.66z)| lr 2.14e-04 | 4166.90 ms | 32.4% bf16 MFU | 125919 tok/s step 11878/19560 | loss 3.380352 (+0.23z)| norm 0.2757 (-0.32z)| lr 2.14e-04 | 4156.83 ms | 32.5% bf16 MFU | 125929 tok/s step 11879/19560 | loss 3.382202 (+0.27z)| norm 0.3048 (+1.19z)| lr 2.14e-04 | 4167.35 ms | 32.4% bf16 MFU | 125923 tok/s step 11880/19560 | loss 3.394317 (+0.56z)| norm 0.3656 (+4.04z)| lr 2.14e-04 | 4168.77 ms | 32.4% bf16 MFU | 125915 tok/s step 11881/19560 | loss 3.350909 (-0.52z)| norm 0.3348 (+2.49z)| lr 2.14e-04 | 4159.02 ms | 32.5% bf16 MFU | 125923 tok/s step 11882/19560 | loss 3.374805 (+0.09z)| norm 0.2737 (-0.42z)| lr 2.14e-04 | 4164.57 ms | 32.4% bf16 MFU | 125921 tok/s step 11883/19560 | loss 3.345139 (-0.65z)| norm 0.2918 (+0.47z)| lr 2.14e-04 | 4170.88 ms | 32.4% bf16 MFU | 125910 tok/s step 11884/19560 | loss 3.324743 (-1.15z)| norm 0.2977 (+0.75z)| lr 2.14e-04 | 4160.31 ms | 32.5% bf16 MFU | 125916 tok/s step 11885/19560 | loss 3.414609 (+1.09z)| norm 0.2838 (+0.07z)| lr 2.14e-04 | 4164.20 ms | 32.4% bf16 MFU | 125915 tok/s step 11886/19560 | loss 3.399621 (+0.70z)| norm 0.2959 (+0.66z)| lr 2.14e-04 | 4172.85 ms | 32.4% bf16 MFU | 125902 tok/s step 11887/19560 | loss 3.395348 (+0.59z)| norm 0.2746 (-0.38z)| lr 2.13e-04 | 4166.12 ms | 32.4% bf16 MFU | 125899 tok/s step 11888/19560 | loss 3.371027 (-0.01z)| norm 0.2883 (+0.29z)| lr 2.13e-04 | 4157.23 ms | 32.5% bf16 MFU | 125910 tok/s step 11889/19560 | loss 3.456746 (+2.12z)| norm 0.2788 (-0.18z)| lr 2.13e-04 | 4162.59 ms | 32.4% bf16 MFU | 125912 tok/s step 11890/19560 | loss 3.370358 (-0.03z)| norm 0.2835 (+0.06z)| lr 2.13e-04 | 4164.81 ms | 32.4% bf16 MFU | 125910 tok/s step 11891/19560 | loss 3.386184 (+0.36z)| norm 0.2944 (+0.59z)| lr 2.13e-04 | 4167.49 ms | 32.4% bf16 MFU | 125905 tok/s step 11892/19560 | loss 3.376434 (+0.11z)| norm 0.2895 (+0.34z)| lr 2.13e-04 | 4154.50 ms | 32.5% bf16 MFU | 125920 tok/s step 11893/19560 | loss 3.370392 (-0.04z)| norm 0.3056 (+1.12z)| lr 2.13e-04 | 4153.32 ms | 32.5% bf16 MFU | 125935 tok/s step 11894/19560 | loss 3.329094 (-1.05z)| norm 0.2763 (-0.31z)| lr 2.13e-04 | 4159.41 ms | 32.5% bf16 MFU | 125941 tok/s step 11895/19560 | loss 3.411532 (+0.99z)| norm 0.3043 (+1.05z)| lr 2.13e-04 | 4161.59 ms | 32.4% bf16 MFU | 125943 tok/s step 11896/19560 | loss 3.361068 (-0.27z)| norm 0.3012 (+0.89z)| lr 2.13e-04 | 4158.30 ms | 32.5% bf16 MFU | 125950 tok/s step 11897/19560 | loss 3.249901 (-2.91z)| norm 0.2850 (+0.10z)| lr 2.13e-04 | 4161.80 ms | 32.4% bf16 MFU | 125951 tok/s step 11898/19560 | loss 3.337676 (-0.79z)| norm 0.3111 (+1.36z)| lr 2.13e-04 | 4166.07 ms | 32.4% bf16 MFU | 125946 tok/s step 11899/19560 | loss 3.342253 (-0.68z)| norm 0.2675 (-0.78z)| lr 2.13e-04 | 4156.86 ms | 32.5% bf16 MFU | 125955 tok/s step 11900/19560 | loss 3.297469 (-1.74z)| norm 0.2887 (+0.26z)| lr 2.13e-04 | 4167.66 ms | 32.4% bf16 MFU | 125947 tok/s step 11901/19560 | loss 3.360238 (-0.24z)| norm 0.2821 (-0.08z)| lr 2.13e-04 | 4155.22 ms | 32.5% bf16 MFU | 125959 tok/s step 11902/19560 | loss 3.377662 (+0.17z)| norm 0.3119 (+1.38z)| lr 2.13e-04 | 4155.76 ms | 32.5% bf16 MFU | 125969 tok/s step 11903/19560 | loss 3.370623 (+0.01z)| norm 0.2809 (-0.16z)| lr 2.13e-04 | 4163.52 ms | 32.4% bf16 MFU | 125967 tok/s step 11904/19560 | loss 3.380235 (+0.25z)| norm 0.2914 (+0.37z)| lr 2.13e-04 | 4157.24 ms | 32.5% bf16 MFU | 125974 tok/s step 11905/19560 | loss 3.324069 (-1.10z)| norm 0.2915 (+0.36z)| lr 2.13e-04 | 4156.86 ms | 32.5% bf16 MFU | 125982 tok/s step 11906/19560 | loss 3.319999 (-1.18z)| norm 0.2746 (-0.47z)| lr 2.13e-04 | 4164.44 ms | 32.4% bf16 MFU | 125977 tok/s step 11907/19560 | loss 3.347096 (-0.53z)| norm 0.2835 (-0.04z)| lr 2.13e-04 | 4163.49 ms | 32.4% bf16 MFU | 125975 tok/s step 11908/19560 | loss 3.365757 (-0.08z)| norm 0.2863 (+0.09z)| lr 2.12e-04 | 4170.29 ms | 32.4% bf16 MFU | 125962 tok/s step 11909/19560 | loss 3.337194 (-0.75z)| norm 0.3023 (+0.88z)| lr 2.12e-04 | 4163.47 ms | 32.4% bf16 MFU | 125960 tok/s step 11910/19560 | loss 3.326139 (-1.01z)| norm 0.2828 (-0.10z)| lr 2.12e-04 | 4156.32 ms | 32.5% bf16 MFU | 125969 tok/s step 11911/19560 | loss 3.329189 (-0.92z)| norm 0.2988 (+0.70z)| lr 2.12e-04 | 4150.98 ms | 32.5% bf16 MFU | 125986 tok/s step 11912/19560 | loss 3.410012 (+0.99z)| norm 0.2938 (+0.45z)| lr 2.12e-04 | 4157.76 ms | 32.5% bf16 MFU | 125992 tok/s step 11913/19560 | loss 3.303931 (-1.51z)| norm 0.2973 (+0.61z)| lr 2.12e-04 | 4156.66 ms | 32.5% bf16 MFU | 125999 tok/s step 11914/19560 | loss 3.380093 (+0.31z)| norm 0.2977 (+0.64z)| lr 2.12e-04 | 4164.20 ms | 32.4% bf16 MFU | 125994 tok/s step 11915/19560 | loss 3.313183 (-1.28z)| norm 0.2892 (+0.19z)| lr 2.12e-04 | 4158.08 ms | 32.5% bf16 MFU | 125999 tok/s step 11916/19560 | loss 3.332263 (-0.81z)| norm 0.2785 (-0.34z)| lr 2.12e-04 | 4155.01 ms | 32.5% bf16 MFU | 126008 tok/s step 11917/19560 | loss 3.335865 (-0.72z)| norm 0.2807 (-0.23z)| lr 2.12e-04 | 4187.59 ms | 32.2% bf16 MFU | 125967 tok/s step 11918/19560 | loss 3.426692 (+1.46z)| norm 0.2815 (-0.19z)| lr 2.12e-04 | 4162.93 ms | 32.4% bf16 MFU | 125966 tok/s step 11919/19560 | loss 3.374345 (+0.19z)| norm 0.2773 (-0.39z)| lr 2.12e-04 | 4168.00 ms | 32.4% bf16 MFU | 125957 tok/s step 11920/19560 | loss 3.390241 (+0.59z)| norm 0.2809 (-0.22z)| lr 2.12e-04 | 4155.65 ms | 32.5% bf16 MFU | 125968 tok/s step 11921/19560 | loss 3.343009 (-0.57z)| norm 0.2792 (-0.30z)| lr 2.12e-04 | 4163.27 ms | 32.4% bf16 MFU | 125966 tok/s step 11922/19560 | loss 3.401589 (+0.86z)| norm 0.2795 (-0.27z)| lr 2.12e-04 | 4170.85 ms | 32.4% bf16 MFU | 125953 tok/s step 11923/19560 | loss 3.405138 (+0.93z)| norm 0.2574 (-1.39z)| lr 2.12e-04 | 4203.29 ms | 32.1% bf16 MFU | 125892 tok/s step 11924/19560 | loss 3.313126 (-1.29z)| norm 0.2775 (-0.36z)| lr 2.12e-04 | 4158.85 ms | 32.5% bf16 MFU | 125900 tok/s step 11925/19560 | loss 3.322153 (-1.05z)| norm 0.2660 (-0.93z)| lr 2.12e-04 | 4243.48 ms | 31.8% bf16 MFU | 125783 tok/s step 11926/19560 | loss 3.318941 (-1.11z)| norm 0.2883 (+0.21z)| lr 2.12e-04 | 4152.52 ms | 32.5% bf16 MFU | 125807 tok/s step 11927/19560 | loss 3.337514 (-0.65z)| norm 0.2581 (-1.33z)| lr 2.12e-04 | 4172.92 ms | 32.4% bf16 MFU | 125798 tok/s step 11928/19560 | loss 3.361351 (-0.08z)| norm 0.2688 (-0.78z)| lr 2.12e-04 | 4169.59 ms | 32.4% bf16 MFU | 125795 tok/s step 11929/19560 | loss 3.345123 (-0.46z)| norm 0.2638 (-1.02z)| lr 2.11e-04 | 4188.28 ms | 32.2% bf16 MFU | 125765 tok/s step 11930/19560 | loss 3.320298 (-1.06z)| norm 0.2772 (-0.34z)| lr 2.11e-04 | 4184.84 ms | 32.3% bf16 MFU | 125741 tok/s step 11931/19560 | loss 3.332872 (-0.74z)| norm 0.2745 (-0.49z)| lr 2.11e-04 | 4169.22 ms | 32.4% bf16 MFU | 125741 tok/s step 11932/19560 | loss 3.360638 (-0.07z)| norm 0.2939 (+0.52z)| lr 2.11e-04 | 4166.21 ms | 32.4% bf16 MFU | 125746 tok/s step 11933/19560 | loss 3.370281 (+0.17z)| norm 0.2775 (-0.32z)| lr 2.11e-04 | 4156.33 ms | 32.5% bf16 MFU | 125766 tok/s step 11934/19560 | loss 3.353639 (-0.24z)| norm 0.2951 (+0.59z)| lr 2.11e-04 | 4162.80 ms | 32.4% bf16 MFU | 125775 tok/s step 11935/19560 | loss 3.375539 (+0.29z)| norm 0.2949 (+0.58z)| lr 2.11e-04 | 4166.77 ms | 32.4% bf16 MFU | 125778 tok/s step 11936/19560 | loss 3.357622 (-0.14z)| norm 0.3071 (+1.20z)| lr 2.11e-04 | 4181.96 ms | 32.3% bf16 MFU | 125757 tok/s step 11937/19560 | loss 3.346752 (-0.42z)| norm 0.2815 (-0.12z)| lr 2.11e-04 | 4178.69 ms | 32.3% bf16 MFU | 125743 tok/s step 11938/19560 | loss 3.359665 (-0.09z)| norm 0.3095 (+1.31z)| lr 2.11e-04 | 4159.83 ms | 32.5% bf16 MFU | 125757 tok/s step 11939/19560 | loss 3.372092 (+0.23z)| norm 0.3005 (+0.85z)| lr 2.11e-04 | 4166.15 ms | 32.4% bf16 MFU | 125762 tok/s step 11940/19560 | loss 3.411451 (+1.19z)| norm 0.3062 (+1.38z)| lr 2.11e-04 | 4162.69 ms | 32.4% bf16 MFU | 125771 tok/s step 11941/19560 | loss 3.356552 (-0.17z)| norm 0.2997 (+0.99z)| lr 2.11e-04 | 4166.89 ms | 32.4% bf16 MFU | 125774 tok/s step 11942/19560 | loss 3.400006 (+0.92z)| norm 0.3145 (+1.83z)| lr 2.11e-04 | 6065.69 ms | 22.3% bf16 MFU | 123807 tok/s step 11943/19560 | loss 3.343091 (-0.51z)| norm 0.2726 (-0.65z)| lr 2.11e-04 | 4157.65 ms | 32.5% bf16 MFU | 123921 tok/s step 11944/19560 | loss 3.400224 (+0.93z)| norm 0.2973 (+0.80z)| lr 2.11e-04 | 4153.09 ms | 32.5% bf16 MFU | 124037 tok/s step 11945/19560 | loss 3.349224 (-0.35z)| norm 0.2769 (-0.40z)| lr 2.11e-04 | 4152.29 ms | 32.5% bf16 MFU | 124149 tok/s step 11946/19560 | loss 3.319622 (-1.08z)| norm 0.3037 (+1.20z)| lr 2.11e-04 | 4161.94 ms | 32.4% bf16 MFU | 124240 tok/s step 11947/19560 | loss 3.314336 (-1.20z)| norm 0.3020 (+1.09z)| lr 2.11e-04 | 4162.30 ms | 32.4% bf16 MFU | 124326 tok/s step 11948/19560 | loss 3.399359 (+0.91z)| norm 0.3027 (+1.11z)| lr 2.11e-04 | 4169.49 ms | 32.4% bf16 MFU | 124397 tok/s step 11949/19560 | loss 3.392241 (+0.72z)| norm 0.2827 (-0.09z)| lr 2.11e-04 | 4170.39 ms | 32.4% bf16 MFU | 124463 tok/s step 11950/19560 | loss 3.412259 (+1.20z)| norm 0.2724 (-0.72z)| lr 2.10e-04 | 4176.51 ms | 32.3% bf16 MFU | 124516 tok/s step 11951/19560 | loss 3.372726 (+0.22z)| norm 0.2834 (-0.06z)| lr 2.10e-04 | 4162.13 ms | 32.4% bf16 MFU | 124589 tok/s step 11952/19560 | loss 3.340617 (-0.58z)| norm 0.2847 (+0.01z)| lr 2.10e-04 | 4163.78 ms | 32.4% bf16 MFU | 124655 tok/s step 11953/19560 | loss 3.358532 (-0.13z)| norm 0.2985 (+0.84z)| lr 2.10e-04 | 4170.85 ms | 32.4% bf16 MFU | 124708 tok/s step 11954/19560 | loss 3.365664 (+0.06z)| norm 0.2859 (+0.05z)| lr 2.10e-04 | 4169.42 ms | 32.4% bf16 MFU | 124760 tok/s step 11955/19560 | loss 3.306119 (-1.41z)| norm 0.2893 (+0.26z)| lr 2.10e-04 | 4174.72 ms | 32.3% bf16 MFU | 124801 tok/s step 11956/19560 | loss 3.393723 (+0.78z)| norm 0.2800 (-0.33z)| lr 2.10e-04 | 4159.70 ms | 32.5% bf16 MFU | 124863 tok/s step 11957/19560 | loss 3.354262 (-0.20z)| norm 0.2775 (-0.49z)| lr 2.10e-04 | 4165.97 ms | 32.4% bf16 MFU | 124912 tok/s step 11958/19560 | loss 3.373403 (+0.31z)| norm 0.2699 (-0.97z)| lr 2.10e-04 | 4166.99 ms | 32.4% bf16 MFU | 124958 tok/s step 11959/19560 | loss 3.329642 (-0.85z)| norm 0.2915 (+0.37z)| lr 2.10e-04 | 4165.56 ms | 32.4% bf16 MFU | 125003 tok/s step 11960/19560 | loss 3.396349 (+0.90z)| norm 0.2826 (-0.20z)| lr 2.10e-04 | 4162.52 ms | 32.4% bf16 MFU | 125050 tok/s step 11961/19560 | loss 3.362067 (-0.01z)| norm 0.2745 (-0.72z)| lr 2.10e-04 | 4174.78 ms | 32.3% bf16 MFU | 125077 tok/s step 11962/19560 | loss 3.366424 (+0.10z)| norm 0.2810 (-0.31z)| lr 2.10e-04 | 4181.89 ms | 32.3% bf16 MFU | 125092 tok/s step 11963/19560 | loss 3.343404 (-0.51z)| norm 0.2580 (-1.75z)| lr 2.10e-04 | 4160.20 ms | 32.5% bf16 MFU | 125138 tok/s step 11964/19560 | loss 3.336158 (-0.72z)| norm 0.2892 (+0.23z)| lr 2.10e-04 | 4191.03 ms | 32.2% bf16 MFU | 125136 tok/s step 11965/19560 | loss 3.384661 (+0.60z)| norm 0.2814 (-0.27z)| lr 2.10e-04 | 4180.91 ms | 32.3% bf16 MFU | 125150 tok/s step 11966/19560 | loss 3.357203 (-0.15z)| norm 0.2925 (+0.44z)| lr 2.10e-04 | 4165.20 ms | 32.4% bf16 MFU | 125186 tok/s step 11967/19560 | loss 3.309944 (-1.42z)| norm 0.3089 (+1.46z)| lr 2.10e-04 | 4164.73 ms | 32.4% bf16 MFU | 125221 tok/s step 11968/19560 | loss 3.390285 (+0.74z)| norm 0.2801 (-0.36z)| lr 2.10e-04 | 4183.94 ms | 32.3% bf16 MFU | 125225 tok/s step 11969/19560 | loss 3.386162 (+0.67z)| norm 0.3045 (+1.17z)| lr 2.10e-04 | 4174.48 ms | 32.3% bf16 MFU | 125244 tok/s step 11970/19560 | loss 3.344244 (-0.50z)| norm 0.2881 (+0.14z)| lr 2.10e-04 | 4156.02 ms | 32.5% bf16 MFU | 125289 tok/s step 11971/19560 | loss 3.306115 (-1.54z)| norm 0.2916 (+0.34z)| lr 2.09e-04 | 4167.58 ms | 32.4% bf16 MFU | 125315 tok/s step 11972/19560 | loss 3.369254 (+0.22z)| norm 0.2922 (+0.37z)| lr 2.09e-04 | 4160.51 ms | 32.5% bf16 MFU | 125350 tok/s step 11973/19560 | loss 3.347460 (-0.38z)| norm 0.2949 (+0.53z)| lr 2.09e-04 | 4171.55 ms | 32.4% bf16 MFU | 125366 tok/s step 11974/19560 | loss 3.585803 (+5.44z)| norm 0.2916 (+0.31z)| lr 2.09e-04 | 4164.07 ms | 32.4% bf16 MFU | 125393 tok/s step 11975/19560 | loss 3.348878 (-0.33z)| norm 0.2758 (-0.71z)| lr 2.09e-04 | 4164.24 ms | 32.4% bf16 MFU | 125419 tok/s step 11976/19560 | loss 3.369674 (+0.17z)| norm 0.2704 (-1.04z)| lr 2.09e-04 | 4161.23 ms | 32.4% bf16 MFU | 125448 tok/s step 11977/19560 | loss 3.385984 (+0.56z)| norm 0.2727 (-0.91z)| lr 2.09e-04 | 4168.29 ms | 32.4% bf16 MFU | 125464 tok/s step 11978/19560 | loss 3.337641 (-0.63z)| norm 0.2880 (+0.10z)| lr 2.09e-04 | 4164.92 ms | 32.4% bf16 MFU | 125485 tok/s step 11979/19560 | loss 3.369098 (+0.15z)| norm 0.2698 (-1.11z)| lr 2.09e-04 | 4165.82 ms | 32.4% bf16 MFU | 125504 tok/s step 11980/19560 | loss 3.393737 (+0.75z)| norm 0.2908 (+0.28z)| lr 2.09e-04 | 4172.16 ms | 32.4% bf16 MFU | 125512 tok/s step 11981/19560 | loss 3.297477 (-1.64z)| norm 0.3743 (+5.15z)| lr 2.09e-04 | 4168.81 ms | 32.4% bf16 MFU | 125524 tok/s step 11982/19560 | loss 3.393181 (+0.80z)| norm 0.2617 (-1.47z)| lr 2.09e-04 | 4171.93 ms | 32.4% bf16 MFU | 125532 tok/s step 11983/19560 | loss 3.363440 (+0.05z)| norm 0.2825 (-0.26z)| lr 2.09e-04 | 4171.05 ms | 32.4% bf16 MFU | 125540 tok/s step 11984/19560 | loss 3.343267 (-0.47z)| norm 0.2558 (-1.79z)| lr 2.09e-04 | 4156.85 ms | 32.5% bf16 MFU | 125569 tok/s step 11985/19560 | loss 3.451136 (+2.25z)| norm 0.2784 (-0.47z)| lr 2.09e-04 | 4177.92 ms | 32.3% bf16 MFU | 125565 tok/s step 11986/19560 | loss 3.359914 (-0.06z)| norm 0.2644 (-1.27z)| lr 2.09e-04 | 4179.72 ms | 32.3% bf16 MFU | 125559 tok/s step 11987/19560 | loss 3.355988 (-0.16z)| norm 0.2850 (-0.08z)| lr 2.09e-04 | 4177.32 ms | 32.3% bf16 MFU | 125556 tok/s step 11988/19560 | loss 3.382927 (+0.52z)| norm 0.2647 (-1.26z)| lr 2.09e-04 | 4160.52 ms | 32.5% bf16 MFU | 125579 tok/s step 11989/19560 | loss 3.416258 (+1.34z)| norm 0.2858 (-0.03z)| lr 2.09e-04 | 4161.33 ms | 32.4% bf16 MFU | 125600 tok/s step 11990/19560 | loss 3.327257 (-0.89z)| norm 0.3044 (+1.03z)| lr 2.09e-04 | 4170.69 ms | 32.4% bf16 MFU | 125605 tok/s step 11991/19560 | loss 3.403988 (+1.08z)| norm 0.2974 (+0.62z)| lr 2.09e-04 | 4171.02 ms | 32.4% bf16 MFU | 125610 tok/s step 11992/19560 | loss 3.372917 (+0.27z)| norm 0.2781 (-0.49z)| lr 2.08e-04 | 4165.91 ms | 32.4% bf16 MFU | 125622 tok/s step 11993/19560 | loss 3.340016 (-0.57z)| norm 0.2966 (+0.57z)| lr 2.08e-04 | 4162.81 ms | 32.4% bf16 MFU | 125638 tok/s step 11994/19560 | loss 3.304956 (-1.45z)| norm 0.2692 (-1.01z)| lr 2.08e-04 | 4155.30 ms | 32.5% bf16 MFU | 125665 tok/s step 11995/19560 | loss 3.311621 (-1.26z)| norm 0.2888 (+0.13z)| lr 2.08e-04 | 4170.95 ms | 32.4% bf16 MFU | 125667 tok/s step 11996/19560 | loss 3.319416 (-1.05z)| norm 0.2744 (-0.72z)| lr 2.08e-04 | 4166.18 ms | 32.4% bf16 MFU | 125675 tok/s step 11997/19560 | loss 3.390854 (+0.75z)| norm 0.2831 (-0.21z)| lr 2.08e-04 | 4176.77 ms | 32.3% bf16 MFU | 125668 tok/s step 11998/19560 | loss 3.348904 (-0.31z)| norm 0.2787 (-0.46z)| lr 2.08e-04 | 4186.61 ms | 32.2% bf16 MFU | 125646 tok/s step 11999/19560 | loss 3.348784 (-0.31z)| norm 0.2923 (+0.33z)| lr 2.08e-04 | 4167.28 ms | 32.4% bf16 MFU | 125654 tok/s step 12000/19560 | loss 3.306360 (-1.37z)| norm 0.3127 (+1.49z)| lr 2.08e-04 | 4171.16 ms | 32.4% bf16 MFU | 125656 tok/s val loss 3.345189 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2933/10042 = 0.292073 step 12001/19560 | loss 3.360310 (-0.01z)| norm 0.2886 (+0.09z)| lr 2.08e-04 | 4179.14 ms | 32.3% bf16 MFU | 125646 tok/s step 12002/19560 | loss 3.400517 (+1.00z)| norm 0.3049 (+1.02z)| lr 2.08e-04 | 4175.27 ms | 32.3% bf16 MFU | 125642 tok/s step 12003/19560 | loss 3.344067 (-0.44z)| norm 0.3093 (+1.25z)| lr 2.08e-04 | 4171.92 ms | 32.4% bf16 MFU | 125644 tok/s step 12004/19560 | loss 3.374866 (+0.35z)| norm 0.2938 (+0.34z)| lr 2.08e-04 | 4156.65 ms | 32.5% bf16 MFU | 125668 tok/s step 12005/19560 | loss 3.424560 (+1.59z)| norm 0.2950 (+0.40z)| lr 2.08e-04 | 4176.82 ms | 32.3% bf16 MFU | 125661 tok/s step 12006/19560 | loss 3.322487 (-0.99z)| norm 0.2835 (-0.27z)| lr 2.08e-04 | 4165.83 ms | 32.4% bf16 MFU | 125670 tok/s step 12007/19560 | loss 3.399364 (+0.95z)| norm 0.2739 (-0.82z)| lr 2.08e-04 | 4160.40 ms | 32.5% bf16 MFU | 125688 tok/s step 12008/19560 | loss 3.339303 (-0.56z)| norm 0.2696 (-1.12z)| lr 2.08e-04 | 4171.99 ms | 32.4% bf16 MFU | 125687 tok/s step 12009/19560 | loss 3.442014 (+1.99z)| norm 0.3014 (+0.96z)| lr 2.08e-04 | 4170.07 ms | 32.4% bf16 MFU | 125689 tok/s step 12010/19560 | loss 3.311006 (-1.25z)| norm 0.2841 (-0.19z)| lr 2.08e-04 | 4158.64 ms | 32.5% bf16 MFU | 125708 tok/s step 12011/19560 | loss 3.361876 (+0.00z)| norm 0.2847 (-0.15z)| lr 2.08e-04 | 4158.80 ms | 32.5% bf16 MFU | 125726 tok/s step 12012/19560 | loss 3.368766 (+0.17z)| norm 0.2817 (-0.34z)| lr 2.08e-04 | 4159.50 ms | 32.5% bf16 MFU | 125742 tok/s step 12013/19560 | loss 3.328672 (-0.82z)| norm 0.2735 (-0.87z)| lr 2.07e-04 | 4161.94 ms | 32.4% bf16 MFU | 125753 tok/s step 12014/19560 | loss 3.342438 (-0.46z)| norm 0.2584 (-1.83z)| lr 2.07e-04 | 4187.57 ms | 32.2% bf16 MFU | 125726 tok/s step 12015/19560 | loss 3.315187 (-1.13z)| norm 0.2669 (-1.27z)| lr 2.07e-04 | 4160.32 ms | 32.5% bf16 MFU | 125741 tok/s step 12016/19560 | loss 3.340090 (-0.50z)| norm 0.2843 (-0.13z)| lr 2.07e-04 | 4177.48 ms | 32.3% bf16 MFU | 125729 tok/s step 12017/19560 | loss 3.365938 (+0.17z)| norm 0.2617 (-1.58z)| lr 2.07e-04 | 4166.84 ms | 32.4% bf16 MFU | 125734 tok/s step 12018/19560 | loss 3.346510 (-0.32z)| norm 0.3246 (+2.40z)| lr 2.07e-04 | 4171.07 ms | 32.4% bf16 MFU | 125732 tok/s step 12019/19560 | loss 3.365561 (+0.17z)| norm 0.2725 (-0.87z)| lr 2.07e-04 | 4159.99 ms | 32.5% bf16 MFU | 125747 tok/s step 12020/19560 | loss 3.401296 (+1.07z)| norm 0.2717 (-0.91z)| lr 2.07e-04 | 4155.81 ms | 32.5% bf16 MFU | 125767 tok/s step 12021/19560 | loss 3.381573 (+0.57z)| norm 0.2763 (-0.61z)| lr 2.07e-04 | 4168.88 ms | 32.4% bf16 MFU | 125767 tok/s step 12022/19560 | loss 3.373327 (+0.35z)| norm 0.2716 (-0.90z)| lr 2.07e-04 | 4172.16 ms | 32.4% bf16 MFU | 125762 tok/s step 12023/19560 | loss 3.334485 (-0.63z)| norm 0.2769 (-0.56z)| lr 2.07e-04 | 4158.12 ms | 32.5% bf16 MFU | 125778 tok/s step 12024/19560 | loss 3.453894 (+2.36z)| norm 0.2811 (-0.28z)| lr 2.07e-04 | 4163.09 ms | 32.4% bf16 MFU | 125786 tok/s step 12025/19560 | loss 3.376281 (+0.40z)| norm 0.2853 (-0.02z)| lr 2.07e-04 | 4149.11 ms | 32.5% bf16 MFU | 125815 tok/s step 12026/19560 | loss 3.398525 (+0.96z)| norm 0.2951 (+0.61z)| lr 2.07e-04 | 4183.05 ms | 32.3% bf16 MFU | 125791 tok/s step 12027/19560 | loss 3.295939 (-1.66z)| norm 0.2569 (-1.80z)| lr 2.07e-04 | 4158.55 ms | 32.5% bf16 MFU | 125805 tok/s step 12028/19560 | loss 3.401421 (+1.02z)| norm 0.2863 (+0.06z)| lr 2.07e-04 | 4167.11 ms | 32.4% bf16 MFU | 125806 tok/s step 12029/19560 | loss 3.328559 (-0.84z)| norm 0.2652 (-1.26z)| lr 2.07e-04 | 4170.13 ms | 32.4% bf16 MFU | 125802 tok/s step 12030/19560 | loss 3.333276 (-0.71z)| norm 0.2646 (-1.28z)| lr 2.07e-04 | 4176.63 ms | 32.3% bf16 MFU | 125788 tok/s step 12031/19560 | loss 3.422344 (+1.54z)| norm 0.2814 (-0.22z)| lr 2.07e-04 | 4164.00 ms | 32.4% bf16 MFU | 125794 tok/s step 12032/19560 | loss 3.349351 (-0.30z)| norm 0.2821 (-0.17z)| lr 2.07e-04 | 4173.12 ms | 32.4% bf16 MFU | 125786 tok/s step 12033/19560 | loss 3.286854 (-1.86z)| norm 0.2923 (+0.48z)| lr 2.07e-04 | 4166.21 ms | 32.4% bf16 MFU | 125789 tok/s step 12034/19560 | loss 3.469270 (+2.63z)| norm 0.2904 (+0.35z)| lr 2.06e-04 | 4170.49 ms | 32.4% bf16 MFU | 125785 tok/s step 12035/19560 | loss 3.417184 (+1.33z)| norm 0.2670 (-1.12z)| lr 2.06e-04 | 4172.31 ms | 32.4% bf16 MFU | 125779 tok/s step 12036/19560 | loss 3.374609 (+0.29z)| norm 0.2919 (+0.45z)| lr 2.06e-04 | 4175.73 ms | 32.3% bf16 MFU | 125768 tok/s step 12037/19560 | loss 3.447934 (+2.02z)| norm 0.2858 (+0.07z)| lr 2.06e-04 | 4171.94 ms | 32.4% bf16 MFU | 125763 tok/s step 12038/19560 | loss 3.311728 (-1.24z)| norm 0.3007 (+1.00z)| lr 2.06e-04 | 4190.05 ms | 32.2% bf16 MFU | 125731 tok/s step 12039/19560 | loss 3.324596 (-0.93z)| norm 0.2966 (+0.74z)| lr 2.06e-04 | 4188.77 ms | 32.2% bf16 MFU | 125703 tok/s step 12040/19560 | loss 3.383121 (+0.48z)| norm 0.3052 (+1.27z)| lr 2.06e-04 | 4169.72 ms | 32.4% bf16 MFU | 125704 tok/s step 12041/19560 | loss 3.432332 (+1.63z)| norm 0.2854 (+0.04z)| lr 2.06e-04 | 4173.24 ms | 32.4% bf16 MFU | 125701 tok/s step 12042/19560 | loss 3.403223 (+0.93z)| norm 0.2953 (+0.66z)| lr 2.06e-04 | 4160.43 ms | 32.5% bf16 MFU | 125717 tok/s step 12043/19560 | loss 3.330291 (-0.82z)| norm 0.2728 (-0.74z)| lr 2.06e-04 | 4158.85 ms | 32.5% bf16 MFU | 125734 tok/s step 12044/19560 | loss 3.356891 (-0.19z)| norm 0.2818 (-0.18z)| lr 2.06e-04 | 4170.63 ms | 32.4% bf16 MFU | 125733 tok/s step 12045/19560 | loss 3.309713 (-1.31z)| norm 0.3023 (+1.09z)| lr 2.06e-04 | 4161.99 ms | 32.4% bf16 MFU | 125745 tok/s step 12046/19560 | loss 3.345443 (-0.45z)| norm 0.3045 (+1.21z)| lr 2.06e-04 | 4167.80 ms | 32.4% bf16 MFU | 125747 tok/s step 12047/19560 | loss 3.347731 (-0.38z)| norm 0.2873 (+0.14z)| lr 2.06e-04 | 4176.22 ms | 32.3% bf16 MFU | 125737 tok/s step 12048/19560 | loss 3.353578 (-0.24z)| norm 0.3012 (+0.99z)| lr 2.06e-04 | 4171.40 ms | 32.4% bf16 MFU | 125734 tok/s step 12049/19560 | loss 3.374129 (+0.25z)| norm 0.2827 (-0.16z)| lr 2.06e-04 | 4168.88 ms | 32.4% bf16 MFU | 125736 tok/s step 12050/19560 | loss 3.338887 (-0.59z)| norm 0.2814 (-0.25z)| lr 2.06e-04 | 4160.91 ms | 32.4% bf16 MFU | 125749 tok/s step 12051/19560 | loss 3.356326 (-0.16z)| norm 0.2820 (-0.22z)| lr 2.06e-04 | 4155.04 ms | 32.5% bf16 MFU | 125771 tok/s step 12052/19560 | loss 3.361264 (-0.05z)| norm 0.2788 (-0.42z)| lr 2.06e-04 | 4159.35 ms | 32.5% bf16 MFU | 125785 tok/s step 12053/19560 | loss 3.367806 (+0.10z)| norm 0.2928 (+0.45z)| lr 2.06e-04 | 4168.26 ms | 32.4% bf16 MFU | 125785 tok/s step 12054/19560 | loss 3.364220 (+0.01z)| norm 0.2989 (+0.83z)| lr 2.06e-04 | 4167.73 ms | 32.4% bf16 MFU | 125785 tok/s step 12055/19560 | loss 3.343139 (-0.51z)| norm 0.2713 (-0.93z)| lr 2.05e-04 | 4166.07 ms | 32.4% bf16 MFU | 125788 tok/s step 12056/19560 | loss 3.383808 (+0.49z)| norm 0.3053 (+1.21z)| lr 2.05e-04 | 4162.87 ms | 32.4% bf16 MFU | 125796 tok/s step 12057/19560 | loss 3.329955 (-0.84z)| norm 0.2565 (-1.88z)| lr 2.05e-04 | 4151.23 ms | 32.5% bf16 MFU | 125821 tok/s step 12058/19560 | loss 3.387471 (+0.57z)| norm 0.2890 (+0.17z)| lr 2.05e-04 | 4155.49 ms | 32.5% bf16 MFU | 125838 tok/s step 12059/19560 | loss 3.448482 (+2.02z)| norm 0.2875 (+0.07z)| lr 2.05e-04 | 4155.20 ms | 32.5% bf16 MFU | 125855 tok/s step 12060/19560 | loss 3.475616 (+2.59z)| norm 0.2816 (-0.30z)| lr 2.05e-04 | 4159.01 ms | 32.5% bf16 MFU | 125866 tok/s step 12061/19560 | loss 3.390637 (+0.57z)| norm 0.2820 (-0.27z)| lr 2.05e-04 | 4161.68 ms | 32.4% bf16 MFU | 125871 tok/s step 12062/19560 | loss 3.364475 (-0.05z)| norm 0.2691 (-1.08z)| lr 2.05e-04 | 4174.11 ms | 32.3% bf16 MFU | 125858 tok/s step 12063/19560 | loss 3.360347 (-0.14z)| norm 0.2762 (-0.61z)| lr 2.05e-04 | 4171.26 ms | 32.4% bf16 MFU | 125850 tok/s step 12064/19560 | loss 3.358819 (-0.18z)| norm 0.2840 (-0.11z)| lr 2.05e-04 | 4154.69 ms | 32.5% bf16 MFU | 125867 tok/s step 12065/19560 | loss 3.338608 (-0.66z)| norm 0.2826 (-0.20z)| lr 2.05e-04 | 4155.70 ms | 32.5% bf16 MFU | 125881 tok/s step 12066/19560 | loss 3.336521 (-0.70z)| norm 0.2693 (-1.03z)| lr 2.05e-04 | 4162.16 ms | 32.4% bf16 MFU | 125886 tok/s step 12067/19560 | loss 3.333298 (-0.77z)| norm 0.2834 (-0.12z)| lr 2.05e-04 | 4165.50 ms | 32.4% bf16 MFU | 125885 tok/s step 12068/19560 | loss 3.374034 (+0.20z)| norm 0.2602 (-1.59z)| lr 2.05e-04 | 4157.21 ms | 32.5% bf16 MFU | 125896 tok/s step 12069/19560 | loss 3.289525 (-1.76z)| norm 0.2749 (-0.63z)| lr 2.05e-04 | 4176.40 ms | 32.3% bf16 MFU | 125878 tok/s step 12070/19560 | loss 3.344995 (-0.46z)| norm 0.2842 (-0.02z)| lr 2.05e-04 | 4162.70 ms | 32.4% bf16 MFU | 125882 tok/s step 12071/19560 | loss 3.397032 (+0.75z)| norm 0.2623 (-1.44z)| lr 2.05e-04 | 4167.37 ms | 32.4% bf16 MFU | 125878 tok/s step 12072/19560 | loss 3.295384 (-1.60z)| norm 0.2985 (+0.91z)| lr 2.05e-04 | 4169.10 ms | 32.4% bf16 MFU | 125872 tok/s step 12073/19560 | loss 3.408203 (+1.01z)| norm 0.2833 (-0.08z)| lr 2.05e-04 | 4164.58 ms | 32.4% bf16 MFU | 125873 tok/s step 12074/19560 | loss 3.369943 (+0.11z)| norm 0.2785 (-0.38z)| lr 2.05e-04 | 4157.89 ms | 32.5% bf16 MFU | 125884 tok/s step 12075/19560 | loss 3.336868 (-0.66z)| norm 0.2717 (-0.81z)| lr 2.05e-04 | 4168.62 ms | 32.4% bf16 MFU | 125878 tok/s step 12076/19560 | loss 3.331808 (-0.77z)| norm 0.2811 (-0.18z)| lr 2.04e-04 | 4159.07 ms | 32.5% bf16 MFU | 125887 tok/s step 12077/19560 | loss 3.357679 (-0.16z)| norm 0.2582 (-1.66z)| lr 2.04e-04 | 4174.68 ms | 32.3% bf16 MFU | 125872 tok/s step 12078/19560 | loss 3.347973 (-0.38z)| norm 0.2956 (+0.76z)| lr 2.04e-04 | 4172.59 ms | 32.4% bf16 MFU | 125861 tok/s step 12079/19560 | loss 3.378863 (+0.35z)| norm 0.2873 (+0.22z)| lr 2.04e-04 | 4175.29 ms | 32.3% bf16 MFU | 125847 tok/s step 12080/19560 | loss 3.381136 (+0.39z)| norm 0.2771 (-0.44z)| lr 2.04e-04 | 4157.00 ms | 32.5% bf16 MFU | 125860 tok/s step 12081/19560 | loss 3.383490 (+0.44z)| norm 0.2881 (+0.28z)| lr 2.04e-04 | 4164.02 ms | 32.4% bf16 MFU | 125863 tok/s step 12082/19560 | loss 3.344488 (-0.47z)| norm 0.2666 (-1.11z)| lr 2.04e-04 | 4247.63 ms | 31.8% bf16 MFU | 125741 tok/s step 12083/19560 | loss 3.358830 (-0.14z)| norm 0.3113 (+1.76z)| lr 2.04e-04 | 4163.39 ms | 32.4% bf16 MFU | 125750 tok/s step 12084/19560 | loss 3.392835 (+0.66z)| norm 0.2769 (-0.44z)| lr 2.04e-04 | 4169.04 ms | 32.4% bf16 MFU | 125751 tok/s step 12085/19560 | loss 3.383223 (+0.43z)| norm 0.2594 (-1.54z)| lr 2.04e-04 | 4148.63 ms | 32.5% bf16 MFU | 125782 tok/s step 12086/19560 | loss 3.310594 (-1.26z)| norm 0.2721 (-0.73z)| lr 2.04e-04 | 4156.67 ms | 32.5% bf16 MFU | 125800 tok/s step 12087/19560 | loss 3.361583 (-0.07z)| norm 0.3336 (+3.04z)| lr 2.04e-04 | 4160.53 ms | 32.5% bf16 MFU | 125810 tok/s step 12088/19560 | loss 3.351429 (-0.31z)| norm 0.2898 (+0.35z)| lr 2.04e-04 | 4161.79 ms | 32.4% bf16 MFU | 125819 tok/s step 12089/19560 | loss 3.367969 (+0.08z)| norm 0.2685 (-0.94z)| lr 2.04e-04 | 4167.73 ms | 32.4% bf16 MFU | 125818 tok/s step 12090/19560 | loss 3.367523 (+0.07z)| norm 0.2748 (-0.56z)| lr 2.04e-04 | 4166.27 ms | 32.4% bf16 MFU | 125819 tok/s step 12091/19560 | loss 3.387905 (+0.54z)| norm 0.2711 (-0.80z)| lr 2.04e-04 | 4160.62 ms | 32.5% bf16 MFU | 125828 tok/s step 12092/19560 | loss 3.381778 (+0.39z)| norm 0.2948 (+0.66z)| lr 2.04e-04 | 4177.33 ms | 32.3% bf16 MFU | 125812 tok/s step 12093/19560 | loss 3.356629 (-0.20z)| norm 0.2573 (-1.61z)| lr 2.04e-04 | 4165.02 ms | 32.4% bf16 MFU | 125816 tok/s step 12094/19560 | loss 3.369626 (+0.11z)| norm 0.2795 (-0.26z)| lr 2.04e-04 | 4174.59 ms | 32.3% bf16 MFU | 125804 tok/s step 12095/19560 | loss 3.305798 (-1.39z)| norm 0.2715 (-0.74z)| lr 2.04e-04 | 4164.27 ms | 32.4% bf16 MFU | 125809 tok/s step 12096/19560 | loss 3.329720 (-0.82z)| norm 0.2654 (-1.10z)| lr 2.04e-04 | 4160.16 ms | 32.5% bf16 MFU | 125820 tok/s step 12097/19560 | loss 3.343981 (-0.48z)| norm 0.2589 (-1.47z)| lr 2.04e-04 | 4163.05 ms | 32.4% bf16 MFU | 125826 tok/s step 12098/19560 | loss 3.340359 (-0.56z)| norm 0.2630 (-1.20z)| lr 2.03e-04 | 4197.17 ms | 32.2% bf16 MFU | 125780 tok/s step 12099/19560 | loss 3.346771 (-0.42z)| norm 0.2800 (-0.17z)| lr 2.03e-04 | 4180.81 ms | 32.3% bf16 MFU | 125762 tok/s step 12100/19560 | loss 3.370238 (+0.14z)| norm 0.2798 (-0.17z)| lr 2.03e-04 | 4162.09 ms | 32.4% bf16 MFU | 125772 tok/s step 12101/19560 | loss 3.344065 (-0.48z)| norm 0.2842 (+0.10z)| lr 2.03e-04 | 4159.12 ms | 32.5% bf16 MFU | 125786 tok/s step 12102/19560 | loss 3.404004 (+1.09z)| norm 0.2716 (-0.66z)| lr 2.03e-04 | 4171.33 ms | 32.4% bf16 MFU | 125781 tok/s step 12103/19560 | loss 3.359950 (-0.09z)| norm 0.3135 (+1.85z)| lr 2.03e-04 | 4159.59 ms | 32.5% bf16 MFU | 125794 tok/s step 12104/19560 | loss 3.319919 (-1.13z)| norm 0.2704 (-0.74z)| lr 2.03e-04 | 7347.31 ms | 18.4% bf16 MFU | 123073 tok/s step 12105/19560 | loss 3.312574 (-1.31z)| norm 0.2704 (-0.74z)| lr 2.03e-04 | 4153.77 ms | 32.5% bf16 MFU | 123230 tok/s step 12106/19560 | loss 3.343267 (-0.50z)| norm 0.2847 (+0.13z)| lr 2.03e-04 | 4171.16 ms | 32.4% bf16 MFU | 123353 tok/s step 12107/19560 | loss 3.421969 (+1.55z)| norm 0.2501 (-1.92z)| lr 2.03e-04 | 4170.32 ms | 32.4% bf16 MFU | 123471 tok/s step 12108/19560 | loss 3.343193 (-0.50z)| norm 0.2821 (-0.02z)| lr 2.03e-04 | 4149.69 ms | 32.5% bf16 MFU | 123615 tok/s step 12109/19560 | loss 3.386814 (+0.63z)| norm 0.2816 (-0.01z)| lr 2.03e-04 | 4152.29 ms | 32.5% bf16 MFU | 123747 tok/s step 12110/19560 | loss 3.392753 (+0.79z)| norm 0.3031 (+1.43z)| lr 2.03e-04 | 4152.99 ms | 32.5% bf16 MFU | 123872 tok/s step 12111/19560 | loss 3.369581 (+0.17z)| norm 0.2829 (+0.06z)| lr 2.03e-04 | 4152.33 ms | 32.5% bf16 MFU | 123992 tok/s step 12112/19560 | loss 3.434702 (+1.85z)| norm 0.2991 (+1.14z)| lr 2.03e-04 | 4175.85 ms | 32.3% bf16 MFU | 124070 tok/s step 12113/19560 | loss 3.343351 (-0.52z)| norm 0.3069 (+1.64z)| lr 2.03e-04 | 4170.16 ms | 32.4% bf16 MFU | 124153 tok/s step 12114/19560 | loss 3.354327 (-0.23z)| norm 0.2866 (+0.26z)| lr 2.03e-04 | 4172.85 ms | 32.4% bf16 MFU | 124227 tok/s step 12115/19560 | loss 3.361734 (-0.03z)| norm 0.3323 (+3.20z)| lr 2.03e-04 | 4167.39 ms | 32.4% bf16 MFU | 124306 tok/s step 12116/19560 | loss 3.360451 (-0.06z)| norm 0.2774 (-0.39z)| lr 2.03e-04 | 4166.21 ms | 32.4% bf16 MFU | 124383 tok/s step 12117/19560 | loss 3.412149 (+1.32z)| norm 0.3079 (+1.58z)| lr 2.03e-04 | 4164.95 ms | 32.4% bf16 MFU | 124458 tok/s step 12118/19560 | loss 3.346516 (-0.44z)| norm 0.2951 (+0.77z)| lr 2.03e-04 | 4166.46 ms | 32.4% bf16 MFU | 124527 tok/s step 12119/19560 | loss 3.341649 (-0.55z)| norm 0.2875 (+0.28z)| lr 2.02e-04 | 4173.74 ms | 32.3% bf16 MFU | 124581 tok/s step 12120/19560 | loss 3.319238 (-1.14z)| norm 0.2772 (-0.40z)| lr 2.02e-04 | 4254.33 ms | 31.7% bf16 MFU | 124514 tok/s step 12121/19560 | loss 3.386188 (+0.64z)| norm 0.3055 (+1.44z)| lr 2.02e-04 | 4166.92 ms | 32.4% bf16 MFU | 124579 tok/s step 12122/19560 | loss 3.393083 (+0.81z)| norm 0.3329 (+3.07z)| lr 2.02e-04 | 4170.68 ms | 32.4% bf16 MFU | 124636 tok/s step 12123/19560 | loss 3.376408 (+0.35z)| norm 0.3158 (+1.96z)| lr 2.02e-04 | 4182.75 ms | 32.3% bf16 MFU | 124671 tok/s step 12124/19560 | loss 3.333511 (-0.82z)| norm 0.2914 (+0.45z)| lr 2.02e-04 | 4166.02 ms | 32.4% bf16 MFU | 124730 tok/s step 12125/19560 | loss 3.324039 (-1.06z)| norm 0.2958 (+0.71z)| lr 2.02e-04 | 4168.72 ms | 32.4% bf16 MFU | 124782 tok/s step 12126/19560 | loss 3.367301 (+0.11z)| norm 0.2834 (-0.06z)| lr 2.02e-04 | 4178.26 ms | 32.3% bf16 MFU | 124817 tok/s step 12127/19560 | loss 3.321157 (-1.13z)| norm 0.2826 (-0.10z)| lr 2.02e-04 | 4171.20 ms | 32.4% bf16 MFU | 124861 tok/s step 12128/19560 | loss 3.425657 (+1.66z)| norm 0.2863 (+0.14z)| lr 2.02e-04 | 4171.04 ms | 32.4% bf16 MFU | 124902 tok/s step 12129/19560 | loss 3.368474 (+0.12z)| norm 0.2737 (-0.64z)| lr 2.02e-04 | 4169.13 ms | 32.4% bf16 MFU | 124945 tok/s step 12130/19560 | loss 3.310889 (-1.40z)| norm 0.3033 (+1.21z)| lr 2.02e-04 | 4170.01 ms | 32.4% bf16 MFU | 124984 tok/s step 12131/19560 | loss 3.337905 (-0.68z)| norm 0.2642 (-1.21z)| lr 2.02e-04 | 4185.15 ms | 32.3% bf16 MFU | 124999 tok/s step 12132/19560 | loss 3.342302 (-0.55z)| norm 0.3056 (+1.37z)| lr 2.02e-04 | 4168.48 ms | 32.4% bf16 MFU | 125037 tok/s step 12133/19560 | loss 3.345508 (-0.46z)| norm 0.2833 (-0.02z)| lr 2.02e-04 | 4177.42 ms | 32.3% bf16 MFU | 125061 tok/s step 12134/19560 | loss 3.281261 (-2.15z)| norm 0.2776 (-0.37z)| lr 2.02e-04 | 4163.67 ms | 32.4% bf16 MFU | 125104 tok/s step 12135/19560 | loss 3.355114 (-0.18z)| norm 0.2753 (-0.52z)| lr 2.02e-04 | 4161.37 ms | 32.4% bf16 MFU | 125148 tok/s step 12136/19560 | loss 3.370515 (+0.23z)| norm 0.2793 (-0.27z)| lr 2.02e-04 | 4171.66 ms | 32.4% bf16 MFU | 125175 tok/s step 12137/19560 | loss 3.299802 (-1.65z)| norm 0.2595 (-1.48z)| lr 2.02e-04 | 4158.71 ms | 32.5% bf16 MFU | 125219 tok/s step 12138/19560 | loss 3.367048 (+0.16z)| norm 0.2710 (-0.76z)| lr 2.02e-04 | 4161.96 ms | 32.4% bf16 MFU | 125257 tok/s step 12139/19560 | loss 3.307290 (-1.44z)| norm 0.2574 (-1.58z)| lr 2.02e-04 | 4176.62 ms | 32.3% bf16 MFU | 125271 tok/s step 12140/19560 | loss 3.360300 (-0.01z)| norm 0.3002 (+1.05z)| lr 2.01e-04 | 4165.29 ms | 32.4% bf16 MFU | 125301 tok/s step 12141/19560 | loss 3.334933 (-0.70z)| norm 0.2730 (-0.62z)| lr 2.01e-04 | 4172.17 ms | 32.4% bf16 MFU | 125319 tok/s step 12142/19560 | loss 3.376825 (+0.43z)| norm 0.2785 (-0.29z)| lr 2.01e-04 | 4173.34 ms | 32.4% bf16 MFU | 125334 tok/s step 12143/19560 | loss 3.276842 (-2.24z)| norm 0.2743 (-0.56z)| lr 2.01e-04 | 4173.40 ms | 32.4% bf16 MFU | 125349 tok/s step 12144/19560 | loss 3.319011 (-1.10z)| norm 0.2745 (-0.54z)| lr 2.01e-04 | 4171.50 ms | 32.4% bf16 MFU | 125365 tok/s step 12145/19560 | loss 3.339258 (-0.56z)| norm 0.2970 (+0.84z)| lr 2.01e-04 | 4175.68 ms | 32.3% bf16 MFU | 125375 tok/s step 12146/19560 | loss 3.357105 (-0.09z)| norm 0.2821 (-0.07z)| lr 2.01e-04 | 4170.35 ms | 32.4% bf16 MFU | 125392 tok/s step 12147/19560 | loss 3.365498 (+0.13z)| norm 0.2993 (+1.02z)| lr 2.01e-04 | 4172.87 ms | 32.4% bf16 MFU | 125405 tok/s step 12148/19560 | loss 3.348688 (-0.30z)| norm 0.2817 (-0.11z)| lr 2.01e-04 | 4172.03 ms | 32.4% bf16 MFU | 125418 tok/s step 12149/19560 | loss 3.325632 (-0.90z)| norm 0.2800 (-0.22z)| lr 2.01e-04 | 4176.43 ms | 32.3% bf16 MFU | 125424 tok/s step 12150/19560 | loss 3.410234 (+1.33z)| norm 0.2766 (-0.44z)| lr 2.01e-04 | 4169.74 ms | 32.4% bf16 MFU | 125439 tok/s step 12151/19560 | loss 3.282680 (-2.00z)| norm 0.3023 (+1.18z)| lr 2.01e-04 | 4159.34 ms | 32.5% bf16 MFU | 125470 tok/s step 12152/19560 | loss 3.360191 (+0.04z)| norm 0.2803 (-0.22z)| lr 2.01e-04 | 4164.42 ms | 32.4% bf16 MFU | 125491 tok/s step 12153/19560 | loss 3.347679 (-0.29z)| norm 0.3104 (+1.67z)| lr 2.01e-04 | 4172.47 ms | 32.4% bf16 MFU | 125499 tok/s step 12154/19560 | loss 3.339887 (-0.49z)| norm 0.2944 (+0.66z)| lr 2.01e-04 | 4171.77 ms | 32.4% bf16 MFU | 125508 tok/s step 12155/19560 | loss 3.350461 (-0.22z)| norm 0.3006 (+1.04z)| lr 2.01e-04 | 4174.83 ms | 32.3% bf16 MFU | 125512 tok/s step 12156/19560 | loss 3.277731 (-2.14z)| norm 0.2817 (-0.16z)| lr 2.01e-04 | 4170.29 ms | 32.4% bf16 MFU | 125522 tok/s step 12157/19560 | loss 3.341844 (-0.42z)| norm 0.2885 (+0.26z)| lr 2.01e-04 | 4176.73 ms | 32.3% bf16 MFU | 125522 tok/s step 12158/19560 | loss 3.363663 (+0.15z)| norm 0.3030 (+1.17z)| lr 2.01e-04 | 4164.21 ms | 32.4% bf16 MFU | 125542 tok/s step 12159/19560 | loss 3.291123 (-1.76z)| norm 0.2893 (+0.29z)| lr 2.01e-04 | 4173.60 ms | 32.4% bf16 MFU | 125545 tok/s step 12160/19560 | loss 3.396861 (+1.06z)| norm 0.2809 (-0.25z)| lr 2.01e-04 | 4169.08 ms | 32.4% bf16 MFU | 125556 tok/s step 12161/19560 | loss 3.354386 (-0.09z)| norm 0.2843 (-0.02z)| lr 2.00e-04 | 4174.13 ms | 32.3% bf16 MFU | 125558 tok/s step 12162/19560 | loss 3.319716 (-1.03z)| norm 0.2837 (-0.06z)| lr 2.00e-04 | 4173.24 ms | 32.4% bf16 MFU | 125562 tok/s step 12163/19560 | loss 3.345908 (-0.29z)| norm 0.2783 (-0.41z)| lr 2.00e-04 | 4179.68 ms | 32.3% bf16 MFU | 125556 tok/s step 12164/19560 | loss 3.392589 (+1.03z)| norm 0.2902 (+0.35z)| lr 2.00e-04 | 4167.28 ms | 32.4% bf16 MFU | 125569 tok/s step 12165/19560 | loss 3.280812 (-2.11z)| norm 0.2787 (-0.38z)| lr 2.00e-04 | 4170.17 ms | 32.4% bf16 MFU | 125576 tok/s step 12166/19560 | loss 3.362190 (+0.20z)| norm 0.2840 (-0.03z)| lr 2.00e-04 | 4163.76 ms | 32.4% bf16 MFU | 125593 tok/s step 12167/19560 | loss 3.343775 (-0.33z)| norm 0.2766 (-0.50z)| lr 2.00e-04 | 4180.44 ms | 32.3% bf16 MFU | 125584 tok/s step 12168/19560 | loss 3.327078 (-0.80z)| norm 0.2737 (-0.67z)| lr 2.00e-04 | 4230.56 ms | 31.9% bf16 MFU | 125502 tok/s step 12169/19560 | loss 3.389972 (+1.03z)| norm 0.2862 (+0.14z)| lr 2.00e-04 | 4204.10 ms | 32.1% bf16 MFU | 125462 tok/s step 12170/19560 | loss 3.421580 (+1.93z)| norm 0.2710 (-0.84z)| lr 2.00e-04 | 4153.26 ms | 32.5% bf16 MFU | 125501 tok/s step 12171/19560 | loss 3.263840 (-2.56z)| norm 0.2681 (-1.03z)| lr 2.00e-04 | 4163.99 ms | 32.4% bf16 MFU | 125521 tok/s step 12172/19560 | loss 3.396553 (+1.18z)| norm 0.2834 (-0.03z)| lr 2.00e-04 | 4167.20 ms | 32.4% bf16 MFU | 125536 tok/s step 12173/19560 | loss 3.354018 (-0.03z)| norm 0.2790 (-0.31z)| lr 2.00e-04 | 4178.61 ms | 32.3% bf16 MFU | 125532 tok/s step 12174/19560 | loss 3.406511 (+1.43z)| norm 0.3086 (+1.62z)| lr 2.00e-04 | 4162.85 ms | 32.4% bf16 MFU | 125553 tok/s step 12175/19560 | loss 3.369483 (+0.39z)| norm 0.2633 (-1.31z)| lr 2.00e-04 | 4175.51 ms | 32.3% bf16 MFU | 125553 tok/s step 12176/19560 | loss 3.362454 (+0.19z)| norm 0.2716 (-0.76z)| lr 2.00e-04 | 4166.01 ms | 32.4% bf16 MFU | 125568 tok/s step 12177/19560 | loss 3.326495 (-0.81z)| norm 0.2807 (-0.17z)| lr 2.00e-04 | 4162.67 ms | 32.4% bf16 MFU | 125587 tok/s step 12178/19560 | loss 3.361577 (+0.17z)| norm 0.2866 (+0.21z)| lr 2.00e-04 | 4158.61 ms | 32.5% bf16 MFU | 125612 tok/s step 12179/19560 | loss 3.346695 (-0.24z)| norm 0.2910 (+0.49z)| lr 2.00e-04 | 4179.92 ms | 32.3% bf16 MFU | 125603 tok/s step 12180/19560 | loss 3.344738 (-0.29z)| norm 0.2774 (-0.39z)| lr 2.00e-04 | 4173.37 ms | 32.4% bf16 MFU | 125604 tok/s step 12181/19560 | loss 3.358441 (+0.09z)| norm 0.2861 (+0.18z)| lr 2.00e-04 | 4179.77 ms | 32.3% bf16 MFU | 125595 tok/s step 12182/19560 | loss 3.312711 (-1.17z)| norm 0.2773 (-0.38z)| lr 1.99e-04 | 4186.41 ms | 32.3% bf16 MFU | 125577 tok/s step 12183/19560 | loss 3.380065 (+0.70z)| norm 0.2837 (+0.03z)| lr 1.99e-04 | 4173.84 ms | 32.3% bf16 MFU | 125579 tok/s step 12184/19560 | loss 3.361442 (+0.18z)| norm 0.2634 (-1.27z)| lr 1.99e-04 | 4179.22 ms | 32.3% bf16 MFU | 125573 tok/s step 12185/19560 | loss 3.348695 (-0.18z)| norm 0.2946 (+0.75z)| lr 1.99e-04 | 4159.90 ms | 32.5% bf16 MFU | 125596 tok/s step 12186/19560 | loss 3.322971 (-0.88z)| norm 0.2693 (-0.91z)| lr 1.99e-04 | 4183.66 ms | 32.3% bf16 MFU | 125582 tok/s step 12187/19560 | loss 3.390701 (+1.05z)| norm 0.2824 (-0.04z)| lr 1.99e-04 | 4171.54 ms | 32.4% bf16 MFU | 125587 tok/s step 12188/19560 | loss 3.328416 (-0.74z)| norm 0.2595 (-1.52z)| lr 1.99e-04 | 4169.40 ms | 32.4% bf16 MFU | 125595 tok/s step 12189/19560 | loss 3.418122 (+1.93z)| norm 0.2612 (-1.39z)| lr 1.99e-04 | 4172.51 ms | 32.4% bf16 MFU | 125598 tok/s step 12190/19560 | loss 3.294732 (-1.70z)| norm 0.2812 (-0.11z)| lr 1.99e-04 | 4170.15 ms | 32.4% bf16 MFU | 125604 tok/s step 12191/19560 | loss 3.421303 (+1.98z)| norm 0.2902 (+0.47z)| lr 1.99e-04 | 4163.67 ms | 32.4% bf16 MFU | 125620 tok/s step 12192/19560 | loss 3.361994 (+0.26z)| norm 0.2729 (-0.65z)| lr 1.99e-04 | 4166.68 ms | 32.4% bf16 MFU | 125630 tok/s step 12193/19560 | loss 3.329724 (-0.67z)| norm 0.2690 (-0.89z)| lr 1.99e-04 | 4165.06 ms | 32.4% bf16 MFU | 125643 tok/s step 12194/19560 | loss 3.484742 (+3.58z)| norm 0.2848 (+0.13z)| lr 1.99e-04 | 4178.83 ms | 32.3% bf16 MFU | 125634 tok/s step 12195/19560 | loss 3.415662 (+1.65z)| norm 0.2759 (-0.44z)| lr 1.99e-04 | 4171.88 ms | 32.4% bf16 MFU | 125636 tok/s step 12196/19560 | loss 3.362025 (+0.20z)| norm 0.2655 (-1.12z)| lr 1.99e-04 | 4173.36 ms | 32.4% bf16 MFU | 125635 tok/s step 12197/19560 | loss 3.332631 (-0.62z)| norm 0.2936 (+0.69z)| lr 1.99e-04 | 4165.49 ms | 32.4% bf16 MFU | 125647 tok/s step 12198/19560 | loss 3.336237 (-0.52z)| norm 0.2670 (-1.03z)| lr 1.99e-04 | 4164.30 ms | 32.4% bf16 MFU | 125659 tok/s step 12199/19560 | loss 3.330233 (-0.67z)| norm 0.2947 (+0.75z)| lr 1.99e-04 | 4162.63 ms | 32.4% bf16 MFU | 125674 tok/s step 12200/19560 | loss 3.364765 (+0.27z)| norm 0.2655 (-1.12z)| lr 1.99e-04 | 4170.63 ms | 32.4% bf16 MFU | 125676 tok/s step 12201/19560 | loss 3.296165 (-1.61z)| norm 0.2877 (+0.31z)| lr 1.99e-04 | 4170.22 ms | 32.4% bf16 MFU | 125678 tok/s step 12202/19560 | loss 3.306319 (-1.31z)| norm 0.2709 (-0.77z)| lr 1.99e-04 | 4163.52 ms | 32.4% bf16 MFU | 125690 tok/s step 12203/19560 | loss 3.317096 (-1.00z)| norm 0.2657 (-1.10z)| lr 1.99e-04 | 4171.01 ms | 32.4% bf16 MFU | 125691 tok/s step 12204/19560 | loss 3.356421 (+0.07z)| norm 0.2741 (-0.56z)| lr 1.98e-04 | 4169.48 ms | 32.4% bf16 MFU | 125693 tok/s step 12205/19560 | loss 3.326653 (-0.74z)| norm 0.2614 (-1.38z)| lr 1.98e-04 | 4171.29 ms | 32.4% bf16 MFU | 125693 tok/s step 12206/19560 | loss 3.357865 (+0.12z)| norm 0.2694 (-0.85z)| lr 1.98e-04 | 4175.48 ms | 32.3% bf16 MFU | 125687 tok/s step 12207/19560 | loss 3.284815 (-1.85z)| norm 0.2741 (-0.54z)| lr 1.98e-04 | 4183.16 ms | 32.3% bf16 MFU | 125669 tok/s step 12208/19560 | loss 3.339690 (-0.35z)| norm 0.2724 (-0.64z)| lr 1.98e-04 | 4169.32 ms | 32.4% bf16 MFU | 125673 tok/s step 12209/19560 | loss 3.325821 (-0.72z)| norm 0.3026 (+1.28z)| lr 1.98e-04 | 4180.17 ms | 32.3% bf16 MFU | 125661 tok/s step 12210/19560 | loss 3.418918 (+1.79z)| norm 0.2911 (+0.53z)| lr 1.98e-04 | 4175.86 ms | 32.3% bf16 MFU | 125655 tok/s step 12211/19560 | loss 3.404778 (+1.39z)| norm 0.2799 (-0.17z)| lr 1.98e-04 | 4168.11 ms | 32.4% bf16 MFU | 125662 tok/s step 12212/19560 | loss 3.275728 (-2.02z)| norm 0.2733 (-0.60z)| lr 1.98e-04 | 4164.27 ms | 32.4% bf16 MFU | 125674 tok/s step 12213/19560 | loss 3.380139 (+0.74z)| norm 0.2623 (-1.31z)| lr 1.98e-04 | 4161.96 ms | 32.4% bf16 MFU | 125688 tok/s step 12214/19560 | loss 3.371981 (+0.52z)| norm 0.2670 (-1.00z)| lr 1.98e-04 | 4159.92 ms | 32.5% bf16 MFU | 125706 tok/s step 12215/19560 | loss 3.314391 (-1.00z)| norm 0.2686 (-0.90z)| lr 1.98e-04 | 4162.62 ms | 32.4% bf16 MFU | 125718 tok/s step 12216/19560 | loss 3.336552 (-0.41z)| norm 0.2765 (-0.36z)| lr 1.98e-04 | 4163.85 ms | 32.4% bf16 MFU | 125728 tok/s step 12217/19560 | loss 3.413666 (+1.61z)| norm 0.2758 (-0.41z)| lr 1.98e-04 | 4174.58 ms | 32.3% bf16 MFU | 125721 tok/s step 12218/19560 | loss 3.359458 (+0.19z)| norm 0.2945 (+0.84z)| lr 1.98e-04 | 4162.97 ms | 32.4% bf16 MFU | 125732 tok/s step 12219/19560 | loss 3.336971 (-0.39z)| norm 0.2825 (+0.02z)| lr 1.98e-04 | 4165.63 ms | 32.4% bf16 MFU | 125738 tok/s step 12220/19560 | loss 3.392189 (+1.05z)| norm 0.2912 (+0.62z)| lr 1.98e-04 | 4170.54 ms | 32.4% bf16 MFU | 125737 tok/s step 12221/19560 | loss 3.303693 (-1.25z)| norm 0.2771 (-0.36z)| lr 1.98e-04 | 4167.58 ms | 32.4% bf16 MFU | 125740 tok/s step 12222/19560 | loss 3.346116 (-0.14z)| norm 0.2685 (-0.94z)| lr 1.98e-04 | 4160.18 ms | 32.5% bf16 MFU | 125755 tok/s step 12223/19560 | loss 3.371965 (+0.53z)| norm 0.2781 (-0.28z)| lr 1.98e-04 | 4175.11 ms | 32.3% bf16 MFU | 125746 tok/s step 12224/19560 | loss 3.361852 (+0.25z)| norm 0.2871 (+0.32z)| lr 1.98e-04 | 4173.89 ms | 32.3% bf16 MFU | 125739 tok/s step 12225/19560 | loss 3.305157 (-1.22z)| norm 0.2631 (-1.34z)| lr 1.97e-04 | 4163.91 ms | 32.4% bf16 MFU | 125748 tok/s step 12226/19560 | loss 3.365966 (+0.36z)| norm 0.3088 (+1.79z)| lr 1.97e-04 | 4159.42 ms | 32.5% bf16 MFU | 125763 tok/s step 12227/19560 | loss 3.378589 (+0.68z)| norm 0.2657 (-1.16z)| lr 1.97e-04 | 4238.24 ms | 31.9% bf16 MFU | 125660 tok/s step 12228/19560 | loss 3.307540 (-1.15z)| norm 0.2977 (+1.02z)| lr 1.97e-04 | 4166.64 ms | 32.4% bf16 MFU | 125668 tok/s step 12229/19560 | loss 3.347615 (-0.11z)| norm 0.2606 (-1.49z)| lr 1.97e-04 | 4162.33 ms | 32.4% bf16 MFU | 125683 tok/s step 12230/19560 | loss 3.365553 (+0.37z)| norm 0.2624 (-1.35z)| lr 1.97e-04 | 4163.27 ms | 32.4% bf16 MFU | 125695 tok/s step 12231/19560 | loss 3.296375 (-1.42z)| norm 0.3055 (+1.56z)| lr 1.97e-04 | 4175.12 ms | 32.3% bf16 MFU | 125689 tok/s step 12232/19560 | loss 3.330641 (-0.53z)| norm 0.2875 (+0.33z)| lr 1.97e-04 | 4161.14 ms | 32.4% bf16 MFU | 125705 tok/s step 12233/19560 | loss 3.309977 (-1.07z)| norm 0.2945 (+0.79z)| lr 1.97e-04 | 4171.71 ms | 32.4% bf16 MFU | 125703 tok/s step 12234/19560 | loss 3.324355 (-0.69z)| norm 0.2753 (-0.50z)| lr 1.97e-04 | 4169.36 ms | 32.4% bf16 MFU | 125705 tok/s step 12235/19560 | loss 3.352205 (+0.05z)| norm 0.2913 (+0.57z)| lr 1.97e-04 | 4183.89 ms | 32.3% bf16 MFU | 125686 tok/s step 12236/19560 | loss 3.335207 (-0.40z)| norm 0.2925 (+0.65z)| lr 1.97e-04 | 4170.87 ms | 32.4% bf16 MFU | 125686 tok/s step 12237/19560 | loss 3.362303 (+0.32z)| norm 0.2790 (-0.28z)| lr 1.97e-04 | 4164.07 ms | 32.4% bf16 MFU | 125698 tok/s step 12238/19560 | loss 3.284405 (-1.70z)| norm 0.2853 (+0.16z)| lr 1.97e-04 | 4169.76 ms | 32.4% bf16 MFU | 125699 tok/s step 12239/19560 | loss 3.390084 (+1.06z)| norm 0.2789 (-0.28z)| lr 1.97e-04 | 4158.31 ms | 32.5% bf16 MFU | 125719 tok/s step 12240/19560 | loss 3.379469 (+0.81z)| norm 0.2708 (-0.83z)| lr 1.97e-04 | 4151.72 ms | 32.5% bf16 MFU | 125747 tok/s step 12241/19560 | loss 3.300496 (-1.27z)| norm 0.2802 (-0.16z)| lr 1.97e-04 | 4176.50 ms | 32.3% bf16 MFU | 125736 tok/s step 12242/19560 | loss 3.315491 (-0.86z)| norm 0.2887 (+0.44z)| lr 1.97e-04 | 4159.51 ms | 32.5% bf16 MFU | 125752 tok/s step 12243/19560 | loss 3.322233 (-0.68z)| norm 0.2791 (-0.22z)| lr 1.97e-04 | 4170.08 ms | 32.4% bf16 MFU | 125750 tok/s step 12244/19560 | loss 3.333779 (-0.37z)| norm 0.2916 (+0.70z)| lr 1.97e-04 | 4167.54 ms | 32.4% bf16 MFU | 125753 tok/s step 12245/19560 | loss 3.299243 (-1.26z)| norm 0.2688 (-0.98z)| lr 1.97e-04 | 4175.56 ms | 32.3% bf16 MFU | 125743 tok/s step 12246/19560 | loss 3.283646 (-1.64z)| norm 0.3057 (+1.76z)| lr 1.96e-04 | 4193.03 ms | 32.2% bf16 MFU | 125708 tok/s step 12247/19560 | loss 3.339196 (-0.19z)| norm 0.2816 (-0.03z)| lr 1.96e-04 | 4170.36 ms | 32.4% bf16 MFU | 125709 tok/s step 12248/19560 | loss 3.341097 (-0.15z)| norm 0.3020 (+1.46z)| lr 1.96e-04 | 4157.04 ms | 32.5% bf16 MFU | 125729 tok/s step 12249/19560 | loss 3.314155 (-0.84z)| norm 0.3086 (+1.94z)| lr 1.96e-04 | 4187.31 ms | 32.2% bf16 MFU | 125703 tok/s step 12250/19560 | loss 3.398015 (+1.35z)| norm 0.2908 (+0.69z)| lr 1.96e-04 | 4220.01 ms | 32.0% bf16 MFU | 125630 tok/s val loss 3.339708 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2975/10042 = 0.296256 step 12251/19560 | loss 3.325054 (-0.54z)| norm 0.2889 (+0.58z)| lr 1.96e-04 | 4185.25 ms | 32.3% bf16 MFU | 125612 tok/s step 12252/19560 | loss 3.320665 (-0.65z)| norm 0.2897 (+0.64z)| lr 1.96e-04 | 4165.45 ms | 32.4% bf16 MFU | 125625 tok/s step 12253/19560 | loss 3.344609 (-0.03z)| norm 0.2750 (-0.52z)| lr 1.96e-04 | 4161.84 ms | 32.4% bf16 MFU | 125642 tok/s step 12254/19560 | loss 3.332121 (-0.35z)| norm 0.2948 (+1.06z)| lr 1.96e-04 | 4174.47 ms | 32.3% bf16 MFU | 125640 tok/s step 12255/19560 | loss 3.334784 (-0.28z)| norm 0.2741 (-0.59z)| lr 1.96e-04 | 4153.41 ms | 32.5% bf16 MFU | 125669 tok/s step 12256/19560 | loss 3.380656 (+0.94z)| norm 0.2810 (-0.03z)| lr 1.96e-04 | 4416.62 ms | 30.6% bf16 MFU | 125321 tok/s step 12257/19560 | loss 3.377026 (+0.84z)| norm 0.2860 (+0.36z)| lr 1.96e-04 | 4162.92 ms | 32.4% bf16 MFU | 125352 tok/s step 12258/19560 | loss 3.327972 (-0.47z)| norm 0.2789 (-0.20z)| lr 1.96e-04 | 4209.98 ms | 32.1% bf16 MFU | 125311 tok/s step 12259/19560 | loss 3.368150 (+0.60z)| norm 0.2612 (-1.62z)| lr 1.96e-04 | 4172.17 ms | 32.4% bf16 MFU | 125329 tok/s step 12260/19560 | loss 3.370608 (+0.65z)| norm 0.2814 (+0.02z)| lr 1.96e-04 | 4345.37 ms | 31.1% bf16 MFU | 125095 tok/s step 12261/19560 | loss 3.390166 (+1.16z)| norm 0.2649 (-1.30z)| lr 1.96e-04 | 4165.60 ms | 32.4% bf16 MFU | 125134 tok/s step 12262/19560 | loss 3.348165 (+0.04z)| norm 0.2834 (+0.19z)| lr 1.96e-04 | 4170.55 ms | 32.4% bf16 MFU | 125162 tok/s step 12263/19560 | loss 3.325431 (-0.56z)| norm 0.2572 (-1.90z)| lr 1.96e-04 | 4169.08 ms | 32.4% bf16 MFU | 125192 tok/s step 12264/19560 | loss 3.319284 (-0.72z)| norm 0.2692 (-0.93z)| lr 1.96e-04 | 6045.84 ms | 22.3% bf16 MFU | 123268 tok/s step 12265/19560 | loss 3.286367 (-1.59z)| norm 0.2692 (-0.94z)| lr 1.96e-04 | 4157.04 ms | 32.5% bf16 MFU | 123411 tok/s step 12266/19560 | loss 3.352678 (+0.18z)| norm 0.2551 (-2.04z)| lr 1.96e-04 | 4158.16 ms | 32.5% bf16 MFU | 123545 tok/s step 12267/19560 | loss 3.411757 (+1.71z)| norm 0.2543 (-2.10z)| lr 1.95e-04 | 4147.49 ms | 32.6% bf16 MFU | 123688 tok/s step 12268/19560 | loss 3.346412 (-0.01z)| norm 0.2649 (-1.24z)| lr 1.95e-04 | 4151.74 ms | 32.5% bf16 MFU | 123818 tok/s step 12269/19560 | loss 3.336724 (-0.26z)| norm 0.2503 (-2.34z)| lr 1.95e-04 | 4160.87 ms | 32.4% bf16 MFU | 123927 tok/s step 12270/19560 | loss 3.351231 (+0.13z)| norm 0.2499 (-2.30z)| lr 1.95e-04 | 4168.45 ms | 32.4% bf16 MFU | 124020 tok/s step 12271/19560 | loss 3.344744 (-0.06z)| norm 0.2605 (-1.47z)| lr 1.95e-04 | 4154.73 ms | 32.5% bf16 MFU | 124128 tok/s step 12272/19560 | loss 3.298115 (-1.30z)| norm 0.2708 (-0.69z)| lr 1.95e-04 | 4162.09 ms | 32.4% bf16 MFU | 124220 tok/s step 12273/19560 | loss 3.385449 (+1.02z)| norm 0.2748 (-0.37z)| lr 1.95e-04 | 4163.00 ms | 32.4% bf16 MFU | 124306 tok/s step 12274/19560 | loss 3.289753 (-1.50z)| norm 0.2692 (-0.79z)| lr 1.95e-04 | 4174.60 ms | 32.3% bf16 MFU | 124370 tok/s step 12275/19560 | loss 3.310922 (-0.93z)| norm 0.2775 (-0.15z)| lr 1.95e-04 | 4180.21 ms | 32.3% bf16 MFU | 124423 tok/s step 12276/19560 | loss 3.298955 (-1.23z)| norm 0.2851 (+0.43z)| lr 1.95e-04 | 4179.33 ms | 32.3% bf16 MFU | 124474 tok/s step 12277/19560 | loss 3.364667 (+0.48z)| norm 0.2816 (+0.16z)| lr 1.95e-04 | 4167.22 ms | 32.4% bf16 MFU | 124541 tok/s step 12278/19560 | loss 3.353691 (+0.21z)| norm 0.2765 (-0.23z)| lr 1.95e-04 | 4155.76 ms | 32.5% bf16 MFU | 124622 tok/s step 12279/19560 | loss 3.357339 (+0.29z)| norm 0.2932 (+1.06z)| lr 1.95e-04 | 4165.99 ms | 32.4% bf16 MFU | 124683 tok/s step 12280/19560 | loss 3.324492 (-0.58z)| norm 0.2951 (+1.19z)| lr 1.95e-04 | 4154.39 ms | 32.5% bf16 MFU | 124759 tok/s step 12281/19560 | loss 3.335318 (-0.28z)| norm 0.2713 (-0.62z)| lr 1.95e-04 | 4149.16 ms | 32.5% bf16 MFU | 124839 tok/s step 12282/19560 | loss 3.323316 (-0.60z)| norm 0.2945 (+1.19z)| lr 1.95e-04 | 4163.06 ms | 32.4% bf16 MFU | 124894 tok/s step 12283/19560 | loss 3.356664 (+0.29z)| norm 0.2787 (-0.03z)| lr 1.95e-04 | 4166.95 ms | 32.4% bf16 MFU | 124940 tok/s step 12284/19560 | loss 3.369411 (+0.62z)| norm 0.2761 (-0.24z)| lr 1.95e-04 | 4159.13 ms | 32.5% bf16 MFU | 124996 tok/s step 12285/19560 | loss 3.329634 (-0.45z)| norm 0.2828 (+0.30z)| lr 1.95e-04 | 4165.67 ms | 32.4% bf16 MFU | 125039 tok/s step 12286/19560 | loss 3.286720 (-1.58z)| norm 0.2984 (+1.55z)| lr 1.95e-04 | 4168.77 ms | 32.4% bf16 MFU | 125076 tok/s step 12287/19560 | loss 3.357189 (+0.29z)| norm 0.2867 (+0.61z)| lr 1.95e-04 | 4163.13 ms | 32.4% bf16 MFU | 125119 tok/s step 12288/19560 | loss 3.371025 (+0.67z)| norm 0.2926 (+1.08z)| lr 1.95e-04 | 4155.88 ms | 32.5% bf16 MFU | 125171 tok/s step 12289/19560 | loss 3.306333 (-1.07z)| norm 0.2514 (-2.14z)| lr 1.94e-04 | 4156.31 ms | 32.5% bf16 MFU | 125219 tok/s step 12290/19560 | loss 3.313933 (-0.86z)| norm 0.3099 (+2.36z)| lr 1.94e-04 | 4162.95 ms | 32.4% bf16 MFU | 125255 tok/s step 12291/19560 | loss 3.336939 (-0.24z)| norm 0.2714 (-0.58z)| lr 1.94e-04 | 4157.57 ms | 32.5% bf16 MFU | 125298 tok/s step 12292/19560 | loss 3.343555 (-0.05z)| norm 0.2724 (-0.48z)| lr 1.94e-04 | 4159.56 ms | 32.5% bf16 MFU | 125335 tok/s step 12293/19560 | loss 3.371315 (+0.69z)| norm 0.2800 (+0.09z)| lr 1.94e-04 | 4161.72 ms | 32.4% bf16 MFU | 125367 tok/s step 12294/19560 | loss 3.383432 (+1.02z)| norm 0.2838 (+0.38z)| lr 1.94e-04 | 4155.57 ms | 32.5% bf16 MFU | 125407 tok/s step 12295/19560 | loss 3.371298 (+0.68z)| norm 0.2862 (+0.56z)| lr 1.94e-04 | 4176.90 ms | 32.3% bf16 MFU | 125413 tok/s step 12296/19560 | loss 3.299856 (-1.26z)| norm 0.2631 (-1.19z)| lr 1.94e-04 | 4157.60 ms | 32.5% bf16 MFU | 125447 tok/s step 12297/19560 | loss 3.400805 (+1.47z)| norm 0.2935 (+1.11z)| lr 1.94e-04 | 4163.46 ms | 32.4% bf16 MFU | 125471 tok/s step 12298/19560 | loss 3.316282 (-0.80z)| norm 0.2617 (-1.28z)| lr 1.94e-04 | 4176.02 ms | 32.3% bf16 MFU | 125475 tok/s step 12299/19560 | loss 3.359441 (+0.37z)| norm 0.2795 (+0.05z)| lr 1.94e-04 | 4158.20 ms | 32.5% bf16 MFU | 125506 tok/s step 12300/19560 | loss 3.394856 (+1.36z)| norm 0.2853 (+0.48z)| lr 1.94e-04 | 4153.92 ms | 32.5% bf16 MFU | 125541 tok/s step 12301/19560 | loss 3.358681 (+0.35z)| norm 0.2778 (-0.08z)| lr 1.94e-04 | 4163.18 ms | 32.4% bf16 MFU | 125561 tok/s step 12302/19560 | loss 3.303106 (-1.19z)| norm 0.2778 (-0.07z)| lr 1.94e-04 | 4162.84 ms | 32.4% bf16 MFU | 125580 tok/s step 12303/19560 | loss 3.284511 (-1.68z)| norm 0.2649 (-1.06z)| lr 1.94e-04 | 4157.47 ms | 32.5% bf16 MFU | 125606 tok/s step 12304/19560 | loss 3.333378 (-0.31z)| norm 0.2665 (-0.93z)| lr 1.94e-04 | 4164.67 ms | 32.4% bf16 MFU | 125620 tok/s step 12305/19560 | loss 3.338216 (-0.18z)| norm 0.2788 (+0.02z)| lr 1.94e-04 | 4162.64 ms | 32.4% bf16 MFU | 125637 tok/s step 12306/19560 | loss 3.267036 (-2.11z)| norm 0.2741 (-0.34z)| lr 1.94e-04 | 4165.62 ms | 32.4% bf16 MFU | 125648 tok/s step 12307/19560 | loss 3.369772 (+0.70z)| norm 0.2859 (+0.57z)| lr 1.94e-04 | 4168.19 ms | 32.4% bf16 MFU | 125655 tok/s step 12308/19560 | loss 3.352325 (+0.22z)| norm 0.2884 (+0.76z)| lr 1.94e-04 | 4160.76 ms | 32.5% bf16 MFU | 125673 tok/s step 12309/19560 | loss 3.311036 (-0.89z)| norm 0.2777 (-0.06z)| lr 1.94e-04 | 4156.46 ms | 32.5% bf16 MFU | 125696 tok/s step 12310/19560 | loss 3.294406 (-1.34z)| norm 0.2682 (-0.78z)| lr 1.93e-04 | 4189.65 ms | 32.2% bf16 MFU | 125668 tok/s step 12311/19560 | loss 3.320288 (-0.62z)| norm 0.2747 (-0.28z)| lr 1.93e-04 | 4169.34 ms | 32.4% bf16 MFU | 125672 tok/s step 12312/19560 | loss 3.396469 (+1.43z)| norm 0.2717 (-0.51z)| lr 1.93e-04 | 4270.35 ms | 31.6% bf16 MFU | 125527 tok/s step 12313/19560 | loss 3.343765 (+0.01z)| norm 0.2725 (-0.45z)| lr 1.93e-04 | 4287.38 ms | 31.5% bf16 MFU | 125365 tok/s step 12314/19560 | loss 3.384270 (+1.09z)| norm 0.2768 (-0.11z)| lr 1.93e-04 | 4305.88 ms | 31.4% bf16 MFU | 125185 tok/s step 12315/19560 | loss 3.358684 (+0.41z)| norm 0.2941 (+1.21z)| lr 1.93e-04 | 4246.33 ms | 31.8% bf16 MFU | 125099 tok/s step 12316/19560 | loss 3.326084 (-0.47z)| norm 0.2787 (+0.02z)| lr 1.93e-04 | 4170.26 ms | 32.4% bf16 MFU | 125130 tok/s step 12317/19560 | loss 3.369720 (+0.73z)| norm 0.3281 (+3.65z)| lr 1.93e-04 | 4306.67 ms | 31.4% bf16 MFU | 124961 tok/s step 12318/19560 | loss 3.342108 (-0.04z)| norm 0.2784 (-0.05z)| lr 1.93e-04 | 4224.04 ms | 32.0% bf16 MFU | 124918 tok/s step 12319/19560 | loss 3.340464 (-0.07z)| norm 0.3137 (+2.50z)| lr 1.93e-04 | 4190.04 ms | 32.2% bf16 MFU | 124929 tok/s step 12320/19560 | loss 3.407850 (+1.80z)| norm 0.2877 (+0.61z)| lr 1.93e-04 | 4166.49 ms | 32.4% bf16 MFU | 124974 tok/s step 12321/19560 | loss 3.397194 (+1.48z)| norm 0.2817 (+0.17z)| lr 1.93e-04 | 4185.18 ms | 32.3% bf16 MFU | 124989 tok/s step 12322/19560 | loss 3.334762 (-0.23z)| norm 0.3070 (+1.96z)| lr 1.93e-04 | 4172.54 ms | 32.4% bf16 MFU | 125022 tok/s step 12323/19560 | loss 3.367560 (+0.76z)| norm 0.2758 (-0.27z)| lr 1.93e-04 | 4158.07 ms | 32.5% bf16 MFU | 125076 tok/s step 12324/19560 | loss 3.337416 (-0.14z)| norm 0.2973 (+1.24z)| lr 1.93e-04 | 4171.26 ms | 32.4% bf16 MFU | 125106 tok/s step 12325/19560 | loss 3.374917 (+0.97z)| norm 0.2967 (+1.20z)| lr 1.93e-04 | 4162.38 ms | 32.4% bf16 MFU | 125149 tok/s step 12326/19560 | loss 3.415842 (+2.14z)| norm 0.3018 (+1.54z)| lr 1.93e-04 | 4181.15 ms | 32.3% bf16 MFU | 125161 tok/s step 12327/19560 | loss 3.303197 (-1.16z)| norm 0.2891 (+0.64z)| lr 1.93e-04 | 4168.62 ms | 32.4% bf16 MFU | 125192 tok/s step 12328/19560 | loss 3.363319 (+0.60z)| norm 0.3021 (+1.54z)| lr 1.93e-04 | 4175.56 ms | 32.3% bf16 MFU | 125210 tok/s step 12329/19560 | loss 3.400238 (+1.65z)| norm 0.3120 (+2.18z)| lr 1.93e-04 | 4175.05 ms | 32.3% bf16 MFU | 125228 tok/s step 12330/19560 | loss 3.411892 (+1.94z)| norm 0.2990 (+1.26z)| lr 1.93e-04 | 4198.81 ms | 32.2% bf16 MFU | 125210 tok/s step 12331/19560 | loss 3.312157 (-0.93z)| norm 0.3142 (+2.25z)| lr 1.93e-04 | 4164.78 ms | 32.4% bf16 MFU | 125244 tok/s step 12332/19560 | loss 3.447631 (+2.86z)| norm 0.2940 (+0.86z)| lr 1.92e-04 | 4177.88 ms | 32.3% bf16 MFU | 125256 tok/s step 12333/19560 | loss 3.355823 (+0.29z)| norm 0.2945 (+0.88z)| lr 1.92e-04 | 4162.63 ms | 32.4% bf16 MFU | 125291 tok/s step 12334/19560 | loss 3.361704 (+0.45z)| norm 0.2941 (+0.84z)| lr 1.92e-04 | 4162.39 ms | 32.4% bf16 MFU | 125325 tok/s step 12335/19560 | loss 3.352149 (+0.18z)| norm 0.3084 (+1.78z)| lr 1.92e-04 | 4168.15 ms | 32.4% bf16 MFU | 125348 tok/s step 12336/19560 | loss 3.371494 (+0.71z)| norm 0.2853 (+0.21z)| lr 1.92e-04 | 4162.81 ms | 32.4% bf16 MFU | 125377 tok/s step 12337/19560 | loss 3.373056 (+0.74z)| norm 0.2625 (-1.30z)| lr 1.92e-04 | 4165.66 ms | 32.4% bf16 MFU | 125402 tok/s step 12338/19560 | loss 3.348861 (+0.08z)| norm 0.2755 (-0.42z)| lr 1.92e-04 | 4175.19 ms | 32.3% bf16 MFU | 125410 tok/s step 12339/19560 | loss 3.385277 (+1.13z)| norm 0.2788 (-0.20z)| lr 1.92e-04 | 4170.46 ms | 32.4% bf16 MFU | 125425 tok/s step 12340/19560 | loss 3.352718 (+0.18z)| norm 0.2768 (-0.33z)| lr 1.92e-04 | 4169.85 ms | 32.4% bf16 MFU | 125441 tok/s step 12341/19560 | loss 3.320968 (-0.73z)| norm 0.2803 (-0.10z)| lr 1.92e-04 | 4159.12 ms | 32.5% bf16 MFU | 125472 tok/s step 12342/19560 | loss 3.331297 (-0.42z)| norm 0.2843 (+0.16z)| lr 1.92e-04 | 4166.31 ms | 32.4% bf16 MFU | 125490 tok/s step 12343/19560 | loss 3.480607 (+3.71z)| norm 0.2796 (-0.17z)| lr 1.92e-04 | 4178.06 ms | 32.3% bf16 MFU | 125490 tok/s step 12344/19560 | loss 3.338948 (-0.22z)| norm 0.2817 (-0.03z)| lr 1.92e-04 | 4160.99 ms | 32.4% bf16 MFU | 125515 tok/s step 12345/19560 | loss 3.352159 (+0.16z)| norm 0.2728 (-0.64z)| lr 1.92e-04 | 4163.87 ms | 32.4% bf16 MFU | 125535 tok/s step 12346/19560 | loss 3.328034 (-0.51z)| norm 0.2773 (-0.32z)| lr 1.92e-04 | 4170.97 ms | 32.4% bf16 MFU | 125543 tok/s step 12347/19560 | loss 3.320369 (-0.72z)| norm 0.2670 (-1.02z)| lr 1.92e-04 | 4178.48 ms | 32.3% bf16 MFU | 125540 tok/s step 12348/19560 | loss 3.502114 (+4.09z)| norm 0.2992 (+1.18z)| lr 1.92e-04 | 4151.52 ms | 32.5% bf16 MFU | 125577 tok/s step 12349/19560 | loss 3.316819 (-0.80z)| norm 0.2724 (-0.65z)| lr 1.92e-04 | 4173.81 ms | 32.3% bf16 MFU | 125579 tok/s step 12350/19560 | loss 3.370752 (+0.62z)| norm 0.2685 (-0.91z)| lr 1.92e-04 | 4182.55 ms | 32.3% bf16 MFU | 125568 tok/s step 12351/19560 | loss 3.415918 (+1.78z)| norm 0.2768 (-0.34z)| lr 1.92e-04 | 4182.99 ms | 32.3% bf16 MFU | 125556 tok/s step 12352/19560 | loss 3.421365 (+1.89z)| norm 0.2879 (+0.41z)| lr 1.92e-04 | 4169.27 ms | 32.4% bf16 MFU | 125566 tok/s step 12353/19560 | loss 3.338345 (-0.26z)| norm 0.2835 (+0.10z)| lr 1.91e-04 | 4162.64 ms | 32.4% bf16 MFU | 125585 tok/s step 12354/19560 | loss 3.473025 (+3.08z)| norm 0.2853 (+0.24z)| lr 1.91e-04 | 4179.80 ms | 32.3% bf16 MFU | 125578 tok/s step 12355/19560 | loss 3.360038 (+0.27z)| norm 0.2890 (+0.48z)| lr 1.91e-04 | 4174.34 ms | 32.3% bf16 MFU | 125579 tok/s step 12356/19560 | loss 3.343826 (-0.14z)| norm 0.2888 (+0.48z)| lr 1.91e-04 | 4182.22 ms | 32.3% bf16 MFU | 125568 tok/s step 12357/19560 | loss 3.436759 (+2.13z)| norm 0.2802 (-0.14z)| lr 1.91e-04 | 4179.57 ms | 32.3% bf16 MFU | 125561 tok/s step 12358/19560 | loss 3.382094 (+0.78z)| norm 0.2903 (+0.57z)| lr 1.91e-04 | 4172.56 ms | 32.4% bf16 MFU | 125566 tok/s step 12359/19560 | loss 3.356835 (+0.15z)| norm 0.2765 (-0.40z)| lr 1.91e-04 | 4165.25 ms | 32.4% bf16 MFU | 125581 tok/s step 12360/19560 | loss 3.550769 (+4.50z)| norm 1.4789 (+11.17z)| lr 1.91e-04 | 4175.89 ms | 32.3% bf16 MFU | 125580 tok/s step 12361/19560 | loss 3.388850 (+0.81z)| norm 0.4537 (+1.50z)| lr 1.91e-04 | 4174.19 ms | 32.3% bf16 MFU | 125581 tok/s step 12362/19560 | loss 3.291403 (-1.38z)| norm 0.3757 (+0.77z)| lr 1.91e-04 | 4175.47 ms | 32.3% bf16 MFU | 125580 tok/s step 12363/19560 | loss 3.346374 (-0.14z)| norm 0.3340 (+0.37z)| lr 1.91e-04 | 4166.20 ms | 32.4% bf16 MFU | 125593 tok/s step 12364/19560 | loss 3.409948 (+1.27z)| norm 0.3330 (+0.36z)| lr 1.91e-04 | 4171.01 ms | 32.4% bf16 MFU | 125598 tok/s step 12365/19560 | loss 3.345073 (-0.18z)| norm 0.3219 (+0.25z)| lr 1.91e-04 | 4169.19 ms | 32.4% bf16 MFU | 125606 tok/s step 12366/19560 | loss 3.343983 (-0.22z)| norm 0.3011 (+0.06z)| lr 1.91e-04 | 4173.08 ms | 32.4% bf16 MFU | 125608 tok/s step 12367/19560 | loss 3.341673 (-0.26z)| norm 0.3144 (+0.18z)| lr 1.91e-04 | 4166.64 ms | 32.4% bf16 MFU | 125619 tok/s step 12368/19560 | loss 3.356476 (+0.08z)| norm 0.3034 (+0.08z)| lr 1.91e-04 | 4168.06 ms | 32.4% bf16 MFU | 125627 tok/s step 12369/19560 | loss 3.401981 (+1.10z)| norm 0.2811 (-0.13z)| lr 1.91e-04 | 4155.04 ms | 32.5% bf16 MFU | 125655 tok/s step 12370/19560 | loss 3.343702 (-0.24z)| norm 0.2789 (-0.15z)| lr 1.91e-04 | 4162.27 ms | 32.4% bf16 MFU | 125670 tok/s step 12371/19560 | loss 3.398270 (+0.99z)| norm 0.3008 (+0.05z)| lr 1.91e-04 | 4205.72 ms | 32.1% bf16 MFU | 125620 tok/s step 12372/19560 | loss 3.348218 (-0.15z)| norm 0.2746 (-0.19z)| lr 1.91e-04 | 4167.70 ms | 32.4% bf16 MFU | 125629 tok/s step 12373/19560 | loss 3.389114 (+0.77z)| norm 0.2749 (-0.19z)| lr 1.91e-04 | 4162.59 ms | 32.4% bf16 MFU | 125645 tok/s step 12374/19560 | loss 3.356387 (+0.01z)| norm 0.2847 (-0.09z)| lr 1.91e-04 | 4182.81 ms | 32.3% bf16 MFU | 125630 tok/s step 12375/19560 | loss 3.349163 (-0.16z)| norm 0.2758 (-0.18z)| lr 1.90e-04 | 4175.14 ms | 32.3% bf16 MFU | 125627 tok/s step 12376/19560 | loss 3.337058 (-0.44z)| norm 0.2658 (-0.27z)| lr 1.90e-04 | 4166.42 ms | 32.4% bf16 MFU | 125637 tok/s step 12377/19560 | loss 3.370567 (+0.33z)| norm 0.2684 (-0.24z)| lr 1.90e-04 | 4184.21 ms | 32.3% bf16 MFU | 125621 tok/s step 12378/19560 | loss 3.338347 (-0.41z)| norm 0.2665 (-0.26z)| lr 1.90e-04 | 4156.19 ms | 32.5% bf16 MFU | 125647 tok/s step 12379/19560 | loss 3.358040 (+0.04z)| norm 0.2717 (-0.21z)| lr 1.90e-04 | 4161.10 ms | 32.4% bf16 MFU | 125664 tok/s step 12380/19560 | loss 3.418892 (+1.43z)| norm 0.2573 (-0.34z)| lr 1.90e-04 | 4167.87 ms | 32.4% bf16 MFU | 125671 tok/s step 12381/19560 | loss 3.339381 (-0.41z)| norm 0.2871 (-0.06z)| lr 1.90e-04 | 4196.48 ms | 32.2% bf16 MFU | 125634 tok/s step 12382/19560 | loss 3.299759 (-1.31z)| norm 0.2692 (-0.23z)| lr 1.90e-04 | 4169.77 ms | 32.4% bf16 MFU | 125639 tok/s step 12383/19560 | loss 3.346354 (-0.24z)| norm 0.2802 (-0.12z)| lr 1.90e-04 | 4176.07 ms | 32.3% bf16 MFU | 125634 tok/s step 12384/19560 | loss 3.435838 (+1.79z)| norm 0.2899 (-0.04z)| lr 1.90e-04 | 4171.54 ms | 32.4% bf16 MFU | 125637 tok/s step 12385/19560 | loss 3.383918 (+0.61z)| norm 0.3115 (+0.16z)| lr 1.90e-04 | 4168.54 ms | 32.4% bf16 MFU | 125644 tok/s step 12386/19560 | loss 3.354028 (-0.08z)| norm 0.2888 (-0.05z)| lr 1.90e-04 | 4198.42 ms | 32.2% bf16 MFU | 125605 tok/s step 12387/19560 | loss 3.412349 (+1.24z)| norm 0.2867 (-0.07z)| lr 1.90e-04 | 4162.00 ms | 32.4% bf16 MFU | 125624 tok/s step 12388/19560 | loss 3.359213 (+0.03z)| norm 0.2888 (-0.05z)| lr 1.90e-04 | 4214.55 ms | 32.0% bf16 MFU | 125562 tok/s step 12389/19560 | loss 3.415273 (+1.30z)| norm 0.2865 (-0.07z)| lr 1.90e-04 | 4169.25 ms | 32.4% bf16 MFU | 125572 tok/s step 12390/19560 | loss 3.386351 (+0.63z)| norm 0.2865 (-0.07z)| lr 1.90e-04 | 4164.44 ms | 32.4% bf16 MFU | 125588 tok/s step 12391/19560 | loss 3.373956 (+0.35z)| norm 0.2680 (-0.25z)| lr 1.90e-04 | 4166.78 ms | 32.4% bf16 MFU | 125600 tok/s step 12392/19560 | loss 3.368338 (+0.21z)| norm 0.2786 (-0.15z)| lr 1.90e-04 | 4181.12 ms | 32.3% bf16 MFU | 125590 tok/s step 12393/19560 | loss 3.377239 (+0.40z)| norm 0.2799 (-0.14z)| lr 1.90e-04 | 4165.23 ms | 32.4% bf16 MFU | 125604 tok/s step 12394/19560 | loss 3.361465 (+0.04z)| norm 0.2633 (-0.29z)| lr 1.90e-04 | 4167.48 ms | 32.4% bf16 MFU | 125614 tok/s step 12395/19560 | loss 3.415674 (+1.28z)| norm 0.2684 (-0.25z)| lr 1.90e-04 | 4175.42 ms | 32.3% bf16 MFU | 125611 tok/s step 12396/19560 | loss 3.415857 (+1.26z)| norm 0.2889 (-0.06z)| lr 1.89e-04 | 4178.51 ms | 32.3% bf16 MFU | 125604 tok/s step 12397/19560 | loss 3.390387 (+0.67z)| norm 0.2868 (-0.08z)| lr 1.89e-04 | 4170.92 ms | 32.4% bf16 MFU | 125609 tok/s step 12398/19560 | loss 3.379712 (+0.42z)| norm 0.2664 (-0.27z)| lr 1.89e-04 | 4180.18 ms | 32.3% bf16 MFU | 125600 tok/s step 12399/19560 | loss 3.351812 (-0.21z)| norm 0.2956 (-0.00z)| lr 1.89e-04 | 4177.80 ms | 32.3% bf16 MFU | 125595 tok/s step 12400/19560 | loss 3.323738 (-0.86z)| norm 0.2712 (-0.23z)| lr 1.89e-04 | 4169.87 ms | 32.4% bf16 MFU | 125601 tok/s step 12401/19560 | loss 3.373990 (+0.29z)| norm 0.3026 (+0.06z)| lr 1.89e-04 | 4185.63 ms | 32.3% bf16 MFU | 125584 tok/s step 12402/19560 | loss 3.341176 (-0.47z)| norm 0.2792 (-0.16z)| lr 1.89e-04 | 4175.67 ms | 32.3% bf16 MFU | 125583 tok/s step 12403/19560 | loss 3.414595 (+1.21z)| norm 0.2950 (-0.01z)| lr 1.89e-04 | 4172.44 ms | 32.4% bf16 MFU | 125587 tok/s step 12404/19560 | loss 3.377594 (+0.34z)| norm 0.2868 (-0.09z)| lr 1.89e-04 | 4195.08 ms | 32.2% bf16 MFU | 125556 tok/s step 12405/19560 | loss 3.383766 (+0.48z)| norm 0.3343 (+0.35z)| lr 1.89e-04 | 4151.66 ms | 32.5% bf16 MFU | 125593 tok/s step 12406/19560 | loss 3.377790 (+0.34z)| norm 0.2847 (-0.11z)| lr 1.89e-04 | 4162.66 ms | 32.4% bf16 MFU | 125610 tok/s step 12407/19560 | loss 3.362937 (-0.01z)| norm 0.3331 (+0.34z)| lr 1.89e-04 | 4183.15 ms | 32.3% bf16 MFU | 125597 tok/s step 12408/19560 | loss 3.373640 (+0.23z)| norm 0.2573 (-0.37z)| lr 1.89e-04 | 4167.30 ms | 32.4% bf16 MFU | 125607 tok/s step 12409/19560 | loss 3.360618 (-0.08z)| norm 0.3041 (+0.07z)| lr 1.89e-04 | 4176.21 ms | 32.3% bf16 MFU | 125604 tok/s step 12410/19560 | loss 3.373728 (+0.22z)| norm 0.2714 (-0.24z)| lr 1.89e-04 | 4169.38 ms | 32.4% bf16 MFU | 125611 tok/s step 12411/19560 | loss 3.457628 (+2.13z)| norm 0.2829 (-0.13z)| lr 1.89e-04 | 4161.04 ms | 32.4% bf16 MFU | 125631 tok/s step 12412/19560 | loss 3.306064 (-1.34z)| norm 0.2734 (-0.22z)| lr 1.89e-04 | 4155.27 ms | 32.5% bf16 MFU | 125658 tok/s step 12413/19560 | loss 3.339345 (-0.58z)| norm 0.2838 (-0.12z)| lr 1.89e-04 | 4257.69 ms | 31.7% bf16 MFU | 125532 tok/s step 12414/19560 | loss 3.359305 (-0.14z)| norm 0.2590 (-0.35z)| lr 1.89e-04 | 4161.71 ms | 32.4% bf16 MFU | 125554 tok/s step 12415/19560 | loss 3.340669 (-0.57z)| norm 0.2834 (-0.12z)| lr 1.89e-04 | 4239.71 ms | 31.8% bf16 MFU | 125459 tok/s step 12416/19560 | loss 3.316985 (-1.10z)| norm 0.2632 (-0.31z)| lr 1.89e-04 | 4162.37 ms | 32.4% bf16 MFU | 125484 tok/s step 12417/19560 | loss 3.348958 (-0.38z)| norm 0.2742 (-0.21z)| lr 1.89e-04 | 4155.92 ms | 32.5% bf16 MFU | 125518 tok/s step 12418/19560 | loss 3.350419 (-0.35z)| norm 0.2625 (-0.31z)| lr 1.88e-04 | 4156.40 ms | 32.5% bf16 MFU | 125549 tok/s step 12419/19560 | loss 3.346854 (-0.44z)| norm 0.2597 (-0.34z)| lr 1.88e-04 | 4221.59 ms | 32.0% bf16 MFU | 125481 tok/s step 12420/19560 | loss 3.432464 (+1.54z)| norm 0.2698 (-0.24z)| lr 1.88e-04 | 4322.60 ms | 31.2% bf16 MFU | 125272 tok/s step 12421/19560 | loss 3.352687 (-0.31z)| norm 0.2510 (-0.42z)| lr 1.88e-04 | 4157.43 ms | 32.5% bf16 MFU | 125314 tok/s step 12422/19560 | loss 3.367423 (+0.04z)| norm 0.2701 (-0.24z)| lr 1.88e-04 | 4169.84 ms | 32.4% bf16 MFU | 125334 tok/s step 12423/19560 | loss 3.318510 (-1.08z)| norm 0.2572 (-0.35z)| lr 1.88e-04 | 4158.61 ms | 32.5% bf16 MFU | 125371 tok/s step 12424/19560 | loss 3.382091 (+0.37z)| norm 0.2825 (-0.12z)| lr 1.88e-04 | 4280.38 ms | 31.5% bf16 MFU | 125227 tok/s step 12425/19560 | loss 3.404304 (+0.89z)| norm 0.2444 (-0.47z)| lr 1.88e-04 | 4155.75 ms | 32.5% bf16 MFU | 125274 tok/s step 12426/19560 | loss 3.389827 (+0.54z)| norm 0.2725 (-0.21z)| lr 1.88e-04 | 4192.00 ms | 32.2% bf16 MFU | 125264 tok/s step 12427/19560 | loss 3.363808 (-0.07z)| norm 0.2528 (-0.39z)| lr 1.88e-04 | 4229.10 ms | 31.9% bf16 MFU | 125199 tok/s step 12428/19560 | loss 3.347858 (-0.43z)| norm 0.2723 (-0.21z)| lr 1.88e-04 | 4161.69 ms | 32.4% bf16 MFU | 125238 tok/s step 12429/19560 | loss 3.342042 (-0.57z)| norm 0.2600 (-0.32z)| lr 1.88e-04 | 4179.99 ms | 32.3% bf16 MFU | 125247 tok/s step 12430/19560 | loss 3.337111 (-0.69z)| norm 0.2701 (-0.23z)| lr 1.88e-04 | 4156.55 ms | 32.5% bf16 MFU | 125292 tok/s step 12431/19560 | loss 3.341077 (-0.62z)| norm 0.2784 (-0.15z)| lr 1.88e-04 | 4189.04 ms | 32.2% bf16 MFU | 125285 tok/s step 12432/19560 | loss 3.409438 (+1.00z)| norm 0.2590 (-0.33z)| lr 1.88e-04 | 4204.10 ms | 32.1% bf16 MFU | 125256 tok/s step 12433/19560 | loss 3.322972 (-1.06z)| norm 0.2782 (-0.15z)| lr 1.88e-04 | 4158.23 ms | 32.5% bf16 MFU | 125298 tok/s step 12434/19560 | loss 3.368690 (+0.01z)| norm 0.2747 (-0.19z)| lr 1.88e-04 | 4164.90 ms | 32.4% bf16 MFU | 125327 tok/s step 12435/19560 | loss 3.434580 (+1.58z)| norm 0.2850 (-0.09z)| lr 1.88e-04 | 4157.05 ms | 32.5% bf16 MFU | 125367 tok/s step 12436/19560 | loss 3.373444 (+0.11z)| norm 0.2713 (-0.22z)| lr 1.88e-04 | 4206.80 ms | 32.1% bf16 MFU | 125330 tok/s step 12437/19560 | loss 3.345808 (-0.57z)| norm 0.2585 (-0.33z)| lr 1.88e-04 | 4196.82 ms | 32.2% bf16 MFU | 125309 tok/s step 12438/19560 | loss 3.408664 (+0.94z)| norm 0.2730 (-0.20z)| lr 1.88e-04 | 4209.23 ms | 32.1% bf16 MFU | 125272 tok/s step 12439/19560 | loss 3.341444 (-0.71z)| norm 0.2636 (-0.29z)| lr 1.87e-04 | 4173.44 ms | 32.4% bf16 MFU | 125289 tok/s step 12440/19560 | loss 3.385881 (+0.39z)| norm 0.2738 (-0.19z)| lr 1.87e-04 | 4165.90 ms | 32.4% bf16 MFU | 125318 tok/s step 12441/19560 | loss 3.345655 (-0.60z)| norm 0.2528 (-0.38z)| lr 1.87e-04 | 4160.56 ms | 32.5% bf16 MFU | 125352 tok/s step 12442/19560 | loss 3.392449 (+0.54z)| norm 0.3010 (+0.06z)| lr 1.87e-04 | 4200.17 ms | 32.1% bf16 MFU | 125326 tok/s step 12443/19560 | loss 3.415505 (+1.10z)| norm 0.2629 (-0.29z)| lr 1.87e-04 | 4159.73 ms | 32.5% bf16 MFU | 125362 tok/s step 12444/19560 | loss 3.356834 (-0.35z)| norm 0.2927 (-0.01z)| lr 1.87e-04 | 4289.77 ms | 31.5% bf16 MFU | 125205 tok/s step 12445/19560 | loss 3.302668 (-1.64z)| norm 0.2474 (-0.43z)| lr 1.87e-04 | 4166.67 ms | 32.4% bf16 MFU | 125236 tok/s step 12446/19560 | loss 3.380105 (+0.23z)| norm 0.2829 (-0.10z)| lr 1.87e-04 | 4205.91 ms | 32.1% bf16 MFU | 125207 tok/s step 12447/19560 | loss 3.350729 (-0.49z)| norm 0.2396 (-0.50z)| lr 1.87e-04 | 4222.19 ms | 32.0% bf16 MFU | 125155 tok/s step 12448/19560 | loss 3.453932 (+1.99z)| norm 0.2773 (-0.15z)| lr 1.87e-04 | 4199.24 ms | 32.2% bf16 MFU | 125140 tok/s step 12449/19560 | loss 3.353279 (-0.42z)| norm 0.2752 (-0.17z)| lr 1.87e-04 | 4251.88 ms | 31.8% bf16 MFU | 125048 tok/s step 12450/19560 | loss 3.354743 (-0.39z)| norm 0.2721 (-0.19z)| lr 1.87e-04 | 4161.04 ms | 32.4% bf16 MFU | 125096 tok/s step 12451/19560 | loss 3.386438 (+0.37z)| norm 0.2689 (-0.22z)| lr 1.87e-04 | 4164.20 ms | 32.4% bf16 MFU | 125136 tok/s step 12452/19560 | loss 3.361833 (-0.23z)| norm 0.2865 (-0.06z)| lr 1.87e-04 | 4270.81 ms | 31.6% bf16 MFU | 125018 tok/s step 12453/19560 | loss 3.371365 (+0.00z)| norm 0.2886 (-0.04z)| lr 1.87e-04 | 4165.98 ms | 32.4% bf16 MFU | 125059 tok/s step 12454/19560 | loss 3.311279 (-1.43z)| norm 0.2784 (-0.13z)| lr 1.87e-04 | 4173.90 ms | 32.3% bf16 MFU | 125087 tok/s step 12455/19560 | loss 3.489382 (+2.77z)| norm 0.2755 (-0.16z)| lr 1.87e-04 | 4164.38 ms | 32.4% bf16 MFU | 125127 tok/s step 12456/19560 | loss 3.358179 (-0.32z)| norm 0.2968 (+0.04z)| lr 1.87e-04 | 4169.15 ms | 32.4% bf16 MFU | 125159 tok/s step 12457/19560 | loss 3.337654 (-0.79z)| norm 0.2618 (-0.28z)| lr 1.87e-04 | 4234.19 ms | 31.9% bf16 MFU | 125092 tok/s step 12458/19560 | loss 3.344864 (-0.61z)| norm 0.3061 (+0.13z)| lr 1.87e-04 | 4176.13 ms | 32.3% bf16 MFU | 125114 tok/s step 12459/19560 | loss 3.346543 (-0.58z)| norm 0.2617 (-0.27z)| lr 1.87e-04 | 4166.26 ms | 32.4% bf16 MFU | 125151 tok/s step 12460/19560 | loss 3.394993 (+0.58z)| norm 0.2923 (+0.01z)| lr 1.87e-04 | 4156.28 ms | 32.5% bf16 MFU | 125200 tok/s step 12461/19560 | loss 3.330771 (-0.95z)| norm 0.2797 (-0.11z)| lr 1.86e-04 | 4168.02 ms | 32.4% bf16 MFU | 125230 tok/s step 12462/19560 | loss 3.394344 (+0.56z)| norm 0.2720 (-0.18z)| lr 1.86e-04 | 4165.80 ms | 32.4% bf16 MFU | 125261 tok/s step 12463/19560 | loss 3.336465 (-0.82z)| norm 0.2832 (-0.07z)| lr 1.86e-04 | 4221.18 ms | 32.0% bf16 MFU | 125208 tok/s step 12464/19560 | loss 3.359018 (-0.28z)| norm 0.3045 (+0.12z)| lr 1.86e-04 | 4163.71 ms | 32.4% bf16 MFU | 125244 tok/s step 12465/19560 | loss 3.384280 (+0.33z)| norm 0.3045 (+0.12z)| lr 1.86e-04 | 4153.58 ms | 32.5% bf16 MFU | 125293 tok/s step 12466/19560 | loss 3.346471 (-0.58z)| norm 0.2934 (+0.02z)| lr 1.86e-04 | 4190.39 ms | 32.2% bf16 MFU | 125284 tok/s step 12467/19560 | loss 3.341116 (-0.70z)| norm 0.2879 (-0.03z)| lr 1.86e-04 | 4276.49 ms | 31.6% bf16 MFU | 125150 tok/s step 12468/19560 | loss 3.377576 (+0.17z)| norm 0.2798 (-0.11z)| lr 1.86e-04 | 4181.00 ms | 32.3% bf16 MFU | 125162 tok/s step 12469/19560 | loss 3.393898 (+0.55z)| norm 0.2940 (+0.02z)| lr 1.86e-04 | 4228.37 ms | 31.9% bf16 MFU | 125104 tok/s step 12470/19560 | loss 3.363230 (-0.19z)| norm 0.3014 (+0.09z)| lr 1.86e-04 | 4200.12 ms | 32.1% bf16 MFU | 125090 tok/s step 12471/19560 | loss 3.328458 (-1.03z)| norm 0.2764 (-0.14z)| lr 1.86e-04 | 4171.30 ms | 32.4% bf16 MFU | 125120 tok/s step 12472/19560 | loss 3.332466 (-0.92z)| norm 0.3240 (+0.29z)| lr 1.86e-04 | 4166.29 ms | 32.4% bf16 MFU | 125156 tok/s step 12473/19560 | loss 3.350224 (-0.49z)| norm 0.2990 (+0.06z)| lr 1.86e-04 | 4169.74 ms | 32.4% bf16 MFU | 125185 tok/s step 12474/19560 | loss 3.287652 (-1.99z)| norm 0.3213 (+0.26z)| lr 1.86e-04 | 4162.53 ms | 32.4% bf16 MFU | 125223 tok/s step 12475/19560 | loss 3.372832 (+0.07z)| norm 0.2788 (-0.13z)| lr 1.86e-04 | 4207.22 ms | 32.1% bf16 MFU | 125193 tok/s step 12476/19560 | loss 3.342827 (-0.66z)| norm 0.2904 (-0.02z)| lr 1.86e-04 | 4158.50 ms | 32.5% bf16 MFU | 125237 tok/s step 12477/19560 | loss 3.337960 (-0.80z)| norm 0.2977 (+0.04z)| lr 1.86e-04 | 4159.78 ms | 32.5% bf16 MFU | 125277 tok/s step 12478/19560 | loss 3.366357 (-0.07z)| norm 0.2758 (-0.16z)| lr 1.86e-04 | 4186.73 ms | 32.2% bf16 MFU | 125275 tok/s step 12479/19560 | loss 3.348211 (-0.52z)| norm 0.2872 (-0.05z)| lr 1.86e-04 | 4158.09 ms | 32.5% bf16 MFU | 125315 tok/s step 12480/19560 | loss 3.358249 (-0.25z)| norm 0.2833 (-0.09z)| lr 1.86e-04 | 4191.78 ms | 32.2% bf16 MFU | 125303 tok/s step 12481/19560 | loss 3.316154 (-1.33z)| norm 0.2740 (-0.18z)| lr 1.86e-04 | 4167.29 ms | 32.4% bf16 MFU | 125329 tok/s step 12482/19560 | loss 3.354331 (-0.33z)| norm 0.2749 (-0.17z)| lr 1.85e-04 | 4213.01 ms | 32.0% bf16 MFU | 125284 tok/s step 12483/19560 | loss 3.345857 (-0.55z)| norm 0.2862 (-0.06z)| lr 1.85e-04 | 4165.39 ms | 32.4% bf16 MFU | 125314 tok/s step 12484/19560 | loss 3.362850 (-0.11z)| norm 0.2654 (-0.25z)| lr 1.85e-04 | 4190.03 ms | 32.2% bf16 MFU | 125304 tok/s step 12485/19560 | loss 3.330498 (-0.95z)| norm 0.2663 (-0.24z)| lr 1.85e-04 | 4160.85 ms | 32.4% bf16 MFU | 125339 tok/s step 12486/19560 | loss 3.384217 (+0.49z)| norm 0.2706 (-0.20z)| lr 1.85e-04 | 4292.90 ms | 31.5% bf16 MFU | 125179 tok/s step 12487/19560 | loss 3.349647 (-0.44z)| norm 0.2672 (-0.23z)| lr 1.85e-04 | 4177.11 ms | 32.3% bf16 MFU | 125196 tok/s step 12488/19560 | loss 3.436974 (+2.10z)| norm 0.2791 (-0.16z)| lr 1.85e-04 | 4166.96 ms | 32.4% bf16 MFU | 125227 tok/s step 12489/19560 | loss 3.391541 (+0.77z)| norm 0.2740 (-0.38z)| lr 1.85e-04 | 4166.90 ms | 32.4% bf16 MFU | 125257 tok/s step 12490/19560 | loss 3.390680 (+0.74z)| norm 0.2651 (-0.86z)| lr 1.85e-04 | 4171.52 ms | 32.4% bf16 MFU | 125278 tok/s step 12491/19560 | loss 3.392557 (+0.78z)| norm 0.2759 (-0.25z)| lr 1.85e-04 | 4174.28 ms | 32.3% bf16 MFU | 125294 tok/s step 12492/19560 | loss 3.313139 (-1.55z)| norm 0.2742 (-0.33z)| lr 1.85e-04 | 4165.13 ms | 32.4% bf16 MFU | 125323 tok/s step 12493/19560 | loss 3.328017 (-1.11z)| norm 0.2685 (-0.65z)| lr 1.85e-04 | 4163.24 ms | 32.4% bf16 MFU | 125354 tok/s step 12494/19560 | loss 3.353604 (-0.35z)| norm 0.2653 (-0.83z)| lr 1.85e-04 | 4149.52 ms | 32.5% bf16 MFU | 125403 tok/s step 12495/19560 | loss 3.364232 (-0.04z)| norm 0.2605 (-1.11z)| lr 1.85e-04 | 4170.73 ms | 32.4% bf16 MFU | 125418 tok/s step 12496/19560 | loss 3.395541 (+0.87z)| norm 0.2873 (+0.53z)| lr 1.85e-04 | 4175.08 ms | 32.3% bf16 MFU | 125426 tok/s step 12497/19560 | loss 3.347871 (-0.53z)| norm 0.2917 (+0.80z)| lr 1.85e-04 | 4182.74 ms | 32.3% bf16 MFU | 125422 tok/s step 12498/19560 | loss 3.352314 (-0.40z)| norm 0.2606 (-1.09z)| lr 1.85e-04 | 4160.03 ms | 32.5% bf16 MFU | 125453 tok/s step 12499/19560 | loss 3.325187 (-1.18z)| norm 0.2676 (-0.66z)| lr 1.85e-04 | 4154.11 ms | 32.5% bf16 MFU | 125490 tok/s step 12500/19560 | loss 3.525054 (+4.34z)| norm 0.2664 (-0.72z)| lr 1.85e-04 | 4158.28 ms | 32.5% bf16 MFU | 125520 tok/s val loss 3.334815 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2975/10042 = 0.296256 step 12501/19560 | loss 3.299241 (-1.80z)| norm 0.2850 (+0.41z)| lr 1.85e-04 | 4188.15 ms | 32.2% bf16 MFU | 125503 tok/s step 12502/19560 | loss 3.348647 (-0.46z)| norm 0.2532 (-1.51z)| lr 1.85e-04 | 4172.78 ms | 32.4% bf16 MFU | 125510 tok/s step 12503/19560 | loss 3.385367 (+0.52z)| norm 0.2804 (+0.14z)| lr 1.85e-04 | 4168.53 ms | 32.4% bf16 MFU | 125523 tok/s step 12504/19560 | loss 3.315666 (-1.35z)| norm 0.2757 (-0.15z)| lr 1.84e-04 | 4171.86 ms | 32.4% bf16 MFU | 125531 tok/s step 12505/19560 | loss 3.331249 (-0.92z)| norm 0.2996 (+1.28z)| lr 1.84e-04 | 4220.79 ms | 32.0% bf16 MFU | 125465 tok/s step 12506/19560 | loss 3.306502 (-1.56z)| norm 0.2870 (+0.51z)| lr 1.84e-04 | 4159.51 ms | 32.5% bf16 MFU | 125494 tok/s step 12507/19560 | loss 3.361932 (-0.09z)| norm 0.2821 (+0.21z)| lr 1.84e-04 | 4157.23 ms | 32.5% bf16 MFU | 125525 tok/s step 12508/19560 | loss 3.443942 (+2.07z)| norm 0.2813 (+0.15z)| lr 1.84e-04 | 4154.47 ms | 32.5% bf16 MFU | 125559 tok/s step 12509/19560 | loss 3.354064 (-0.30z)| norm 0.2871 (+0.50z)| lr 1.84e-04 | 4168.15 ms | 32.4% bf16 MFU | 125570 tok/s step 12510/19560 | loss 3.364830 (-0.03z)| norm 0.2927 (+0.83z)| lr 1.84e-04 | 4159.70 ms | 32.5% bf16 MFU | 125594 tok/s step 12511/19560 | loss 3.371504 (+0.14z)| norm 0.2750 (-0.24z)| lr 1.84e-04 | 4162.11 ms | 32.4% bf16 MFU | 125612 tok/s step 12512/19560 | loss 3.363343 (-0.06z)| norm 0.2876 (+0.53z)| lr 1.84e-04 | 4169.86 ms | 32.4% bf16 MFU | 125618 tok/s step 12513/19560 | loss 3.358279 (-0.20z)| norm 0.2846 (+0.36z)| lr 1.84e-04 | 4151.30 ms | 32.5% bf16 MFU | 125652 tok/s step 12514/19560 | loss 3.328412 (-1.00z)| norm 0.2742 (-0.28z)| lr 1.84e-04 | 4159.46 ms | 32.5% bf16 MFU | 125672 tok/s step 12515/19560 | loss 3.464812 (+2.63z)| norm 0.2602 (-1.12z)| lr 1.84e-04 | 4165.66 ms | 32.4% bf16 MFU | 125681 tok/s step 12516/19560 | loss 3.360832 (-0.13z)| norm 0.2839 (+0.34z)| lr 1.84e-04 | 4168.89 ms | 32.4% bf16 MFU | 125685 tok/s step 12517/19560 | loss 3.336023 (-0.77z)| norm 0.2601 (-1.11z)| lr 1.84e-04 | 4161.46 ms | 32.4% bf16 MFU | 125700 tok/s step 12518/19560 | loss 3.342014 (-0.61z)| norm 0.2755 (-0.16z)| lr 1.84e-04 | 4160.58 ms | 32.5% bf16 MFU | 125716 tok/s step 12519/19560 | loss 3.335118 (-0.78z)| norm 0.2709 (-0.44z)| lr 1.84e-04 | 4172.44 ms | 32.4% bf16 MFU | 125713 tok/s step 12520/19560 | loss 3.373403 (+0.24z)| norm 0.2841 (+0.37z)| lr 1.84e-04 | 4152.48 ms | 32.5% bf16 MFU | 125740 tok/s step 12521/19560 | loss 3.364613 (+0.01z)| norm 0.2787 (+0.03z)| lr 1.84e-04 | 4151.70 ms | 32.5% bf16 MFU | 125767 tok/s step 12522/19560 | loss 3.394391 (+0.79z)| norm 0.2872 (+0.55z)| lr 1.84e-04 | 4153.91 ms | 32.5% bf16 MFU | 125790 tok/s step 12523/19560 | loss 3.360760 (-0.09z)| norm 0.2833 (+0.30z)| lr 1.84e-04 | 4163.17 ms | 32.4% bf16 MFU | 125797 tok/s step 12524/19560 | loss 3.368853 (+0.13z)| norm 0.2975 (+1.17z)| lr 1.84e-04 | 4165.48 ms | 32.4% bf16 MFU | 125800 tok/s step 12525/19560 | loss 3.367234 (+0.10z)| norm 0.2784 (-0.00z)| lr 1.84e-04 | 4167.13 ms | 32.4% bf16 MFU | 125801 tok/s step 12526/19560 | loss 3.307442 (-1.49z)| norm 0.2723 (-0.38z)| lr 1.83e-04 | 4149.04 ms | 32.5% bf16 MFU | 125829 tok/s step 12527/19560 | loss 3.351579 (-0.31z)| norm 0.2768 (-0.10z)| lr 1.83e-04 | 4154.96 ms | 32.5% bf16 MFU | 125847 tok/s step 12528/19560 | loss 3.389569 (+0.69z)| norm 0.2594 (-1.16z)| lr 1.83e-04 | 4161.47 ms | 32.4% bf16 MFU | 125854 tok/s step 12529/19560 | loss 3.345104 (-0.49z)| norm 0.2624 (-0.96z)| lr 1.83e-04 | 4164.73 ms | 32.4% bf16 MFU | 125856 tok/s step 12530/19560 | loss 3.345893 (-0.47z)| norm 0.2713 (-0.41z)| lr 1.83e-04 | 4155.91 ms | 32.5% bf16 MFU | 125871 tok/s step 12531/19560 | loss 3.336995 (-0.70z)| norm 0.2739 (-0.24z)| lr 1.83e-04 | 4180.39 ms | 32.3% bf16 MFU | 125848 tok/s step 12532/19560 | loss 3.329490 (-0.89z)| norm 0.2631 (-0.89z)| lr 1.83e-04 | 4158.52 ms | 32.5% bf16 MFU | 125859 tok/s step 12533/19560 | loss 3.353144 (-0.24z)| norm 0.2599 (-1.11z)| lr 1.83e-04 | 4165.69 ms | 32.4% bf16 MFU | 125859 tok/s step 12534/19560 | loss 3.353441 (-0.23z)| norm 0.2660 (-0.71z)| lr 1.83e-04 | 4171.72 ms | 32.4% bf16 MFU | 125850 tok/s step 12535/19560 | loss 3.275948 (-2.25z)| norm 0.2796 (+0.22z)| lr 1.83e-04 | 4156.31 ms | 32.5% bf16 MFU | 125865 tok/s step 12536/19560 | loss 3.342520 (-0.49z)| norm 0.3055 (+1.96z)| lr 1.83e-04 | 4165.46 ms | 32.4% bf16 MFU | 125865 tok/s step 12537/19560 | loss 3.358581 (-0.07z)| norm 0.3034 (+1.81z)| lr 1.83e-04 | 4151.93 ms | 32.5% bf16 MFU | 125885 tok/s step 12538/19560 | loss 3.387407 (+0.69z)| norm 0.2874 (+0.72z)| lr 1.83e-04 | 4149.68 ms | 32.5% bf16 MFU | 125908 tok/s step 12539/19560 | loss 3.375510 (+0.40z)| norm 0.2791 (+0.15z)| lr 1.83e-04 | 4159.86 ms | 32.5% bf16 MFU | 125915 tok/s step 12540/19560 | loss 3.400700 (+1.07z)| norm 0.2874 (+0.71z)| lr 1.83e-04 | 4156.26 ms | 32.5% bf16 MFU | 125926 tok/s step 12541/19560 | loss 3.437072 (+2.00z)| norm 0.2739 (-0.20z)| lr 1.83e-04 | 4162.80 ms | 32.4% bf16 MFU | 125927 tok/s step 12542/19560 | loss 3.361509 (-0.02z)| norm 0.3009 (+1.60z)| lr 1.83e-04 | 4161.41 ms | 32.4% bf16 MFU | 125930 tok/s step 12543/19560 | loss 3.440717 (+2.05z)| norm 0.2677 (-0.64z)| lr 1.83e-04 | 4165.77 ms | 32.4% bf16 MFU | 125926 tok/s step 12544/19560 | loss 3.368631 (+0.14z)| norm 0.2909 (+0.92z)| lr 1.83e-04 | 4151.09 ms | 32.5% bf16 MFU | 125945 tok/s step 12545/19560 | loss 3.317098 (-1.21z)| norm 0.2622 (-1.01z)| lr 1.83e-04 | 4152.69 ms | 32.5% bf16 MFU | 125961 tok/s step 12546/19560 | loss 3.367072 (+0.10z)| norm 0.2727 (-0.31z)| lr 1.83e-04 | 4160.32 ms | 32.5% bf16 MFU | 125964 tok/s step 12547/19560 | loss 3.345090 (-0.48z)| norm 0.2648 (-0.85z)| lr 1.82e-04 | 4168.53 ms | 32.4% bf16 MFU | 125954 tok/s step 12548/19560 | loss 3.343194 (-0.51z)| norm 0.2534 (-1.60z)| lr 1.82e-04 | 4166.21 ms | 32.4% bf16 MFU | 125948 tok/s step 12549/19560 | loss 3.370565 (+0.21z)| norm 0.3026 (+1.68z)| lr 1.82e-04 | 4156.07 ms | 32.5% bf16 MFU | 125959 tok/s step 12550/19560 | loss 3.360988 (-0.04z)| norm 0.2735 (-0.28z)| lr 1.82e-04 | 4165.32 ms | 32.4% bf16 MFU | 125954 tok/s step 12551/19560 | loss 3.481163 (+3.03z)| norm 0.2594 (-1.23z)| lr 1.82e-04 | 4169.14 ms | 32.4% bf16 MFU | 125944 tok/s step 12552/19560 | loss 3.280248 (-2.10z)| norm 0.2914 (+0.92z)| lr 1.82e-04 | 4167.19 ms | 32.4% bf16 MFU | 125938 tok/s step 12553/19560 | loss 3.378963 (+0.41z)| norm 0.2606 (-1.18z)| lr 1.82e-04 | 4162.42 ms | 32.4% bf16 MFU | 125939 tok/s step 12554/19560 | loss 3.392415 (+0.75z)| norm 0.2875 (+0.65z)| lr 1.82e-04 | 4155.55 ms | 32.5% bf16 MFU | 125950 tok/s step 12555/19560 | loss 3.400058 (+0.94z)| norm 0.2795 (+0.09z)| lr 1.82e-04 | 4163.14 ms | 32.4% bf16 MFU | 125949 tok/s step 12556/19560 | loss 3.348352 (-0.37z)| norm 0.2763 (-0.13z)| lr 1.82e-04 | 4157.03 ms | 32.5% bf16 MFU | 125958 tok/s step 12557/19560 | loss 3.365193 (+0.05z)| norm 0.2591 (-1.32z)| lr 1.82e-04 | 4157.08 ms | 32.5% bf16 MFU | 125966 tok/s step 12558/19560 | loss 3.343666 (-0.50z)| norm 0.2711 (-0.49z)| lr 1.82e-04 | 4166.53 ms | 32.4% bf16 MFU | 125959 tok/s step 12559/19560 | loss 3.341546 (-0.55z)| norm 0.2698 (-0.58z)| lr 1.82e-04 | 4156.10 ms | 32.5% bf16 MFU | 125969 tok/s step 12560/19560 | loss 3.405260 (+1.07z)| norm 0.2687 (-0.66z)| lr 1.82e-04 | 4163.14 ms | 32.4% bf16 MFU | 125967 tok/s step 12561/19560 | loss 3.398325 (+0.88z)| norm 0.2704 (-0.54z)| lr 1.82e-04 | 4160.38 ms | 32.5% bf16 MFU | 125970 tok/s step 12562/19560 | loss 3.358763 (-0.13z)| norm 0.2683 (-0.68z)| lr 1.82e-04 | 4155.03 ms | 32.5% bf16 MFU | 125980 tok/s step 12563/19560 | loss 3.366569 (+0.08z)| norm 0.2764 (-0.11z)| lr 1.82e-04 | 4157.65 ms | 32.5% bf16 MFU | 125986 tok/s step 12564/19560 | loss 3.367082 (+0.10z)| norm 0.2658 (-0.84z)| lr 1.82e-04 | 4167.75 ms | 32.4% bf16 MFU | 125977 tok/s step 12565/19560 | loss 3.498943 (+3.32z)| norm 0.3021 (+1.64z)| lr 1.82e-04 | 4163.15 ms | 32.4% bf16 MFU | 125975 tok/s step 12566/19560 | loss 3.423379 (+1.45z)| norm 0.2628 (-1.06z)| lr 1.82e-04 | 4157.88 ms | 32.5% bf16 MFU | 125981 tok/s step 12567/19560 | loss 3.417891 (+1.29z)| norm 0.2670 (-0.78z)| lr 1.82e-04 | 4157.42 ms | 32.5% bf16 MFU | 125987 tok/s step 12568/19560 | loss 3.377661 (+0.31z)| norm 0.2722 (-0.42z)| lr 1.82e-04 | 4158.57 ms | 32.5% bf16 MFU | 125992 tok/s step 12569/19560 | loss 3.350415 (-0.36z)| norm 0.2944 (+1.09z)| lr 1.81e-04 | 4161.43 ms | 32.4% bf16 MFU | 125991 tok/s step 12570/19560 | loss 3.349085 (-0.39z)| norm 0.2772 (-0.09z)| lr 1.81e-04 | 4149.95 ms | 32.5% bf16 MFU | 126009 tok/s step 12571/19560 | loss 3.389084 (+0.60z)| norm 0.2938 (+1.06z)| lr 1.81e-04 | 4155.42 ms | 32.5% bf16 MFU | 126017 tok/s step 12572/19560 | loss 3.419874 (+1.34z)| norm 0.2870 (+0.58z)| lr 1.81e-04 | 4159.47 ms | 32.5% bf16 MFU | 126018 tok/s step 12573/19560 | loss 3.370409 (+0.12z)| norm 0.2829 (+0.29z)| lr 1.81e-04 | 4153.07 ms | 32.5% bf16 MFU | 126029 tok/s step 12574/19560 | loss 3.348820 (-0.41z)| norm 0.2774 (-0.10z)| lr 1.81e-04 | 4168.05 ms | 32.4% bf16 MFU | 126017 tok/s step 12575/19560 | loss 3.287502 (-1.88z)| norm 0.2954 (+1.18z)| lr 1.81e-04 | 4163.71 ms | 32.4% bf16 MFU | 126012 tok/s step 12576/19560 | loss 3.326756 (-0.92z)| norm 0.2888 (+0.69z)| lr 1.81e-04 | 4160.85 ms | 32.4% bf16 MFU | 126012 tok/s step 12577/19560 | loss 3.330264 (-0.83z)| norm 0.2898 (+0.76z)| lr 1.81e-04 | 4159.14 ms | 32.5% bf16 MFU | 126014 tok/s step 12578/19560 | loss 3.372188 (+0.21z)| norm 0.2808 (+0.09z)| lr 1.81e-04 | 4156.82 ms | 32.5% bf16 MFU | 126020 tok/s step 12579/19560 | loss 3.398036 (+0.84z)| norm 0.2691 (-0.76z)| lr 1.81e-04 | 4157.51 ms | 32.5% bf16 MFU | 126024 tok/s step 12580/19560 | loss 3.396586 (+0.80z)| norm 0.2870 (+0.55z)| lr 1.81e-04 | 4159.08 ms | 32.5% bf16 MFU | 126026 tok/s step 12581/19560 | loss 3.393933 (+0.72z)| norm 0.2837 (+0.30z)| lr 1.81e-04 | 4159.82 ms | 32.5% bf16 MFU | 126026 tok/s step 12582/19560 | loss 3.363501 (-0.03z)| norm 0.2840 (+0.32z)| lr 1.81e-04 | 4170.73 ms | 32.4% bf16 MFU | 126010 tok/s step 12583/19560 | loss 3.403057 (+1.00z)| norm 0.2862 (+0.48z)| lr 1.81e-04 | 4160.97 ms | 32.4% bf16 MFU | 126010 tok/s step 12584/19560 | loss 3.365078 (+0.02z)| norm 0.2670 (-0.91z)| lr 1.81e-04 | 4167.91 ms | 32.4% bf16 MFU | 125999 tok/s step 12585/19560 | loss 3.387163 (+0.58z)| norm 0.2856 (+0.44z)| lr 1.81e-04 | 4160.31 ms | 32.5% bf16 MFU | 126000 tok/s step 12586/19560 | loss 3.404670 (+1.01z)| norm 0.2691 (-0.76z)| lr 1.81e-04 | 4166.58 ms | 32.4% bf16 MFU | 125992 tok/s step 12587/19560 | loss 3.411461 (+1.17z)| norm 0.2682 (-0.83z)| lr 1.81e-04 | 4161.37 ms | 32.4% bf16 MFU | 125992 tok/s step 12588/19560 | loss 3.403569 (+0.96z)| norm 0.2695 (-0.73z)| lr 1.81e-04 | 4993.34 ms | 27.0% bf16 MFU | 124942 tok/s step 12589/19560 | loss 3.313053 (-1.32z)| norm 0.2968 (+1.30z)| lr 1.81e-04 | 4162.70 ms | 32.4% bf16 MFU | 124992 tok/s step 12590/19560 | loss 3.396633 (+0.79z)| norm 0.2731 (-0.46z)| lr 1.81e-04 | 4153.04 ms | 32.5% bf16 MFU | 125055 tok/s step 12591/19560 | loss 3.358698 (-0.18z)| norm 0.2842 (+0.36z)| lr 1.80e-04 | 4162.75 ms | 32.4% bf16 MFU | 125099 tok/s step 12592/19560 | loss 3.329427 (-0.91z)| norm 0.2829 (+0.28z)| lr 1.80e-04 | 4161.86 ms | 32.4% bf16 MFU | 125143 tok/s step 12593/19560 | loss 3.346680 (-0.47z)| norm 0.2667 (-0.93z)| lr 1.80e-04 | 4163.82 ms | 32.4% bf16 MFU | 125182 tok/s step 12594/19560 | loss 3.387644 (+0.56z)| norm 0.2913 (+0.95z)| lr 1.80e-04 | 4154.36 ms | 32.5% bf16 MFU | 125233 tok/s step 12595/19560 | loss 3.351870 (-0.34z)| norm 0.2766 (-0.17z)| lr 1.80e-04 | 4158.96 ms | 32.5% bf16 MFU | 125274 tok/s step 12596/19560 | loss 3.314997 (-1.26z)| norm 0.2783 (-0.03z)| lr 1.80e-04 | 4157.18 ms | 32.5% bf16 MFU | 125316 tok/s step 12597/19560 | loss 3.383534 (+0.47z)| norm 0.2516 (-2.04z)| lr 1.80e-04 | 4195.46 ms | 32.2% bf16 MFU | 125299 tok/s step 12598/19560 | loss 3.321240 (-1.08z)| norm 0.2792 (+0.07z)| lr 1.80e-04 | 4214.62 ms | 32.0% bf16 MFU | 125254 tok/s step 12599/19560 | loss 3.375017 (+0.25z)| norm 0.2622 (-1.22z)| lr 1.80e-04 | 4156.73 ms | 32.5% bf16 MFU | 125297 tok/s step 12600/19560 | loss 3.390881 (+0.64z)| norm 0.2612 (-1.31z)| lr 1.80e-04 | 4160.48 ms | 32.5% bf16 MFU | 125333 tok/s step 12601/19560 | loss 3.351374 (-0.35z)| norm 0.2801 (+0.21z)| lr 1.80e-04 | 4171.70 ms | 32.4% bf16 MFU | 125351 tok/s step 12602/19560 | loss 3.401645 (+0.90z)| norm 0.2677 (-0.80z)| lr 1.80e-04 | 4161.28 ms | 32.4% bf16 MFU | 125383 tok/s step 12603/19560 | loss 3.365573 (-0.02z)| norm 0.2989 (+1.81z)| lr 1.80e-04 | 4154.31 ms | 32.5% bf16 MFU | 125424 tok/s step 12604/19560 | loss 3.373671 (+0.18z)| norm 0.2665 (-0.88z)| lr 1.80e-04 | 4167.74 ms | 32.4% bf16 MFU | 125442 tok/s step 12605/19560 | loss 3.344706 (-0.56z)| norm 0.2996 (+1.88z)| lr 1.80e-04 | 4156.98 ms | 32.5% bf16 MFU | 125476 tok/s step 12606/19560 | loss 3.337945 (-0.72z)| norm 0.2972 (+1.65z)| lr 1.80e-04 | 4165.27 ms | 32.4% bf16 MFU | 125496 tok/s step 12607/19560 | loss 3.369453 (+0.07z)| norm 0.2721 (-0.42z)| lr 1.80e-04 | 4156.14 ms | 32.5% bf16 MFU | 125529 tok/s step 12608/19560 | loss 3.364676 (-0.05z)| norm 0.2830 (+0.48z)| lr 1.80e-04 | 4160.84 ms | 32.4% bf16 MFU | 125553 tok/s step 12609/19560 | loss 3.435954 (+1.73z)| norm 0.2965 (+1.58z)| lr 1.80e-04 | 4156.47 ms | 32.5% bf16 MFU | 125582 tok/s step 12610/19560 | loss 3.335659 (-0.80z)| norm 0.2939 (+1.34z)| lr 1.80e-04 | 4169.35 ms | 32.4% bf16 MFU | 125590 tok/s step 12611/19560 | loss 3.333115 (-0.86z)| norm 0.4058 (+7.64z)| lr 1.80e-04 | 4156.26 ms | 32.5% bf16 MFU | 125618 tok/s step 12612/19560 | loss 3.335514 (-0.79z)| norm 0.2805 (+0.12z)| lr 1.80e-04 | 4185.52 ms | 32.3% bf16 MFU | 125600 tok/s step 12613/19560 | loss 3.353027 (-0.36z)| norm 0.2832 (+0.28z)| lr 1.79e-04 | 4164.07 ms | 32.4% bf16 MFU | 125615 tok/s step 12614/19560 | loss 3.342760 (-0.61z)| norm 0.2828 (+0.25z)| lr 1.79e-04 | 4166.44 ms | 32.4% bf16 MFU | 125626 tok/s step 12615/19560 | loss 3.386444 (+0.48z)| norm 0.2866 (+0.47z)| lr 1.79e-04 | 4164.84 ms | 32.4% bf16 MFU | 125639 tok/s step 12616/19560 | loss 3.353938 (-0.32z)| norm 0.2728 (-0.36z)| lr 1.79e-04 | 4153.87 ms | 32.5% bf16 MFU | 125668 tok/s step 12617/19560 | loss 3.414657 (+1.22z)| norm 0.2825 (+0.22z)| lr 1.79e-04 | 4162.25 ms | 32.4% bf16 MFU | 125683 tok/s step 12618/19560 | loss 3.384207 (+0.45z)| norm 0.2718 (-0.43z)| lr 1.79e-04 | 4164.66 ms | 32.4% bf16 MFU | 125693 tok/s step 12619/19560 | loss 3.401735 (+0.89z)| norm 0.2783 (-0.04z)| lr 1.79e-04 | 4155.10 ms | 32.5% bf16 MFU | 125718 tok/s step 12620/19560 | loss 3.448259 (+2.02z)| norm 0.2812 (+0.14z)| lr 1.79e-04 | 4164.13 ms | 32.4% bf16 MFU | 125727 tok/s step 12621/19560 | loss 3.369328 (+0.03z)| norm 0.2796 (+0.03z)| lr 1.79e-04 | 4173.98 ms | 32.3% bf16 MFU | 125721 tok/s step 12622/19560 | loss 3.282722 (-2.10z)| norm 0.2705 (-0.52z)| lr 1.79e-04 | 4154.53 ms | 32.5% bf16 MFU | 125745 tok/s step 12623/19560 | loss 3.371975 (+0.11z)| norm 0.2710 (-0.50z)| lr 1.79e-04 | 4151.73 ms | 32.5% bf16 MFU | 125772 tok/s step 12624/19560 | loss 3.436963 (+1.70z)| norm 0.2743 (-0.29z)| lr 1.79e-04 | 4156.09 ms | 32.5% bf16 MFU | 125791 tok/s step 12625/19560 | loss 3.314095 (-1.31z)| norm 0.2900 (+0.67z)| lr 1.79e-04 | 4159.08 ms | 32.5% bf16 MFU | 125804 tok/s step 12626/19560 | loss 3.361803 (-0.15z)| norm 0.2822 (+0.18z)| lr 1.79e-04 | 4161.78 ms | 32.4% bf16 MFU | 125813 tok/s step 12627/19560 | loss 3.343845 (-0.59z)| norm 0.2791 (-0.02z)| lr 1.79e-04 | 5338.65 ms | 25.3% bf16 MFU | 124432 tok/s step 12628/19560 | loss 3.355673 (-0.28z)| norm 0.2631 (-1.00z)| lr 1.79e-04 | 4160.52 ms | 32.5% bf16 MFU | 124511 tok/s step 12629/19560 | loss 3.339293 (-0.73z)| norm 0.2830 (+0.23z)| lr 1.79e-04 | 4147.96 ms | 32.6% bf16 MFU | 124606 tok/s step 12630/19560 | loss 3.386600 (+0.51z)| norm 0.2814 (+0.12z)| lr 1.79e-04 | 4159.05 ms | 32.5% bf16 MFU | 124678 tok/s step 12631/19560 | loss 3.475526 (+2.75z)| norm 0.2856 (+0.37z)| lr 1.79e-04 | 4158.98 ms | 32.5% bf16 MFU | 124748 tok/s step 12632/19560 | loss 3.388816 (+0.52z)| norm 0.2693 (-0.63z)| lr 1.79e-04 | 4162.01 ms | 32.4% bf16 MFU | 124809 tok/s step 12633/19560 | loss 3.347447 (-0.55z)| norm 0.2780 (-0.09z)| lr 1.79e-04 | 4163.70 ms | 32.4% bf16 MFU | 124864 tok/s step 12634/19560 | loss 3.369139 (+0.00z)| norm 0.2830 (+0.23z)| lr 1.79e-04 | 4158.79 ms | 32.5% bf16 MFU | 124924 tok/s step 12635/19560 | loss 3.324406 (-1.15z)| norm 0.2647 (-0.90z)| lr 1.78e-04 | 4153.63 ms | 32.5% bf16 MFU | 124989 tok/s step 12636/19560 | loss 3.351264 (-0.44z)| norm 0.3044 (+1.54z)| lr 1.78e-04 | 4157.53 ms | 32.5% bf16 MFU | 125045 tok/s step 12637/19560 | loss 3.368476 (+0.01z)| norm 0.2674 (-0.73z)| lr 1.78e-04 | 4162.75 ms | 32.4% bf16 MFU | 125090 tok/s step 12638/19560 | loss 3.412010 (+1.14z)| norm 0.2853 (+0.38z)| lr 1.78e-04 | 4159.96 ms | 32.5% bf16 MFU | 125137 tok/s step 12639/19560 | loss 3.418268 (+1.28z)| norm 0.2760 (-0.20z)| lr 1.78e-04 | 4180.90 ms | 32.3% bf16 MFU | 125150 tok/s step 12640/19560 | loss 3.378326 (+0.24z)| norm 0.2905 (+0.69z)| lr 1.78e-04 | 4157.98 ms | 32.5% bf16 MFU | 125198 tok/s step 12641/19560 | loss 3.382469 (+0.34z)| norm 0.2824 (+0.20z)| lr 1.78e-04 | 4155.60 ms | 32.5% bf16 MFU | 125246 tok/s step 12642/19560 | loss 3.384029 (+0.37z)| norm 0.2776 (-0.10z)| lr 1.78e-04 | 4166.90 ms | 32.4% bf16 MFU | 125275 tok/s step 12643/19560 | loss 3.339812 (-0.77z)| norm 0.2789 (-0.03z)| lr 1.78e-04 | 4148.63 ms | 32.5% bf16 MFU | 125330 tok/s step 12644/19560 | loss 3.346134 (-0.60z)| norm 0.2756 (-0.23z)| lr 1.78e-04 | 4153.70 ms | 32.5% bf16 MFU | 125374 tok/s step 12645/19560 | loss 3.349967 (-0.50z)| norm 0.2821 (+0.16z)| lr 1.78e-04 | 4146.37 ms | 32.6% bf16 MFU | 125428 tok/s step 12646/19560 | loss 3.377427 (+0.23z)| norm 0.2772 (-0.14z)| lr 1.78e-04 | 4174.42 ms | 32.3% bf16 MFU | 125436 tok/s step 12647/19560 | loss 3.332670 (-0.97z)| norm 0.2884 (+0.55z)| lr 1.78e-04 | 4160.31 ms | 32.5% bf16 MFU | 125466 tok/s step 12648/19560 | loss 3.365290 (-0.10z)| norm 0.2667 (-0.79z)| lr 1.78e-04 | 4163.66 ms | 32.4% bf16 MFU | 125488 tok/s step 12649/19560 | loss 3.349234 (-0.52z)| norm 0.3010 (+1.32z)| lr 1.78e-04 | 4164.38 ms | 32.4% bf16 MFU | 125509 tok/s step 12650/19560 | loss 3.385087 (+0.44z)| norm 0.2703 (-0.57z)| lr 1.78e-04 | 4169.33 ms | 32.4% bf16 MFU | 125521 tok/s step 12651/19560 | loss 3.371077 (+0.06z)| norm 0.2786 (-0.05z)| lr 1.78e-04 | 4157.54 ms | 32.5% bf16 MFU | 125550 tok/s step 12652/19560 | loss 3.377804 (+0.24z)| norm 0.2569 (-1.37z)| lr 1.78e-04 | 4164.77 ms | 32.4% bf16 MFU | 125567 tok/s step 12653/19560 | loss 3.366139 (-0.07z)| norm 0.2812 (+0.12z)| lr 1.78e-04 | 4159.75 ms | 32.5% bf16 MFU | 125590 tok/s step 12654/19560 | loss 3.387422 (+0.49z)| norm 0.2603 (-1.15z)| lr 1.78e-04 | 4159.21 ms | 32.5% bf16 MFU | 125614 tok/s step 12655/19560 | loss 3.361059 (-0.23z)| norm 0.2805 (+0.09z)| lr 1.78e-04 | 4148.05 ms | 32.5% bf16 MFU | 125653 tok/s step 12656/19560 | loss 3.358415 (-0.29z)| norm 0.2426 (-2.20z)| lr 1.78e-04 | 4163.08 ms | 32.4% bf16 MFU | 125667 tok/s step 12657/19560 | loss 3.361115 (-0.22z)| norm 0.2677 (-0.69z)| lr 1.77e-04 | 4170.20 ms | 32.4% bf16 MFU | 125670 tok/s step 12658/19560 | loss 3.289788 (-2.11z)| norm 0.2536 (-1.52z)| lr 1.77e-04 | 4163.49 ms | 32.4% bf16 MFU | 125682 tok/s step 12659/19560 | loss 3.358537 (-0.28z)| norm 0.3015 (+1.33z)| lr 1.77e-04 | 4160.09 ms | 32.5% bf16 MFU | 125700 tok/s step 12660/19560 | loss 3.376766 (+0.19z)| norm 0.2715 (-0.46z)| lr 1.77e-04 | 4164.84 ms | 32.4% bf16 MFU | 125709 tok/s step 12661/19560 | loss 3.357007 (-0.34z)| norm 0.2582 (-1.25z)| lr 1.77e-04 | 4164.57 ms | 32.4% bf16 MFU | 125718 tok/s step 12662/19560 | loss 3.346038 (-0.63z)| norm 0.2789 (-0.02z)| lr 1.77e-04 | 4166.26 ms | 32.4% bf16 MFU | 125724 tok/s step 12663/19560 | loss 3.371319 (+0.03z)| norm 0.2819 (+0.16z)| lr 1.77e-04 | 4159.05 ms | 32.5% bf16 MFU | 125741 tok/s step 12664/19560 | loss 3.347704 (-0.62z)| norm 0.2787 (-0.02z)| lr 1.77e-04 | 4160.22 ms | 32.5% bf16 MFU | 125755 tok/s step 12665/19560 | loss 3.438721 (+1.84z)| norm 0.2921 (+0.80z)| lr 1.77e-04 | 4160.70 ms | 32.5% bf16 MFU | 125768 tok/s step 12666/19560 | loss 3.365192 (-0.15z)| norm 0.2715 (-0.45z)| lr 1.77e-04 | 4162.38 ms | 32.4% bf16 MFU | 125777 tok/s step 12667/19560 | loss 3.366150 (-0.12z)| norm 0.2924 (+0.82z)| lr 1.77e-04 | 4152.59 ms | 32.5% bf16 MFU | 125801 tok/s step 12668/19560 | loss 3.344729 (-0.69z)| norm 0.2760 (-0.17z)| lr 1.77e-04 | 4159.24 ms | 32.5% bf16 MFU | 125814 tok/s step 12669/19560 | loss 3.367059 (-0.07z)| norm 0.2940 (+0.91z)| lr 1.77e-04 | 4158.83 ms | 32.5% bf16 MFU | 125827 tok/s step 12670/19560 | loss 3.404872 (+0.96z)| norm 0.2645 (-0.87z)| lr 1.77e-04 | 4157.27 ms | 32.5% bf16 MFU | 125841 tok/s step 12671/19560 | loss 3.357858 (-0.32z)| norm 0.2744 (-0.27z)| lr 1.77e-04 | 4160.13 ms | 32.5% bf16 MFU | 125850 tok/s step 12672/19560 | loss 3.380593 (+0.31z)| norm 0.2770 (-0.11z)| lr 1.77e-04 | 4166.19 ms | 32.4% bf16 MFU | 125850 tok/s step 12673/19560 | loss 3.387774 (+0.50z)| norm 0.2688 (-0.61z)| lr 1.77e-04 | 4161.92 ms | 32.4% bf16 MFU | 125856 tok/s step 12674/19560 | loss 3.413330 (+1.20z)| norm 0.2830 (+0.25z)| lr 1.77e-04 | 4157.28 ms | 32.5% bf16 MFU | 125869 tok/s step 12675/19560 | loss 3.398619 (+0.78z)| norm 0.2908 (+0.72z)| lr 1.77e-04 | 4153.46 ms | 32.5% bf16 MFU | 125887 tok/s step 12676/19560 | loss 3.328646 (-1.17z)| norm 0.2668 (-0.76z)| lr 1.77e-04 | 4164.57 ms | 32.4% bf16 MFU | 125887 tok/s step 12677/19560 | loss 3.327692 (-1.18z)| norm 0.3037 (+1.51z)| lr 1.77e-04 | 4163.64 ms | 32.4% bf16 MFU | 125889 tok/s step 12678/19560 | loss 3.369603 (-0.02z)| norm 0.2647 (-0.88z)| lr 1.77e-04 | 4160.25 ms | 32.5% bf16 MFU | 125896 tok/s step 12679/19560 | loss 3.356395 (-0.37z)| norm 0.2803 (+0.06z)| lr 1.76e-04 | 4155.43 ms | 32.5% bf16 MFU | 125909 tok/s step 12680/19560 | loss 3.378752 (+0.25z)| norm 0.2862 (+0.43z)| lr 1.76e-04 | 4169.43 ms | 32.4% bf16 MFU | 125901 tok/s step 12681/19560 | loss 3.414469 (+1.29z)| norm 0.2893 (+0.61z)| lr 1.76e-04 | 4167.42 ms | 32.4% bf16 MFU | 125896 tok/s step 12682/19560 | loss 3.353665 (-0.48z)| norm 0.2889 (+0.59z)| lr 1.76e-04 | 4155.47 ms | 32.5% bf16 MFU | 125910 tok/s step 12683/19560 | loss 3.380759 (+0.32z)| norm 0.2547 (-1.51z)| lr 1.76e-04 | 4160.63 ms | 32.5% bf16 MFU | 125915 tok/s step 12684/19560 | loss 3.383166 (+0.38z)| norm 0.2666 (-0.77z)| lr 1.76e-04 | 4159.36 ms | 32.5% bf16 MFU | 125922 tok/s step 12685/19560 | loss 3.400576 (+0.88z)| norm 0.2695 (-0.60z)| lr 1.76e-04 | 4156.22 ms | 32.5% bf16 MFU | 125933 tok/s step 12686/19560 | loss 3.365772 (-0.15z)| norm 0.2700 (-0.57z)| lr 1.76e-04 | 4161.70 ms | 32.4% bf16 MFU | 125935 tok/s step 12687/19560 | loss 3.331142 (-1.16z)| norm 0.2515 (-1.68z)| lr 1.76e-04 | 4156.36 ms | 32.5% bf16 MFU | 125946 tok/s step 12688/19560 | loss 3.410666 (+1.17z)| norm 0.2810 (+0.11z)| lr 1.76e-04 | 4170.27 ms | 32.4% bf16 MFU | 125934 tok/s step 12689/19560 | loss 3.402300 (+0.93z)| norm 0.2983 (+1.15z)| lr 1.76e-04 | 4160.47 ms | 32.5% bf16 MFU | 125938 tok/s step 12690/19560 | loss 3.370120 (-0.02z)| norm 0.2617 (-1.07z)| lr 1.76e-04 | 4162.21 ms | 32.4% bf16 MFU | 125940 tok/s step 12691/19560 | loss 3.517477 (+3.99z)| norm 0.2867 (+0.44z)| lr 1.76e-04 | 4191.25 ms | 32.2% bf16 MFU | 125897 tok/s step 12692/19560 | loss 3.313996 (-1.56z)| norm 0.2617 (-1.07z)| lr 1.76e-04 | 4184.72 ms | 32.3% bf16 MFU | 125867 tok/s step 12693/19560 | loss 3.404391 (+0.95z)| norm 0.2919 (+0.76z)| lr 1.76e-04 | 4184.59 ms | 32.3% bf16 MFU | 125838 tok/s step 12694/19560 | loss 3.336587 (-0.96z)| norm 0.2398 (-2.35z)| lr 1.76e-04 | 4171.79 ms | 32.4% bf16 MFU | 125830 tok/s step 12695/19560 | loss 3.413161 (+1.23z)| norm 0.2932 (+0.83z)| lr 1.76e-04 | 4168.54 ms | 32.4% bf16 MFU | 125827 tok/s step 12696/19560 | loss 3.320174 (-1.41z)| norm 0.2674 (-0.71z)| lr 1.76e-04 | 4161.73 ms | 32.4% bf16 MFU | 125834 tok/s step 12697/19560 | loss 3.326017 (-1.23z)| norm 0.2780 (-0.07z)| lr 1.76e-04 | 4161.71 ms | 32.4% bf16 MFU | 125842 tok/s step 12698/19560 | loss 3.354413 (-0.43z)| norm 0.2573 (-1.29z)| lr 1.76e-04 | 4161.97 ms | 32.4% bf16 MFU | 125848 tok/s step 12699/19560 | loss 3.356668 (-0.36z)| norm 0.2655 (-0.79z)| lr 1.76e-04 | 4160.10 ms | 32.5% bf16 MFU | 125857 tok/s step 12700/19560 | loss 3.330911 (-1.07z)| norm 0.2697 (-0.53z)| lr 1.76e-04 | 4267.73 ms | 31.6% bf16 MFU | 125707 tok/s step 12701/19560 | loss 3.349738 (-0.53z)| norm 0.2630 (-0.92z)| lr 1.75e-04 | 4161.37 ms | 32.4% bf16 MFU | 125721 tok/s step 12702/19560 | loss 3.343594 (-0.70z)| norm 0.2833 (+0.28z)| lr 1.75e-04 | 4173.17 ms | 32.4% bf16 MFU | 125716 tok/s step 12703/19560 | loss 3.373916 (+0.14z)| norm 0.2626 (-0.93z)| lr 1.75e-04 | 4161.54 ms | 32.4% bf16 MFU | 125730 tok/s step 12704/19560 | loss 3.361728 (-0.22z)| norm 0.2869 (+0.51z)| lr 1.75e-04 | 4159.81 ms | 32.5% bf16 MFU | 125745 tok/s step 12705/19560 | loss 3.319749 (-1.44z)| norm 0.2673 (-0.64z)| lr 1.75e-04 | 4157.54 ms | 32.5% bf16 MFU | 125763 tok/s step 12706/19560 | loss 3.377331 (+0.23z)| norm 0.2719 (-0.36z)| lr 1.75e-04 | 4163.18 ms | 32.4% bf16 MFU | 125772 tok/s step 12707/19560 | loss 3.374561 (+0.16z)| norm 0.2709 (-0.43z)| lr 1.75e-04 | 4156.36 ms | 32.5% bf16 MFU | 125790 tok/s step 12708/19560 | loss 3.361503 (-0.21z)| norm 0.2705 (-0.44z)| lr 1.75e-04 | 4169.24 ms | 32.4% bf16 MFU | 125788 tok/s step 12709/19560 | loss 3.391788 (+0.67z)| norm 0.2779 (+0.00z)| lr 1.75e-04 | 4160.78 ms | 32.4% bf16 MFU | 125799 tok/s step 12710/19560 | loss 3.307240 (-1.76z)| norm 0.2809 (+0.18z)| lr 1.75e-04 | 4164.35 ms | 32.4% bf16 MFU | 125804 tok/s step 12711/19560 | loss 3.331161 (-1.06z)| norm 0.2720 (-0.34z)| lr 1.75e-04 | 4168.24 ms | 32.4% bf16 MFU | 125803 tok/s step 12712/19560 | loss 3.333742 (-0.97z)| norm 0.2678 (-0.59z)| lr 1.75e-04 | 4158.00 ms | 32.5% bf16 MFU | 125817 tok/s step 12713/19560 | loss 3.384097 (+0.48z)| norm 0.2915 (+0.81z)| lr 1.75e-04 | 4167.25 ms | 32.4% bf16 MFU | 125817 tok/s step 12714/19560 | loss 3.376012 (+0.25z)| norm 0.2728 (-0.30z)| lr 1.75e-04 | 4167.91 ms | 32.4% bf16 MFU | 125816 tok/s step 12715/19560 | loss 3.305368 (-1.75z)| norm 0.2831 (+0.31z)| lr 1.75e-04 | 4164.31 ms | 32.4% bf16 MFU | 125820 tok/s step 12716/19560 | loss 3.277068 (-2.48z)| norm 0.2841 (+0.36z)| lr 1.75e-04 | 4166.25 ms | 32.4% bf16 MFU | 125821 tok/s step 12717/19560 | loss 3.379828 (+0.39z)| norm 0.2924 (+0.86z)| lr 1.75e-04 | 4162.56 ms | 32.4% bf16 MFU | 125828 tok/s step 12718/19560 | loss 3.339970 (-0.73z)| norm 0.2952 (+1.01z)| lr 1.75e-04 | 4164.78 ms | 32.4% bf16 MFU | 125831 tok/s step 12719/19560 | loss 3.410271 (+1.25z)| norm 0.3003 (+1.30z)| lr 1.75e-04 | 4166.70 ms | 32.4% bf16 MFU | 125831 tok/s step 12720/19560 | loss 3.359171 (-0.20z)| norm 0.2893 (+0.65z)| lr 1.75e-04 | 4198.01 ms | 32.2% bf16 MFU | 125784 tok/s step 12721/19560 | loss 3.330056 (-1.02z)| norm 0.3187 (+2.31z)| lr 1.75e-04 | 4164.22 ms | 32.4% bf16 MFU | 125790 tok/s step 12722/19560 | loss 3.324198 (-1.16z)| norm 0.2890 (+0.60z)| lr 1.75e-04 | 4160.63 ms | 32.5% bf16 MFU | 125801 tok/s step 12723/19560 | loss 3.341372 (-0.68z)| norm 0.2838 (+0.29z)| lr 1.74e-04 | 4151.74 ms | 32.5% bf16 MFU | 125825 tok/s step 12724/19560 | loss 3.414963 (+1.37z)| norm 0.2953 (+0.94z)| lr 1.74e-04 | 4157.70 ms | 32.5% bf16 MFU | 125838 tok/s step 12725/19560 | loss 3.386002 (+0.55z)| norm 0.3072 (+1.60z)| lr 1.74e-04 | 4161.58 ms | 32.4% bf16 MFU | 125846 tok/s step 12726/19560 | loss 3.324522 (-1.17z)| norm 0.2605 (-1.08z)| lr 1.74e-04 | 4163.82 ms | 32.4% bf16 MFU | 125849 tok/s step 12727/19560 | loss 3.385755 (+0.54z)| norm 0.2909 (+0.66z)| lr 1.74e-04 | 4157.71 ms | 32.5% bf16 MFU | 125862 tok/s step 12728/19560 | loss 3.336986 (-0.81z)| norm 0.2805 (+0.05z)| lr 1.74e-04 | 4164.25 ms | 32.4% bf16 MFU | 125864 tok/s step 12729/19560 | loss 3.385397 (+0.54z)| norm 0.2754 (-0.24z)| lr 1.74e-04 | 4161.82 ms | 32.4% bf16 MFU | 125869 tok/s step 12730/19560 | loss 3.303289 (-1.73z)| norm 0.2869 (+0.41z)| lr 1.74e-04 | 4161.56 ms | 32.4% bf16 MFU | 125875 tok/s step 12731/19560 | loss 3.349080 (-0.45z)| norm 0.3136 (+1.93z)| lr 1.74e-04 | 4168.05 ms | 32.4% bf16 MFU | 125871 tok/s step 12732/19560 | loss 3.343437 (-0.60z)| norm 0.2827 (+0.15z)| lr 1.74e-04 | 4161.33 ms | 32.4% bf16 MFU | 125877 tok/s step 12733/19560 | loss 3.328206 (-1.02z)| norm 0.3065 (+1.51z)| lr 1.74e-04 | 4169.50 ms | 32.4% bf16 MFU | 125870 tok/s step 12734/19560 | loss 3.393060 (+0.76z)| norm 0.2846 (+0.27z)| lr 1.74e-04 | 4164.29 ms | 32.4% bf16 MFU | 125872 tok/s step 12735/19560 | loss 3.504382 (+3.61z)| norm 0.2986 (+1.05z)| lr 1.74e-04 | 4163.69 ms | 32.4% bf16 MFU | 125874 tok/s step 12736/19560 | loss 3.325688 (-1.06z)| norm 0.3101 (+1.68z)| lr 1.74e-04 | 4156.87 ms | 32.5% bf16 MFU | 125886 tok/s step 12737/19560 | loss 3.350440 (-0.40z)| norm 0.3265 (+2.54z)| lr 1.74e-04 | 4157.26 ms | 32.5% bf16 MFU | 125898 tok/s step 12738/19560 | loss 3.342750 (-0.60z)| norm 0.2798 (-0.04z)| lr 1.74e-04 | 4162.60 ms | 32.4% bf16 MFU | 125901 tok/s step 12739/19560 | loss 3.350403 (-0.41z)| norm 0.2852 (+0.40z)| lr 1.74e-04 | 4165.05 ms | 32.4% bf16 MFU | 125899 tok/s step 12740/19560 | loss 3.417560 (+1.35z)| norm 0.2842 (+0.33z)| lr 1.74e-04 | 4153.34 ms | 32.5% bf16 MFU | 125916 tok/s step 12741/19560 | loss 3.349343 (-0.45z)| norm 0.3100 (+2.09z)| lr 1.74e-04 | 4165.37 ms | 32.4% bf16 MFU | 125914 tok/s step 12742/19560 | loss 3.378316 (+0.31z)| norm 0.2978 (+1.23z)| lr 1.74e-04 | 4153.95 ms | 32.5% bf16 MFU | 125929 tok/s step 12743/19560 | loss 3.382022 (+0.41z)| norm 0.2713 (-0.58z)| lr 1.74e-04 | 4161.92 ms | 32.4% bf16 MFU | 125931 tok/s step 12744/19560 | loss 3.469513 (+2.62z)| norm 0.3317 (+3.37z)| lr 1.74e-04 | 4167.92 ms | 32.4% bf16 MFU | 125924 tok/s step 12745/19560 | loss 3.355089 (-0.31z)| norm 0.2808 (+0.04z)| lr 1.73e-04 | 4174.42 ms | 32.3% bf16 MFU | 125908 tok/s step 12746/19560 | loss 3.313605 (-1.36z)| norm 0.2790 (-0.08z)| lr 1.73e-04 | 4157.03 ms | 32.5% bf16 MFU | 125918 tok/s step 12747/19560 | loss 3.382778 (+0.42z)| norm 0.3035 (+1.50z)| lr 1.73e-04 | 4162.92 ms | 32.4% bf16 MFU | 125919 tok/s step 12748/19560 | loss 3.318838 (-1.21z)| norm 0.2799 (-0.03z)| lr 1.73e-04 | 4166.93 ms | 32.4% bf16 MFU | 125914 tok/s step 12749/19560 | loss 3.279331 (-2.18z)| norm 0.2785 (-0.13z)| lr 1.73e-04 | 4156.86 ms | 32.5% bf16 MFU | 125925 tok/s step 12750/19560 | loss 3.338861 (-0.68z)| norm 0.2654 (-0.97z)| lr 1.73e-04 | 4161.32 ms | 32.4% bf16 MFU | 125928 tok/s val loss 3.330235 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2975/10042 = 0.296256 step 12751/19560 | loss 3.335767 (-0.75z)| norm 0.2780 (-0.16z)| lr 1.73e-04 | 4161.36 ms | 32.4% bf16 MFU | 125931 tok/s step 12752/19560 | loss 3.395787 (+0.82z)| norm 0.2723 (-0.53z)| lr 1.73e-04 | 4172.59 ms | 32.4% bf16 MFU | 125917 tok/s step 12753/19560 | loss 3.312900 (-1.35z)| norm 0.2685 (-0.76z)| lr 1.73e-04 | 4162.80 ms | 32.4% bf16 MFU | 125919 tok/s step 12754/19560 | loss 3.322968 (-1.07z)| norm 0.2692 (-0.71z)| lr 1.73e-04 | 4170.33 ms | 32.4% bf16 MFU | 125909 tok/s step 12755/19560 | loss 3.375508 (+0.29z)| norm 0.2929 (+0.81z)| lr 1.73e-04 | 4176.96 ms | 32.3% bf16 MFU | 125889 tok/s step 12756/19560 | loss 3.383630 (+0.50z)| norm 0.2829 (+0.16z)| lr 1.73e-04 | 4177.37 ms | 32.3% bf16 MFU | 125870 tok/s step 12757/19560 | loss 3.359149 (-0.15z)| norm 0.2840 (+0.23z)| lr 1.73e-04 | 4168.83 ms | 32.4% bf16 MFU | 125865 tok/s step 12758/19560 | loss 3.375593 (+0.29z)| norm 0.2562 (-1.54z)| lr 1.73e-04 | 4158.84 ms | 32.5% bf16 MFU | 125875 tok/s step 12759/19560 | loss 3.308619 (-1.47z)| norm 0.2818 (+0.10z)| lr 1.73e-04 | 4163.36 ms | 32.4% bf16 MFU | 125878 tok/s step 12760/19560 | loss 3.308178 (-1.45z)| norm 0.2646 (-1.00z)| lr 1.73e-04 | 4184.17 ms | 32.3% bf16 MFU | 125849 tok/s step 12761/19560 | loss 3.323982 (-1.02z)| norm 0.2679 (-0.78z)| lr 1.73e-04 | 4171.38 ms | 32.4% bf16 MFU | 125841 tok/s step 12762/19560 | loss 3.334974 (-0.72z)| norm 0.2726 (-0.48z)| lr 1.73e-04 | 4167.44 ms | 32.4% bf16 MFU | 125839 tok/s step 12763/19560 | loss 3.374382 (+0.31z)| norm 0.2684 (-0.75z)| lr 1.73e-04 | 4155.87 ms | 32.5% bf16 MFU | 125855 tok/s step 12764/19560 | loss 3.364004 (+0.03z)| norm 0.2842 (+0.28z)| lr 1.73e-04 | 4166.87 ms | 32.4% bf16 MFU | 125853 tok/s step 12765/19560 | loss 3.279922 (-2.15z)| norm 0.2520 (-1.78z)| lr 1.73e-04 | 4166.91 ms | 32.4% bf16 MFU | 125852 tok/s step 12766/19560 | loss 3.335123 (-0.69z)| norm 0.2637 (-1.01z)| lr 1.73e-04 | 4159.26 ms | 32.5% bf16 MFU | 125862 tok/s step 12767/19560 | loss 3.319492 (-1.09z)| norm 0.2724 (-0.45z)| lr 1.72e-04 | 4164.63 ms | 32.4% bf16 MFU | 125863 tok/s step 12768/19560 | loss 3.302490 (-1.51z)| norm 0.2824 (+0.18z)| lr 1.72e-04 | 4161.26 ms | 32.4% bf16 MFU | 125870 tok/s step 12769/19560 | loss 3.325138 (-0.90z)| norm 0.2720 (-0.47z)| lr 1.72e-04 | 4164.44 ms | 32.4% bf16 MFU | 125871 tok/s step 12770/19560 | loss 3.340209 (-0.50z)| norm 0.2742 (-0.33z)| lr 1.72e-04 | 4170.72 ms | 32.4% bf16 MFU | 125863 tok/s step 12771/19560 | loss 3.342693 (-0.43z)| norm 0.2671 (-0.77z)| lr 1.72e-04 | 4173.49 ms | 32.4% bf16 MFU | 125851 tok/s step 12772/19560 | loss 3.367516 (+0.21z)| norm 0.2575 (-1.37z)| lr 1.72e-04 | 4160.31 ms | 32.5% bf16 MFU | 125859 tok/s step 12773/19560 | loss 3.413875 (+1.40z)| norm 0.2968 (+1.10z)| lr 1.72e-04 | 4179.03 ms | 32.3% bf16 MFU | 125839 tok/s step 12774/19560 | loss 3.333597 (-0.67z)| norm 0.2781 (-0.07z)| lr 1.72e-04 | 4154.95 ms | 32.5% bf16 MFU | 125856 tok/s step 12775/19560 | loss 3.378037 (+0.47z)| norm 0.2654 (-0.86z)| lr 1.72e-04 | 4175.97 ms | 32.3% bf16 MFU | 125841 tok/s step 12776/19560 | loss 3.337033 (-0.59z)| norm 0.2746 (-0.28z)| lr 1.72e-04 | 4181.44 ms | 32.3% bf16 MFU | 125818 tok/s step 12777/19560 | loss 3.335318 (-0.63z)| norm 0.2841 (+0.32z)| lr 1.72e-04 | 4169.32 ms | 32.4% bf16 MFU | 125815 tok/s step 12778/19560 | loss 3.475857 (+2.89z)| norm 0.2662 (-0.81z)| lr 1.72e-04 | 4158.46 ms | 32.5% bf16 MFU | 125828 tok/s step 12779/19560 | loss 3.331849 (-0.71z)| norm 0.2745 (-0.28z)| lr 1.72e-04 | 4163.75 ms | 32.4% bf16 MFU | 125832 tok/s step 12780/19560 | loss 3.375115 (+0.38z)| norm 0.2546 (-1.54z)| lr 1.72e-04 | 4162.93 ms | 32.4% bf16 MFU | 125838 tok/s step 12781/19560 | loss 3.361863 (+0.05z)| norm 0.2820 (+0.19z)| lr 1.72e-04 | 4170.37 ms | 32.4% bf16 MFU | 125832 tok/s step 12782/19560 | loss 3.324290 (-0.88z)| norm 0.2563 (-1.42z)| lr 1.72e-04 | 4161.01 ms | 32.4% bf16 MFU | 125840 tok/s step 12783/19560 | loss 3.332477 (-0.67z)| norm 0.2734 (-0.34z)| lr 1.72e-04 | 4163.47 ms | 32.4% bf16 MFU | 125845 tok/s step 12784/19560 | loss 3.325953 (-0.82z)| norm 0.2732 (-0.38z)| lr 1.72e-04 | 4165.39 ms | 32.4% bf16 MFU | 125846 tok/s step 12785/19560 | loss 3.297275 (-1.51z)| norm 0.2785 (-0.05z)| lr 1.72e-04 | 4157.90 ms | 32.5% bf16 MFU | 125858 tok/s step 12786/19560 | loss 3.412086 (+1.30z)| norm 0.2762 (-0.21z)| lr 1.72e-04 | 4159.17 ms | 32.5% bf16 MFU | 125868 tok/s step 12787/19560 | loss 3.307302 (-1.27z)| norm 0.2776 (-0.10z)| lr 1.72e-04 | 4186.32 ms | 32.3% bf16 MFU | 125837 tok/s step 12788/19560 | loss 3.402904 (+1.07z)| norm 0.2889 (+0.62z)| lr 1.72e-04 | 4162.84 ms | 32.4% bf16 MFU | 125842 tok/s step 12789/19560 | loss 3.274364 (-2.03z)| norm 0.2694 (-0.66z)| lr 1.71e-04 | 4167.08 ms | 32.4% bf16 MFU | 125841 tok/s step 12790/19560 | loss 3.348111 (-0.25z)| norm 0.2871 (+0.51z)| lr 1.71e-04 | 4205.78 ms | 32.1% bf16 MFU | 125782 tok/s step 12791/19560 | loss 3.371982 (+0.32z)| norm 0.2727 (-0.44z)| lr 1.71e-04 | 4169.07 ms | 32.4% bf16 MFU | 125780 tok/s step 12792/19560 | loss 3.327986 (-0.74z)| norm 0.2916 (+0.79z)| lr 1.71e-04 | 4168.74 ms | 32.4% bf16 MFU | 125780 tok/s step 12793/19560 | loss 3.432511 (+1.79z)| norm 0.2895 (+0.66z)| lr 1.71e-04 | 4167.32 ms | 32.4% bf16 MFU | 125781 tok/s step 12794/19560 | loss 3.373160 (+0.35z)| norm 0.2950 (+1.00z)| lr 1.71e-04 | 4167.85 ms | 32.4% bf16 MFU | 125782 tok/s step 12795/19560 | loss 3.335687 (-0.54z)| norm 0.2975 (+1.16z)| lr 1.71e-04 | 4162.92 ms | 32.4% bf16 MFU | 125790 tok/s step 12796/19560 | loss 3.345076 (-0.32z)| norm 0.2902 (+0.68z)| lr 1.71e-04 | 4167.91 ms | 32.4% bf16 MFU | 125790 tok/s step 12797/19560 | loss 3.342060 (-0.39z)| norm 0.2947 (+0.97z)| lr 1.71e-04 | 4157.55 ms | 32.5% bf16 MFU | 125806 tok/s step 12798/19560 | loss 3.276093 (-1.93z)| norm 0.2794 (-0.04z)| lr 1.71e-04 | 4168.71 ms | 32.4% bf16 MFU | 125804 tok/s step 12799/19560 | loss 3.299999 (-1.34z)| norm 0.2664 (-0.88z)| lr 1.71e-04 | 4164.14 ms | 32.4% bf16 MFU | 125809 tok/s step 12800/19560 | loss 3.324621 (-0.75z)| norm 0.2842 (+0.28z)| lr 1.71e-04 | 4217.71 ms | 32.0% bf16 MFU | 125734 tok/s step 12801/19560 | loss 3.350363 (-0.13z)| norm 0.2605 (-1.26z)| lr 1.71e-04 | 4163.28 ms | 32.4% bf16 MFU | 125744 tok/s step 12802/19560 | loss 3.290935 (-1.52z)| norm 0.2629 (-1.09z)| lr 1.71e-04 | 4177.58 ms | 32.3% bf16 MFU | 125731 tok/s step 12803/19560 | loss 3.318672 (-0.85z)| norm 0.2652 (-0.92z)| lr 1.71e-04 | 4163.31 ms | 32.4% bf16 MFU | 125741 tok/s step 12804/19560 | loss 3.262469 (-2.13z)| norm 0.2546 (-1.60z)| lr 1.71e-04 | 4161.02 ms | 32.4% bf16 MFU | 125754 tok/s step 12805/19560 | loss 3.320107 (-0.78z)| norm 0.2667 (-0.80z)| lr 1.71e-04 | 4168.76 ms | 32.4% bf16 MFU | 125755 tok/s step 12806/19560 | loss 3.386031 (+0.75z)| norm 0.2729 (-0.40z)| lr 1.71e-04 | 4160.89 ms | 32.4% bf16 MFU | 125767 tok/s step 12807/19560 | loss 3.314901 (-0.90z)| norm 0.2758 (-0.22z)| lr 1.71e-04 | 4159.21 ms | 32.5% bf16 MFU | 125782 tok/s step 12808/19560 | loss 3.335437 (-0.41z)| norm 0.2599 (-1.23z)| lr 1.71e-04 | 4154.99 ms | 32.5% bf16 MFU | 125802 tok/s step 12809/19560 | loss 3.406647 (+1.24z)| norm 0.2660 (-0.82z)| lr 1.71e-04 | 4165.61 ms | 32.4% bf16 MFU | 125805 tok/s step 12810/19560 | loss 3.308332 (-1.03z)| norm 0.2901 (+0.73z)| lr 1.71e-04 | 4168.00 ms | 32.4% bf16 MFU | 125804 tok/s step 12811/19560 | loss 3.346280 (-0.15z)| norm 0.2704 (-0.55z)| lr 1.70e-04 | 4163.57 ms | 32.4% bf16 MFU | 125810 tok/s step 12812/19560 | loss 3.300095 (-1.20z)| norm 0.2790 (-0.00z)| lr 1.70e-04 | 4167.13 ms | 32.4% bf16 MFU | 125810 tok/s step 12813/19560 | loss 3.355385 (+0.09z)| norm 0.2775 (-0.10z)| lr 1.70e-04 | 4165.95 ms | 32.4% bf16 MFU | 125812 tok/s step 12814/19560 | loss 3.332558 (-0.43z)| norm 0.2584 (-1.33z)| lr 1.70e-04 | 4167.69 ms | 32.4% bf16 MFU | 125811 tok/s step 12815/19560 | loss 3.309415 (-0.96z)| norm 0.2652 (-0.90z)| lr 1.70e-04 | 4163.65 ms | 32.4% bf16 MFU | 125817 tok/s step 12816/19560 | loss 3.361046 (+0.24z)| norm 0.2834 (+0.28z)| lr 1.70e-04 | 4187.68 ms | 32.2% bf16 MFU | 125786 tok/s step 12817/19560 | loss 3.311129 (-0.91z)| norm 0.2682 (-0.70z)| lr 1.70e-04 | 4170.43 ms | 32.4% bf16 MFU | 125782 tok/s step 12818/19560 | loss 3.395825 (+1.06z)| norm 0.2832 (+0.28z)| lr 1.70e-04 | 4160.40 ms | 32.5% bf16 MFU | 125794 tok/s step 12819/19560 | loss 3.342585 (-0.16z)| norm 0.2942 (+0.99z)| lr 1.70e-04 | 4154.28 ms | 32.5% bf16 MFU | 125815 tok/s step 12820/19560 | loss 3.340871 (-0.20z)| norm 0.2683 (-0.71z)| lr 1.70e-04 | 4176.13 ms | 32.3% bf16 MFU | 125801 tok/s step 12821/19560 | loss 3.315637 (-0.82z)| norm 0.3175 (+2.47z)| lr 1.70e-04 | 4157.98 ms | 32.5% bf16 MFU | 125816 tok/s step 12822/19560 | loss 3.321552 (-0.67z)| norm 0.2637 (-1.04z)| lr 1.70e-04 | 4154.85 ms | 32.5% bf16 MFU | 125834 tok/s step 12823/19560 | loss 3.352521 (+0.12z)| norm 0.2989 (+1.27z)| lr 1.70e-04 | 4155.96 ms | 32.5% bf16 MFU | 125850 tok/s step 12824/19560 | loss 3.340773 (-0.18z)| norm 0.2814 (+0.11z)| lr 1.70e-04 | 4164.60 ms | 32.4% bf16 MFU | 125852 tok/s step 12825/19560 | loss 3.322579 (-0.64z)| norm 0.3059 (+1.70z)| lr 1.70e-04 | 4162.99 ms | 32.4% bf16 MFU | 125857 tok/s step 12826/19560 | loss 3.357890 (+0.25z)| norm 0.3027 (+1.47z)| lr 1.70e-04 | 4160.55 ms | 32.5% bf16 MFU | 125865 tok/s step 12827/19560 | loss 3.334075 (-0.35z)| norm 0.2695 (-0.70z)| lr 1.70e-04 | 4163.50 ms | 32.4% bf16 MFU | 125868 tok/s step 12828/19560 | loss 3.358333 (+0.26z)| norm 0.2942 (+0.90z)| lr 1.70e-04 | 4183.11 ms | 32.3% bf16 MFU | 125841 tok/s step 12829/19560 | loss 3.356754 (+0.22z)| norm 0.2889 (+0.54z)| lr 1.70e-04 | 4170.46 ms | 32.4% bf16 MFU | 125835 tok/s step 12830/19560 | loss 3.337835 (-0.26z)| norm 0.3000 (+1.25z)| lr 1.70e-04 | 4168.91 ms | 32.4% bf16 MFU | 125831 tok/s step 12831/19560 | loss 3.298494 (-1.23z)| norm 0.2740 (-0.45z)| lr 1.70e-04 | 4168.68 ms | 32.4% bf16 MFU | 125828 tok/s step 12832/19560 | loss 3.320308 (-0.67z)| norm 0.2840 (+0.20z)| lr 1.70e-04 | 4161.99 ms | 32.4% bf16 MFU | 125835 tok/s step 12833/19560 | loss 3.392560 (+1.12z)| norm 0.3064 (+1.64z)| lr 1.69e-04 | 4160.33 ms | 32.5% bf16 MFU | 125844 tok/s step 12834/19560 | loss 3.363421 (+0.40z)| norm 0.2900 (+0.57z)| lr 1.69e-04 | 4155.31 ms | 32.5% bf16 MFU | 125861 tok/s step 12835/19560 | loss 3.377282 (+0.74z)| norm 0.2854 (+0.26z)| lr 1.69e-04 | 4160.51 ms | 32.5% bf16 MFU | 125868 tok/s step 12836/19560 | loss 3.346264 (-0.03z)| norm 0.2951 (+0.88z)| lr 1.69e-04 | 4160.21 ms | 32.5% bf16 MFU | 125876 tok/s step 12837/19560 | loss 3.314489 (-0.81z)| norm 0.2792 (-0.16z)| lr 1.69e-04 | 4162.76 ms | 32.4% bf16 MFU | 125880 tok/s step 12838/19560 | loss 3.346019 (-0.03z)| norm 0.3023 (+1.32z)| lr 1.69e-04 | 4166.06 ms | 32.4% bf16 MFU | 125878 tok/s step 12839/19560 | loss 3.338225 (-0.23z)| norm 0.2687 (-0.84z)| lr 1.69e-04 | 4164.14 ms | 32.4% bf16 MFU | 125880 tok/s step 12840/19560 | loss 3.342149 (-0.13z)| norm 0.2980 (+1.03z)| lr 1.69e-04 | 4165.60 ms | 32.4% bf16 MFU | 125879 tok/s step 12841/19560 | loss 3.327764 (-0.48z)| norm 0.2761 (-0.37z)| lr 1.69e-04 | 4151.92 ms | 32.5% bf16 MFU | 125898 tok/s step 12842/19560 | loss 3.337386 (-0.23z)| norm 0.2768 (-0.33z)| lr 1.69e-04 | 4159.32 ms | 32.5% bf16 MFU | 125906 tok/s step 12843/19560 | loss 3.320081 (-0.67z)| norm 0.2771 (-0.31z)| lr 1.69e-04 | 4160.47 ms | 32.5% bf16 MFU | 125912 tok/s step 12844/19560 | loss 3.311007 (-0.92z)| norm 0.2795 (-0.15z)| lr 1.69e-04 | 4156.83 ms | 32.5% bf16 MFU | 125922 tok/s step 12845/19560 | loss 3.274502 (-1.82z)| norm 0.2750 (-0.43z)| lr 1.69e-04 | 4162.75 ms | 32.4% bf16 MFU | 125924 tok/s step 12846/19560 | loss 3.339157 (-0.18z)| norm 0.2951 (+0.87z)| lr 1.69e-04 | 4155.21 ms | 32.5% bf16 MFU | 125936 tok/s step 12847/19560 | loss 3.282760 (-1.58z)| norm 0.2789 (-0.17z)| lr 1.69e-04 | 4159.25 ms | 32.5% bf16 MFU | 125942 tok/s step 12848/19560 | loss 3.342113 (-0.07z)| norm 0.2839 (+0.15z)| lr 1.69e-04 | 4160.46 ms | 32.5% bf16 MFU | 125946 tok/s step 12849/19560 | loss 3.308886 (-0.91z)| norm 0.2561 (-1.64z)| lr 1.69e-04 | 4154.92 ms | 32.5% bf16 MFU | 125958 tok/s step 12850/19560 | loss 3.330317 (-0.37z)| norm 0.2917 (+0.70z)| lr 1.69e-04 | 4169.63 ms | 32.4% bf16 MFU | 125947 tok/s step 12851/19560 | loss 3.371659 (+0.67z)| norm 0.2797 (-0.09z)| lr 1.69e-04 | 4147.84 ms | 32.6% bf16 MFU | 125970 tok/s step 12852/19560 | loss 3.309902 (-0.88z)| norm 0.2662 (-0.96z)| lr 1.69e-04 | 4156.88 ms | 32.5% bf16 MFU | 125977 tok/s step 12853/19560 | loss 3.290289 (-1.35z)| norm 0.2966 (+1.05z)| lr 1.69e-04 | 4205.77 ms | 32.1% bf16 MFU | 125911 tok/s step 12854/19560 | loss 3.354022 (+0.26z)| norm 0.2908 (+0.66z)| lr 1.69e-04 | 4167.08 ms | 32.4% bf16 MFU | 125907 tok/s step 12855/19560 | loss 3.380324 (+0.94z)| norm 0.2858 (+0.33z)| lr 1.68e-04 | 4155.62 ms | 32.5% bf16 MFU | 125920 tok/s step 12856/19560 | loss 3.358131 (+0.36z)| norm 0.3163 (+2.30z)| lr 1.68e-04 | 4161.89 ms | 32.4% bf16 MFU | 125922 tok/s step 12857/19560 | loss 3.331261 (-0.31z)| norm 0.2710 (-0.66z)| lr 1.68e-04 | 4149.04 ms | 32.5% bf16 MFU | 125944 tok/s step 12858/19560 | loss 3.340790 (-0.07z)| norm 0.3040 (+1.47z)| lr 1.68e-04 | 4149.58 ms | 32.5% bf16 MFU | 125964 tok/s step 12859/19560 | loss 3.348536 (+0.12z)| norm 0.2715 (-0.62z)| lr 1.68e-04 | 4148.66 ms | 32.5% bf16 MFU | 125985 tok/s step 12860/19560 | loss 3.350268 (+0.17z)| norm 0.2773 (-0.24z)| lr 1.68e-04 | 4158.85 ms | 32.5% bf16 MFU | 125989 tok/s step 12861/19560 | loss 3.311735 (-0.82z)| norm 0.2774 (-0.22z)| lr 1.68e-04 | 4148.63 ms | 32.5% bf16 MFU | 126008 tok/s step 12862/19560 | loss 3.340652 (-0.07z)| norm 0.2872 (+0.44z)| lr 1.68e-04 | 4160.05 ms | 32.5% bf16 MFU | 126009 tok/s step 12863/19560 | loss 3.389434 (+1.30z)| norm 0.3039 (+1.54z)| lr 1.68e-04 | 4156.53 ms | 32.5% bf16 MFU | 126016 tok/s step 12864/19560 | loss 3.354848 (+0.34z)| norm 0.2913 (+0.72z)| lr 1.68e-04 | 4163.23 ms | 32.4% bf16 MFU | 126012 tok/s step 12865/19560 | loss 3.318772 (-0.65z)| norm 0.2663 (-0.96z)| lr 1.68e-04 | 4158.01 ms | 32.5% bf16 MFU | 126016 tok/s step 12866/19560 | loss 3.340537 (-0.05z)| norm 0.2734 (-0.46z)| lr 1.68e-04 | 4157.48 ms | 32.5% bf16 MFU | 126020 tok/s step 12867/19560 | loss 3.304797 (-1.02z)| norm 0.2768 (-0.22z)| lr 1.68e-04 | 4189.25 ms | 32.2% bf16 MFU | 125977 tok/s step 12868/19560 | loss 3.295354 (-1.27z)| norm 0.2692 (-0.74z)| lr 1.68e-04 | 4157.84 ms | 32.5% bf16 MFU | 125983 tok/s step 12869/19560 | loss 3.375641 (+0.95z)| norm 0.2763 (-0.23z)| lr 1.68e-04 | 4155.95 ms | 32.5% bf16 MFU | 125991 tok/s step 12870/19560 | loss 3.363797 (+0.63z)| norm 0.2761 (-0.24z)| lr 1.68e-04 | 4154.15 ms | 32.5% bf16 MFU | 126002 tok/s step 12871/19560 | loss 3.303945 (-1.01z)| norm 0.2745 (-0.35z)| lr 1.68e-04 | 4153.67 ms | 32.5% bf16 MFU | 126013 tok/s step 12872/19560 | loss 3.327402 (-0.35z)| norm 0.2934 (+1.07z)| lr 1.68e-04 | 4158.23 ms | 32.5% bf16 MFU | 126017 tok/s step 12873/19560 | loss 3.298842 (-1.17z)| norm 0.2807 (+0.12z)| lr 1.68e-04 | 4152.19 ms | 32.5% bf16 MFU | 126029 tok/s step 12874/19560 | loss 3.341708 (+0.08z)| norm 0.2921 (+0.96z)| lr 1.68e-04 | 4173.54 ms | 32.4% bf16 MFU | 126009 tok/s step 12875/19560 | loss 3.309372 (-0.86z)| norm 0.2827 (+0.27z)| lr 1.68e-04 | 4153.89 ms | 32.5% bf16 MFU | 126019 tok/s step 12876/19560 | loss 3.323245 (-0.45z)| norm 0.2875 (+0.63z)| lr 1.68e-04 | 4148.54 ms | 32.5% bf16 MFU | 126037 tok/s step 12877/19560 | loss 3.341410 (+0.07z)| norm 0.2816 (+0.18z)| lr 1.68e-04 | 4159.71 ms | 32.5% bf16 MFU | 126037 tok/s step 12878/19560 | loss 3.354543 (+0.46z)| norm 0.3089 (+2.19z)| lr 1.67e-04 | 4158.03 ms | 32.5% bf16 MFU | 126040 tok/s step 12879/19560 | loss 3.310125 (-0.85z)| norm 0.2950 (+1.13z)| lr 1.67e-04 | 4158.74 ms | 32.5% bf16 MFU | 126041 tok/s step 12880/19560 | loss 3.300378 (-1.13z)| norm 0.3537 (+4.90z)| lr 1.67e-04 | 4159.76 ms | 32.5% bf16 MFU | 126041 tok/s step 12881/19560 | loss 3.304017 (-1.01z)| norm 0.2914 (+0.73z)| lr 1.67e-04 | 4160.98 ms | 32.4% bf16 MFU | 126039 tok/s step 12882/19560 | loss 3.322677 (-0.46z)| norm 0.3114 (+2.02z)| lr 1.67e-04 | 4242.25 ms | 31.8% bf16 MFU | 125917 tok/s step 12883/19560 | loss 3.366079 (+0.83z)| norm 0.2924 (+0.77z)| lr 1.67e-04 | 4242.27 ms | 31.8% bf16 MFU | 125800 tok/s step 12884/19560 | loss 3.335814 (-0.06z)| norm 0.3045 (+1.54z)| lr 1.67e-04 | 4206.92 ms | 32.1% bf16 MFU | 125741 tok/s step 12885/19560 | loss 3.335947 (-0.05z)| norm 0.2992 (+1.18z)| lr 1.67e-04 | 4251.60 ms | 31.8% bf16 MFU | 125620 tok/s step 12886/19560 | loss 3.363730 (+0.79z)| norm 0.2824 (+0.07z)| lr 1.67e-04 | 4158.43 ms | 32.5% bf16 MFU | 125643 tok/s step 12887/19560 | loss 3.371799 (+1.02z)| norm 0.2950 (+0.89z)| lr 1.67e-04 | 4152.97 ms | 32.5% bf16 MFU | 125673 tok/s step 12888/19560 | loss 3.332724 (-0.16z)| norm 0.2879 (+0.41z)| lr 1.67e-04 | 4151.36 ms | 32.5% bf16 MFU | 125704 tok/s step 12889/19560 | loss 3.329276 (-0.27z)| norm 0.2787 (-0.19z)| lr 1.67e-04 | 4156.10 ms | 32.5% bf16 MFU | 125726 tok/s step 12890/19560 | loss 3.369988 (+0.95z)| norm 0.2693 (-0.81z)| lr 1.67e-04 | 4148.07 ms | 32.5% bf16 MFU | 125760 tok/s step 12891/19560 | loss 3.380051 (+1.25z)| norm 0.2759 (-0.38z)| lr 1.67e-04 | 4153.92 ms | 32.5% bf16 MFU | 125782 tok/s step 12892/19560 | loss 3.362702 (+0.73z)| norm 0.3007 (+1.23z)| lr 1.67e-04 | 4149.31 ms | 32.5% bf16 MFU | 125811 tok/s step 12893/19560 | loss 3.340583 (+0.05z)| norm 0.2649 (-1.13z)| lr 1.67e-04 | 4147.08 ms | 32.6% bf16 MFU | 125842 tok/s step 12894/19560 | loss 3.325316 (-0.41z)| norm 0.3054 (+1.53z)| lr 1.67e-04 | 4150.62 ms | 32.5% bf16 MFU | 125865 tok/s step 12895/19560 | loss 3.352652 (+0.41z)| norm 0.2784 (-0.26z)| lr 1.67e-04 | 4146.92 ms | 32.6% bf16 MFU | 125894 tok/s step 12896/19560 | loss 3.348325 (+0.27z)| norm 0.2818 (-0.04z)| lr 1.67e-04 | 4145.47 ms | 32.6% bf16 MFU | 125922 tok/s step 12897/19560 | loss 3.371915 (+0.98z)| norm 0.2729 (-0.62z)| lr 1.67e-04 | 4148.16 ms | 32.5% bf16 MFU | 125946 tok/s step 12898/19560 | loss 3.260874 (-2.34z)| norm 0.2942 (+0.77z)| lr 1.67e-04 | 4149.61 ms | 32.5% bf16 MFU | 125966 tok/s step 12899/19560 | loss 3.349158 (+0.29z)| norm 0.2687 (-0.91z)| lr 1.67e-04 | 4150.36 ms | 32.5% bf16 MFU | 125984 tok/s step 12900/19560 | loss 3.286033 (-1.56z)| norm 0.2774 (-0.35z)| lr 1.66e-04 | 4151.49 ms | 32.5% bf16 MFU | 125999 tok/s step 12901/19560 | loss 3.394039 (+1.66z)| norm 0.2641 (-1.22z)| lr 1.66e-04 | 4151.31 ms | 32.5% bf16 MFU | 126014 tok/s step 12902/19560 | loss 3.325760 (-0.38z)| norm 0.2781 (-0.29z)| lr 1.66e-04 | 4149.56 ms | 32.5% bf16 MFU | 126030 tok/s step 12903/19560 | loss 3.334988 (-0.09z)| norm 0.2874 (+0.32z)| lr 1.66e-04 | 4151.31 ms | 32.5% bf16 MFU | 126044 tok/s step 12904/19560 | loss 3.300642 (-1.11z)| norm 0.2574 (-1.66z)| lr 1.66e-04 | 4146.58 ms | 32.6% bf16 MFU | 126063 tok/s step 12905/19560 | loss 3.345115 (+0.22z)| norm 0.2652 (-1.13z)| lr 1.66e-04 | 4147.10 ms | 32.6% bf16 MFU | 126081 tok/s step 12906/19560 | loss 3.381591 (+1.41z)| norm 0.2623 (-1.31z)| lr 1.66e-04 | 4154.11 ms | 32.5% bf16 MFU | 126088 tok/s step 12907/19560 | loss 3.324047 (-0.42z)| norm 0.2827 (+0.03z)| lr 1.66e-04 | 4148.59 ms | 32.5% bf16 MFU | 126102 tok/s step 12908/19560 | loss 3.353033 (+0.51z)| norm 0.2858 (+0.22z)| lr 1.66e-04 | 4144.25 ms | 32.6% bf16 MFU | 126123 tok/s step 12909/19560 | loss 3.358397 (+0.69z)| norm 0.2634 (-1.26z)| lr 1.66e-04 | 4152.18 ms | 32.5% bf16 MFU | 126130 tok/s step 12910/19560 | loss 3.387882 (+1.60z)| norm 0.2919 (+0.62z)| lr 1.66e-04 | 4150.19 ms | 32.5% bf16 MFU | 126140 tok/s step 12911/19560 | loss 3.347466 (+0.32z)| norm 0.2732 (-0.63z)| lr 1.66e-04 | 4151.35 ms | 32.5% bf16 MFU | 126148 tok/s step 12912/19560 | loss 3.306782 (-0.97z)| norm 0.2683 (-0.96z)| lr 1.66e-04 | 4150.15 ms | 32.5% bf16 MFU | 126157 tok/s step 12913/19560 | loss 3.294635 (-1.35z)| norm 0.2667 (-1.05z)| lr 1.66e-04 | 4147.94 ms | 32.6% bf16 MFU | 126169 tok/s step 12914/19560 | loss 3.331800 (-0.16z)| norm 0.2606 (-1.44z)| lr 1.66e-04 | 4147.96 ms | 32.6% bf16 MFU | 126180 tok/s step 12915/19560 | loss 3.327400 (-0.31z)| norm 0.2623 (-1.31z)| lr 1.66e-04 | 4149.18 ms | 32.5% bf16 MFU | 126189 tok/s step 12916/19560 | loss 3.221131 (-3.57z)| norm 0.2676 (-0.95z)| lr 1.66e-04 | 4149.85 ms | 32.5% bf16 MFU | 126197 tok/s step 12917/19560 | loss 3.353714 (+0.56z)| norm 0.2775 (-0.31z)| lr 1.66e-04 | 4152.04 ms | 32.5% bf16 MFU | 126200 tok/s step 12918/19560 | loss 3.327161 (-0.28z)| norm 0.2653 (-1.10z)| lr 1.66e-04 | 4151.41 ms | 32.5% bf16 MFU | 126205 tok/s step 12919/19560 | loss 3.319073 (-0.52z)| norm 0.2729 (-0.60z)| lr 1.66e-04 | 4150.04 ms | 32.5% bf16 MFU | 126211 tok/s step 12920/19560 | loss 3.388935 (+1.67z)| norm 0.2667 (-0.99z)| lr 1.66e-04 | 4148.91 ms | 32.5% bf16 MFU | 126219 tok/s step 12921/19560 | loss 3.303034 (-1.04z)| norm 0.2534 (-1.81z)| lr 1.66e-04 | 4147.52 ms | 32.6% bf16 MFU | 126229 tok/s step 12922/19560 | loss 3.322630 (-0.39z)| norm 0.2737 (-0.50z)| lr 1.65e-04 | 4145.64 ms | 32.6% bf16 MFU | 126241 tok/s step 12923/19560 | loss 3.238509 (-3.01z)| norm 0.2719 (-0.60z)| lr 1.65e-04 | 4148.42 ms | 32.5% bf16 MFU | 126248 tok/s step 12924/19560 | loss 3.319198 (-0.45z)| norm 0.2537 (-1.74z)| lr 1.65e-04 | 4146.87 ms | 32.6% bf16 MFU | 126257 tok/s step 12925/19560 | loss 3.340112 (+0.21z)| norm 0.2622 (-1.18z)| lr 1.65e-04 | 4150.90 ms | 32.5% bf16 MFU | 126259 tok/s step 12926/19560 | loss 3.322959 (-0.35z)| norm 0.2819 (+0.08z)| lr 1.65e-04 | 4145.54 ms | 32.6% bf16 MFU | 126270 tok/s step 12927/19560 | loss 3.338191 (+0.13z)| norm 0.2931 (+0.78z)| lr 1.65e-04 | 4147.21 ms | 32.6% bf16 MFU | 126277 tok/s step 12928/19560 | loss 3.360867 (+0.84z)| norm 0.2796 (-0.08z)| lr 1.65e-04 | 4151.81 ms | 32.5% bf16 MFU | 126277 tok/s step 12929/19560 | loss 3.328155 (-0.20z)| norm 0.2815 (+0.03z)| lr 1.65e-04 | 4150.72 ms | 32.5% bf16 MFU | 126279 tok/s step 12930/19560 | loss 3.339978 (+0.17z)| norm 0.2835 (+0.15z)| lr 1.65e-04 | 4151.67 ms | 32.5% bf16 MFU | 126279 tok/s step 12931/19560 | loss 3.334206 (-0.02z)| norm 0.2719 (-0.61z)| lr 1.65e-04 | 4149.13 ms | 32.5% bf16 MFU | 126283 tok/s step 12932/19560 | loss 3.354055 (+0.61z)| norm 0.2882 (+0.44z)| lr 1.65e-04 | 4149.00 ms | 32.5% bf16 MFU | 126288 tok/s step 12933/19560 | loss 3.362185 (+0.86z)| norm 0.2705 (-0.72z)| lr 1.65e-04 | 4151.09 ms | 32.5% bf16 MFU | 126288 tok/s step 12934/19560 | loss 3.328275 (-0.24z)| norm 0.2814 (-0.01z)| lr 1.65e-04 | 4149.20 ms | 32.5% bf16 MFU | 126292 tok/s step 12935/19560 | loss 3.367733 (+1.06z)| norm 0.2649 (-1.09z)| lr 1.65e-04 | 4150.52 ms | 32.5% bf16 MFU | 126293 tok/s step 12936/19560 | loss 3.366482 (+1.00z)| norm 0.2637 (-1.17z)| lr 1.65e-04 | 4150.76 ms | 32.5% bf16 MFU | 126294 tok/s step 12937/19560 | loss 3.400470 (+2.14z)| norm 0.2810 (-0.04z)| lr 1.65e-04 | 4150.02 ms | 32.5% bf16 MFU | 126296 tok/s step 12938/19560 | loss 3.303161 (-1.09z)| norm 0.2634 (-1.18z)| lr 1.65e-04 | 4148.88 ms | 32.5% bf16 MFU | 126300 tok/s step 12939/19560 | loss 3.408081 (+2.32z)| norm 0.2824 (+0.05z)| lr 1.65e-04 | 4148.55 ms | 32.5% bf16 MFU | 126304 tok/s step 12940/19560 | loss 3.344616 (+0.25z)| norm 0.2663 (-0.99z)| lr 1.65e-04 | 4147.28 ms | 32.6% bf16 MFU | 126309 tok/s step 12941/19560 | loss 3.284322 (-1.68z)| norm 0.2643 (-1.11z)| lr 1.65e-04 | 4148.13 ms | 32.5% bf16 MFU | 126313 tok/s step 12942/19560 | loss 3.358488 (+0.71z)| norm 0.2802 (-0.09z)| lr 1.65e-04 | 4149.18 ms | 32.5% bf16 MFU | 126316 tok/s step 12943/19560 | loss 3.342621 (+0.19z)| norm 0.2588 (-1.48z)| lr 1.65e-04 | 4145.90 ms | 32.6% bf16 MFU | 126323 tok/s step 12944/19560 | loss 3.304268 (-1.03z)| norm 0.2745 (-0.45z)| lr 1.65e-04 | 4150.31 ms | 32.5% bf16 MFU | 126323 tok/s step 12945/19560 | loss 3.365320 (+0.92z)| norm 0.2725 (-0.59z)| lr 1.64e-04 | 4146.16 ms | 32.6% bf16 MFU | 126329 tok/s step 12946/19560 | loss 3.229877 (-3.30z)| norm 0.2783 (-0.20z)| lr 1.64e-04 | 4151.29 ms | 32.5% bf16 MFU | 126328 tok/s step 12947/19560 | loss 3.320882 (-0.45z)| norm 0.2819 (+0.04z)| lr 1.64e-04 | 4151.51 ms | 32.5% bf16 MFU | 126326 tok/s step 12948/19560 | loss 3.305526 (-0.92z)| norm 0.2755 (-0.38z)| lr 1.64e-04 | 4147.37 ms | 32.6% bf16 MFU | 126330 tok/s step 12949/19560 | loss 3.358036 (+0.71z)| norm 0.2691 (-0.80z)| lr 1.64e-04 | 4148.09 ms | 32.5% bf16 MFU | 126333 tok/s step 12950/19560 | loss 3.395791 (+1.84z)| norm 0.2658 (-1.02z)| lr 1.64e-04 | 4148.26 ms | 32.5% bf16 MFU | 126336 tok/s step 12951/19560 | loss 3.410651 (+2.24z)| norm 0.2933 (+0.83z)| lr 1.64e-04 | 4150.10 ms | 32.5% bf16 MFU | 126336 tok/s step 12952/19560 | loss 3.325263 (-0.33z)| norm 0.2656 (-1.03z)| lr 1.64e-04 | 4151.96 ms | 32.5% bf16 MFU | 126333 tok/s step 12953/19560 | loss 3.350588 (+0.43z)| norm 0.2773 (-0.22z)| lr 1.64e-04 | 4147.58 ms | 32.6% bf16 MFU | 126336 tok/s step 12954/19560 | loss 3.317041 (-0.58z)| norm 0.2853 (+0.33z)| lr 1.64e-04 | 4144.59 ms | 32.6% bf16 MFU | 126345 tok/s step 12955/19560 | loss 3.333477 (-0.08z)| norm 0.2548 (-1.74z)| lr 1.64e-04 | 4151.55 ms | 32.5% bf16 MFU | 126342 tok/s step 12956/19560 | loss 3.311677 (-0.73z)| norm 0.2805 (+0.02z)| lr 1.64e-04 | 4148.12 ms | 32.5% bf16 MFU | 126344 tok/s step 12957/19560 | loss 3.353621 (+0.54z)| norm 0.2780 (-0.14z)| lr 1.64e-04 | 4149.19 ms | 32.5% bf16 MFU | 126345 tok/s step 12958/19560 | loss 3.351226 (+0.46z)| norm 0.2745 (-0.37z)| lr 1.64e-04 | 4149.14 ms | 32.5% bf16 MFU | 126346 tok/s step 12959/19560 | loss 3.299263 (-1.10z)| norm 0.2726 (-0.50z)| lr 1.64e-04 | 4150.15 ms | 32.5% bf16 MFU | 126345 tok/s step 12960/19560 | loss 3.337193 (+0.04z)| norm 0.2711 (-0.60z)| lr 1.64e-04 | 4147.48 ms | 32.6% bf16 MFU | 126348 tok/s step 12961/19560 | loss 3.361838 (+0.79z)| norm 0.2671 (-0.86z)| lr 1.64e-04 | 4151.22 ms | 32.5% bf16 MFU | 126346 tok/s step 12962/19560 | loss 3.312814 (-0.69z)| norm 0.2809 (+0.10z)| lr 1.64e-04 | 4148.80 ms | 32.5% bf16 MFU | 126347 tok/s step 12963/19560 | loss 3.434494 (+2.92z)| norm 0.2950 (+1.07z)| lr 1.64e-04 | 4146.40 ms | 32.6% bf16 MFU | 126352 tok/s step 12964/19560 | loss 3.333617 (-0.06z)| norm 0.2651 (-0.98z)| lr 1.64e-04 | 4146.29 ms | 32.6% bf16 MFU | 126357 tok/s step 12965/19560 | loss 3.332954 (-0.09z)| norm 0.2776 (-0.11z)| lr 1.64e-04 | 4148.51 ms | 32.5% bf16 MFU | 126358 tok/s step 12966/19560 | loss 3.312068 (-0.70z)| norm 0.2661 (-0.90z)| lr 1.64e-04 | 4150.03 ms | 32.5% bf16 MFU | 126357 tok/s step 12967/19560 | loss 3.391223 (+1.62z)| norm 0.2800 (+0.06z)| lr 1.63e-04 | 4148.52 ms | 32.5% bf16 MFU | 126358 tok/s step 12968/19560 | loss 3.332797 (-0.09z)| norm 0.2751 (-0.27z)| lr 1.63e-04 | 4146.85 ms | 32.6% bf16 MFU | 126361 tok/s step 12969/19560 | loss 3.408999 (+2.09z)| norm 0.2565 (-1.55z)| lr 1.63e-04 | 4148.06 ms | 32.5% bf16 MFU | 126363 tok/s step 12970/19560 | loss 3.287766 (-1.38z)| norm 0.2926 (+0.95z)| lr 1.63e-04 | 4145.15 ms | 32.6% bf16 MFU | 126369 tok/s step 12971/19560 | loss 3.278342 (-1.63z)| norm 0.2680 (-0.75z)| lr 1.63e-04 | 4147.24 ms | 32.6% bf16 MFU | 126371 tok/s step 12972/19560 | loss 3.325576 (-0.29z)| norm 0.2631 (-1.08z)| lr 1.63e-04 | 4149.24 ms | 32.5% bf16 MFU | 126371 tok/s step 12973/19560 | loss 3.324331 (-0.35z)| norm 0.2871 (+0.58z)| lr 1.63e-04 | 4149.52 ms | 32.5% bf16 MFU | 126370 tok/s step 12974/19560 | loss 3.230524 (-2.91z)| norm 0.2679 (-0.74z)| lr 1.63e-04 | 4150.22 ms | 32.5% bf16 MFU | 126368 tok/s step 12975/19560 | loss 3.387393 (+1.42z)| norm 0.2829 (+0.30z)| lr 1.63e-04 | 4148.35 ms | 32.5% bf16 MFU | 126368 tok/s step 12976/19560 | loss 3.386540 (+1.37z)| norm 0.2726 (-0.40z)| lr 1.63e-04 | 4148.52 ms | 32.5% bf16 MFU | 126369 tok/s step 12977/19560 | loss 3.391804 (+1.49z)| norm 0.2843 (+0.39z)| lr 1.63e-04 | 4146.24 ms | 32.6% bf16 MFU | 126373 tok/s step 12978/19560 | loss 3.323946 (-0.36z)| norm 0.3009 (+1.53z)| lr 1.63e-04 | 4147.39 ms | 32.6% bf16 MFU | 126375 tok/s step 12979/19560 | loss 3.317418 (-0.53z)| norm 0.2852 (+0.44z)| lr 1.63e-04 | 4162.50 ms | 32.4% bf16 MFU | 126354 tok/s step 12980/19560 | loss 3.394464 (+1.55z)| norm 0.3258 (+3.11z)| lr 1.63e-04 | 4147.55 ms | 32.6% bf16 MFU | 126357 tok/s step 12981/19560 | loss 3.345891 (+0.22z)| norm 0.2740 (-0.34z)| lr 1.63e-04 | 4147.51 ms | 32.6% bf16 MFU | 126359 tok/s step 12982/19560 | loss 3.345646 (+0.21z)| norm 0.3007 (+1.44z)| lr 1.63e-04 | 4145.64 ms | 32.6% bf16 MFU | 126365 tok/s step 12983/19560 | loss 3.317624 (-0.54z)| norm 0.2718 (-0.48z)| lr 1.63e-04 | 4147.61 ms | 32.6% bf16 MFU | 126367 tok/s step 12984/19560 | loss 3.350235 (+0.36z)| norm 0.3111 (+2.15z)| lr 1.63e-04 | 4152.21 ms | 32.5% bf16 MFU | 126362 tok/s step 12985/19560 | loss 3.348669 (+0.31z)| norm 0.3201 (+2.66z)| lr 1.63e-04 | 4149.24 ms | 32.5% bf16 MFU | 126362 tok/s step 12986/19560 | loss 3.352179 (+0.40z)| norm 0.3040 (+1.61z)| lr 1.63e-04 | 4145.81 ms | 32.6% bf16 MFU | 126367 tok/s step 12987/19560 | loss 3.364365 (+0.73z)| norm 0.2903 (+0.70z)| lr 1.63e-04 | 4146.18 ms | 32.6% bf16 MFU | 126371 tok/s step 12988/19560 | loss 3.406357 (+1.85z)| norm 0.2878 (+0.53z)| lr 1.63e-04 | 4149.23 ms | 32.5% bf16 MFU | 126370 tok/s step 12989/19560 | loss 3.297444 (-1.10z)| norm 0.2760 (-0.24z)| lr 1.63e-04 | 4150.95 ms | 32.5% bf16 MFU | 126367 tok/s step 12990/19560 | loss 3.369750 (+0.85z)| norm 0.2740 (-0.36z)| lr 1.62e-04 | 4156.65 ms | 32.5% bf16 MFU | 126355 tok/s step 12991/19560 | loss 3.371434 (+0.90z)| norm 0.2671 (-0.80z)| lr 1.62e-04 | 4151.21 ms | 32.5% bf16 MFU | 126352 tok/s step 12992/19560 | loss 3.357598 (+0.53z)| norm 0.2679 (-0.74z)| lr 1.62e-04 | 4151.69 ms | 32.5% bf16 MFU | 126349 tok/s step 12993/19560 | loss 3.323795 (-0.39z)| norm 0.2901 (+0.71z)| lr 1.62e-04 | 4151.44 ms | 32.5% bf16 MFU | 126346 tok/s step 12994/19560 | loss 3.313179 (-0.67z)| norm 0.2538 (-1.65z)| lr 1.62e-04 | 4152.63 ms | 32.5% bf16 MFU | 126341 tok/s step 12995/19560 | loss 3.329148 (-0.24z)| norm 0.2918 (+0.82z)| lr 1.62e-04 | 4148.26 ms | 32.5% bf16 MFU | 126344 tok/s step 12996/19560 | loss 3.314237 (-0.65z)| norm 0.2758 (-0.23z)| lr 1.62e-04 | 4152.47 ms | 32.5% bf16 MFU | 126340 tok/s step 12997/19560 | loss 3.325402 (-0.34z)| norm 0.2596 (-1.26z)| lr 1.62e-04 | 4151.21 ms | 32.5% bf16 MFU | 126337 tok/s step 12998/19560 | loss 3.266726 (-1.90z)| norm 0.2811 (+0.12z)| lr 1.62e-04 | 4143.80 ms | 32.6% bf16 MFU | 126347 tok/s step 12999/19560 | loss 3.375739 (+1.03z)| norm 0.2671 (-0.78z)| lr 1.62e-04 | 4148.78 ms | 32.5% bf16 MFU | 126348 tok/s step 13000/19560 | loss 3.313624 (-0.64z)| norm 0.2785 (-0.03z)| lr 1.62e-04 | 4147.30 ms | 32.6% bf16 MFU | 126351 tok/s val loss 3.324930 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2950/10042 = 0.293766 step 13001/19560 | loss 3.310660 (-0.73z)| norm 0.2713 (-0.49z)| lr 1.62e-04 | 4149.96 ms | 32.5% bf16 MFU | 126351 tok/s step 13002/19560 | loss 3.334171 (-0.09z)| norm 0.2812 (+0.15z)| lr 1.62e-04 | 4146.00 ms | 32.6% bf16 MFU | 126356 tok/s step 13003/19560 | loss 3.319199 (-0.50z)| norm 0.2893 (+0.67z)| lr 1.62e-04 | 4148.03 ms | 32.5% bf16 MFU | 126358 tok/s step 13004/19560 | loss 3.342956 (+0.14z)| norm 0.2768 (-0.13z)| lr 1.62e-04 | 4145.79 ms | 32.6% bf16 MFU | 126363 tok/s step 13005/19560 | loss 3.413122 (+1.99z)| norm 0.2967 (+1.14z)| lr 1.62e-04 | 4147.55 ms | 32.6% bf16 MFU | 126365 tok/s step 13006/19560 | loss 3.385828 (+1.25z)| norm 0.2693 (-0.61z)| lr 1.62e-04 | 4148.51 ms | 32.5% bf16 MFU | 126366 tok/s step 13007/19560 | loss 3.333621 (-0.14z)| norm 0.2787 (+0.01z)| lr 1.62e-04 | 4148.94 ms | 32.5% bf16 MFU | 126366 tok/s step 13008/19560 | loss 3.295526 (-1.15z)| norm 0.2836 (+0.42z)| lr 1.62e-04 | 4148.13 ms | 32.5% bf16 MFU | 126367 tok/s step 13009/19560 | loss 3.338483 (-0.02z)| norm 0.2765 (-0.10z)| lr 1.62e-04 | 4149.90 ms | 32.5% bf16 MFU | 126366 tok/s step 13010/19560 | loss 3.316833 (-0.59z)| norm 0.2761 (-0.11z)| lr 1.62e-04 | 4146.56 ms | 32.6% bf16 MFU | 126370 tok/s step 13011/19560 | loss 3.308597 (-0.80z)| norm 0.2786 (+0.09z)| lr 1.62e-04 | 4150.77 ms | 32.5% bf16 MFU | 126367 tok/s step 13012/19560 | loss 3.296926 (-1.09z)| norm 0.2654 (-0.90z)| lr 1.61e-04 | 4148.64 ms | 32.5% bf16 MFU | 126367 tok/s step 13013/19560 | loss 3.277536 (-1.58z)| norm 0.2966 (+1.50z)| lr 1.61e-04 | 4147.83 ms | 32.6% bf16 MFU | 126369 tok/s step 13014/19560 | loss 3.321923 (-0.41z)| norm 0.2826 (+0.42z)| lr 1.61e-04 | 4148.56 ms | 32.5% bf16 MFU | 126369 tok/s step 13015/19560 | loss 3.328181 (-0.24z)| norm 0.2847 (+0.59z)| lr 1.61e-04 | 4147.51 ms | 32.6% bf16 MFU | 126371 tok/s step 13016/19560 | loss 3.328013 (-0.24z)| norm 0.2893 (+0.94z)| lr 1.61e-04 | 4150.81 ms | 32.5% bf16 MFU | 126368 tok/s step 13017/19560 | loss 3.315332 (-0.57z)| norm 0.2806 (+0.27z)| lr 1.61e-04 | 4150.09 ms | 32.5% bf16 MFU | 126366 tok/s step 13018/19560 | loss 3.347426 (+0.28z)| norm 0.2971 (+1.52z)| lr 1.61e-04 | 4149.26 ms | 32.5% bf16 MFU | 126366 tok/s step 13019/19560 | loss 3.343082 (+0.17z)| norm 0.2980 (+1.56z)| lr 1.61e-04 | 4147.71 ms | 32.6% bf16 MFU | 126368 tok/s step 13020/19560 | loss 3.348336 (+0.32z)| norm 0.2805 (+0.24z)| lr 1.61e-04 | 4147.97 ms | 32.6% bf16 MFU | 126369 tok/s step 13021/19560 | loss 3.398083 (+1.60z)| norm 0.2796 (+0.17z)| lr 1.61e-04 | 4144.61 ms | 32.6% bf16 MFU | 126376 tok/s step 13022/19560 | loss 3.334561 (-0.06z)| norm 0.2953 (+1.40z)| lr 1.61e-04 | 4147.90 ms | 32.6% bf16 MFU | 126377 tok/s step 13023/19560 | loss 3.338750 (+0.05z)| norm 0.2764 (-0.08z)| lr 1.61e-04 | 4147.00 ms | 32.6% bf16 MFU | 126379 tok/s step 13024/19560 | loss 3.297995 (-1.00z)| norm 0.2658 (-0.88z)| lr 1.61e-04 | 4149.22 ms | 32.5% bf16 MFU | 126378 tok/s step 13025/19560 | loss 3.346794 (+0.28z)| norm 0.2998 (+1.72z)| lr 1.61e-04 | 4164.26 ms | 32.4% bf16 MFU | 126354 tok/s step 13026/19560 | loss 3.377486 (+1.07z)| norm 0.2687 (-0.66z)| lr 1.61e-04 | 4144.61 ms | 32.6% bf16 MFU | 126362 tok/s step 13027/19560 | loss 3.293849 (-1.13z)| norm 0.2968 (+1.48z)| lr 1.61e-04 | 4148.88 ms | 32.5% bf16 MFU | 126362 tok/s step 13028/19560 | loss 3.352504 (+0.41z)| norm 0.2906 (+1.00z)| lr 1.61e-04 | 4148.38 ms | 32.5% bf16 MFU | 126363 tok/s step 13029/19560 | loss 3.275991 (-1.60z)| norm 0.2644 (-1.00z)| lr 1.61e-04 | 4147.57 ms | 32.6% bf16 MFU | 126365 tok/s step 13030/19560 | loss 3.324581 (-0.31z)| norm 0.2840 (+0.49z)| lr 1.61e-04 | 4149.22 ms | 32.5% bf16 MFU | 126365 tok/s step 13031/19560 | loss 3.375338 (+1.02z)| norm 0.2556 (-1.64z)| lr 1.61e-04 | 4147.54 ms | 32.6% bf16 MFU | 126367 tok/s step 13032/19560 | loss 3.323459 (-0.35z)| norm 0.2694 (-0.61z)| lr 1.61e-04 | 4148.28 ms | 32.5% bf16 MFU | 126368 tok/s step 13033/19560 | loss 3.386765 (+1.31z)| norm 0.2742 (-0.25z)| lr 1.61e-04 | 4146.09 ms | 32.6% bf16 MFU | 126372 tok/s step 13034/19560 | loss 3.282015 (-1.43z)| norm 0.2473 (-2.26z)| lr 1.61e-04 | 4149.98 ms | 32.5% bf16 MFU | 126371 tok/s step 13035/19560 | loss 3.297260 (-1.02z)| norm 0.2621 (-1.13z)| lr 1.60e-04 | 4147.48 ms | 32.6% bf16 MFU | 126373 tok/s step 13036/19560 | loss 3.335789 (-0.01z)| norm 0.2595 (-1.31z)| lr 1.60e-04 | 4147.16 ms | 32.6% bf16 MFU | 126375 tok/s step 13037/19560 | loss 3.315657 (-0.52z)| norm 0.2765 (-0.05z)| lr 1.60e-04 | 4147.68 ms | 32.6% bf16 MFU | 126377 tok/s step 13038/19560 | loss 3.362099 (+0.70z)| norm 0.2774 (+0.03z)| lr 1.60e-04 | 4146.27 ms | 32.6% bf16 MFU | 126380 tok/s step 13039/19560 | loss 3.324610 (-0.28z)| norm 0.2807 (+0.27z)| lr 1.60e-04 | 4147.12 ms | 32.6% bf16 MFU | 126382 tok/s step 13040/19560 | loss 3.357174 (+0.57z)| norm 0.2518 (-1.87z)| lr 1.60e-04 | 4146.01 ms | 32.6% bf16 MFU | 126386 tok/s step 13041/19560 | loss 3.297404 (-1.01z)| norm 0.2500 (-1.97z)| lr 1.60e-04 | 4147.14 ms | 32.6% bf16 MFU | 126388 tok/s step 13042/19560 | loss 3.313098 (-0.59z)| norm 0.2668 (-0.74z)| lr 1.60e-04 | 4152.16 ms | 32.5% bf16 MFU | 126382 tok/s step 13043/19560 | loss 3.296929 (-1.01z)| norm 0.2484 (-2.06z)| lr 1.60e-04 | 4146.87 ms | 32.6% bf16 MFU | 126384 tok/s step 13044/19560 | loss 3.288840 (-1.27z)| norm 0.2717 (-0.37z)| lr 1.60e-04 | 4146.16 ms | 32.6% bf16 MFU | 126387 tok/s step 13045/19560 | loss 3.344219 (+0.23z)| norm 0.2724 (-0.32z)| lr 1.60e-04 | 4147.31 ms | 32.6% bf16 MFU | 126389 tok/s step 13046/19560 | loss 3.309763 (-0.70z)| norm 0.2737 (-0.22z)| lr 1.60e-04 | 4145.07 ms | 32.6% bf16 MFU | 126394 tok/s step 13047/19560 | loss 3.431168 (+2.50z)| norm 0.2813 (+0.32z)| lr 1.60e-04 | 4146.99 ms | 32.6% bf16 MFU | 126395 tok/s step 13048/19560 | loss 3.285927 (-1.31z)| norm 0.2800 (+0.22z)| lr 1.60e-04 | 4147.96 ms | 32.6% bf16 MFU | 126395 tok/s step 13049/19560 | loss 3.316037 (-0.52z)| norm 0.2693 (-0.57z)| lr 1.60e-04 | 4150.96 ms | 32.5% bf16 MFU | 126391 tok/s step 13050/19560 | loss 3.337839 (+0.05z)| norm 0.2823 (+0.38z)| lr 1.60e-04 | 4147.35 ms | 32.6% bf16 MFU | 126392 tok/s step 13051/19560 | loss 3.286737 (-1.34z)| norm 0.2782 (+0.07z)| lr 1.60e-04 | 4145.23 ms | 32.6% bf16 MFU | 126396 tok/s step 13052/19560 | loss 3.322581 (-0.37z)| norm 0.2716 (-0.43z)| lr 1.60e-04 | 4146.36 ms | 32.6% bf16 MFU | 126399 tok/s step 13053/19560 | loss 3.295550 (-1.09z)| norm 0.2933 (+1.17z)| lr 1.60e-04 | 4145.56 ms | 32.6% bf16 MFU | 126402 tok/s step 13054/19560 | loss 3.310364 (-0.68z)| norm 0.2497 (-2.03z)| lr 1.60e-04 | 4149.14 ms | 32.5% bf16 MFU | 126400 tok/s step 13055/19560 | loss 3.289945 (-1.21z)| norm 0.2748 (-0.18z)| lr 1.60e-04 | 4145.75 ms | 32.6% bf16 MFU | 126404 tok/s step 13056/19560 | loss 3.326838 (-0.22z)| norm 0.2634 (-1.00z)| lr 1.60e-04 | 4145.13 ms | 32.6% bf16 MFU | 126407 tok/s step 13057/19560 | loss 3.300778 (-0.91z)| norm 0.2691 (-0.57z)| lr 1.60e-04 | 4146.81 ms | 32.6% bf16 MFU | 126409 tok/s step 13058/19560 | loss 3.361504 (+0.70z)| norm 0.2757 (-0.09z)| lr 1.59e-04 | 4147.22 ms | 32.6% bf16 MFU | 126409 tok/s step 13059/19560 | loss 3.372384 (+0.98z)| norm 0.2642 (-0.93z)| lr 1.59e-04 | 4149.91 ms | 32.5% bf16 MFU | 126406 tok/s step 13060/19560 | loss 3.327842 (-0.20z)| norm 0.2561 (-1.49z)| lr 1.59e-04 | 4146.35 ms | 32.6% bf16 MFU | 126408 tok/s step 13061/19560 | loss 3.330719 (-0.11z)| norm 0.2725 (-0.30z)| lr 1.59e-04 | 4149.61 ms | 32.5% bf16 MFU | 126405 tok/s step 13062/19560 | loss 3.400034 (+1.69z)| norm 0.2726 (-0.29z)| lr 1.59e-04 | 4145.00 ms | 32.6% bf16 MFU | 126409 tok/s step 13063/19560 | loss 3.392146 (+1.47z)| norm 0.2795 (+0.20z)| lr 1.59e-04 | 4145.59 ms | 32.6% bf16 MFU | 126412 tok/s step 13064/19560 | loss 3.351244 (+0.41z)| norm 0.2919 (+1.09z)| lr 1.59e-04 | 4142.83 ms | 32.6% bf16 MFU | 126419 tok/s step 13065/19560 | loss 3.379509 (+1.16z)| norm 0.2702 (-0.48z)| lr 1.59e-04 | 4146.83 ms | 32.6% bf16 MFU | 126419 tok/s step 13066/19560 | loss 3.383779 (+1.25z)| norm 0.2961 (+1.38z)| lr 1.59e-04 | 4144.53 ms | 32.6% bf16 MFU | 126423 tok/s step 13067/19560 | loss 3.276721 (-1.54z)| norm 0.2630 (-1.01z)| lr 1.59e-04 | 4143.00 ms | 32.6% bf16 MFU | 126430 tok/s step 13068/19560 | loss 3.367238 (+0.84z)| norm 0.2868 (+0.71z)| lr 1.59e-04 | 4146.38 ms | 32.6% bf16 MFU | 126430 tok/s step 13069/19560 | loss 3.336000 (+0.01z)| norm 0.2913 (+1.01z)| lr 1.59e-04 | 4148.37 ms | 32.5% bf16 MFU | 126428 tok/s step 13070/19560 | loss 3.339178 (+0.10z)| norm 0.2799 (+0.19z)| lr 1.59e-04 | 4147.25 ms | 32.6% bf16 MFU | 126428 tok/s step 13071/19560 | loss 3.322217 (-0.35z)| norm 0.3044 (+1.92z)| lr 1.59e-04 | 4142.12 ms | 32.6% bf16 MFU | 126435 tok/s step 13072/19560 | loss 3.325626 (-0.26z)| norm 0.2731 (-0.32z)| lr 1.59e-04 | 4146.05 ms | 32.6% bf16 MFU | 126436 tok/s step 13073/19560 | loss 3.376610 (+1.09z)| norm 0.2883 (+0.76z)| lr 1.59e-04 | 4492.14 ms | 30.1% bf16 MFU | 125950 tok/s step 13074/19560 | loss 3.391330 (+1.48z)| norm 0.2802 (+0.17z)| lr 1.59e-04 | 4244.56 ms | 31.8% bf16 MFU | 125828 tok/s step 13075/19560 | loss 3.310315 (-0.72z)| norm 0.2625 (-1.08z)| lr 1.59e-04 | 4174.18 ms | 32.3% bf16 MFU | 125817 tok/s step 13076/19560 | loss 3.400820 (+1.70z)| norm 0.2897 (+0.85z)| lr 1.59e-04 | 4165.58 ms | 32.4% bf16 MFU | 125819 tok/s step 13077/19560 | loss 3.373993 (+0.97z)| norm 0.2650 (-0.90z)| lr 1.59e-04 | 4164.40 ms | 32.4% bf16 MFU | 125823 tok/s step 13078/19560 | loss 3.300272 (-0.99z)| norm 0.2761 (-0.12z)| lr 1.59e-04 | 4215.49 ms | 32.0% bf16 MFU | 125751 tok/s step 13079/19560 | loss 3.370431 (+0.92z)| norm 0.2891 (+0.81z)| lr 1.59e-04 | 4209.84 ms | 32.1% bf16 MFU | 125690 tok/s step 13080/19560 | loss 3.314486 (-0.60z)| norm 0.2527 (-1.76z)| lr 1.58e-04 | 4160.17 ms | 32.5% bf16 MFU | 125707 tok/s step 13081/19560 | loss 3.332455 (-0.11z)| norm 0.3045 (+1.86z)| lr 1.58e-04 | 4228.15 ms | 31.9% bf16 MFU | 125621 tok/s step 13082/19560 | loss 3.331443 (-0.14z)| norm 0.2682 (-0.66z)| lr 1.58e-04 | 4175.09 ms | 32.3% bf16 MFU | 125619 tok/s step 13083/19560 | loss 3.359380 (+0.62z)| norm 0.2841 (+0.43z)| lr 1.58e-04 | 4159.68 ms | 32.5% bf16 MFU | 125640 tok/s step 13084/19560 | loss 3.291121 (-1.23z)| norm 0.2676 (-0.72z)| lr 1.58e-04 | 4166.98 ms | 32.4% bf16 MFU | 125649 tok/s step 13085/19560 | loss 3.299860 (-0.98z)| norm 0.2843 (+0.45z)| lr 1.58e-04 | 4221.16 ms | 32.0% bf16 MFU | 125577 tok/s step 13086/19560 | loss 3.401322 (+1.73z)| norm 0.2917 (+0.96z)| lr 1.58e-04 | 4157.05 ms | 32.5% bf16 MFU | 125604 tok/s step 13087/19560 | loss 3.407644 (+1.86z)| norm 0.2670 (-0.76z)| lr 1.58e-04 | 4159.38 ms | 32.5% bf16 MFU | 125626 tok/s step 13088/19560 | loss 3.323478 (-0.36z)| norm 0.2945 (+1.14z)| lr 1.58e-04 | 4156.54 ms | 32.5% bf16 MFU | 125652 tok/s step 13089/19560 | loss 3.351151 (+0.37z)| norm 0.2699 (-0.58z)| lr 1.58e-04 | 4155.88 ms | 32.5% bf16 MFU | 125677 tok/s step 13090/19560 | loss 3.327826 (-0.25z)| norm 0.2662 (-0.82z)| lr 1.58e-04 | 4164.28 ms | 32.4% bf16 MFU | 125688 tok/s step 13091/19560 | loss 3.318465 (-0.49z)| norm 0.2731 (-0.34z)| lr 1.58e-04 | 4198.25 ms | 32.2% bf16 MFU | 125648 tok/s step 13092/19560 | loss 3.398438 (+1.66z)| norm 0.2721 (-0.41z)| lr 1.58e-04 | 4156.53 ms | 32.5% bf16 MFU | 125672 tok/s step 13093/19560 | loss 3.315240 (-0.58z)| norm 0.2770 (-0.06z)| lr 1.58e-04 | 4151.56 ms | 32.5% bf16 MFU | 125703 tok/s step 13094/19560 | loss 3.398514 (+1.63z)| norm 0.2790 (+0.07z)| lr 1.58e-04 | 4158.83 ms | 32.5% bf16 MFU | 125721 tok/s step 13095/19560 | loss 3.360646 (+0.63z)| norm 0.2763 (-0.12z)| lr 1.58e-04 | 4150.26 ms | 32.5% bf16 MFU | 125752 tok/s step 13096/19560 | loss 3.355686 (+0.49z)| norm 0.2758 (-0.16z)| lr 1.58e-04 | 4163.36 ms | 32.4% bf16 MFU | 125760 tok/s step 13097/19560 | loss 3.307112 (-0.80z)| norm 0.2614 (-1.17z)| lr 1.58e-04 | 4157.60 ms | 32.5% bf16 MFU | 125778 tok/s step 13098/19560 | loss 3.253476 (-2.22z)| norm 0.2800 (+0.15z)| lr 1.58e-04 | 4157.56 ms | 32.5% bf16 MFU | 125794 tok/s step 13099/19560 | loss 3.408190 (+1.89z)| norm 0.2774 (-0.05z)| lr 1.58e-04 | 4158.40 ms | 32.5% bf16 MFU | 125808 tok/s step 13100/19560 | loss 3.309628 (-0.73z)| norm 0.2612 (-1.19z)| lr 1.58e-04 | 4151.92 ms | 32.5% bf16 MFU | 125832 tok/s step 13101/19560 | loss 3.395743 (+1.53z)| norm 0.2609 (-1.19z)| lr 1.58e-04 | 4159.22 ms | 32.5% bf16 MFU | 125843 tok/s step 13102/19560 | loss 3.275832 (-1.68z)| norm 0.2541 (-1.65z)| lr 1.58e-04 | 4175.82 ms | 32.3% bf16 MFU | 125828 tok/s step 13103/19560 | loss 3.343180 (+0.15z)| norm 0.2522 (-1.74z)| lr 1.57e-04 | 4161.45 ms | 32.4% bf16 MFU | 125836 tok/s step 13104/19560 | loss 3.306626 (-0.83z)| norm 0.2603 (-1.17z)| lr 1.57e-04 | 4154.67 ms | 32.5% bf16 MFU | 125854 tok/s step 13105/19560 | loss 3.365302 (+0.78z)| norm 0.2532 (-1.63z)| lr 1.57e-04 | 4157.28 ms | 32.5% bf16 MFU | 125867 tok/s step 13106/19560 | loss 3.320781 (-0.44z)| norm 0.2654 (-0.78z)| lr 1.57e-04 | 4167.74 ms | 32.4% bf16 MFU | 125863 tok/s step 13107/19560 | loss 3.376252 (+1.06z)| norm 0.2564 (-1.38z)| lr 1.57e-04 | 4167.31 ms | 32.4% bf16 MFU | 125861 tok/s step 13108/19560 | loss 3.342391 (+0.15z)| norm 0.2698 (-0.46z)| lr 1.57e-04 | 4153.56 ms | 32.5% bf16 MFU | 125879 tok/s step 13109/19560 | loss 3.389834 (+1.44z)| norm 0.2719 (-0.31z)| lr 1.57e-04 | 4158.97 ms | 32.5% bf16 MFU | 125888 tok/s step 13110/19560 | loss 3.352068 (+0.40z)| norm 0.2492 (-1.89z)| lr 1.57e-04 | 4163.13 ms | 32.4% bf16 MFU | 125891 tok/s step 13111/19560 | loss 3.369454 (+0.87z)| norm 0.2789 (+0.22z)| lr 1.57e-04 | 4157.53 ms | 32.5% bf16 MFU | 125901 tok/s step 13112/19560 | loss 3.323038 (-0.40z)| norm 0.2628 (-0.92z)| lr 1.57e-04 | 4155.31 ms | 32.5% bf16 MFU | 125915 tok/s step 13113/19560 | loss 3.355139 (+0.48z)| norm 0.2678 (-0.55z)| lr 1.57e-04 | 4157.03 ms | 32.5% bf16 MFU | 125925 tok/s step 13114/19560 | loss 3.367228 (+0.80z)| norm 0.2598 (-1.15z)| lr 1.57e-04 | 4153.25 ms | 32.5% bf16 MFU | 125941 tok/s step 13115/19560 | loss 3.328904 (-0.23z)| norm 0.2687 (-0.45z)| lr 1.57e-04 | 4155.99 ms | 32.5% bf16 MFU | 125951 tok/s step 13116/19560 | loss 3.360877 (+0.66z)| norm 0.2857 (+0.87z)| lr 1.57e-04 | 4152.22 ms | 32.5% bf16 MFU | 125967 tok/s step 13117/19560 | loss 3.340994 (+0.10z)| norm 0.2591 (-1.18z)| lr 1.57e-04 | 4157.96 ms | 32.5% bf16 MFU | 125973 tok/s step 13118/19560 | loss 3.342434 (+0.15z)| norm 0.2674 (-0.54z)| lr 1.57e-04 | 4161.05 ms | 32.4% bf16 MFU | 125975 tok/s step 13119/19560 | loss 3.331666 (-0.15z)| norm 0.2751 (+0.05z)| lr 1.57e-04 | 4170.98 ms | 32.4% bf16 MFU | 125961 tok/s step 13120/19560 | loss 3.367034 (+0.84z)| norm 0.2625 (-0.91z)| lr 1.57e-04 | 4159.58 ms | 32.5% bf16 MFU | 125965 tok/s step 13121/19560 | loss 3.314529 (-0.62z)| norm 0.2626 (-0.89z)| lr 1.57e-04 | 4156.66 ms | 32.5% bf16 MFU | 125973 tok/s step 13122/19560 | loss 3.338788 (+0.05z)| norm 0.2681 (-0.48z)| lr 1.57e-04 | 4155.29 ms | 32.5% bf16 MFU | 125983 tok/s step 13123/19560 | loss 3.350214 (+0.36z)| norm 0.2542 (-1.53z)| lr 1.57e-04 | 4162.48 ms | 32.4% bf16 MFU | 125982 tok/s step 13124/19560 | loss 3.376516 (+1.08z)| norm 0.2757 (+0.13z)| lr 1.57e-04 | 4165.14 ms | 32.4% bf16 MFU | 125977 tok/s step 13125/19560 | loss 3.308928 (-0.80z)| norm 0.2638 (-0.80z)| lr 1.57e-04 | 4175.25 ms | 32.3% bf16 MFU | 125956 tok/s step 13126/19560 | loss 3.329727 (-0.24z)| norm 0.2735 (-0.04z)| lr 1.56e-04 | 4154.66 ms | 32.5% bf16 MFU | 125968 tok/s step 13127/19560 | loss 3.310673 (-0.76z)| norm 0.2676 (-0.49z)| lr 1.56e-04 | 4166.96 ms | 32.4% bf16 MFU | 125961 tok/s step 13128/19560 | loss 3.329567 (-0.23z)| norm 0.2883 (+1.11z)| lr 1.56e-04 | 4160.21 ms | 32.5% bf16 MFU | 125964 tok/s step 13129/19560 | loss 3.337047 (-0.03z)| norm 0.2694 (-0.35z)| lr 1.56e-04 | 4165.04 ms | 32.4% bf16 MFU | 125960 tok/s step 13130/19560 | loss 3.392483 (+1.52z)| norm 0.2791 (+0.40z)| lr 1.56e-04 | 4155.09 ms | 32.5% bf16 MFU | 125971 tok/s step 13131/19560 | loss 3.328215 (-0.29z)| norm 0.3023 (+2.16z)| lr 1.56e-04 | 4151.16 ms | 32.5% bf16 MFU | 125987 tok/s step 13132/19560 | loss 3.396062 (+1.59z)| norm 0.2983 (+1.82z)| lr 1.56e-04 | 4158.04 ms | 32.5% bf16 MFU | 125992 tok/s step 13133/19560 | loss 3.301094 (-1.04z)| norm 0.2814 (+0.56z)| lr 1.56e-04 | 4161.50 ms | 32.4% bf16 MFU | 125992 tok/s step 13134/19560 | loss 3.334733 (-0.08z)| norm 0.3048 (+2.28z)| lr 1.56e-04 | 4162.01 ms | 32.4% bf16 MFU | 125991 tok/s step 13135/19560 | loss 3.316142 (-0.61z)| norm 0.2718 (-0.19z)| lr 1.56e-04 | 4150.43 ms | 32.5% bf16 MFU | 126007 tok/s step 13136/19560 | loss 3.381743 (+1.24z)| norm 0.2786 (+0.32z)| lr 1.56e-04 | 4153.81 ms | 32.5% bf16 MFU | 126018 tok/s step 13137/19560 | loss 3.379159 (+1.15z)| norm 0.2917 (+1.29z)| lr 1.56e-04 | 4154.00 ms | 32.5% bf16 MFU | 126028 tok/s step 13138/19560 | loss 3.302569 (-1.01z)| norm 0.2920 (+1.29z)| lr 1.56e-04 | 4149.67 ms | 32.5% bf16 MFU | 126043 tok/s step 13139/19560 | loss 3.317407 (-0.59z)| norm 0.2789 (+0.32z)| lr 1.56e-04 | 4160.79 ms | 32.4% bf16 MFU | 126042 tok/s step 13140/19560 | loss 3.283829 (-1.53z)| norm 0.2880 (+0.98z)| lr 1.56e-04 | 4154.53 ms | 32.5% bf16 MFU | 126049 tok/s step 13141/19560 | loss 3.396000 (+1.60z)| norm 0.2806 (+0.45z)| lr 1.56e-04 | 4161.74 ms | 32.4% bf16 MFU | 126046 tok/s step 13142/19560 | loss 3.298409 (-1.14z)| norm 0.2876 (+0.97z)| lr 1.56e-04 | 4164.09 ms | 32.4% bf16 MFU | 126039 tok/s step 13143/19560 | loss 3.355629 (+0.46z)| norm 0.2721 (-0.19z)| lr 1.56e-04 | 4156.47 ms | 32.5% bf16 MFU | 126044 tok/s step 13144/19560 | loss 3.329453 (-0.28z)| norm 0.2888 (+1.06z)| lr 1.56e-04 | 4149.91 ms | 32.5% bf16 MFU | 126058 tok/s step 13145/19560 | loss 3.344751 (+0.15z)| norm 0.2580 (-1.22z)| lr 1.56e-04 | 4151.11 ms | 32.5% bf16 MFU | 126071 tok/s step 13146/19560 | loss 3.280283 (-1.63z)| norm 0.2856 (+0.85z)| lr 1.56e-04 | 4155.70 ms | 32.5% bf16 MFU | 126075 tok/s step 13147/19560 | loss 3.367849 (+0.79z)| norm 0.2691 (-0.37z)| lr 1.56e-04 | 4155.97 ms | 32.5% bf16 MFU | 126079 tok/s step 13148/19560 | loss 3.306098 (-0.90z)| norm 0.2998 (+1.92z)| lr 1.56e-04 | 4164.86 ms | 32.4% bf16 MFU | 126069 tok/s step 13149/19560 | loss 3.376526 (+1.05z)| norm 0.2733 (-0.06z)| lr 1.55e-04 | 4153.57 ms | 32.5% bf16 MFU | 126077 tok/s step 13150/19560 | loss 3.372146 (+0.92z)| norm 0.2896 (+1.16z)| lr 1.55e-04 | 4163.67 ms | 32.4% bf16 MFU | 126069 tok/s step 13151/19560 | loss 3.325183 (-0.38z)| norm 0.2730 (-0.08z)| lr 1.55e-04 | 4176.83 ms | 32.3% bf16 MFU | 126042 tok/s step 13152/19560 | loss 3.336402 (-0.08z)| norm 0.2977 (+1.74z)| lr 1.55e-04 | 4165.29 ms | 32.4% bf16 MFU | 126033 tok/s step 13153/19560 | loss 3.370317 (+0.86z)| norm 0.2878 (+1.02z)| lr 1.55e-04 | 4152.47 ms | 32.5% bf16 MFU | 126045 tok/s step 13154/19560 | loss 3.366225 (+0.75z)| norm 0.2761 (+0.14z)| lr 1.55e-04 | 4161.04 ms | 32.4% bf16 MFU | 126042 tok/s step 13155/19560 | loss 3.368016 (+0.79z)| norm 0.2787 (+0.35z)| lr 1.55e-04 | 4161.59 ms | 32.4% bf16 MFU | 126039 tok/s step 13156/19560 | loss 3.351267 (+0.32z)| norm 0.2682 (-0.44z)| lr 1.55e-04 | 4150.20 ms | 32.5% bf16 MFU | 126054 tok/s step 13157/19560 | loss 3.314648 (-0.72z)| norm 0.2789 (+0.37z)| lr 1.55e-04 | 4161.93 ms | 32.4% bf16 MFU | 126050 tok/s step 13158/19560 | loss 3.333793 (-0.18z)| norm 0.2769 (+0.22z)| lr 1.55e-04 | 4157.73 ms | 32.5% bf16 MFU | 126052 tok/s step 13159/19560 | loss 3.304962 (-0.98z)| norm 0.2659 (-0.64z)| lr 1.55e-04 | 4157.45 ms | 32.5% bf16 MFU | 126055 tok/s step 13160/19560 | loss 3.348870 (+0.25z)| norm 0.2899 (+1.20z)| lr 1.55e-04 | 4155.13 ms | 32.5% bf16 MFU | 126061 tok/s step 13161/19560 | loss 3.320987 (-0.52z)| norm 0.2673 (-0.53z)| lr 1.55e-04 | 4153.82 ms | 32.5% bf16 MFU | 126069 tok/s step 13162/19560 | loss 3.367700 (+0.79z)| norm 0.2968 (+1.71z)| lr 1.55e-04 | 4151.58 ms | 32.5% bf16 MFU | 126080 tok/s step 13163/19560 | loss 3.401277 (+1.72z)| norm 0.2713 (-0.26z)| lr 1.55e-04 | 4154.19 ms | 32.5% bf16 MFU | 126086 tok/s step 13164/19560 | loss 3.354545 (+0.39z)| norm 0.2640 (-0.83z)| lr 1.55e-04 | 4157.62 ms | 32.5% bf16 MFU | 126087 tok/s step 13165/19560 | loss 3.317737 (-0.66z)| norm 0.2920 (+1.32z)| lr 1.55e-04 | 4154.29 ms | 32.5% bf16 MFU | 126093 tok/s step 13166/19560 | loss 3.323909 (-0.48z)| norm 0.2788 (+0.30z)| lr 1.55e-04 | 4156.21 ms | 32.5% bf16 MFU | 126096 tok/s step 13167/19560 | loss 3.365366 (+0.69z)| norm 0.3177 (+3.15z)| lr 1.55e-04 | 4152.93 ms | 32.5% bf16 MFU | 126103 tok/s step 13168/19560 | loss 3.371879 (+0.87z)| norm 0.2745 (-0.06z)| lr 1.55e-04 | 4154.97 ms | 32.5% bf16 MFU | 126107 tok/s step 13169/19560 | loss 3.374161 (+0.92z)| norm 0.3096 (+2.51z)| lr 1.55e-04 | 4151.21 ms | 32.5% bf16 MFU | 126117 tok/s step 13170/19560 | loss 3.377789 (+1.01z)| norm 0.2599 (-1.17z)| lr 1.55e-04 | 4159.26 ms | 32.5% bf16 MFU | 126113 tok/s step 13171/19560 | loss 3.296350 (-1.31z)| norm 0.2940 (+1.34z)| lr 1.54e-04 | 4158.45 ms | 32.5% bf16 MFU | 126112 tok/s step 13172/19560 | loss 3.301659 (-1.16z)| norm 0.2722 (-0.29z)| lr 1.54e-04 | 4171.66 ms | 32.4% bf16 MFU | 126090 tok/s step 13173/19560 | loss 3.338047 (-0.12z)| norm 0.2748 (-0.10z)| lr 1.54e-04 | 4149.27 ms | 32.5% bf16 MFU | 126103 tok/s step 13174/19560 | loss 3.355376 (+0.36z)| norm 0.2971 (+1.54z)| lr 1.54e-04 | 4158.04 ms | 32.5% bf16 MFU | 126103 tok/s step 13175/19560 | loss 3.349623 (+0.22z)| norm 0.2673 (-0.66z)| lr 1.54e-04 | 4157.63 ms | 32.5% bf16 MFU | 126103 tok/s step 13176/19560 | loss 3.348535 (+0.18z)| norm 0.2984 (+1.61z)| lr 1.54e-04 | 4251.74 ms | 31.8% bf16 MFU | 125963 tok/s step 13177/19560 | loss 3.353892 (+0.33z)| norm 0.2766 (+0.02z)| lr 1.54e-04 | 4163.99 ms | 32.4% bf16 MFU | 125960 tok/s step 13178/19560 | loss 3.335478 (-0.22z)| norm 0.2959 (+1.41z)| lr 1.54e-04 | 4166.30 ms | 32.4% bf16 MFU | 125954 tok/s step 13179/19560 | loss 3.306320 (-1.10z)| norm 0.2904 (+1.00z)| lr 1.54e-04 | 4155.50 ms | 32.5% bf16 MFU | 125965 tok/s step 13180/19560 | loss 3.319176 (-0.71z)| norm 0.2666 (-0.72z)| lr 1.54e-04 | 4148.09 ms | 32.5% bf16 MFU | 125986 tok/s step 13181/19560 | loss 3.400444 (+1.69z)| norm 0.2877 (+0.81z)| lr 1.54e-04 | 4154.67 ms | 32.5% bf16 MFU | 125997 tok/s step 13182/19560 | loss 3.365673 (+0.64z)| norm 0.2592 (-1.27z)| lr 1.54e-04 | 4156.73 ms | 32.5% bf16 MFU | 126003 tok/s step 13183/19560 | loss 3.354023 (+0.28z)| norm 0.2891 (+0.90z)| lr 1.54e-04 | 4158.62 ms | 32.5% bf16 MFU | 126007 tok/s step 13184/19560 | loss 3.376702 (+0.95z)| norm 0.2860 (+0.67z)| lr 1.54e-04 | 4155.13 ms | 32.5% bf16 MFU | 126015 tok/s step 13185/19560 | loss 3.251045 (-2.75z)| norm 0.2836 (+0.49z)| lr 1.54e-04 | 4159.14 ms | 32.5% bf16 MFU | 126017 tok/s step 13186/19560 | loss 3.312886 (-0.92z)| norm 0.2741 (-0.21z)| lr 1.54e-04 | 4150.33 ms | 32.5% bf16 MFU | 126033 tok/s step 13187/19560 | loss 3.327453 (-0.48z)| norm 0.2797 (+0.19z)| lr 1.54e-04 | 4157.15 ms | 32.5% bf16 MFU | 126037 tok/s step 13188/19560 | loss 3.427663 (+2.39z)| norm 0.2850 (+0.57z)| lr 1.54e-04 | 4166.71 ms | 32.4% bf16 MFU | 126027 tok/s step 13189/19560 | loss 3.408073 (+1.78z)| norm 0.2723 (-0.37z)| lr 1.54e-04 | 4158.59 ms | 32.5% bf16 MFU | 126029 tok/s step 13190/19560 | loss 3.355744 (+0.31z)| norm 0.2744 (-0.21z)| lr 1.54e-04 | 4159.27 ms | 32.5% bf16 MFU | 126030 tok/s step 13191/19560 | loss 3.283482 (-1.73z)| norm 0.2632 (-1.03z)| lr 1.54e-04 | 4154.71 ms | 32.5% bf16 MFU | 126038 tok/s step 13192/19560 | loss 3.360244 (+0.46z)| norm 0.2739 (-0.23z)| lr 1.54e-04 | 4160.64 ms | 32.5% bf16 MFU | 126037 tok/s step 13193/19560 | loss 3.369678 (+0.73z)| norm 0.2665 (-0.78z)| lr 1.54e-04 | 4154.28 ms | 32.5% bf16 MFU | 126045 tok/s step 13194/19560 | loss 3.311599 (-0.91z)| norm 0.2752 (-0.12z)| lr 1.53e-04 | 4163.33 ms | 32.4% bf16 MFU | 126039 tok/s step 13195/19560 | loss 3.433722 (+2.52z)| norm 0.2753 (-0.12z)| lr 1.53e-04 | 4148.15 ms | 32.5% bf16 MFU | 126057 tok/s step 13196/19560 | loss 3.364042 (+0.55z)| norm 0.3263 (+3.49z)| lr 1.53e-04 | 4158.72 ms | 32.5% bf16 MFU | 126058 tok/s step 13197/19560 | loss 3.318315 (-0.74z)| norm 0.2940 (+1.19z)| lr 1.53e-04 | 4153.73 ms | 32.5% bf16 MFU | 126066 tok/s step 13198/19560 | loss 3.275053 (-1.92z)| norm 0.2915 (+1.00z)| lr 1.53e-04 | 4157.10 ms | 32.5% bf16 MFU | 126068 tok/s step 13199/19560 | loss 3.331853 (-0.34z)| norm 0.2966 (+1.37z)| lr 1.53e-04 | 4151.36 ms | 32.5% bf16 MFU | 126080 tok/s step 13200/19560 | loss 3.337118 (-0.20z)| norm 0.2905 (+0.93z)| lr 1.53e-04 | 4151.32 ms | 32.5% bf16 MFU | 126090 tok/s step 13201/19560 | loss 3.336233 (-0.22z)| norm 0.2702 (-0.51z)| lr 1.53e-04 | 4151.65 ms | 32.5% bf16 MFU | 126100 tok/s step 13202/19560 | loss 3.314552 (-0.81z)| norm 0.2972 (+1.39z)| lr 1.53e-04 | 4165.92 ms | 32.4% bf16 MFU | 126088 tok/s step 13203/19560 | loss 3.370311 (+0.75z)| norm 0.2686 (-0.63z)| lr 1.53e-04 | 4150.39 ms | 32.5% bf16 MFU | 126099 tok/s step 13204/19560 | loss 3.331299 (-0.34z)| norm 0.2904 (+0.91z)| lr 1.53e-04 | 4153.90 ms | 32.5% bf16 MFU | 126105 tok/s step 13205/19560 | loss 3.297081 (-1.29z)| norm 0.2903 (+0.89z)| lr 1.53e-04 | 4160.55 ms | 32.5% bf16 MFU | 126101 tok/s step 13206/19560 | loss 3.335967 (-0.20z)| norm 0.2755 (-0.16z)| lr 1.53e-04 | 4157.39 ms | 32.5% bf16 MFU | 126101 tok/s step 13207/19560 | loss 3.333364 (-0.27z)| norm 0.2835 (+0.42z)| lr 1.53e-04 | 4158.91 ms | 32.5% bf16 MFU | 126099 tok/s step 13208/19560 | loss 3.433332 (+2.50z)| norm 0.2885 (+0.76z)| lr 1.53e-04 | 4153.65 ms | 32.5% bf16 MFU | 126105 tok/s step 13209/19560 | loss 3.309449 (-0.94z)| norm 0.2884 (+0.77z)| lr 1.53e-04 | 4161.32 ms | 32.4% bf16 MFU | 126100 tok/s step 13210/19560 | loss 3.307282 (-1.00z)| norm 0.2751 (-0.20z)| lr 1.53e-04 | 4163.65 ms | 32.4% bf16 MFU | 126091 tok/s step 13211/19560 | loss 3.333601 (-0.26z)| norm 0.2718 (-0.43z)| lr 1.53e-04 | 4161.33 ms | 32.4% bf16 MFU | 126086 tok/s step 13212/19560 | loss 3.285918 (-1.58z)| norm 0.2677 (-0.73z)| lr 1.53e-04 | 4159.22 ms | 32.5% bf16 MFU | 126084 tok/s step 13213/19560 | loss 3.331422 (-0.33z)| norm 0.2702 (-0.54z)| lr 1.53e-04 | 4158.03 ms | 32.5% bf16 MFU | 126084 tok/s step 13214/19560 | loss 3.392220 (+1.37z)| norm 0.2782 (+0.05z)| lr 1.53e-04 | 4152.37 ms | 32.5% bf16 MFU | 126093 tok/s step 13215/19560 | loss 3.372013 (+0.82z)| norm 0.2752 (-0.18z)| lr 1.53e-04 | 4160.54 ms | 32.5% bf16 MFU | 126089 tok/s step 13216/19560 | loss 3.405792 (+1.74z)| norm 0.2771 (-0.03z)| lr 1.53e-04 | 4160.23 ms | 32.5% bf16 MFU | 126086 tok/s step 13217/19560 | loss 3.317278 (-0.72z)| norm 0.2936 (+1.16z)| lr 1.52e-04 | 4151.84 ms | 32.5% bf16 MFU | 126096 tok/s step 13218/19560 | loss 3.346712 (+0.09z)| norm 0.2838 (+0.44z)| lr 1.52e-04 | 4149.24 ms | 32.5% bf16 MFU | 126109 tok/s step 13219/19560 | loss 3.362182 (+0.51z)| norm 0.2712 (-0.48z)| lr 1.52e-04 | 4158.10 ms | 32.5% bf16 MFU | 126108 tok/s step 13220/19560 | loss 3.418704 (+2.07z)| norm 0.2834 (+0.41z)| lr 1.52e-04 | 4159.79 ms | 32.5% bf16 MFU | 126104 tok/s step 13221/19560 | loss 3.308192 (-0.99z)| norm 0.2718 (-0.44z)| lr 1.52e-04 | 4151.62 ms | 32.5% bf16 MFU | 126113 tok/s step 13222/19560 | loss 3.343187 (-0.01z)| norm 0.2704 (-0.54z)| lr 1.52e-04 | 4165.10 ms | 32.4% bf16 MFU | 126101 tok/s step 13223/19560 | loss 3.322377 (-0.58z)| norm 0.2944 (+1.20z)| lr 1.52e-04 | 4160.76 ms | 32.5% bf16 MFU | 126097 tok/s step 13224/19560 | loss 3.367638 (+0.68z)| norm 0.2757 (-0.16z)| lr 1.52e-04 | 4156.43 ms | 32.5% bf16 MFU | 126099 tok/s step 13225/19560 | loss 3.292356 (-1.41z)| norm 0.2645 (-0.98z)| lr 1.52e-04 | 4158.02 ms | 32.5% bf16 MFU | 126098 tok/s step 13226/19560 | loss 3.340393 (-0.10z)| norm 0.2681 (-0.71z)| lr 1.52e-04 | 4160.15 ms | 32.5% bf16 MFU | 126095 tok/s step 13227/19560 | loss 3.321464 (-0.62z)| norm 0.2723 (-0.40z)| lr 1.52e-04 | 4153.29 ms | 32.5% bf16 MFU | 126102 tok/s step 13228/19560 | loss 3.302557 (-1.16z)| norm 0.2572 (-1.49z)| lr 1.52e-04 | 4155.51 ms | 32.5% bf16 MFU | 126105 tok/s step 13229/19560 | loss 3.387075 (+1.27z)| norm 0.2747 (-0.23z)| lr 1.52e-04 | 4158.49 ms | 32.5% bf16 MFU | 126104 tok/s step 13230/19560 | loss 3.427744 (+2.39z)| norm 0.2779 (-0.01z)| lr 1.52e-04 | 4163.47 ms | 32.4% bf16 MFU | 126095 tok/s step 13231/19560 | loss 3.333512 (-0.30z)| norm 0.2699 (-0.62z)| lr 1.52e-04 | 4158.91 ms | 32.5% bf16 MFU | 126093 tok/s step 13232/19560 | loss 3.416458 (+2.02z)| norm 0.3062 (+2.04z)| lr 1.52e-04 | 4156.35 ms | 32.5% bf16 MFU | 126096 tok/s step 13233/19560 | loss 3.398604 (+1.50z)| norm 0.2655 (-0.98z)| lr 1.52e-04 | 4149.52 ms | 32.5% bf16 MFU | 126108 tok/s step 13234/19560 | loss 3.379927 (+0.96z)| norm 0.2703 (-0.63z)| lr 1.52e-04 | 4161.98 ms | 32.4% bf16 MFU | 126101 tok/s step 13235/19560 | loss 3.322939 (-0.62z)| norm 0.2821 (+0.24z)| lr 1.52e-04 | 4149.33 ms | 32.5% bf16 MFU | 126114 tok/s step 13236/19560 | loss 3.428498 (+2.27z)| norm 0.2836 (+0.35z)| lr 1.52e-04 | 4151.60 ms | 32.5% bf16 MFU | 126123 tok/s step 13237/19560 | loss 3.323866 (-0.59z)| norm 0.2702 (-0.66z)| lr 1.52e-04 | 4165.50 ms | 32.4% bf16 MFU | 126110 tok/s step 13238/19560 | loss 3.335415 (-0.27z)| norm 0.2687 (-0.80z)| lr 1.52e-04 | 4156.42 ms | 32.5% bf16 MFU | 126111 tok/s step 13239/19560 | loss 3.335646 (-0.26z)| norm 0.2891 (+0.76z)| lr 1.52e-04 | 4152.85 ms | 32.5% bf16 MFU | 126118 tok/s step 13240/19560 | loss 3.290172 (-1.49z)| norm 0.2627 (-1.27z)| lr 1.51e-04 | 4158.77 ms | 32.5% bf16 MFU | 126116 tok/s step 13241/19560 | loss 3.316782 (-0.76z)| norm 0.2765 (-0.22z)| lr 1.51e-04 | 4155.13 ms | 32.5% bf16 MFU | 126119 tok/s step 13242/19560 | loss 3.289486 (-1.47z)| norm 0.2948 (+1.18z)| lr 1.51e-04 | 4148.67 ms | 32.5% bf16 MFU | 126131 tok/s step 13243/19560 | loss 3.323410 (-0.55z)| norm 0.2783 (-0.11z)| lr 1.51e-04 | 4155.71 ms | 32.5% bf16 MFU | 126133 tok/s step 13244/19560 | loss 3.317739 (-0.70z)| norm 0.2949 (+1.17z)| lr 1.51e-04 | 4149.18 ms | 32.5% bf16 MFU | 126144 tok/s step 13245/19560 | loss 3.337615 (-0.16z)| norm 0.2791 (-0.06z)| lr 1.51e-04 | 4154.18 ms | 32.5% bf16 MFU | 126147 tok/s step 13246/19560 | loss 3.453868 (+2.87z)| norm 0.3001 (+1.55z)| lr 1.51e-04 | 4160.32 ms | 32.5% bf16 MFU | 126141 tok/s step 13247/19560 | loss 3.314620 (-0.77z)| norm 0.2840 (+0.30z)| lr 1.51e-04 | 4156.98 ms | 32.5% bf16 MFU | 126140 tok/s step 13248/19560 | loss 3.315376 (-0.74z)| norm 0.2809 (+0.05z)| lr 1.51e-04 | 4146.58 ms | 32.6% bf16 MFU | 126155 tok/s step 13249/19560 | loss 3.384316 (+1.04z)| norm 0.2728 (-0.60z)| lr 1.51e-04 | 4158.16 ms | 32.5% bf16 MFU | 126152 tok/s step 13250/19560 | loss 3.346230 (+0.05z)| norm 0.3367 (+4.11z)| lr 1.51e-04 | 4158.30 ms | 32.5% bf16 MFU | 126148 tok/s val loss 3.322341 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2980/10042 = 0.296754 step 13251/19560 | loss 3.346783 (+0.06z)| norm 0.2649 (-1.20z)| lr 1.51e-04 | 4161.42 ms | 32.4% bf16 MFU | 126140 tok/s step 13252/19560 | loss 3.303550 (-1.05z)| norm 0.3118 (+2.23z)| lr 1.51e-04 | 4158.99 ms | 32.5% bf16 MFU | 126136 tok/s step 13253/19560 | loss 3.376851 (+0.85z)| norm 0.2932 (+0.85z)| lr 1.51e-04 | 4163.38 ms | 32.4% bf16 MFU | 126126 tok/s step 13254/19560 | loss 3.407681 (+1.62z)| norm 0.2925 (+0.79z)| lr 1.51e-04 | 4154.10 ms | 32.5% bf16 MFU | 126130 tok/s step 13255/19560 | loss 3.374781 (+0.76z)| norm 0.2967 (+1.08z)| lr 1.51e-04 | 4165.81 ms | 32.4% bf16 MFU | 126116 tok/s step 13256/19560 | loss 3.296809 (-1.24z)| norm 0.2910 (+0.66z)| lr 1.51e-04 | 4150.79 ms | 32.5% bf16 MFU | 126126 tok/s step 13257/19560 | loss 3.323472 (-0.55z)| norm 0.2909 (+0.64z)| lr 1.51e-04 | 4159.82 ms | 32.5% bf16 MFU | 126122 tok/s step 13258/19560 | loss 3.324223 (-0.52z)| norm 0.2763 (-0.43z)| lr 1.51e-04 | 4158.45 ms | 32.5% bf16 MFU | 126119 tok/s step 13259/19560 | loss 3.370926 (+0.67z)| norm 0.2791 (-0.20z)| lr 1.51e-04 | 4148.03 ms | 32.5% bf16 MFU | 126133 tok/s step 13260/19560 | loss 3.351827 (+0.19z)| norm 0.2942 (+0.91z)| lr 1.51e-04 | 4154.25 ms | 32.5% bf16 MFU | 126137 tok/s step 13261/19560 | loss 3.414182 (+1.77z)| norm 0.2867 (+0.36z)| lr 1.51e-04 | 4151.23 ms | 32.5% bf16 MFU | 126145 tok/s step 13262/19560 | loss 3.313716 (-0.81z)| norm 0.2762 (-0.41z)| lr 1.51e-04 | 4147.52 ms | 32.6% bf16 MFU | 126158 tok/s step 13263/19560 | loss 3.335249 (-0.26z)| norm 0.2802 (-0.11z)| lr 1.50e-04 | 4156.40 ms | 32.5% bf16 MFU | 126157 tok/s step 13264/19560 | loss 3.375033 (+0.76z)| norm 0.2711 (-0.79z)| lr 1.50e-04 | 4447.87 ms | 30.4% bf16 MFU | 125743 tok/s step 13265/19560 | loss 3.275942 (-1.75z)| norm 0.3003 (+1.38z)| lr 1.50e-04 | 4434.05 ms | 30.5% bf16 MFU | 125368 tok/s step 13266/19560 | loss 3.319936 (-0.63z)| norm 0.2718 (-0.73z)| lr 1.50e-04 | 4199.25 ms | 32.2% bf16 MFU | 125342 tok/s step 13267/19560 | loss 3.270659 (-1.86z)| norm 0.2719 (-0.72z)| lr 1.50e-04 | 4240.66 ms | 31.8% bf16 MFU | 125257 tok/s step 13268/19560 | loss 3.339564 (-0.13z)| norm 0.2882 (+0.49z)| lr 1.50e-04 | 4164.19 ms | 32.4% bf16 MFU | 125289 tok/s step 13269/19560 | loss 3.343445 (-0.02z)| norm 0.2643 (-1.26z)| lr 1.50e-04 | 4214.01 ms | 32.0% bf16 MFU | 125245 tok/s step 13270/19560 | loss 3.359749 (+0.39z)| norm 0.2841 (+0.20z)| lr 1.50e-04 | 4168.91 ms | 32.4% bf16 MFU | 125271 tok/s step 13271/19560 | loss 3.366003 (+0.54z)| norm 0.2675 (-1.02z)| lr 1.50e-04 | 4161.02 ms | 32.4% bf16 MFU | 125308 tok/s step 13272/19560 | loss 3.296910 (-1.22z)| norm 0.2697 (-0.85z)| lr 1.50e-04 | 4158.75 ms | 32.5% bf16 MFU | 125346 tok/s step 13273/19560 | loss 3.308845 (-0.91z)| norm 0.2600 (-1.56z)| lr 1.50e-04 | 4203.14 ms | 32.1% bf16 MFU | 125315 tok/s step 13274/19560 | loss 3.315984 (-0.74z)| norm 0.2766 (-0.34z)| lr 1.50e-04 | 4156.94 ms | 32.5% bf16 MFU | 125356 tok/s step 13275/19560 | loss 3.361910 (+0.45z)| norm 0.2658 (-1.13z)| lr 1.50e-04 | 4153.88 ms | 32.5% bf16 MFU | 125399 tok/s step 13276/19560 | loss 3.301119 (-1.12z)| norm 0.2914 (+0.76z)| lr 1.50e-04 | 4168.10 ms | 32.4% bf16 MFU | 125418 tok/s step 13277/19560 | loss 3.341876 (-0.06z)| norm 0.2622 (-1.38z)| lr 1.50e-04 | 4154.08 ms | 32.5% bf16 MFU | 125458 tok/s step 13278/19560 | loss 3.385638 (+1.06z)| norm 0.2918 (+0.79z)| lr 1.50e-04 | 4149.75 ms | 32.5% bf16 MFU | 125502 tok/s step 13279/19560 | loss 3.393498 (+1.25z)| norm 0.2724 (-0.63z)| lr 1.50e-04 | 4160.68 ms | 32.5% bf16 MFU | 125527 tok/s step 13280/19560 | loss 3.354639 (+0.25z)| norm 0.2883 (+0.54z)| lr 1.50e-04 | 4159.21 ms | 32.5% bf16 MFU | 125554 tok/s step 13281/19560 | loss 3.315141 (-0.76z)| norm 0.2661 (-1.07z)| lr 1.50e-04 | 4155.30 ms | 32.5% bf16 MFU | 125585 tok/s step 13282/19560 | loss 3.309883 (-0.88z)| norm 0.2823 (+0.11z)| lr 1.50e-04 | 4157.78 ms | 32.5% bf16 MFU | 125610 tok/s step 13283/19560 | loss 3.275964 (-1.71z)| norm 0.2751 (-0.42z)| lr 1.50e-04 | 4152.26 ms | 32.5% bf16 MFU | 125643 tok/s step 13284/19560 | loss 3.310077 (-0.84z)| norm 0.2623 (-1.35z)| lr 1.50e-04 | 4157.51 ms | 32.5% bf16 MFU | 125666 tok/s step 13285/19560 | loss 3.293351 (-1.25z)| norm 0.2651 (-1.13z)| lr 1.50e-04 | 4160.55 ms | 32.5% bf16 MFU | 125684 tok/s step 13286/19560 | loss 3.321909 (-0.53z)| norm 0.2791 (-0.11z)| lr 1.49e-04 | 4165.00 ms | 32.4% bf16 MFU | 125693 tok/s step 13287/19560 | loss 3.318623 (-0.61z)| norm 0.2708 (-0.72z)| lr 1.49e-04 | 4155.87 ms | 32.5% bf16 MFU | 125717 tok/s step 13288/19560 | loss 3.320050 (-0.57z)| norm 0.2736 (-0.50z)| lr 1.49e-04 | 4161.70 ms | 32.4% bf16 MFU | 125730 tok/s step 13289/19560 | loss 3.322497 (-0.51z)| norm 0.2554 (-1.81z)| lr 1.49e-04 | 4151.54 ms | 32.5% bf16 MFU | 125758 tok/s step 13290/19560 | loss 3.301682 (-1.02z)| norm 0.2797 (-0.04z)| lr 1.49e-04 | 4161.12 ms | 32.4% bf16 MFU | 125770 tok/s step 13291/19560 | loss 3.273553 (-1.69z)| norm 0.2709 (-0.69z)| lr 1.49e-04 | 4157.57 ms | 32.5% bf16 MFU | 125786 tok/s step 13292/19560 | loss 3.331755 (-0.23z)| norm 0.2763 (-0.30z)| lr 1.49e-04 | 4155.57 ms | 32.5% bf16 MFU | 125805 tok/s step 13293/19560 | loss 3.273807 (-1.66z)| norm 0.2505 (-2.13z)| lr 1.49e-04 | 4157.14 ms | 32.5% bf16 MFU | 125821 tok/s step 13294/19560 | loss 3.317046 (-0.59z)| norm 0.2806 (+0.04z)| lr 1.49e-04 | 4151.89 ms | 32.5% bf16 MFU | 125844 tok/s step 13295/19560 | loss 3.323252 (-0.42z)| norm 0.2675 (-0.90z)| lr 1.49e-04 | 4155.97 ms | 32.5% bf16 MFU | 125859 tok/s step 13296/19560 | loss 3.273409 (-1.63z)| norm 0.2937 (+1.02z)| lr 1.49e-04 | 4147.27 ms | 32.6% bf16 MFU | 125887 tok/s step 13297/19560 | loss 3.355227 (+0.39z)| norm 0.2776 (-0.15z)| lr 1.49e-04 | 4154.42 ms | 32.5% bf16 MFU | 125903 tok/s step 13298/19560 | loss 3.334914 (-0.10z)| norm 0.3028 (+1.71z)| lr 1.49e-04 | 4160.65 ms | 32.5% bf16 MFU | 125908 tok/s step 13299/19560 | loss 3.342798 (+0.08z)| norm 0.2916 (+0.88z)| lr 1.49e-04 | 4153.20 ms | 32.5% bf16 MFU | 125925 tok/s step 13300/19560 | loss 3.289842 (-1.23z)| norm 0.2592 (-1.53z)| lr 1.49e-04 | 4153.69 ms | 32.5% bf16 MFU | 125939 tok/s step 13301/19560 | loss 3.333412 (-0.15z)| norm 0.2812 (+0.10z)| lr 1.49e-04 | 4159.79 ms | 32.5% bf16 MFU | 125944 tok/s step 13302/19560 | loss 3.324886 (-0.35z)| norm 0.2527 (-1.98z)| lr 1.49e-04 | 4154.15 ms | 32.5% bf16 MFU | 125958 tok/s step 13303/19560 | loss 3.313307 (-0.63z)| norm 0.2820 (+0.17z)| lr 1.49e-04 | 4159.09 ms | 32.5% bf16 MFU | 125963 tok/s step 13304/19560 | loss 3.365238 (+0.65z)| norm 0.2740 (-0.41z)| lr 1.49e-04 | 4157.53 ms | 32.5% bf16 MFU | 125970 tok/s step 13305/19560 | loss 3.276700 (-1.51z)| norm 0.2909 (+0.84z)| lr 1.49e-04 | 4153.51 ms | 32.5% bf16 MFU | 125983 tok/s step 13306/19560 | loss 3.321196 (-0.42z)| norm 0.2897 (+0.76z)| lr 1.49e-04 | 4150.39 ms | 32.5% bf16 MFU | 126000 tok/s step 13307/19560 | loss 3.376269 (+0.92z)| norm 0.2785 (-0.07z)| lr 1.49e-04 | 4159.29 ms | 32.5% bf16 MFU | 126002 tok/s step 13308/19560 | loss 3.324251 (-0.36z)| norm 0.3196 (+2.88z)| lr 1.49e-04 | 4150.95 ms | 32.5% bf16 MFU | 126017 tok/s step 13309/19560 | loss 3.307522 (-0.75z)| norm 0.2694 (-0.74z)| lr 1.49e-04 | 4158.82 ms | 32.5% bf16 MFU | 126020 tok/s step 13310/19560 | loss 3.300940 (-0.90z)| norm 0.3030 (+1.66z)| lr 1.48e-04 | 4784.36 ms | 28.2% bf16 MFU | 125198 tok/s step 13311/19560 | loss 3.329290 (-0.20z)| norm 0.2954 (+1.11z)| lr 1.48e-04 | 4152.27 ms | 32.5% bf16 MFU | 125251 tok/s step 13312/19560 | loss 3.371488 (+0.84z)| norm 0.2811 (+0.08z)| lr 1.48e-04 | 4147.41 ms | 32.6% bf16 MFU | 125310 tok/s step 13313/19560 | loss 3.378987 (+1.02z)| norm 0.3121 (+2.25z)| lr 1.48e-04 | 4155.58 ms | 32.5% bf16 MFU | 125352 tok/s step 13314/19560 | loss 3.305652 (-0.82z)| norm 0.2866 (+0.44z)| lr 1.48e-04 | 4158.83 ms | 32.5% bf16 MFU | 125388 tok/s step 13315/19560 | loss 3.310708 (-0.68z)| norm 0.2865 (+0.43z)| lr 1.48e-04 | 4152.32 ms | 32.5% bf16 MFU | 125432 tok/s step 13316/19560 | loss 3.366545 (+0.73z)| norm 0.3039 (+1.63z)| lr 1.48e-04 | 4158.81 ms | 32.5% bf16 MFU | 125464 tok/s step 13317/19560 | loss 3.374453 (+0.95z)| norm 0.2781 (-0.17z)| lr 1.48e-04 | 4164.25 ms | 32.4% bf16 MFU | 125485 tok/s step 13318/19560 | loss 3.307374 (-0.76z)| norm 0.2867 (+0.42z)| lr 1.48e-04 | 4158.34 ms | 32.5% bf16 MFU | 125515 tok/s step 13319/19560 | loss 3.301629 (-0.91z)| norm 0.2848 (+0.28z)| lr 1.48e-04 | 4162.41 ms | 32.4% bf16 MFU | 125537 tok/s step 13320/19560 | loss 3.338524 (+0.04z)| norm 0.2836 (+0.19z)| lr 1.48e-04 | 4159.16 ms | 32.5% bf16 MFU | 125563 tok/s step 13321/19560 | loss 3.305491 (-0.80z)| norm 0.2882 (+0.50z)| lr 1.48e-04 | 4151.01 ms | 32.5% bf16 MFU | 125600 tok/s step 13322/19560 | loss 3.329194 (-0.19z)| norm 0.2726 (-0.60z)| lr 1.48e-04 | 4154.84 ms | 32.5% bf16 MFU | 125630 tok/s step 13323/19560 | loss 3.320239 (-0.41z)| norm 0.2863 (+0.36z)| lr 1.48e-04 | 4156.30 ms | 32.5% bf16 MFU | 125655 tok/s step 13324/19560 | loss 3.293758 (-1.09z)| norm 0.2603 (-1.49z)| lr 1.48e-04 | 4153.09 ms | 32.5% bf16 MFU | 125685 tok/s step 13325/19560 | loss 3.324008 (-0.30z)| norm 0.2890 (+0.62z)| lr 1.48e-04 | 4162.71 ms | 32.4% bf16 MFU | 125698 tok/s step 13326/19560 | loss 3.359242 (+0.62z)| norm 0.2737 (-0.50z)| lr 1.48e-04 | 4160.68 ms | 32.5% bf16 MFU | 125713 tok/s step 13327/19560 | loss 3.367929 (+0.84z)| norm 0.2705 (-0.72z)| lr 1.48e-04 | 4152.92 ms | 32.5% bf16 MFU | 125740 tok/s step 13328/19560 | loss 3.337063 (+0.02z)| norm 0.2940 (+1.00z)| lr 1.48e-04 | 4153.71 ms | 32.5% bf16 MFU | 125764 tok/s step 13329/19560 | loss 3.266509 (-1.81z)| norm 0.2535 (-1.93z)| lr 1.48e-04 | 4150.44 ms | 32.5% bf16 MFU | 125792 tok/s step 13330/19560 | loss 3.319225 (-0.43z)| norm 0.2894 (+0.68z)| lr 1.48e-04 | 4150.30 ms | 32.5% bf16 MFU | 125819 tok/s step 13331/19560 | loss 3.393176 (+1.49z)| norm 0.2742 (-0.43z)| lr 1.48e-04 | 4160.32 ms | 32.5% bf16 MFU | 125829 tok/s step 13332/19560 | loss 3.308743 (-0.70z)| norm 0.2692 (-0.78z)| lr 1.48e-04 | 4151.09 ms | 32.5% bf16 MFU | 125852 tok/s step 13333/19560 | loss 3.303213 (-0.85z)| norm 0.2687 (-0.81z)| lr 1.47e-04 | 4155.77 ms | 32.5% bf16 MFU | 125868 tok/s step 13334/19560 | loss 3.329453 (-0.16z)| norm 0.2637 (-1.16z)| lr 1.47e-04 | 4157.52 ms | 32.5% bf16 MFU | 125880 tok/s step 13335/19560 | loss 3.364161 (+0.73z)| norm 0.2790 (-0.05z)| lr 1.47e-04 | 4166.10 ms | 32.4% bf16 MFU | 125878 tok/s step 13336/19560 | loss 3.314370 (-0.55z)| norm 0.2781 (-0.11z)| lr 1.47e-04 | 4164.49 ms | 32.4% bf16 MFU | 125879 tok/s step 13337/19560 | loss 3.350808 (+0.41z)| norm 0.2947 (+1.09z)| lr 1.47e-04 | 4178.16 ms | 32.3% bf16 MFU | 125859 tok/s step 13338/19560 | loss 3.378006 (+1.12z)| norm 0.2886 (+0.64z)| lr 1.47e-04 | 4153.82 ms | 32.5% bf16 MFU | 125877 tok/s step 13339/19560 | loss 3.322015 (-0.37z)| norm 0.2956 (+1.13z)| lr 1.47e-04 | 4156.04 ms | 32.5% bf16 MFU | 125891 tok/s step 13340/19560 | loss 3.360412 (+0.64z)| norm 0.2898 (+0.70z)| lr 1.47e-04 | 4149.55 ms | 32.5% bf16 MFU | 125914 tok/s step 13341/19560 | loss 3.338549 (+0.06z)| norm 0.3075 (+1.93z)| lr 1.47e-04 | 4157.71 ms | 32.5% bf16 MFU | 125923 tok/s step 13342/19560 | loss 3.309368 (-0.71z)| norm 0.2851 (+0.33z)| lr 1.47e-04 | 4161.99 ms | 32.4% bf16 MFU | 125925 tok/s step 13343/19560 | loss 3.299299 (-0.96z)| norm 0.3201 (+2.71z)| lr 1.47e-04 | 4154.79 ms | 32.5% bf16 MFU | 125938 tok/s step 13344/19560 | loss 3.317429 (-0.47z)| norm 0.2885 (+0.53z)| lr 1.47e-04 | 4159.53 ms | 32.5% bf16 MFU | 125944 tok/s step 13345/19560 | loss 3.314705 (-0.54z)| norm 0.3000 (+1.31z)| lr 1.47e-04 | 4161.99 ms | 32.4% bf16 MFU | 125945 tok/s step 13346/19560 | loss 3.288977 (-1.22z)| norm 0.2904 (+0.64z)| lr 1.47e-04 | 4160.40 ms | 32.5% bf16 MFU | 125949 tok/s step 13347/19560 | loss 3.352635 (+0.51z)| norm 0.2918 (+0.73z)| lr 1.47e-04 | 4170.52 ms | 32.4% bf16 MFU | 125937 tok/s step 13348/19560 | loss 3.301528 (-0.87z)| norm 0.3078 (+1.79z)| lr 1.47e-04 | 4162.97 ms | 32.4% bf16 MFU | 125937 tok/s step 13349/19560 | loss 3.324295 (-0.25z)| norm 0.2746 (-0.46z)| lr 1.47e-04 | 4156.99 ms | 32.5% bf16 MFU | 125946 tok/s step 13350/19560 | loss 3.367096 (+0.93z)| norm 0.3589 (+4.74z)| lr 1.47e-04 | 4165.44 ms | 32.4% bf16 MFU | 125942 tok/s step 13351/19560 | loss 3.323816 (-0.26z)| norm 0.2988 (+1.03z)| lr 1.47e-04 | 4160.71 ms | 32.5% bf16 MFU | 125946 tok/s step 13352/19560 | loss 3.336497 (+0.09z)| norm 0.3065 (+1.48z)| lr 1.47e-04 | 4161.33 ms | 32.4% bf16 MFU | 125948 tok/s step 13353/19560 | loss 3.316461 (-0.47z)| norm 0.3113 (+1.73z)| lr 1.47e-04 | 4161.45 ms | 32.4% bf16 MFU | 125950 tok/s step 13354/19560 | loss 3.296791 (-1.00z)| norm 0.2816 (-0.07z)| lr 1.47e-04 | 4158.41 ms | 32.5% bf16 MFU | 125956 tok/s step 13355/19560 | loss 3.379151 (+1.26z)| norm 0.3055 (+1.35z)| lr 1.47e-04 | 4151.94 ms | 32.5% bf16 MFU | 125972 tok/s step 13356/19560 | loss 3.285866 (-1.30z)| norm 0.3205 (+2.21z)| lr 1.46e-04 | 4156.33 ms | 32.5% bf16 MFU | 125981 tok/s step 13357/19560 | loss 3.317794 (-0.41z)| norm 0.2679 (-0.93z)| lr 1.46e-04 | 4164.07 ms | 32.4% bf16 MFU | 125977 tok/s step 13358/19560 | loss 3.380362 (+1.35z)| norm 0.2999 (+0.97z)| lr 1.46e-04 | 4154.79 ms | 32.5% bf16 MFU | 125988 tok/s step 13359/19560 | loss 3.336854 (+0.13z)| norm 0.2763 (-0.44z)| lr 1.46e-04 | 4166.41 ms | 32.4% bf16 MFU | 125980 tok/s step 13360/19560 | loss 3.340014 (+0.24z)| norm 0.2888 (+0.31z)| lr 1.46e-04 | 4161.55 ms | 32.4% bf16 MFU | 125980 tok/s step 13361/19560 | loss 3.341926 (+0.31z)| norm 0.3047 (+1.25z)| lr 1.46e-04 | 4164.61 ms | 32.4% bf16 MFU | 125976 tok/s step 13362/19560 | loss 3.329385 (-0.05z)| norm 0.3281 (+2.56z)| lr 1.46e-04 | 4154.13 ms | 32.5% bf16 MFU | 125988 tok/s step 13363/19560 | loss 3.383997 (+1.54z)| norm 0.2898 (+0.32z)| lr 1.46e-04 | 4155.56 ms | 32.5% bf16 MFU | 125996 tok/s step 13364/19560 | loss 3.408696 (+2.29z)| norm 0.2751 (-0.54z)| lr 1.46e-04 | 4172.67 ms | 32.4% bf16 MFU | 125979 tok/s step 13365/19560 | loss 3.360015 (+0.84z)| norm 0.2954 (+0.63z)| lr 1.46e-04 | 4158.43 ms | 32.5% bf16 MFU | 125984 tok/s step 13366/19560 | loss 3.319261 (-0.36z)| norm 0.2517 (-1.89z)| lr 1.46e-04 | 4181.77 ms | 32.3% bf16 MFU | 125954 tok/s step 13367/19560 | loss 3.357342 (+0.76z)| norm 0.3075 (+1.32z)| lr 1.46e-04 | 4158.00 ms | 32.5% bf16 MFU | 125960 tok/s step 13368/19560 | loss 3.285481 (-1.36z)| norm 0.2537 (-1.76z)| lr 1.46e-04 | 4156.39 ms | 32.5% bf16 MFU | 125969 tok/s step 13369/19560 | loss 3.361153 (+0.86z)| norm 0.3006 (+0.91z)| lr 1.46e-04 | 4162.61 ms | 32.4% bf16 MFU | 125969 tok/s step 13370/19560 | loss 3.379204 (+1.36z)| norm 0.2856 (+0.06z)| lr 1.46e-04 | 4159.52 ms | 32.5% bf16 MFU | 125972 tok/s step 13371/19560 | loss 3.318667 (-0.41z)| norm 0.2842 (-0.03z)| lr 1.46e-04 | 4156.52 ms | 32.5% bf16 MFU | 125981 tok/s step 13372/19560 | loss 3.327632 (-0.15z)| norm 0.2940 (+0.53z)| lr 1.46e-04 | 4158.92 ms | 32.5% bf16 MFU | 125985 tok/s step 13373/19560 | loss 3.289647 (-1.24z)| norm 0.2867 (+0.12z)| lr 1.46e-04 | 4157.64 ms | 32.5% bf16 MFU | 125991 tok/s step 13374/19560 | loss 3.281530 (-1.50z)| norm 0.2753 (-0.53z)| lr 1.46e-04 | 4155.84 ms | 32.5% bf16 MFU | 125999 tok/s step 13375/19560 | loss 3.353106 (+0.66z)| norm 0.2839 (-0.04z)| lr 1.46e-04 | 4156.90 ms | 32.5% bf16 MFU | 126005 tok/s step 13376/19560 | loss 3.285910 (-1.36z)| norm 0.2621 (-1.27z)| lr 1.46e-04 | 4151.07 ms | 32.5% bf16 MFU | 126020 tok/s step 13377/19560 | loss 3.315039 (-0.47z)| norm 0.2685 (-0.90z)| lr 1.46e-04 | 4159.18 ms | 32.5% bf16 MFU | 126022 tok/s step 13378/19560 | loss 3.326453 (-0.12z)| norm 0.2638 (-1.17z)| lr 1.46e-04 | 4162.31 ms | 32.4% bf16 MFU | 126019 tok/s step 13379/19560 | loss 3.291908 (-1.15z)| norm 0.2819 (-0.12z)| lr 1.45e-04 | 4165.76 ms | 32.4% bf16 MFU | 126011 tok/s step 13380/19560 | loss 3.348965 (+0.57z)| norm 0.2753 (-0.49z)| lr 1.45e-04 | 4163.39 ms | 32.4% bf16 MFU | 126007 tok/s step 13381/19560 | loss 3.306771 (-0.70z)| norm 0.2719 (-0.69z)| lr 1.45e-04 | 4168.91 ms | 32.4% bf16 MFU | 125994 tok/s step 13382/19560 | loss 3.291284 (-1.17z)| norm 0.2834 (-0.00z)| lr 1.45e-04 | 4164.02 ms | 32.4% bf16 MFU | 125990 tok/s step 13383/19560 | loss 3.319599 (-0.28z)| norm 0.2805 (-0.16z)| lr 1.45e-04 | 4162.60 ms | 32.4% bf16 MFU | 125988 tok/s step 13384/19560 | loss 3.328606 (-0.00z)| norm 0.2630 (-1.19z)| lr 1.45e-04 | 4158.82 ms | 32.5% bf16 MFU | 125992 tok/s step 13385/19560 | loss 3.384669 (+1.73z)| norm 0.2817 (-0.08z)| lr 1.45e-04 | 4168.30 ms | 32.4% bf16 MFU | 125981 tok/s step 13386/19560 | loss 3.264150 (-1.98z)| norm 0.2687 (-0.84z)| lr 1.45e-04 | 4158.39 ms | 32.5% bf16 MFU | 125986 tok/s step 13387/19560 | loss 3.310359 (-0.55z)| norm 0.2787 (-0.25z)| lr 1.45e-04 | 4166.21 ms | 32.4% bf16 MFU | 125979 tok/s step 13388/19560 | loss 3.384022 (+1.70z)| norm 0.2748 (-0.47z)| lr 1.45e-04 | 4164.44 ms | 32.4% bf16 MFU | 125975 tok/s step 13389/19560 | loss 3.301502 (-0.81z)| norm 0.2652 (-1.03z)| lr 1.45e-04 | 4161.41 ms | 32.4% bf16 MFU | 125976 tok/s step 13390/19560 | loss 3.313051 (-0.45z)| norm 0.2645 (-1.06z)| lr 1.45e-04 | 4151.29 ms | 32.5% bf16 MFU | 125992 tok/s step 13391/19560 | loss 3.349081 (+0.67z)| norm 0.2599 (-1.31z)| lr 1.45e-04 | 4156.27 ms | 32.5% bf16 MFU | 125999 tok/s step 13392/19560 | loss 3.344715 (+0.55z)| norm 0.2629 (-1.13z)| lr 1.45e-04 | 4158.76 ms | 32.5% bf16 MFU | 126003 tok/s step 13393/19560 | loss 3.280716 (-1.47z)| norm 0.2750 (-0.42z)| lr 1.45e-04 | 4169.93 ms | 32.4% bf16 MFU | 125989 tok/s step 13394/19560 | loss 3.358786 (+0.98z)| norm 0.2776 (-0.26z)| lr 1.45e-04 | 4162.12 ms | 32.4% bf16 MFU | 125988 tok/s step 13395/19560 | loss 3.367206 (+1.23z)| norm 0.2679 (-0.83z)| lr 1.45e-04 | 4160.10 ms | 32.5% bf16 MFU | 125990 tok/s step 13396/19560 | loss 3.354451 (+0.82z)| norm 0.2719 (-0.59z)| lr 1.45e-04 | 4156.94 ms | 32.5% bf16 MFU | 125997 tok/s step 13397/19560 | loss 3.341177 (+0.40z)| norm 0.2804 (-0.10z)| lr 1.45e-04 | 4163.49 ms | 32.4% bf16 MFU | 125993 tok/s step 13398/19560 | loss 3.333089 (+0.15z)| norm 0.2825 (+0.03z)| lr 1.45e-04 | 4150.78 ms | 32.5% bf16 MFU | 126009 tok/s step 13399/19560 | loss 3.312564 (-0.49z)| norm 0.2808 (-0.08z)| lr 1.45e-04 | 4156.32 ms | 32.5% bf16 MFU | 126016 tok/s step 13400/19560 | loss 3.287720 (-1.28z)| norm 0.2738 (-0.49z)| lr 1.45e-04 | 4159.93 ms | 32.5% bf16 MFU | 126016 tok/s step 13401/19560 | loss 3.330936 (+0.09z)| norm 0.2762 (-0.37z)| lr 1.45e-04 | 4153.29 ms | 32.5% bf16 MFU | 126027 tok/s step 13402/19560 | loss 3.287715 (-1.27z)| norm 0.2896 (+0.42z)| lr 1.45e-04 | 4167.07 ms | 32.4% bf16 MFU | 126017 tok/s step 13403/19560 | loss 3.345021 (+0.55z)| norm 0.2847 (+0.13z)| lr 1.44e-04 | 4156.31 ms | 32.5% bf16 MFU | 126023 tok/s step 13404/19560 | loss 3.368967 (+1.29z)| norm 0.2863 (+0.23z)| lr 1.44e-04 | 4166.36 ms | 32.4% bf16 MFU | 126014 tok/s step 13405/19560 | loss 3.425066 (+2.94z)| norm 0.3162 (+1.96z)| lr 1.44e-04 | 4157.28 ms | 32.5% bf16 MFU | 126019 tok/s step 13406/19560 | loss 3.341612 (+0.40z)| norm 0.2874 (+0.26z)| lr 1.44e-04 | 4161.16 ms | 32.4% bf16 MFU | 126018 tok/s step 13407/19560 | loss 3.294738 (-1.03z)| norm 0.2770 (-0.35z)| lr 1.44e-04 | 4159.98 ms | 32.5% bf16 MFU | 126018 tok/s step 13408/19560 | loss 3.273945 (-1.65z)| norm 0.2800 (-0.17z)| lr 1.44e-04 | 4148.42 ms | 32.5% bf16 MFU | 126037 tok/s step 13409/19560 | loss 3.318617 (-0.27z)| norm 0.2754 (-0.45z)| lr 1.44e-04 | 4165.27 ms | 32.4% bf16 MFU | 126028 tok/s step 13410/19560 | loss 3.375391 (+1.47z)| norm 0.2655 (-1.03z)| lr 1.44e-04 | 4157.06 ms | 32.5% bf16 MFU | 126033 tok/s step 13411/19560 | loss 3.323880 (-0.13z)| norm 0.2694 (-0.79z)| lr 1.44e-04 | 4155.62 ms | 32.5% bf16 MFU | 126039 tok/s step 13412/19560 | loss 3.312684 (-0.48z)| norm 0.2826 (-0.02z)| lr 1.44e-04 | 4161.09 ms | 32.4% bf16 MFU | 126037 tok/s step 13413/19560 | loss 3.390866 (+1.91z)| norm 0.2799 (-0.19z)| lr 1.44e-04 | 4157.75 ms | 32.5% bf16 MFU | 126040 tok/s step 13414/19560 | loss 3.316718 (-0.37z)| norm 0.2814 (-0.10z)| lr 1.44e-04 | 4157.70 ms | 32.5% bf16 MFU | 126043 tok/s step 13415/19560 | loss 3.289183 (-1.21z)| norm 0.2611 (-1.30z)| lr 1.44e-04 | 4155.57 ms | 32.5% bf16 MFU | 126050 tok/s step 13416/19560 | loss 3.357229 (+0.87z)| norm 0.2681 (-0.88z)| lr 1.44e-04 | 4163.43 ms | 32.4% bf16 MFU | 126043 tok/s step 13417/19560 | loss 3.437215 (+3.15z)| norm 0.2811 (-0.12z)| lr 1.44e-04 | 4155.55 ms | 32.5% bf16 MFU | 126049 tok/s step 13418/19560 | loss 3.347049 (+0.50z)| norm 0.2674 (-0.94z)| lr 1.44e-04 | 4158.27 ms | 32.5% bf16 MFU | 126051 tok/s step 13419/19560 | loss 3.337933 (+0.22z)| norm 0.2784 (-0.28z)| lr 1.44e-04 | 4169.58 ms | 32.4% bf16 MFU | 126036 tok/s step 13420/19560 | loss 3.302472 (-0.83z)| norm 0.2524 (-1.80z)| lr 1.44e-04 | 4160.16 ms | 32.5% bf16 MFU | 126035 tok/s step 13421/19560 | loss 3.283728 (-1.39z)| norm 0.2922 (+0.54z)| lr 1.44e-04 | 4159.59 ms | 32.5% bf16 MFU | 126036 tok/s step 13422/19560 | loss 3.307944 (-0.67z)| norm 0.2638 (-1.15z)| lr 1.44e-04 | 4169.27 ms | 32.4% bf16 MFU | 126021 tok/s step 13423/19560 | loss 3.367828 (+1.10z)| norm 0.2802 (-0.18z)| lr 1.44e-04 | 4161.70 ms | 32.4% bf16 MFU | 126019 tok/s step 13424/19560 | loss 3.341506 (+0.31z)| norm 0.2619 (-1.25z)| lr 1.44e-04 | 4157.96 ms | 32.5% bf16 MFU | 126023 tok/s step 13425/19560 | loss 3.281962 (-1.45z)| norm 0.2605 (-1.32z)| lr 1.44e-04 | 4157.12 ms | 32.5% bf16 MFU | 126028 tok/s step 13426/19560 | loss 3.379792 (+1.44z)| norm 0.2857 (+0.18z)| lr 1.43e-04 | 4156.80 ms | 32.5% bf16 MFU | 126033 tok/s step 13427/19560 | loss 3.355319 (+0.71z)| norm 0.2728 (-0.58z)| lr 1.43e-04 | 4163.87 ms | 32.4% bf16 MFU | 126027 tok/s step 13428/19560 | loss 3.375088 (+1.27z)| norm 0.2881 (+0.32z)| lr 1.43e-04 | 4165.95 ms | 32.4% bf16 MFU | 126018 tok/s step 13429/19560 | loss 3.330090 (-0.05z)| norm 0.2656 (-1.02z)| lr 1.43e-04 | 4158.96 ms | 32.5% bf16 MFU | 126020 tok/s step 13430/19560 | loss 3.273760 (-1.68z)| norm 0.2922 (+0.56z)| lr 1.43e-04 | 4156.57 ms | 32.5% bf16 MFU | 126026 tok/s step 13431/19560 | loss 3.296334 (-1.01z)| norm 0.2786 (-0.26z)| lr 1.43e-04 | 4160.44 ms | 32.5% bf16 MFU | 126025 tok/s step 13432/19560 | loss 3.290346 (-1.17z)| norm 0.3015 (+1.10z)| lr 1.43e-04 | 4159.27 ms | 32.5% bf16 MFU | 126027 tok/s step 13433/19560 | loss 3.335756 (+0.14z)| norm 0.2731 (-0.60z)| lr 1.43e-04 | 4155.73 ms | 32.5% bf16 MFU | 126033 tok/s step 13434/19560 | loss 3.360334 (+0.84z)| norm 0.2863 (+0.20z)| lr 1.43e-04 | 4155.13 ms | 32.5% bf16 MFU | 126041 tok/s step 13435/19560 | loss 3.270300 (-1.75z)| norm 0.2667 (-0.97z)| lr 1.43e-04 | 4168.81 ms | 32.4% bf16 MFU | 126027 tok/s step 13436/19560 | loss 3.315477 (-0.44z)| norm 0.2916 (+0.54z)| lr 1.43e-04 | 4168.33 ms | 32.4% bf16 MFU | 126014 tok/s step 13437/19560 | loss 3.313158 (-0.51z)| norm 0.2574 (-1.52z)| lr 1.43e-04 | 5095.16 ms | 26.5% bf16 MFU | 124859 tok/s step 13438/19560 | loss 3.366576 (+1.03z)| norm 0.2843 (+0.11z)| lr 1.43e-04 | 4152.00 ms | 32.5% bf16 MFU | 124929 tok/s step 13439/19560 | loss 3.327735 (-0.10z)| norm 0.2703 (-0.73z)| lr 1.43e-04 | 4151.81 ms | 32.5% bf16 MFU | 124997 tok/s step 13440/19560 | loss 3.304942 (-0.74z)| norm 0.2761 (-0.37z)| lr 1.43e-04 | 4154.59 ms | 32.5% bf16 MFU | 125057 tok/s step 13441/19560 | loss 3.322007 (-0.24z)| norm 0.2985 (+1.00z)| lr 1.43e-04 | 4151.67 ms | 32.5% bf16 MFU | 125118 tok/s step 13442/19560 | loss 3.279845 (-1.46z)| norm 0.2830 (+0.05z)| lr 1.43e-04 | 4310.64 ms | 31.3% bf16 MFU | 124944 tok/s step 13443/19560 | loss 3.311450 (-0.54z)| norm 0.2827 (+0.04z)| lr 1.43e-04 | 4294.95 ms | 31.4% bf16 MFU | 124800 tok/s step 13444/19560 | loss 3.322745 (-0.20z)| norm 0.2882 (+0.39z)| lr 1.43e-04 | 4163.79 ms | 32.4% bf16 MFU | 124856 tok/s step 13445/19560 | loss 3.319679 (-0.28z)| norm 0.2754 (-0.40z)| lr 1.43e-04 | 4149.92 ms | 32.5% bf16 MFU | 124930 tok/s step 13446/19560 | loss 3.336396 (+0.21z)| norm 0.2890 (+0.43z)| lr 1.43e-04 | 4157.01 ms | 32.5% bf16 MFU | 124989 tok/s step 13447/19560 | loss 3.354379 (+0.72z)| norm 0.2873 (+0.33z)| lr 1.43e-04 | 4162.69 ms | 32.4% bf16 MFU | 125037 tok/s step 13448/19560 | loss 3.330567 (+0.02z)| norm 0.2721 (-0.60z)| lr 1.43e-04 | 4160.12 ms | 32.5% bf16 MFU | 125087 tok/s step 13449/19560 | loss 3.301841 (-0.82z)| norm 0.2676 (-0.87z)| lr 1.43e-04 | 4162.96 ms | 32.4% bf16 MFU | 125130 tok/s step 13450/19560 | loss 3.279636 (-1.45z)| norm 0.2750 (-0.41z)| lr 1.42e-04 | 4158.86 ms | 32.5% bf16 MFU | 125176 tok/s step 13451/19560 | loss 3.277766 (-1.48z)| norm 0.2714 (-0.63z)| lr 1.42e-04 | 4158.73 ms | 32.5% bf16 MFU | 125221 tok/s step 13452/19560 | loss 3.346178 (+0.49z)| norm 0.2865 (+0.29z)| lr 1.42e-04 | 4160.53 ms | 32.5% bf16 MFU | 125261 tok/s step 13453/19560 | loss 3.313807 (-0.45z)| norm 0.2639 (-1.09z)| lr 1.42e-04 | 4169.86 ms | 32.4% bf16 MFU | 125284 tok/s step 13454/19560 | loss 3.338168 (+0.26z)| norm 0.2713 (-0.63z)| lr 1.42e-04 | 4190.68 ms | 32.2% bf16 MFU | 125275 tok/s step 13455/19560 | loss 3.294410 (-1.00z)| norm 0.2582 (-1.42z)| lr 1.42e-04 | 4201.43 ms | 32.1% bf16 MFU | 125251 tok/s step 13456/19560 | loss 3.356102 (+0.80z)| norm 0.2812 (-0.01z)| lr 1.42e-04 | 4271.00 ms | 31.6% bf16 MFU | 125126 tok/s step 13457/19560 | loss 3.291199 (-1.11z)| norm 0.2572 (-1.49z)| lr 1.42e-04 | 4174.55 ms | 32.3% bf16 MFU | 125150 tok/s step 13458/19560 | loss 3.318836 (-0.30z)| norm 0.2708 (-0.64z)| lr 1.42e-04 | 4168.50 ms | 32.4% bf16 MFU | 125181 tok/s step 13459/19560 | loss 3.350974 (+0.66z)| norm 0.2714 (-0.60z)| lr 1.42e-04 | 4208.80 ms | 32.1% bf16 MFU | 125150 tok/s step 13460/19560 | loss 3.374277 (+1.33z)| norm 0.2650 (-0.99z)| lr 1.42e-04 | 4183.17 ms | 32.3% bf16 MFU | 125159 tok/s step 13461/19560 | loss 3.289375 (-1.17z)| norm 0.2807 (-0.04z)| lr 1.42e-04 | 4168.94 ms | 32.4% bf16 MFU | 125189 tok/s step 13462/19560 | loss 3.316214 (-0.37z)| norm 0.2690 (-0.76z)| lr 1.42e-04 | 4159.60 ms | 32.5% bf16 MFU | 125232 tok/s step 13463/19560 | loss 3.359898 (+0.91z)| norm 0.2913 (+0.61z)| lr 1.42e-04 | 4210.63 ms | 32.1% bf16 MFU | 125196 tok/s step 13464/19560 | loss 3.352276 (+0.68z)| norm 0.2629 (-1.13z)| lr 1.42e-04 | 4167.02 ms | 32.4% bf16 MFU | 125227 tok/s step 13465/19560 | loss 3.318032 (-0.32z)| norm 0.2791 (-0.13z)| lr 1.42e-04 | 4160.02 ms | 32.5% bf16 MFU | 125267 tok/s step 13466/19560 | loss 3.330053 (+0.05z)| norm 0.2668 (-0.87z)| lr 1.42e-04 | 4185.92 ms | 32.3% bf16 MFU | 125267 tok/s step 13467/19560 | loss 3.303507 (-0.74z)| norm 0.2700 (-0.67z)| lr 1.42e-04 | 4176.44 ms | 32.3% bf16 MFU | 125280 tok/s step 13468/19560 | loss 3.403371 (+2.18z)| norm 0.2812 (+0.03z)| lr 1.42e-04 | 4169.15 ms | 32.4% bf16 MFU | 125304 tok/s step 13469/19560 | loss 3.282401 (-1.33z)| norm 0.3000 (+1.19z)| lr 1.42e-04 | 4161.02 ms | 32.4% bf16 MFU | 125339 tok/s step 13470/19560 | loss 3.320280 (-0.23z)| norm 0.2939 (+0.81z)| lr 1.42e-04 | 4166.75 ms | 32.4% bf16 MFU | 125363 tok/s step 13471/19560 | loss 3.254612 (-2.10z)| norm 0.2727 (-0.49z)| lr 1.42e-04 | 4163.06 ms | 32.4% bf16 MFU | 125392 tok/s step 13472/19560 | loss 3.372174 (+1.24z)| norm 0.3019 (+1.34z)| lr 1.42e-04 | 4172.65 ms | 32.4% bf16 MFU | 125405 tok/s step 13473/19560 | loss 3.349810 (+0.60z)| norm 0.2539 (-1.64z)| lr 1.41e-04 | 4160.52 ms | 32.5% bf16 MFU | 125435 tok/s step 13474/19560 | loss 3.286901 (-1.19z)| norm 0.2868 (+0.42z)| lr 1.41e-04 | 4164.43 ms | 32.4% bf16 MFU | 125458 tok/s step 13475/19560 | loss 3.309803 (-0.53z)| norm 0.2757 (-0.27z)| lr 1.41e-04 | 4165.07 ms | 32.4% bf16 MFU | 125479 tok/s step 13476/19560 | loss 3.307321 (-0.60z)| norm 0.2873 (+0.47z)| lr 1.41e-04 | 4162.31 ms | 32.4% bf16 MFU | 125503 tok/s step 13477/19560 | loss 3.330830 (+0.07z)| norm 0.3152 (+2.17z)| lr 1.41e-04 | 4171.56 ms | 32.4% bf16 MFU | 125512 tok/s step 13478/19560 | loss 3.365032 (+1.04z)| norm 0.2836 (+0.28z)| lr 1.41e-04 | 4173.33 ms | 32.4% bf16 MFU | 125518 tok/s step 13479/19560 | loss 3.342909 (+0.41z)| norm 0.3075 (+1.90z)| lr 1.41e-04 | 4164.27 ms | 32.4% bf16 MFU | 125537 tok/s step 13480/19560 | loss 3.298937 (-0.83z)| norm 0.2789 (-0.03z)| lr 1.41e-04 | 4159.77 ms | 32.5% bf16 MFU | 125562 tok/s step 13481/19560 | loss 3.347528 (+0.54z)| norm 0.2888 (+0.67z)| lr 1.41e-04 | 4170.53 ms | 32.4% bf16 MFU | 125570 tok/s step 13482/19560 | loss 3.327556 (-0.03z)| norm 0.2785 (-0.05z)| lr 1.41e-04 | 4165.21 ms | 32.4% bf16 MFU | 125585 tok/s step 13483/19560 | loss 3.353936 (+0.73z)| norm 0.2758 (-0.23z)| lr 1.41e-04 | 4174.28 ms | 32.3% bf16 MFU | 125586 tok/s step 13484/19560 | loss 3.417355 (+2.46z)| norm 0.3123 (+2.41z)| lr 1.41e-04 | 4160.24 ms | 32.5% bf16 MFU | 125607 tok/s step 13485/19560 | loss 3.388926 (+1.63z)| norm 0.2701 (-0.64z)| lr 1.41e-04 | 4176.63 ms | 32.3% bf16 MFU | 125604 tok/s step 13486/19560 | loss 3.351879 (+0.61z)| norm 0.2962 (+1.25z)| lr 1.41e-04 | 4164.89 ms | 32.4% bf16 MFU | 125618 tok/s step 13487/19560 | loss 3.348558 (+0.52z)| norm 0.2946 (+1.12z)| lr 1.41e-04 | 4170.48 ms | 32.4% bf16 MFU | 125622 tok/s step 13488/19560 | loss 3.358988 (+0.80z)| norm 0.2684 (-0.76z)| lr 1.41e-04 | 4158.54 ms | 32.5% bf16 MFU | 125645 tok/s step 13489/19560 | loss 3.386664 (+1.55z)| norm 0.3107 (+2.27z)| lr 1.41e-04 | 4173.09 ms | 32.4% bf16 MFU | 125644 tok/s step 13490/19560 | loss 3.334604 (+0.11z)| norm 0.2861 (+0.56z)| lr 1.41e-04 | 4164.73 ms | 32.4% bf16 MFU | 125657 tok/s step 13491/19560 | loss 3.371525 (+1.14z)| norm 0.2959 (+1.30z)| lr 1.41e-04 | 4178.01 ms | 32.3% bf16 MFU | 125648 tok/s step 13492/19560 | loss 3.338018 (+0.23z)| norm 0.2760 (-0.20z)| lr 1.41e-04 | 4153.77 ms | 32.5% bf16 MFU | 125677 tok/s step 13493/19560 | loss 3.310802 (-0.53z)| norm 0.3051 (+1.96z)| lr 1.41e-04 | 4155.86 ms | 32.5% bf16 MFU | 125701 tok/s step 13494/19560 | loss 3.381414 (+1.44z)| norm 0.3018 (+1.70z)| lr 1.41e-04 | 4205.12 ms | 32.1% bf16 MFU | 125650 tok/s step 13495/19560 | loss 3.341552 (+0.33z)| norm 0.3028 (+1.78z)| lr 1.41e-04 | 5400.19 ms | 25.0% bf16 MFU | 124221 tok/s step 13496/19560 | loss 3.312855 (-0.49z)| norm 0.3159 (+2.69z)| lr 1.41e-04 | 4151.54 ms | 32.5% bf16 MFU | 124325 tok/s step 13497/19560 | loss 3.324417 (-0.15z)| norm 0.3117 (+2.34z)| lr 1.40e-04 | 4156.47 ms | 32.5% bf16 MFU | 124415 tok/s step 13498/19560 | loss 3.383717 (+1.52z)| norm 0.3077 (+2.01z)| lr 1.40e-04 | 4164.33 ms | 32.4% bf16 MFU | 124490 tok/s step 13499/19560 | loss 3.371513 (+1.16z)| norm 0.3065 (+1.88z)| lr 1.40e-04 | 4160.56 ms | 32.5% bf16 MFU | 124566 tok/s step 13500/19560 | loss 3.346272 (+0.45z)| norm 0.3514 (+4.61z)| lr 1.40e-04 | 4166.57 ms | 32.4% bf16 MFU | 124629 tok/s val loss 3.317923 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2991/10042 = 0.297849 step 13501/19560 | loss 3.292799 (-1.06z)| norm 0.2831 (+0.17z)| lr 1.40e-04 | 4167.94 ms | 32.4% bf16 MFU | 124687 tok/s step 13502/19560 | loss 3.356157 (+0.71z)| norm 0.3143 (+2.14z)| lr 1.40e-04 | 4171.13 ms | 32.4% bf16 MFU | 124738 tok/s step 13503/19560 | loss 3.403786 (+2.02z)| norm 0.2977 (+1.07z)| lr 1.40e-04 | 4181.57 ms | 32.3% bf16 MFU | 124770 tok/s step 13504/19560 | loss 3.309157 (-0.63z)| norm 0.2854 (+0.28z)| lr 1.40e-04 | 4169.98 ms | 32.4% bf16 MFU | 124818 tok/s step 13505/19560 | loss 3.297688 (-0.94z)| norm 0.3324 (+3.13z)| lr 1.40e-04 | 4166.10 ms | 32.4% bf16 MFU | 124869 tok/s step 13506/19560 | loss 3.325877 (-0.16z)| norm 0.2783 (-0.21z)| lr 1.40e-04 | 4181.58 ms | 32.3% bf16 MFU | 124895 tok/s step 13507/19560 | loss 3.380670 (+1.35z)| norm 0.3165 (+2.11z)| lr 1.40e-04 | 4168.85 ms | 32.4% bf16 MFU | 124938 tok/s step 13508/19560 | loss 3.326210 (-0.16z)| norm 0.2825 (+0.03z)| lr 1.40e-04 | 4172.09 ms | 32.4% bf16 MFU | 124974 tok/s step 13509/19560 | loss 3.270820 (-1.68z)| norm 0.2996 (+1.06z)| lr 1.40e-04 | 4168.89 ms | 32.4% bf16 MFU | 125014 tok/s step 13510/19560 | loss 3.289297 (-1.17z)| norm 0.2516 (-1.82z)| lr 1.40e-04 | 4172.83 ms | 32.4% bf16 MFU | 125045 tok/s step 13511/19560 | loss 3.361012 (+0.80z)| norm 0.2778 (-0.25z)| lr 1.40e-04 | 4170.49 ms | 32.4% bf16 MFU | 125079 tok/s step 13512/19560 | loss 3.326297 (-0.16z)| norm 0.2879 (+0.34z)| lr 1.40e-04 | 4182.88 ms | 32.3% bf16 MFU | 125092 tok/s step 13513/19560 | loss 3.370796 (+1.08z)| norm 0.2556 (-1.57z)| lr 1.40e-04 | 4176.71 ms | 32.3% bf16 MFU | 125114 tok/s step 13514/19560 | loss 3.304623 (-0.77z)| norm 0.2743 (-0.46z)| lr 1.40e-04 | 4178.23 ms | 32.3% bf16 MFU | 125132 tok/s step 13515/19560 | loss 3.299091 (-0.92z)| norm 0.2473 (-2.02z)| lr 1.40e-04 | 4181.80 ms | 32.3% bf16 MFU | 125144 tok/s step 13516/19560 | loss 3.327114 (-0.13z)| norm 0.2678 (-0.82z)| lr 1.40e-04 | 4176.84 ms | 32.3% bf16 MFU | 125163 tok/s step 13517/19560 | loss 3.370274 (+1.07z)| norm 0.2808 (-0.06z)| lr 1.40e-04 | 4170.52 ms | 32.4% bf16 MFU | 125190 tok/s step 13518/19560 | loss 3.357420 (+0.70z)| norm 0.2878 (+0.35z)| lr 1.40e-04 | 4161.17 ms | 32.4% bf16 MFU | 125231 tok/s step 13519/19560 | loss 3.301428 (-0.86z)| norm 0.2881 (+0.35z)| lr 1.40e-04 | 4159.40 ms | 32.5% bf16 MFU | 125272 tok/s step 13520/19560 | loss 3.354105 (+0.61z)| norm 0.2987 (+0.97z)| lr 1.39e-04 | 4176.16 ms | 32.3% bf16 MFU | 125285 tok/s step 13521/19560 | loss 3.306414 (-0.73z)| norm 0.2706 (-0.71z)| lr 1.39e-04 | 4171.53 ms | 32.4% bf16 MFU | 125305 tok/s step 13522/19560 | loss 3.329100 (-0.09z)| norm 0.2845 (+0.12z)| lr 1.39e-04 | 4163.82 ms | 32.4% bf16 MFU | 125336 tok/s step 13523/19560 | loss 3.308125 (-0.67z)| norm 0.2753 (-0.43z)| lr 1.39e-04 | 4161.43 ms | 32.4% bf16 MFU | 125368 tok/s step 13524/19560 | loss 3.284537 (-1.32z)| norm 0.2591 (-1.38z)| lr 1.39e-04 | 4166.50 ms | 32.4% bf16 MFU | 125391 tok/s step 13525/19560 | loss 3.368810 (+1.05z)| norm 0.2718 (-0.62z)| lr 1.39e-04 | 4165.44 ms | 32.4% bf16 MFU | 125415 tok/s step 13526/19560 | loss 3.308981 (-0.62z)| norm 0.2673 (-0.88z)| lr 1.39e-04 | 4161.69 ms | 32.4% bf16 MFU | 125443 tok/s step 13527/19560 | loss 3.387479 (+1.55z)| norm 0.2598 (-1.31z)| lr 1.39e-04 | 4164.05 ms | 32.4% bf16 MFU | 125467 tok/s step 13528/19560 | loss 3.337284 (+0.14z)| norm 0.2928 (+0.62z)| lr 1.39e-04 | 4158.08 ms | 32.5% bf16 MFU | 125498 tok/s step 13529/19560 | loss 3.350366 (+0.50z)| norm 0.2705 (-0.68z)| lr 1.39e-04 | 4171.89 ms | 32.4% bf16 MFU | 125506 tok/s step 13530/19560 | loss 3.344748 (+0.34z)| norm 0.3208 (+2.20z)| lr 1.39e-04 | 4164.24 ms | 32.4% bf16 MFU | 125526 tok/s step 13531/19560 | loss 3.334047 (+0.04z)| norm 0.2821 (-0.02z)| lr 1.39e-04 | 4178.05 ms | 32.3% bf16 MFU | 125524 tok/s step 13532/19560 | loss 3.365938 (+0.93z)| norm 0.2933 (+0.62z)| lr 1.39e-04 | 4161.82 ms | 32.4% bf16 MFU | 125547 tok/s step 13533/19560 | loss 3.383498 (+1.47z)| norm 0.2742 (-0.46z)| lr 1.39e-04 | 4180.31 ms | 32.3% bf16 MFU | 125540 tok/s step 13534/19560 | loss 3.345602 (+0.38z)| norm 0.3101 (+1.60z)| lr 1.39e-04 | 4164.38 ms | 32.4% bf16 MFU | 125558 tok/s step 13535/19560 | loss 3.292983 (-1.13z)| norm 0.2682 (-0.81z)| lr 1.39e-04 | 4191.23 ms | 32.2% bf16 MFU | 125535 tok/s step 13536/19560 | loss 3.346827 (+0.40z)| norm 0.2977 (+0.88z)| lr 1.39e-04 | 4176.02 ms | 32.3% bf16 MFU | 125536 tok/s step 13537/19560 | loss 3.338178 (+0.15z)| norm 0.2726 (-0.56z)| lr 1.39e-04 | 4161.58 ms | 32.4% bf16 MFU | 125558 tok/s step 13538/19560 | loss 3.330953 (-0.05z)| norm 0.2793 (-0.19z)| lr 1.39e-04 | 4167.02 ms | 32.4% bf16 MFU | 125571 tok/s step 13539/19560 | loss 3.387014 (+1.55z)| norm 0.2752 (-0.42z)| lr 1.39e-04 | 4174.20 ms | 32.3% bf16 MFU | 125573 tok/s step 13540/19560 | loss 3.259250 (-2.09z)| norm 0.2622 (-1.16z)| lr 1.39e-04 | 4162.72 ms | 32.4% bf16 MFU | 125591 tok/s step 13541/19560 | loss 3.360354 (+0.79z)| norm 0.2843 (+0.11z)| lr 1.39e-04 | 4185.62 ms | 32.3% bf16 MFU | 125575 tok/s step 13542/19560 | loss 3.343211 (+0.30z)| norm 0.2759 (-0.37z)| lr 1.39e-04 | 4167.98 ms | 32.4% bf16 MFU | 125585 tok/s step 13543/19560 | loss 3.355239 (+0.63z)| norm 0.2669 (-0.89z)| lr 1.39e-04 | 4186.10 ms | 32.3% bf16 MFU | 125568 tok/s step 13544/19560 | loss 3.309332 (-0.68z)| norm 0.2771 (-0.31z)| lr 1.38e-04 | 4172.29 ms | 32.4% bf16 MFU | 125573 tok/s step 13545/19560 | loss 3.325813 (-0.19z)| norm 0.2739 (-0.49z)| lr 1.38e-04 | 4177.47 ms | 32.3% bf16 MFU | 125570 tok/s step 13546/19560 | loss 3.312542 (-0.57z)| norm 0.2801 (-0.14z)| lr 1.38e-04 | 4169.89 ms | 32.4% bf16 MFU | 125578 tok/s step 13547/19560 | loss 3.292015 (-1.17z)| norm 0.2796 (-0.17z)| lr 1.38e-04 | 4191.10 ms | 32.2% bf16 MFU | 125554 tok/s step 13548/19560 | loss 3.362602 (+0.91z)| norm 0.2851 (+0.14z)| lr 1.38e-04 | 4171.67 ms | 32.4% bf16 MFU | 125560 tok/s step 13549/19560 | loss 3.355489 (+0.69z)| norm 0.2757 (-0.40z)| lr 1.38e-04 | 4159.06 ms | 32.5% bf16 MFU | 125585 tok/s step 13550/19560 | loss 3.371024 (+1.13z)| norm 0.2508 (-1.84z)| lr 1.38e-04 | 4171.62 ms | 32.4% bf16 MFU | 125589 tok/s step 13551/19560 | loss 3.351391 (+0.55z)| norm 0.2732 (-0.53z)| lr 1.38e-04 | 4167.82 ms | 32.4% bf16 MFU | 125600 tok/s step 13552/19560 | loss 3.295885 (-1.09z)| norm 0.2693 (-0.77z)| lr 1.38e-04 | 4181.54 ms | 32.3% bf16 MFU | 125589 tok/s step 13553/19560 | loss 3.299903 (-0.98z)| norm 0.2843 (+0.09z)| lr 1.38e-04 | 4175.36 ms | 32.3% bf16 MFU | 125588 tok/s step 13554/19560 | loss 3.378458 (+1.37z)| norm 0.2702 (-0.72z)| lr 1.38e-04 | 4165.11 ms | 32.4% bf16 MFU | 125602 tok/s step 13555/19560 | loss 3.308365 (-0.71z)| norm 0.2635 (-1.10z)| lr 1.38e-04 | 4178.58 ms | 32.3% bf16 MFU | 125596 tok/s step 13556/19560 | loss 3.325886 (-0.18z)| norm 0.2723 (-0.59z)| lr 1.38e-04 | 4222.57 ms | 32.0% bf16 MFU | 125524 tok/s step 13557/19560 | loss 3.458925 (+3.59z)| norm 0.2957 (+0.76z)| lr 1.38e-04 | 4173.81 ms | 32.3% bf16 MFU | 125528 tok/s step 13558/19560 | loss 3.320626 (-0.36z)| norm 0.2846 (+0.12z)| lr 1.38e-04 | 4173.25 ms | 32.4% bf16 MFU | 125533 tok/s step 13559/19560 | loss 3.319706 (-0.40z)| norm 0.2855 (+0.16z)| lr 1.38e-04 | 4175.76 ms | 32.3% bf16 MFU | 125535 tok/s step 13560/19560 | loss 3.324147 (-0.28z)| norm 0.2610 (-1.24z)| lr 1.38e-04 | 4161.97 ms | 32.4% bf16 MFU | 125556 tok/s step 13561/19560 | loss 3.340055 (+0.18z)| norm 0.2940 (+0.67z)| lr 1.38e-04 | 4155.73 ms | 32.5% bf16 MFU | 125587 tok/s step 13562/19560 | loss 3.366972 (+0.96z)| norm 0.2807 (-0.10z)| lr 1.38e-04 | 4159.53 ms | 32.5% bf16 MFU | 125610 tok/s step 13563/19560 | loss 3.301338 (-0.96z)| norm 0.2525 (-1.72z)| lr 1.38e-04 | 4170.04 ms | 32.4% bf16 MFU | 125615 tok/s step 13564/19560 | loss 3.283440 (-1.47z)| norm 0.2749 (-0.42z)| lr 1.38e-04 | 4169.62 ms | 32.4% bf16 MFU | 125622 tok/s step 13565/19560 | loss 3.349794 (+0.46z)| norm 0.2626 (-1.14z)| lr 1.38e-04 | 4161.39 ms | 32.4% bf16 MFU | 125640 tok/s step 13566/19560 | loss 3.310530 (-0.67z)| norm 0.2697 (-0.72z)| lr 1.38e-04 | 4176.51 ms | 32.3% bf16 MFU | 125635 tok/s step 13567/19560 | loss 3.356339 (+0.66z)| norm 0.2740 (-0.47z)| lr 1.38e-04 | 4160.33 ms | 32.5% bf16 MFU | 125654 tok/s step 13568/19560 | loss 3.317743 (-0.47z)| norm 0.2790 (-0.18z)| lr 1.37e-04 | 4160.23 ms | 32.5% bf16 MFU | 125672 tok/s step 13569/19560 | loss 3.335659 (+0.05z)| norm 0.2687 (-0.76z)| lr 1.37e-04 | 4169.47 ms | 32.4% bf16 MFU | 125676 tok/s step 13570/19560 | loss 3.299206 (-1.03z)| norm 0.2748 (-0.41z)| lr 1.37e-04 | 4172.74 ms | 32.4% bf16 MFU | 125674 tok/s step 13571/19560 | loss 3.331954 (-0.07z)| norm 0.2831 (+0.07z)| lr 1.37e-04 | 4168.26 ms | 32.4% bf16 MFU | 125680 tok/s step 13572/19560 | loss 3.314919 (-0.57z)| norm 0.2653 (-0.95z)| lr 1.37e-04 | 4159.47 ms | 32.5% bf16 MFU | 125698 tok/s step 13573/19560 | loss 3.342052 (+0.22z)| norm 0.2817 (-0.00z)| lr 1.37e-04 | 4175.79 ms | 32.3% bf16 MFU | 125691 tok/s step 13574/19560 | loss 3.333175 (-0.04z)| norm 0.2687 (-0.74z)| lr 1.37e-04 | 4175.48 ms | 32.3% bf16 MFU | 125685 tok/s step 13575/19560 | loss 3.291974 (-1.23z)| norm 0.2669 (-0.83z)| lr 1.37e-04 | 4166.79 ms | 32.4% bf16 MFU | 125692 tok/s step 13576/19560 | loss 3.299019 (-1.01z)| norm 0.2782 (-0.19z)| lr 1.37e-04 | 4163.47 ms | 32.4% bf16 MFU | 125703 tok/s step 13577/19560 | loss 3.330213 (-0.11z)| norm 0.2791 (-0.14z)| lr 1.37e-04 | 4175.57 ms | 32.3% bf16 MFU | 125696 tok/s step 13578/19560 | loss 3.325957 (-0.25z)| norm 0.2763 (-0.31z)| lr 1.37e-04 | 4166.21 ms | 32.4% bf16 MFU | 125704 tok/s step 13579/19560 | loss 3.330194 (-0.13z)| norm 0.2832 (+0.09z)| lr 1.37e-04 | 4164.44 ms | 32.4% bf16 MFU | 125713 tok/s step 13580/19560 | loss 3.323937 (-0.32z)| norm 0.2620 (-1.12z)| lr 1.37e-04 | 4208.25 ms | 32.1% bf16 MFU | 125657 tok/s step 13581/19560 | loss 3.260594 (-2.16z)| norm 0.2700 (-0.66z)| lr 1.37e-04 | 4173.51 ms | 32.4% bf16 MFU | 125655 tok/s step 13582/19560 | loss 3.335869 (+0.05z)| norm 0.2901 (+0.48z)| lr 1.37e-04 | 4159.37 ms | 32.5% bf16 MFU | 125675 tok/s step 13583/19560 | loss 3.348684 (+0.42z)| norm 0.2728 (-0.52z)| lr 1.37e-04 | 4173.48 ms | 32.4% bf16 MFU | 125672 tok/s step 13584/19560 | loss 3.295021 (-1.15z)| norm 0.2601 (-1.24z)| lr 1.37e-04 | 4165.02 ms | 32.4% bf16 MFU | 125683 tok/s step 13585/19560 | loss 3.399819 (+1.90z)| norm 0.2891 (+0.42z)| lr 1.37e-04 | 4160.32 ms | 32.5% bf16 MFU | 125700 tok/s step 13586/19560 | loss 3.378071 (+1.24z)| norm 0.2694 (-0.73z)| lr 1.37e-04 | 4164.55 ms | 32.4% bf16 MFU | 125709 tok/s step 13587/19560 | loss 3.341348 (+0.18z)| norm 0.2466 (-2.00z)| lr 1.37e-04 | 4217.48 ms | 32.0% bf16 MFU | 125639 tok/s step 13588/19560 | loss 3.339282 (+0.12z)| norm 0.2769 (-0.28z)| lr 1.37e-04 | 4261.66 ms | 31.7% bf16 MFU | 125509 tok/s step 13589/19560 | loss 3.318573 (-0.49z)| norm 0.2596 (-1.25z)| lr 1.37e-04 | 4158.26 ms | 32.5% bf16 MFU | 125537 tok/s step 13590/19560 | loss 3.454672 (+3.33z)| norm 0.2769 (-0.27z)| lr 1.37e-04 | 4175.01 ms | 32.3% bf16 MFU | 125539 tok/s step 13591/19560 | loss 3.332630 (-0.10z)| norm 0.2658 (-0.89z)| lr 1.37e-04 | 4187.44 ms | 32.2% bf16 MFU | 125523 tok/s step 13592/19560 | loss 3.330129 (-0.16z)| norm 0.2691 (-0.71z)| lr 1.36e-04 | 4180.65 ms | 32.3% bf16 MFU | 125517 tok/s step 13593/19560 | loss 3.285255 (-1.41z)| norm 0.2603 (-1.20z)| lr 1.36e-04 | 4157.47 ms | 32.5% bf16 MFU | 125546 tok/s step 13594/19560 | loss 3.312663 (-0.64z)| norm 0.2878 (+0.35z)| lr 1.36e-04 | 4154.20 ms | 32.5% bf16 MFU | 125579 tok/s step 13595/19560 | loss 3.402518 (+1.83z)| norm 0.2862 (+0.26z)| lr 1.36e-04 | 4166.69 ms | 32.4% bf16 MFU | 125592 tok/s step 13596/19560 | loss 3.345795 (+0.28z)| norm 0.2744 (-0.41z)| lr 1.36e-04 | 4174.50 ms | 32.3% bf16 MFU | 125592 tok/s step 13597/19560 | loss 3.416403 (+2.20z)| norm 0.3026 (+1.19z)| lr 1.36e-04 | 4169.42 ms | 32.4% bf16 MFU | 125600 tok/s step 13598/19560 | loss 3.334344 (-0.07z)| norm 0.2781 (-0.19z)| lr 1.36e-04 | 4163.63 ms | 32.4% bf16 MFU | 125616 tok/s step 13599/19560 | loss 3.275438 (-1.73z)| norm 0.2821 (+0.03z)| lr 1.36e-04 | 4157.98 ms | 32.5% bf16 MFU | 125640 tok/s step 13600/19560 | loss 3.263639 (-2.01z)| norm 0.2614 (-1.13z)| lr 1.36e-04 | 4165.02 ms | 32.4% bf16 MFU | 125652 tok/s step 13601/19560 | loss 3.261017 (-2.03z)| norm 0.2791 (-0.14z)| lr 1.36e-04 | 4170.05 ms | 32.4% bf16 MFU | 125655 tok/s step 13602/19560 | loss 3.344547 (+0.23z)| norm 0.2639 (-1.00z)| lr 1.36e-04 | 4163.01 ms | 32.4% bf16 MFU | 125670 tok/s step 13603/19560 | loss 3.314511 (-0.59z)| norm 0.2579 (-1.32z)| lr 1.36e-04 | 4172.99 ms | 32.4% bf16 MFU | 125668 tok/s step 13604/19560 | loss 3.338951 (+0.07z)| norm 0.2812 (+0.01z)| lr 1.36e-04 | 4167.05 ms | 32.4% bf16 MFU | 125675 tok/s step 13605/19560 | loss 3.367666 (+0.85z)| norm 0.2783 (-0.15z)| lr 1.36e-04 | 4160.86 ms | 32.4% bf16 MFU | 125692 tok/s step 13606/19560 | loss 3.327450 (-0.25z)| norm 0.2742 (-0.38z)| lr 1.36e-04 | 4189.45 ms | 32.2% bf16 MFU | 125665 tok/s step 13607/19560 | loss 3.326859 (-0.26z)| norm 0.3012 (+1.19z)| lr 1.36e-04 | 4173.87 ms | 32.3% bf16 MFU | 125662 tok/s step 13608/19560 | loss 3.404257 (+1.83z)| norm 0.2751 (-0.32z)| lr 1.36e-04 | 4164.85 ms | 32.4% bf16 MFU | 125673 tok/s step 13609/19560 | loss 3.435887 (+2.60z)| norm 0.2785 (-0.12z)| lr 1.36e-04 | 4161.64 ms | 32.4% bf16 MFU | 125688 tok/s step 13610/19560 | loss 3.404017 (+1.72z)| norm 0.2780 (-0.15z)| lr 1.36e-04 | 4166.46 ms | 32.4% bf16 MFU | 125696 tok/s step 13611/19560 | loss 3.326123 (-0.31z)| norm 0.2586 (-1.26z)| lr 1.36e-04 | 4171.94 ms | 32.4% bf16 MFU | 125694 tok/s step 13612/19560 | loss 3.387847 (+1.32z)| norm 0.2805 (+0.02z)| lr 1.36e-04 | 4162.58 ms | 32.4% bf16 MFU | 125707 tok/s step 13613/19560 | loss 3.340745 (+0.09z)| norm 0.2663 (-0.81z)| lr 1.36e-04 | 4210.78 ms | 32.1% bf16 MFU | 125648 tok/s step 13614/19560 | loss 3.458485 (+3.08z)| norm 0.2815 (+0.08z)| lr 1.36e-04 | 4167.21 ms | 32.4% bf16 MFU | 125656 tok/s step 13615/19560 | loss 3.328147 (-0.26z)| norm 0.2830 (+0.18z)| lr 1.36e-04 | 4161.05 ms | 32.4% bf16 MFU | 125673 tok/s step 13616/19560 | loss 3.342380 (+0.11z)| norm 0.2769 (-0.18z)| lr 1.35e-04 | 4166.74 ms | 32.4% bf16 MFU | 125681 tok/s step 13617/19560 | loss 3.351573 (+0.36z)| norm 0.2720 (-0.46z)| lr 1.35e-04 | 4165.01 ms | 32.4% bf16 MFU | 125691 tok/s step 13618/19560 | loss 3.412804 (+1.89z)| norm 0.2706 (-0.54z)| lr 1.35e-04 | 4156.89 ms | 32.5% bf16 MFU | 125712 tok/s step 13619/19560 | loss 3.310823 (-0.69z)| norm 0.2577 (-1.28z)| lr 1.35e-04 | 4153.37 ms | 32.5% bf16 MFU | 125738 tok/s step 13620/19560 | loss 3.386180 (+1.21z)| norm 0.2583 (-1.23z)| lr 1.35e-04 | 4153.38 ms | 32.5% bf16 MFU | 125763 tok/s step 13621/19560 | loss 3.357764 (+0.48z)| norm 0.2662 (-0.76z)| lr 1.35e-04 | 4171.69 ms | 32.4% bf16 MFU | 125759 tok/s step 13622/19560 | loss 3.282012 (-1.41z)| norm 0.2708 (-0.47z)| lr 1.35e-04 | 4172.38 ms | 32.4% bf16 MFU | 125754 tok/s step 13623/19560 | loss 3.329677 (-0.21z)| norm 0.2701 (-0.50z)| lr 1.35e-04 | 4160.73 ms | 32.5% bf16 MFU | 125766 tok/s step 13624/19560 | loss 3.350112 (+0.30z)| norm 0.2818 (+0.23z)| lr 1.35e-04 | 4169.22 ms | 32.4% bf16 MFU | 125766 tok/s step 13625/19560 | loss 3.389907 (+1.29z)| norm 0.2772 (-0.04z)| lr 1.35e-04 | 4180.00 ms | 32.3% bf16 MFU | 125749 tok/s step 13626/19560 | loss 3.303093 (-0.88z)| norm 0.2869 (+0.59z)| lr 1.35e-04 | 4162.07 ms | 32.4% bf16 MFU | 125760 tok/s step 13627/19560 | loss 3.339175 (+0.04z)| norm 0.2635 (-0.89z)| lr 1.35e-04 | 4165.49 ms | 32.4% bf16 MFU | 125765 tok/s step 13628/19560 | loss 3.349321 (+0.29z)| norm 0.2885 (+0.82z)| lr 1.35e-04 | 4177.23 ms | 32.3% bf16 MFU | 125752 tok/s step 13629/19560 | loss 3.392888 (+1.37z)| norm 0.2872 (+0.73z)| lr 1.35e-04 | 4170.25 ms | 32.4% bf16 MFU | 125751 tok/s step 13630/19560 | loss 3.344788 (+0.16z)| norm 0.2932 (+1.19z)| lr 1.35e-04 | 4159.24 ms | 32.5% bf16 MFU | 125766 tok/s step 13631/19560 | loss 3.396674 (+1.47z)| norm 0.2674 (-0.66z)| lr 1.35e-04 | 4165.15 ms | 32.4% bf16 MFU | 125771 tok/s step 13632/19560 | loss 3.412418 (+1.83z)| norm 0.2782 (+0.13z)| lr 1.35e-04 | 4181.34 ms | 32.3% bf16 MFU | 125752 tok/s step 13633/19560 | loss 3.379823 (+1.00z)| norm 0.2843 (+0.64z)| lr 1.35e-04 | 4154.26 ms | 32.5% bf16 MFU | 125775 tok/s step 13634/19560 | loss 3.273036 (-1.64z)| norm 0.3058 (+2.25z)| lr 1.35e-04 | 4178.64 ms | 32.3% bf16 MFU | 125759 tok/s step 13635/19560 | loss 3.342096 (+0.07z)| norm 0.2646 (-0.89z)| lr 1.35e-04 | 4185.45 ms | 32.3% bf16 MFU | 125735 tok/s step 13636/19560 | loss 3.311144 (-0.69z)| norm 0.3040 (+2.17z)| lr 1.35e-04 | 4163.63 ms | 32.4% bf16 MFU | 125744 tok/s step 13637/19560 | loss 3.289116 (-1.25z)| norm 0.2630 (-0.99z)| lr 1.35e-04 | 4162.36 ms | 32.4% bf16 MFU | 125755 tok/s step 13638/19560 | loss 3.332496 (-0.18z)| norm 0.2898 (+1.09z)| lr 1.35e-04 | 4157.75 ms | 32.5% bf16 MFU | 125772 tok/s step 13639/19560 | loss 3.306365 (-0.82z)| norm 0.2531 (-1.77z)| lr 1.35e-04 | 4163.31 ms | 32.4% bf16 MFU | 125780 tok/s step 13640/19560 | loss 3.372170 (+0.82z)| norm 0.2873 (+0.89z)| lr 1.34e-04 | 4173.10 ms | 32.4% bf16 MFU | 125773 tok/s step 13641/19560 | loss 3.297777 (-1.03z)| norm 0.2461 (-2.29z)| lr 1.34e-04 | 4154.90 ms | 32.5% bf16 MFU | 125793 tok/s step 13642/19560 | loss 3.403018 (+1.57z)| norm 0.3132 (+2.77z)| lr 1.34e-04 | 4174.89 ms | 32.3% bf16 MFU | 125783 tok/s step 13643/19560 | loss 3.379696 (+0.98z)| norm 0.2668 (-0.72z)| lr 1.34e-04 | 4168.37 ms | 32.4% bf16 MFU | 125782 tok/s step 13644/19560 | loss 3.303293 (-0.91z)| norm 0.3047 (+2.11z)| lr 1.34e-04 | 4174.24 ms | 32.3% bf16 MFU | 125773 tok/s step 13645/19560 | loss 3.380308 (+0.99z)| norm 0.2688 (-0.57z)| lr 1.34e-04 | 4209.19 ms | 32.1% bf16 MFU | 125713 tok/s step 13646/19560 | loss 3.355387 (+0.38z)| norm 0.2760 (-0.02z)| lr 1.34e-04 | 4181.21 ms | 32.3% bf16 MFU | 125697 tok/s step 13647/19560 | loss 3.358652 (+0.45z)| norm 0.2771 (+0.07z)| lr 1.34e-04 | 4199.10 ms | 32.2% bf16 MFU | 125655 tok/s step 13648/19560 | loss 3.421637 (+1.96z)| norm 0.2730 (-0.23z)| lr 1.34e-04 | 4163.99 ms | 32.4% bf16 MFU | 125667 tok/s step 13649/19560 | loss 3.330063 (-0.27z)| norm 0.2901 (+1.05z)| lr 1.34e-04 | 4165.87 ms | 32.4% bf16 MFU | 125677 tok/s step 13650/19560 | loss 3.417335 (+1.82z)| norm 0.2536 (-1.67z)| lr 1.34e-04 | 4175.35 ms | 32.3% bf16 MFU | 125671 tok/s step 13651/19560 | loss 3.381613 (+0.94z)| norm 0.2718 (-0.31z)| lr 1.34e-04 | 4167.01 ms | 32.4% bf16 MFU | 125679 tok/s step 13652/19560 | loss 3.318762 (-0.58z)| norm 0.2924 (+1.21z)| lr 1.34e-04 | 4173.36 ms | 32.4% bf16 MFU | 125676 tok/s step 13653/19560 | loss 3.354217 (+0.28z)| norm 0.3217 (+3.24z)| lr 1.34e-04 | 4849.87 ms | 27.8% bf16 MFU | 124797 tok/s step 13654/19560 | loss 3.328936 (-0.34z)| norm 0.2679 (-0.62z)| lr 1.34e-04 | 4167.75 ms | 32.4% bf16 MFU | 124847 tok/s step 13655/19560 | loss 3.426435 (+2.01z)| norm 0.2645 (-0.87z)| lr 1.34e-04 | 4178.36 ms | 32.3% bf16 MFU | 124879 tok/s step 13656/19560 | loss 3.389268 (+1.10z)| norm 0.2810 (+0.33z)| lr 1.34e-04 | 4186.13 ms | 32.3% bf16 MFU | 124897 tok/s step 13657/19560 | loss 3.381649 (+0.90z)| norm 0.2727 (-0.28z)| lr 1.34e-04 | 4168.73 ms | 32.4% bf16 MFU | 124941 tok/s step 13658/19560 | loss 3.375544 (+0.75z)| norm 0.2725 (-0.27z)| lr 1.34e-04 | 4921.36 ms | 27.4% bf16 MFU | 124020 tok/s step 13659/19560 | loss 3.333083 (-0.26z)| norm 0.2757 (-0.03z)| lr 1.34e-04 | 4169.30 ms | 32.4% bf16 MFU | 124107 tok/s step 13660/19560 | loss 3.364891 (+0.50z)| norm 0.2723 (-0.28z)| lr 1.34e-04 | 4172.82 ms | 32.4% bf16 MFU | 124183 tok/s step 13661/19560 | loss 3.504156 (+3.61z)| norm 0.2704 (-0.42z)| lr 1.34e-04 | 4215.89 ms | 32.0% bf16 MFU | 124192 tok/s step 13662/19560 | loss 3.352426 (+0.17z)| norm 0.2671 (-0.66z)| lr 1.34e-04 | 4186.71 ms | 32.2% bf16 MFU | 124244 tok/s step 13663/19560 | loss 3.439745 (+2.10z)| norm 0.2724 (-0.25z)| lr 1.34e-04 | 4199.80 ms | 32.1% bf16 MFU | 124274 tok/s step 13664/19560 | loss 3.368976 (+0.51z)| norm 0.2621 (-1.04z)| lr 1.33e-04 | 4156.10 ms | 32.5% bf16 MFU | 124367 tok/s step 13665/19560 | loss 3.332336 (-0.31z)| norm 0.2793 (+0.31z)| lr 1.33e-04 | 4163.80 ms | 32.4% bf16 MFU | 124445 tok/s step 13666/19560 | loss 3.323875 (-0.50z)| norm 0.2668 (-0.67z)| lr 1.33e-04 | 4168.77 ms | 32.4% bf16 MFU | 124511 tok/s step 13667/19560 | loss 3.353290 (+0.16z)| norm 0.2780 (+0.21z)| lr 1.33e-04 | 4182.57 ms | 32.3% bf16 MFU | 124553 tok/s step 13668/19560 | loss 3.361446 (+0.33z)| norm 0.2674 (-0.62z)| lr 1.33e-04 | 4181.71 ms | 32.3% bf16 MFU | 124594 tok/s step 13669/19560 | loss 3.286579 (-1.35z)| norm 0.2756 (+0.02z)| lr 1.33e-04 | 4186.70 ms | 32.2% bf16 MFU | 124626 tok/s step 13670/19560 | loss 3.324662 (-0.48z)| norm 0.2618 (-1.05z)| lr 1.33e-04 | 4172.48 ms | 32.4% bf16 MFU | 124677 tok/s step 13671/19560 | loss 3.320439 (-0.57z)| norm 0.2802 (+0.38z)| lr 1.33e-04 | 4191.94 ms | 32.2% bf16 MFU | 124697 tok/s step 13672/19560 | loss 3.328396 (-0.40z)| norm 0.2538 (-1.66z)| lr 1.33e-04 | 4169.24 ms | 32.4% bf16 MFU | 124749 tok/s step 13673/19560 | loss 3.340735 (-0.12z)| norm 0.2659 (-0.71z)| lr 1.33e-04 | 4158.25 ms | 32.5% bf16 MFU | 124816 tok/s step 13674/19560 | loss 3.330002 (-0.37z)| norm 0.2742 (-0.06z)| lr 1.33e-04 | 4170.74 ms | 32.4% bf16 MFU | 124861 tok/s step 13675/19560 | loss 3.310234 (-0.82z)| norm 0.2694 (-0.43z)| lr 1.33e-04 | 4188.64 ms | 32.2% bf16 MFU | 124876 tok/s step 13676/19560 | loss 3.369495 (+0.53z)| norm 0.2650 (-0.76z)| lr 1.33e-04 | 4189.37 ms | 32.2% bf16 MFU | 124890 tok/s step 13677/19560 | loss 3.314152 (-0.72z)| norm 0.2599 (-1.14z)| lr 1.33e-04 | 4162.84 ms | 32.4% bf16 MFU | 124942 tok/s step 13678/19560 | loss 3.371679 (+0.58z)| norm 0.2627 (-0.94z)| lr 1.33e-04 | 4196.24 ms | 32.2% bf16 MFU | 124942 tok/s step 13679/19560 | loss 3.314790 (-0.70z)| norm 0.2642 (-0.81z)| lr 1.33e-04 | 4167.95 ms | 32.4% bf16 MFU | 124985 tok/s step 13680/19560 | loss 3.348124 (+0.04z)| norm 0.2710 (-0.28z)| lr 1.33e-04 | 4177.03 ms | 32.3% bf16 MFU | 125011 tok/s step 13681/19560 | loss 3.326765 (-0.45z)| norm 0.2653 (-0.72z)| lr 1.33e-04 | 4170.62 ms | 32.4% bf16 MFU | 125046 tok/s step 13682/19560 | loss 3.324101 (-0.50z)| norm 0.2684 (-0.48z)| lr 1.33e-04 | 4178.84 ms | 32.3% bf16 MFU | 125067 tok/s step 13683/19560 | loss 3.433267 (+1.95z)| norm 0.2632 (-0.88z)| lr 1.33e-04 | 4183.51 ms | 32.3% bf16 MFU | 125080 tok/s step 13684/19560 | loss 3.371269 (+0.54z)| norm 0.2743 (-0.02z)| lr 1.33e-04 | 4173.76 ms | 32.3% bf16 MFU | 125107 tok/s step 13685/19560 | loss 3.325718 (-0.48z)| norm 0.2747 (+0.02z)| lr 1.33e-04 | 4180.35 ms | 32.3% bf16 MFU | 125122 tok/s step 13686/19560 | loss 3.375956 (+0.68z)| norm 0.2710 (-0.26z)| lr 1.33e-04 | 4181.08 ms | 32.3% bf16 MFU | 125136 tok/s step 13687/19560 | loss 3.329133 (-0.41z)| norm 0.2807 (+0.52z)| lr 1.33e-04 | 4183.90 ms | 32.3% bf16 MFU | 125145 tok/s step 13688/19560 | loss 3.373592 (+0.61z)| norm 0.2777 (+0.26z)| lr 1.32e-04 | 4182.12 ms | 32.3% bf16 MFU | 125156 tok/s step 13689/19560 | loss 3.311615 (-0.82z)| norm 0.2604 (-1.09z)| lr 1.32e-04 | 4176.64 ms | 32.3% bf16 MFU | 125174 tok/s step 13690/19560 | loss 3.347884 (+0.03z)| norm 0.2773 (+0.26z)| lr 1.32e-04 | 4177.45 ms | 32.3% bf16 MFU | 125191 tok/s step 13691/19560 | loss 3.331286 (-0.36z)| norm 0.2749 (+0.05z)| lr 1.32e-04 | 4151.96 ms | 32.5% bf16 MFU | 125245 tok/s step 13692/19560 | loss 3.335183 (-0.29z)| norm 0.2850 (+0.87z)| lr 1.32e-04 | 4203.07 ms | 32.1% bf16 MFU | 125220 tok/s step 13693/19560 | loss 3.330360 (-0.40z)| norm 0.2748 (+0.03z)| lr 1.32e-04 | 4166.71 ms | 32.4% bf16 MFU | 125250 tok/s step 13694/19560 | loss 3.400592 (+1.23z)| norm 0.2804 (+0.48z)| lr 1.32e-04 | 4175.99 ms | 32.3% bf16 MFU | 125265 tok/s step 13695/19560 | loss 3.304941 (-0.99z)| norm 0.2869 (+0.99z)| lr 1.32e-04 | 4179.31 ms | 32.3% bf16 MFU | 125274 tok/s step 13696/19560 | loss 3.400771 (+1.22z)| norm 0.2686 (-0.47z)| lr 1.32e-04 | 4164.27 ms | 32.4% bf16 MFU | 125306 tok/s step 13697/19560 | loss 3.328482 (-0.46z)| norm 0.2784 (+0.31z)| lr 1.32e-04 | 4177.61 ms | 32.3% bf16 MFU | 125315 tok/s step 13698/19560 | loss 3.374032 (+0.59z)| norm 0.2804 (+0.47z)| lr 1.32e-04 | 4174.95 ms | 32.3% bf16 MFU | 125328 tok/s step 13699/19560 | loss 3.383452 (+0.79z)| norm 0.2901 (+1.23z)| lr 1.32e-04 | 4166.03 ms | 32.4% bf16 MFU | 125354 tok/s step 13700/19560 | loss 3.308574 (-0.94z)| norm 0.2743 (-0.04z)| lr 1.32e-04 | 4171.41 ms | 32.4% bf16 MFU | 125371 tok/s step 13701/19560 | loss 3.351344 (+0.05z)| norm 0.3076 (+2.55z)| lr 1.32e-04 | 4183.48 ms | 32.3% bf16 MFU | 125369 tok/s step 13702/19560 | loss 3.353095 (+0.09z)| norm 0.2710 (-0.31z)| lr 1.32e-04 | 4336.18 ms | 31.1% bf16 MFU | 125146 tok/s step 13703/19560 | loss 3.344721 (-0.12z)| norm 0.2609 (-1.09z)| lr 1.32e-04 | 4170.26 ms | 32.4% bf16 MFU | 125174 tok/s step 13704/19560 | loss 3.355334 (+0.12z)| norm 0.2732 (-0.13z)| lr 1.32e-04 | 4167.08 ms | 32.4% bf16 MFU | 125207 tok/s step 13705/19560 | loss 3.431056 (+1.85z)| norm 0.2742 (-0.05z)| lr 1.32e-04 | 4173.85 ms | 32.3% bf16 MFU | 125227 tok/s step 13706/19560 | loss 3.362196 (+0.25z)| norm 0.2902 (+1.19z)| lr 1.32e-04 | 4180.41 ms | 32.3% bf16 MFU | 125236 tok/s step 13707/19560 | loss 3.326316 (-0.58z)| norm 0.2809 (+0.47z)| lr 1.32e-04 | 4177.78 ms | 32.3% bf16 MFU | 125249 tok/s step 13708/19560 | loss 3.320841 (-0.70z)| norm 0.3070 (+2.42z)| lr 1.32e-04 | 4255.32 ms | 31.7% bf16 MFU | 125147 tok/s step 13709/19560 | loss 3.318581 (-0.78z)| norm 0.3053 (+2.22z)| lr 1.32e-04 | 4197.50 ms | 32.2% bf16 MFU | 125135 tok/s step 13710/19560 | loss 3.354250 (+0.06z)| norm 0.2842 (+0.65z)| lr 1.32e-04 | 4233.67 ms | 31.9% bf16 MFU | 125070 tok/s step 13711/19560 | loss 3.348536 (-0.08z)| norm 0.2788 (+0.25z)| lr 1.32e-04 | 4165.71 ms | 32.4% bf16 MFU | 125110 tok/s step 13712/19560 | loss 3.385049 (+0.77z)| norm 0.2631 (-0.94z)| lr 1.31e-04 | 4155.85 ms | 32.5% bf16 MFU | 125162 tok/s step 13713/19560 | loss 3.353180 (+0.02z)| norm 0.2706 (-0.36z)| lr 1.31e-04 | 4177.10 ms | 32.3% bf16 MFU | 125180 tok/s step 13714/19560 | loss 3.250685 (-2.34z)| norm 0.2574 (-1.35z)| lr 1.31e-04 | 4167.84 ms | 32.4% bf16 MFU | 125210 tok/s step 13715/19560 | loss 3.380740 (+0.68z)| norm 0.2800 (+0.34z)| lr 1.31e-04 | 4151.31 ms | 32.5% bf16 MFU | 125264 tok/s step 13716/19560 | loss 3.305254 (-1.06z)| norm 0.2490 (-1.98z)| lr 1.31e-04 | 4206.47 ms | 32.1% bf16 MFU | 125233 tok/s step 13717/19560 | loss 3.310559 (-0.94z)| norm 0.2707 (-0.36z)| lr 1.31e-04 | 4184.89 ms | 32.3% bf16 MFU | 125236 tok/s step 13718/19560 | loss 3.324416 (-0.61z)| norm 0.2649 (-0.79z)| lr 1.31e-04 | 4160.68 ms | 32.5% bf16 MFU | 125274 tok/s step 13719/19560 | loss 3.466954 (+2.66z)| norm 0.2818 (+0.48z)| lr 1.31e-04 | 4162.97 ms | 32.4% bf16 MFU | 125308 tok/s step 13720/19560 | loss 3.453766 (+2.29z)| norm 0.2755 (-0.00z)| lr 1.31e-04 | 4174.24 ms | 32.3% bf16 MFU | 125322 tok/s step 13721/19560 | loss 3.316037 (-0.82z)| norm 0.2857 (+0.76z)| lr 1.31e-04 | 4177.14 ms | 32.3% bf16 MFU | 125332 tok/s step 13722/19560 | loss 3.397623 (+1.01z)| norm 0.2821 (+0.49z)| lr 1.31e-04 | 4159.61 ms | 32.5% bf16 MFU | 125367 tok/s step 13723/19560 | loss 3.277697 (-1.67z)| norm 0.2864 (+0.81z)| lr 1.31e-04 | 4162.74 ms | 32.4% bf16 MFU | 125396 tok/s step 13724/19560 | loss 3.383877 (+0.71z)| norm 0.2673 (-0.63z)| lr 1.31e-04 | 4196.68 ms | 32.2% bf16 MFU | 125373 tok/s step 13725/19560 | loss 3.447161 (+2.10z)| norm 0.2934 (+1.36z)| lr 1.31e-04 | 4181.94 ms | 32.3% bf16 MFU | 125373 tok/s step 13726/19560 | loss 3.313257 (-0.87z)| norm 0.2612 (-1.09z)| lr 1.31e-04 | 4177.65 ms | 32.3% bf16 MFU | 125379 tok/s step 13727/19560 | loss 3.343256 (-0.22z)| norm 0.2810 (+0.42z)| lr 1.31e-04 | 4486.70 ms | 30.1% bf16 MFU | 124953 tok/s step 13728/19560 | loss 3.363929 (+0.23z)| norm 0.2733 (-0.17z)| lr 1.31e-04 | 4156.64 ms | 32.5% bf16 MFU | 125012 tok/s step 13729/19560 | loss 3.389499 (+0.80z)| norm 0.2745 (-0.08z)| lr 1.31e-04 | 4175.39 ms | 32.3% bf16 MFU | 125040 tok/s step 13730/19560 | loss 3.409531 (+1.25z)| norm 0.2971 (+1.62z)| lr 1.31e-04 | 4179.24 ms | 32.3% bf16 MFU | 125060 tok/s step 13731/19560 | loss 3.355545 (-0.00z)| norm 0.2797 (+0.29z)| lr 1.31e-04 | 4166.91 ms | 32.4% bf16 MFU | 125098 tok/s step 13732/19560 | loss 3.354286 (-0.03z)| norm 0.2893 (+1.01z)| lr 1.31e-04 | 4182.19 ms | 32.3% bf16 MFU | 125111 tok/s step 13733/19560 | loss 3.335024 (-0.47z)| norm 0.2872 (+0.85z)| lr 1.31e-04 | 4164.90 ms | 32.4% bf16 MFU | 125150 tok/s step 13734/19560 | loss 3.425413 (+1.58z)| norm 0.2713 (-0.36z)| lr 1.31e-04 | 4181.04 ms | 32.3% bf16 MFU | 125162 tok/s step 13735/19560 | loss 3.386939 (+0.69z)| norm 0.2718 (-0.31z)| lr 1.31e-04 | 4187.24 ms | 32.2% bf16 MFU | 125165 tok/s step 13736/19560 | loss 3.273537 (-1.86z)| norm 0.2647 (-0.85z)| lr 1.30e-04 | 4172.52 ms | 32.4% bf16 MFU | 125189 tok/s step 13737/19560 | loss 3.345635 (-0.21z)| norm 0.2762 (+0.03z)| lr 1.30e-04 | 4176.11 ms | 32.3% bf16 MFU | 125207 tok/s step 13738/19560 | loss 3.375853 (+0.49z)| norm 0.2801 (+0.34z)| lr 1.30e-04 | 4186.47 ms | 32.3% bf16 MFU | 125208 tok/s step 13739/19560 | loss 3.336844 (-0.41z)| norm 0.2748 (-0.08z)| lr 1.30e-04 | 4173.32 ms | 32.4% bf16 MFU | 125229 tok/s step 13740/19560 | loss 3.333927 (-0.47z)| norm 0.2801 (+0.33z)| lr 1.30e-04 | 4176.24 ms | 32.3% bf16 MFU | 125245 tok/s step 13741/19560 | loss 3.311998 (-0.97z)| norm 0.3044 (+2.15z)| lr 1.30e-04 | 4428.53 ms | 30.5% bf16 MFU | 124902 tok/s step 13742/19560 | loss 3.360338 (+0.16z)| norm 0.2750 (-0.08z)| lr 1.30e-04 | 4179.51 ms | 32.3% bf16 MFU | 124929 tok/s step 13743/19560 | loss 3.346772 (-0.16z)| norm 0.2906 (+1.10z)| lr 1.30e-04 | 4180.84 ms | 32.3% bf16 MFU | 124953 tok/s step 13744/19560 | loss 3.349869 (-0.09z)| norm 0.2829 (+0.51z)| lr 1.30e-04 | 4180.88 ms | 32.3% bf16 MFU | 124975 tok/s step 13745/19560 | loss 3.323821 (-0.70z)| norm 0.2782 (+0.15z)| lr 1.30e-04 | 4162.91 ms | 32.4% bf16 MFU | 125023 tok/s step 13746/19560 | loss 3.378244 (+0.60z)| norm 0.2874 (+0.83z)| lr 1.30e-04 | 4178.94 ms | 32.3% bf16 MFU | 125045 tok/s step 13747/19560 | loss 3.343732 (-0.23z)| norm 0.2951 (+1.40z)| lr 1.30e-04 | 4181.14 ms | 32.3% bf16 MFU | 125063 tok/s step 13748/19560 | loss 3.344208 (-0.21z)| norm 0.2696 (-0.54z)| lr 1.30e-04 | 4174.74 ms | 32.3% bf16 MFU | 125089 tok/s step 13749/19560 | loss 3.354078 (+0.03z)| norm 0.2811 (+0.32z)| lr 1.30e-04 | 4188.80 ms | 32.2% bf16 MFU | 125093 tok/s step 13750/19560 | loss 3.465509 (+2.61z)| norm 0.2756 (-0.10z)| lr 1.30e-04 | 4168.86 ms | 32.4% bf16 MFU | 125126 tok/s val loss 3.312726 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2987/10042 = 0.297451 step 13751/19560 | loss 3.297486 (-1.32z)| norm 0.2800 (+0.23z)| lr 1.30e-04 | 4164.05 ms | 32.4% bf16 MFU | 125165 tok/s step 13752/19560 | loss 3.376577 (+0.52z)| norm 0.2830 (+0.46z)| lr 1.30e-04 | 4172.81 ms | 32.4% bf16 MFU | 125189 tok/s step 13753/19560 | loss 3.352476 (-0.04z)| norm 0.2817 (+0.35z)| lr 1.30e-04 | 4226.55 ms | 31.9% bf16 MFU | 125132 tok/s step 13754/19560 | loss 3.382699 (+0.66z)| norm 0.2741 (-0.22z)| lr 1.30e-04 | 4181.71 ms | 32.3% bf16 MFU | 125144 tok/s step 13755/19560 | loss 3.349020 (-0.14z)| norm 0.2738 (-0.25z)| lr 1.30e-04 | 4169.89 ms | 32.4% bf16 MFU | 125174 tok/s step 13756/19560 | loss 3.375779 (+0.49z)| norm 0.2704 (-0.50z)| lr 1.30e-04 | 4161.12 ms | 32.4% bf16 MFU | 125215 tok/s step 13757/19560 | loss 3.375795 (+0.49z)| norm 0.3016 (+1.87z)| lr 1.30e-04 | 4176.48 ms | 32.3% bf16 MFU | 125231 tok/s step 13758/19560 | loss 3.359647 (+0.11z)| norm 0.2633 (-1.03z)| lr 1.30e-04 | 4179.53 ms | 32.3% bf16 MFU | 125241 tok/s step 13759/19560 | loss 3.388231 (+0.78z)| norm 0.3022 (+1.90z)| lr 1.30e-04 | 4406.39 ms | 30.6% bf16 MFU | 124928 tok/s step 13760/19560 | loss 3.404391 (+1.17z)| norm 0.2876 (+0.79z)| lr 1.29e-04 | 4181.07 ms | 32.3% bf16 MFU | 124952 tok/s step 13761/19560 | loss 3.331905 (-0.53z)| norm 0.2754 (-0.12z)| lr 1.29e-04 | 4374.12 ms | 30.9% bf16 MFU | 124697 tok/s step 13762/19560 | loss 3.305945 (-1.16z)| norm 0.2814 (+0.35z)| lr 1.29e-04 | 4432.16 ms | 30.5% bf16 MFU | 124377 tok/s step 13763/19560 | loss 3.344493 (-0.24z)| norm 0.2893 (+0.94z)| lr 1.29e-04 | 4176.36 ms | 32.3% bf16 MFU | 124435 tok/s step 13764/19560 | loss 3.342349 (-0.30z)| norm 0.2886 (+0.91z)| lr 1.29e-04 | 4164.27 ms | 32.4% bf16 MFU | 124508 tok/s step 13765/19560 | loss 3.312202 (-1.04z)| norm 0.2862 (+0.71z)| lr 1.29e-04 | 4313.90 ms | 31.3% bf16 MFU | 124360 tok/s step 13766/19560 | loss 3.315625 (-0.95z)| norm 0.2916 (+1.13z)| lr 1.29e-04 | 4235.23 ms | 31.9% bf16 MFU | 124331 tok/s step 13767/19560 | loss 3.361959 (+0.16z)| norm 0.2680 (-0.72z)| lr 1.29e-04 | 4162.14 ms | 32.4% bf16 MFU | 124413 tok/s step 13768/19560 | loss 3.315950 (-0.94z)| norm 0.2944 (+1.34z)| lr 1.29e-04 | 4196.13 ms | 32.2% bf16 MFU | 124440 tok/s step 13769/19560 | loss 3.330048 (-0.61z)| norm 0.2709 (-0.53z)| lr 1.29e-04 | 4170.29 ms | 32.4% bf16 MFU | 124504 tok/s step 13770/19560 | loss 3.313322 (-1.00z)| norm 0.3093 (+2.57z)| lr 1.29e-04 | 4167.90 ms | 32.4% bf16 MFU | 124568 tok/s step 13771/19560 | loss 3.352435 (-0.05z)| norm 0.2663 (-0.90z)| lr 1.29e-04 | 4180.05 ms | 32.3% bf16 MFU | 124611 tok/s step 13772/19560 | loss 3.247718 (-2.53z)| norm 0.2917 (+1.18z)| lr 1.29e-04 | 4168.61 ms | 32.4% bf16 MFU | 124669 tok/s step 13773/19560 | loss 3.300392 (-1.26z)| norm 0.2803 (+0.23z)| lr 1.29e-04 | 4155.81 ms | 32.5% bf16 MFU | 124743 tok/s step 13774/19560 | loss 3.326686 (-0.63z)| norm 0.2803 (+0.23z)| lr 1.29e-04 | 4181.11 ms | 32.3% bf16 MFU | 124776 tok/s step 13775/19560 | loss 3.476641 (+2.81z)| norm 0.2996 (+1.78z)| lr 1.29e-04 | 4181.00 ms | 32.3% bf16 MFU | 124807 tok/s step 13776/19560 | loss 3.370537 (+0.39z)| norm 0.2921 (+1.16z)| lr 1.29e-04 | 4199.93 ms | 32.1% bf16 MFU | 124808 tok/s step 13777/19560 | loss 3.395284 (+0.95z)| norm 0.2842 (+0.53z)| lr 1.29e-04 | 4232.39 ms | 31.9% bf16 MFU | 124762 tok/s step 13778/19560 | loss 3.323557 (-0.69z)| norm 0.2808 (+0.23z)| lr 1.29e-04 | 4225.40 ms | 32.0% bf16 MFU | 124728 tok/s step 13779/19560 | loss 3.334633 (-0.43z)| norm 0.2709 (-0.58z)| lr 1.29e-04 | 4184.89 ms | 32.3% bf16 MFU | 124755 tok/s step 13780/19560 | loss 3.356310 (+0.07z)| norm 0.2777 (-0.01z)| lr 1.29e-04 | 4198.88 ms | 32.2% bf16 MFU | 124761 tok/s step 13781/19560 | loss 3.377896 (+0.57z)| norm 0.2710 (-0.56z)| lr 1.29e-04 | 4178.11 ms | 32.3% bf16 MFU | 124797 tok/s step 13782/19560 | loss 3.355113 (+0.03z)| norm 0.2801 (+0.23z)| lr 1.29e-04 | 4182.28 ms | 32.3% bf16 MFU | 124825 tok/s step 13783/19560 | loss 3.414510 (+1.42z)| norm 0.2727 (-0.43z)| lr 1.29e-04 | 4172.75 ms | 32.4% bf16 MFU | 124866 tok/s step 13784/19560 | loss 3.336434 (-0.39z)| norm 0.2724 (-0.45z)| lr 1.29e-04 | 4178.01 ms | 32.3% bf16 MFU | 124897 tok/s step 13785/19560 | loss 3.347087 (-0.14z)| norm 0.2609 (-1.44z)| lr 1.28e-04 | 4174.13 ms | 32.3% bf16 MFU | 124932 tok/s step 13786/19560 | loss 3.387281 (+0.80z)| norm 0.2593 (-1.55z)| lr 1.28e-04 | 4162.05 ms | 32.4% bf16 MFU | 124984 tok/s step 13787/19560 | loss 3.356806 (+0.08z)| norm 0.2543 (-1.94z)| lr 1.28e-04 | 4189.93 ms | 32.2% bf16 MFU | 124992 tok/s step 13788/19560 | loss 3.302752 (-1.17z)| norm 0.2690 (-0.69z)| lr 1.28e-04 | 4184.82 ms | 32.3% bf16 MFU | 125006 tok/s step 13789/19560 | loss 3.311232 (-0.98z)| norm 0.2670 (-0.85z)| lr 1.28e-04 | 4169.58 ms | 32.4% bf16 MFU | 125043 tok/s step 13790/19560 | loss 3.332778 (-0.45z)| norm 0.2570 (-1.69z)| lr 1.28e-04 | 4182.69 ms | 32.3% bf16 MFU | 125058 tok/s step 13791/19560 | loss 3.347325 (-0.08z)| norm 0.2778 (+0.06z)| lr 1.28e-04 | 4175.19 ms | 32.3% bf16 MFU | 125084 tok/s step 13792/19560 | loss 3.335979 (-0.35z)| norm 0.2609 (-1.36z)| lr 1.28e-04 | 4180.81 ms | 32.3% bf16 MFU | 125100 tok/s step 13793/19560 | loss 3.291375 (-1.45z)| norm 0.2664 (-0.88z)| lr 1.28e-04 | 4163.50 ms | 32.4% bf16 MFU | 125141 tok/s step 13794/19560 | loss 3.399066 (+1.20z)| norm 0.2769 (-0.01z)| lr 1.28e-04 | 4187.16 ms | 32.2% bf16 MFU | 125145 tok/s step 13795/19560 | loss 3.363868 (+0.33z)| norm 0.2712 (-0.49z)| lr 1.28e-04 | 4178.42 ms | 32.3% bf16 MFU | 125161 tok/s step 13796/19560 | loss 3.393562 (+1.05z)| norm 0.2874 (+0.86z)| lr 1.28e-04 | 4200.56 ms | 32.1% bf16 MFU | 125144 tok/s step 13797/19560 | loss 3.323131 (-0.69z)| norm 0.2639 (-1.10z)| lr 1.28e-04 | 4178.59 ms | 32.3% bf16 MFU | 125160 tok/s step 13798/19560 | loss 3.323270 (-0.69z)| norm 0.2777 (+0.04z)| lr 1.28e-04 | 4167.75 ms | 32.4% bf16 MFU | 125192 tok/s step 13799/19560 | loss 3.322868 (-0.70z)| norm 0.2655 (-0.97z)| lr 1.28e-04 | 4173.12 ms | 32.4% bf16 MFU | 125214 tok/s step 13800/19560 | loss 3.342728 (-0.21z)| norm 0.2673 (-0.83z)| lr 1.28e-04 | 4179.31 ms | 32.3% bf16 MFU | 125226 tok/s step 13801/19560 | loss 3.310081 (-1.01z)| norm 0.2892 (+1.00z)| lr 1.28e-04 | 4240.94 ms | 31.8% bf16 MFU | 125146 tok/s step 13802/19560 | loss 3.333541 (-0.43z)| norm 0.2596 (-1.48z)| lr 1.28e-04 | 4244.15 ms | 31.8% bf16 MFU | 125065 tok/s step 13803/19560 | loss 3.355086 (+0.10z)| norm 0.2906 (+1.11z)| lr 1.28e-04 | 4164.77 ms | 32.4% bf16 MFU | 125106 tok/s step 13804/19560 | loss 3.275932 (-1.83z)| norm 0.2725 (-0.42z)| lr 1.28e-04 | 4173.83 ms | 32.3% bf16 MFU | 125131 tok/s step 13805/19560 | loss 3.414368 (+1.53z)| norm 0.2966 (+1.58z)| lr 1.28e-04 | 4175.94 ms | 32.3% bf16 MFU | 125152 tok/s step 13806/19560 | loss 3.365019 (+0.34z)| norm 0.2926 (+1.22z)| lr 1.28e-04 | 4177.45 ms | 32.3% bf16 MFU | 125170 tok/s step 13807/19560 | loss 3.571667 (+4.82z)| norm 0.2844 (+0.53z)| lr 1.28e-04 | 4179.71 ms | 32.3% bf16 MFU | 125183 tok/s step 13808/19560 | loss 3.280796 (-1.57z)| norm 0.2960 (+1.48z)| lr 1.28e-04 | 4196.99 ms | 32.2% bf16 MFU | 125170 tok/s step 13809/19560 | loss 3.383342 (+0.66z)| norm 0.2924 (+1.16z)| lr 1.27e-04 | 4169.75 ms | 32.4% bf16 MFU | 125198 tok/s step 13810/19560 | loss 3.357258 (+0.08z)| norm 0.2915 (+1.07z)| lr 1.27e-04 | 4192.39 ms | 32.2% bf16 MFU | 125191 tok/s step 13811/19560 | loss 3.321087 (-0.70z)| norm 0.2789 (+0.00z)| lr 1.27e-04 | 4165.17 ms | 32.4% bf16 MFU | 125226 tok/s step 13812/19560 | loss 3.409141 (+1.24z)| norm 0.3270 (+3.78z)| lr 1.27e-04 | 4194.54 ms | 32.2% bf16 MFU | 125214 tok/s step 13813/19560 | loss 3.378519 (+0.56z)| norm 0.2952 (+1.24z)| lr 1.27e-04 | 4172.64 ms | 32.4% bf16 MFU | 125236 tok/s step 13814/19560 | loss 3.355538 (+0.05z)| norm 0.3064 (+2.07z)| lr 1.27e-04 | 4421.95 ms | 30.5% bf16 MFU | 124902 tok/s step 13815/19560 | loss 3.381094 (+0.61z)| norm 0.2924 (+0.97z)| lr 1.27e-04 | 4410.85 ms | 30.6% bf16 MFU | 124600 tok/s step 13816/19560 | loss 3.425061 (+1.55z)| norm 0.3052 (+1.92z)| lr 1.27e-04 | 4180.32 ms | 32.3% bf16 MFU | 124641 tok/s step 13817/19560 | loss 3.448188 (+2.01z)| norm 0.3279 (+3.46z)| lr 1.27e-04 | 4275.83 ms | 31.6% bf16 MFU | 124540 tok/s step 13818/19560 | loss 3.336398 (-0.40z)| norm 0.2887 (+0.59z)| lr 1.27e-04 | 5483.16 ms | 24.6% bf16 MFU | 123094 tok/s step 13819/19560 | loss 3.375311 (+0.43z)| norm 0.2961 (+1.11z)| lr 1.27e-04 | 4338.50 ms | 31.1% bf16 MFU | 122981 tok/s step 13820/19560 | loss 3.359889 (+0.10z)| norm 0.2888 (+0.58z)| lr 1.27e-04 | 4172.56 ms | 32.4% bf16 MFU | 123115 tok/s step 13821/19560 | loss 3.478756 (+2.57z)| norm 0.3128 (+2.26z)| lr 1.27e-04 | 4166.18 ms | 32.4% bf16 MFU | 123251 tok/s step 13822/19560 | loss 3.305152 (-1.06z)| norm 0.3012 (+1.41z)| lr 1.27e-04 | 4185.19 ms | 32.3% bf16 MFU | 123352 tok/s step 13823/19560 | loss 3.379201 (+0.48z)| norm 0.2935 (+0.86z)| lr 1.27e-04 | 4182.52 ms | 32.3% bf16 MFU | 123452 tok/s step 13824/19560 | loss 3.332793 (-0.49z)| norm 0.2738 (-0.53z)| lr 1.27e-04 | 4243.13 ms | 31.8% bf16 MFU | 123458 tok/s step 13825/19560 | loss 3.345476 (-0.22z)| norm 0.2894 (+0.56z)| lr 1.27e-04 | 4175.47 ms | 32.3% bf16 MFU | 123563 tok/s step 13826/19560 | loss 3.371613 (+0.33z)| norm 0.2866 (+0.36z)| lr 1.27e-04 | 4192.70 ms | 32.2% bf16 MFU | 123637 tok/s step 13827/19560 | loss 3.308687 (-0.99z)| norm 0.2632 (-1.27z)| lr 1.27e-04 | 4188.10 ms | 32.2% bf16 MFU | 123715 tok/s step 13828/19560 | loss 3.367754 (+0.25z)| norm 0.2978 (+1.14z)| lr 1.27e-04 | 4167.75 ms | 32.4% bf16 MFU | 123819 tok/s step 13829/19560 | loss 3.345065 (-0.23z)| norm 0.2631 (-1.27z)| lr 1.27e-04 | 4169.46 ms | 32.4% bf16 MFU | 123915 tok/s step 13830/19560 | loss 3.368789 (+0.27z)| norm 0.2867 (+0.38z)| lr 1.27e-04 | 4156.62 ms | 32.5% bf16 MFU | 124026 tok/s step 13831/19560 | loss 3.336285 (-0.41z)| norm 0.2956 (+0.99z)| lr 1.27e-04 | 4171.98 ms | 32.4% bf16 MFU | 124108 tok/s step 13832/19560 | loss 3.328914 (-0.57z)| norm 0.2736 (-0.56z)| lr 1.27e-04 | 4162.47 ms | 32.4% bf16 MFU | 124201 tok/s step 13833/19560 | loss 3.332172 (-0.48z)| norm 0.2859 (+0.30z)| lr 1.27e-04 | 4185.82 ms | 32.3% bf16 MFU | 124253 tok/s step 13834/19560 | loss 3.342283 (-0.27z)| norm 0.2754 (-0.43z)| lr 1.26e-04 | 4161.10 ms | 32.4% bf16 MFU | 124340 tok/s step 13835/19560 | loss 3.392882 (+0.80z)| norm 0.2906 (+0.64z)| lr 1.26e-04 | 4176.58 ms | 32.3% bf16 MFU | 124400 tok/s step 13836/19560 | loss 3.251799 (-2.15z)| norm 0.2717 (-0.69z)| lr 1.26e-04 | 4183.85 ms | 32.3% bf16 MFU | 124446 tok/s step 13837/19560 | loss 3.348010 (-0.15z)| norm 0.2783 (-0.20z)| lr 1.26e-04 | 4200.66 ms | 32.1% bf16 MFU | 124464 tok/s step 13838/19560 | loss 3.250871 (-2.13z)| norm 0.2693 (-0.84z)| lr 1.26e-04 | 4170.26 ms | 32.4% bf16 MFU | 124527 tok/s step 13839/19560 | loss 3.326725 (-0.56z)| norm 0.2808 (-0.01z)| lr 1.26e-04 | 4180.39 ms | 32.3% bf16 MFU | 124571 tok/s step 13840/19560 | loss 3.376136 (+0.46z)| norm 0.2787 (-0.17z)| lr 1.26e-04 | 4186.11 ms | 32.3% bf16 MFU | 124605 tok/s step 13841/19560 | loss 3.272347 (-1.65z)| norm 0.3010 (+1.42z)| lr 1.26e-04 | 4180.94 ms | 32.3% bf16 MFU | 124645 tok/s step 13842/19560 | loss 3.295760 (-1.19z)| norm 0.2781 (-0.25z)| lr 1.26e-04 | 4180.09 ms | 32.3% bf16 MFU | 124684 tok/s step 13843/19560 | loss 3.301113 (-1.07z)| norm 0.2972 (+1.13z)| lr 1.26e-04 | 4185.36 ms | 32.3% bf16 MFU | 124713 tok/s step 13844/19560 | loss 3.326491 (-0.55z)| norm 0.2898 (+0.58z)| lr 1.26e-04 | 4181.04 ms | 32.3% bf16 MFU | 124747 tok/s step 13845/19560 | loss 3.299936 (-1.09z)| norm 0.3210 (+2.79z)| lr 1.26e-04 | 4176.71 ms | 32.3% bf16 MFU | 124786 tok/s step 13846/19560 | loss 3.279414 (-1.50z)| norm 0.2976 (+1.08z)| lr 1.26e-04 | 4169.74 ms | 32.4% bf16 MFU | 124833 tok/s step 13847/19560 | loss 3.367297 (+0.32z)| norm 0.2859 (+0.24z)| lr 1.26e-04 | 4163.29 ms | 32.4% bf16 MFU | 124888 tok/s step 13848/19560 | loss 3.327794 (-0.49z)| norm 0.2964 (+0.98z)| lr 1.26e-04 | 4175.01 ms | 32.3% bf16 MFU | 124923 tok/s step 13849/19560 | loss 3.325772 (-0.54z)| norm 0.2899 (+0.51z)| lr 1.26e-04 | 4163.13 ms | 32.4% bf16 MFU | 124973 tok/s step 13850/19560 | loss 3.304207 (-0.98z)| norm 0.2672 (-1.11z)| lr 1.26e-04 | 4167.44 ms | 32.4% bf16 MFU | 125015 tok/s step 13851/19560 | loss 3.267968 (-1.74z)| norm 0.2981 (+1.09z)| lr 1.26e-04 | 4160.63 ms | 32.5% bf16 MFU | 125065 tok/s step 13852/19560 | loss 3.306404 (-0.91z)| norm 0.2746 (-0.59z)| lr 1.26e-04 | 4176.43 ms | 32.3% bf16 MFU | 125088 tok/s step 13853/19560 | loss 3.350397 (+0.03z)| norm 0.2775 (-0.38z)| lr 1.26e-04 | 4180.37 ms | 32.3% bf16 MFU | 125105 tok/s step 13854/19560 | loss 3.297965 (-1.09z)| norm 0.2816 (-0.09z)| lr 1.26e-04 | 4170.00 ms | 32.4% bf16 MFU | 125136 tok/s step 13855/19560 | loss 3.255438 (-1.96z)| norm 0.2673 (-1.11z)| lr 1.26e-04 | 4174.70 ms | 32.3% bf16 MFU | 125159 tok/s step 13856/19560 | loss 3.400957 (+1.11z)| norm 0.2736 (-0.66z)| lr 1.26e-04 | 4164.10 ms | 32.4% bf16 MFU | 125196 tok/s step 13857/19560 | loss 3.307923 (-0.84z)| norm 0.2842 (+0.09z)| lr 1.26e-04 | 4179.24 ms | 32.3% bf16 MFU | 125209 tok/s step 13858/19560 | loss 3.264616 (-1.72z)| norm 0.2722 (-0.76z)| lr 1.25e-04 | 4177.53 ms | 32.3% bf16 MFU | 125223 tok/s step 13859/19560 | loss 3.302033 (-0.92z)| norm 0.2751 (-0.54z)| lr 1.25e-04 | 4161.48 ms | 32.4% bf16 MFU | 125261 tok/s step 13860/19560 | loss 3.250674 (-1.95z)| norm 0.2937 (+0.80z)| lr 1.25e-04 | 4164.48 ms | 32.4% bf16 MFU | 125293 tok/s step 13861/19560 | loss 3.257621 (-1.77z)| norm 0.2704 (-0.88z)| lr 1.25e-04 | 4173.54 ms | 32.4% bf16 MFU | 125310 tok/s step 13862/19560 | loss 3.282342 (-1.25z)| norm 0.2783 (-0.31z)| lr 1.25e-04 | 4168.59 ms | 32.4% bf16 MFU | 125333 tok/s step 13863/19560 | loss 3.346786 (+0.07z)| norm 0.2688 (-0.99z)| lr 1.25e-04 | 4169.41 ms | 32.4% bf16 MFU | 125353 tok/s step 13864/19560 | loss 3.359408 (+0.32z)| norm 0.3011 (+1.31z)| lr 1.25e-04 | 4154.96 ms | 32.5% bf16 MFU | 125395 tok/s step 13865/19560 | loss 3.268307 (-1.53z)| norm 0.2795 (-0.24z)| lr 1.25e-04 | 4163.95 ms | 32.4% bf16 MFU | 125421 tok/s step 13866/19560 | loss 3.358982 (+0.32z)| norm 0.3094 (+1.87z)| lr 1.25e-04 | 4173.69 ms | 32.3% bf16 MFU | 125431 tok/s step 13867/19560 | loss 3.383452 (+0.81z)| norm 0.2785 (-0.33z)| lr 1.25e-04 | 4173.52 ms | 32.4% bf16 MFU | 125440 tok/s step 13868/19560 | loss 3.346518 (+0.06z)| norm 0.3429 (+3.95z)| lr 1.25e-04 | 4184.89 ms | 32.3% bf16 MFU | 125432 tok/s step 13869/19560 | loss 3.363579 (+0.40z)| norm 0.2824 (-0.07z)| lr 1.25e-04 | 4176.05 ms | 32.3% bf16 MFU | 125438 tok/s step 13870/19560 | loss 3.280771 (-1.27z)| norm 0.2909 (+0.49z)| lr 1.25e-04 | 4188.79 ms | 32.2% bf16 MFU | 125424 tok/s step 13871/19560 | loss 3.307860 (-0.72z)| norm 0.2794 (-0.28z)| lr 1.25e-04 | 4204.57 ms | 32.1% bf16 MFU | 125388 tok/s step 13872/19560 | loss 3.261122 (-1.63z)| norm 0.2865 (+0.20z)| lr 1.25e-04 | 4183.10 ms | 32.3% bf16 MFU | 125385 tok/s step 13873/19560 | loss 3.297700 (-0.89z)| norm 0.2793 (-0.29z)| lr 1.25e-04 | 4172.69 ms | 32.4% bf16 MFU | 125398 tok/s step 13874/19560 | loss 3.337220 (-0.10z)| norm 0.2804 (-0.20z)| lr 1.25e-04 | 4180.36 ms | 32.3% bf16 MFU | 125399 tok/s step 13875/19560 | loss 3.279359 (-1.24z)| norm 0.2918 (+0.56z)| lr 1.25e-04 | 4169.38 ms | 32.4% bf16 MFU | 125417 tok/s step 13876/19560 | loss 3.307782 (-0.66z)| norm 0.2708 (-0.85z)| lr 1.25e-04 | 4161.22 ms | 32.4% bf16 MFU | 125445 tok/s step 13877/19560 | loss 3.271238 (-1.37z)| norm 0.2842 (+0.05z)| lr 1.25e-04 | 4178.03 ms | 32.3% bf16 MFU | 125448 tok/s step 13878/19560 | loss 3.248289 (-1.81z)| norm 0.2789 (-0.31z)| lr 1.25e-04 | 4174.39 ms | 32.3% bf16 MFU | 125455 tok/s step 13879/19560 | loss 3.340055 (+0.02z)| norm 0.2906 (+0.47z)| lr 1.25e-04 | 4179.12 ms | 32.3% bf16 MFU | 125455 tok/s step 13880/19560 | loss 3.327281 (-0.23z)| norm 0.2867 (+0.21z)| lr 1.25e-04 | 4179.21 ms | 32.3% bf16 MFU | 125455 tok/s step 13881/19560 | loss 3.307375 (-0.62z)| norm 0.2849 (+0.08z)| lr 1.25e-04 | 4170.03 ms | 32.4% bf16 MFU | 125468 tok/s step 13882/19560 | loss 3.307182 (-0.62z)| norm 0.2892 (+0.37z)| lr 1.25e-04 | 4163.03 ms | 32.4% bf16 MFU | 125492 tok/s step 13883/19560 | loss 3.323686 (-0.28z)| norm 0.2885 (+0.31z)| lr 1.24e-04 | 4182.78 ms | 32.3% bf16 MFU | 125485 tok/s step 13884/19560 | loss 3.277733 (-1.18z)| norm 0.2852 (+0.08z)| lr 1.24e-04 | 4171.11 ms | 32.4% bf16 MFU | 125495 tok/s step 13885/19560 | loss 3.305929 (-0.61z)| norm 0.2828 (-0.07z)| lr 1.24e-04 | 4177.29 ms | 32.3% bf16 MFU | 125496 tok/s step 13886/19560 | loss 3.330399 (-0.11z)| norm 0.2736 (-0.71z)| lr 1.24e-04 | 4164.92 ms | 32.4% bf16 MFU | 125515 tok/s step 13887/19560 | loss 3.294802 (-0.82z)| norm 0.2926 (+0.60z)| lr 1.24e-04 | 4186.11 ms | 32.3% bf16 MFU | 125502 tok/s step 13888/19560 | loss 3.294658 (-0.80z)| norm 0.2706 (-0.90z)| lr 1.24e-04 | 4192.55 ms | 32.2% bf16 MFU | 125479 tok/s step 13889/19560 | loss 3.335135 (+0.01z)| norm 0.2977 (+0.95z)| lr 1.24e-04 | 4179.03 ms | 32.3% bf16 MFU | 125478 tok/s step 13890/19560 | loss 3.385684 (+1.02z)| norm 0.2732 (-0.73z)| lr 1.24e-04 | 4165.52 ms | 32.4% bf16 MFU | 125497 tok/s step 13891/19560 | loss 3.246820 (-1.74z)| norm 0.2888 (+0.34z)| lr 1.24e-04 | 4173.38 ms | 32.4% bf16 MFU | 125504 tok/s step 13892/19560 | loss 3.268575 (-1.29z)| norm 0.2604 (-1.57z)| lr 1.24e-04 | 4184.91 ms | 32.3% bf16 MFU | 125493 tok/s step 13893/19560 | loss 3.307827 (-0.51z)| norm 0.2964 (+0.86z)| lr 1.24e-04 | 4155.79 ms | 32.5% bf16 MFU | 125526 tok/s step 13894/19560 | loss 3.278872 (-1.08z)| norm 0.2548 (-1.90z)| lr 1.24e-04 | 4179.81 ms | 32.3% bf16 MFU | 125521 tok/s step 13895/19560 | loss 3.277214 (-1.09z)| norm 0.2595 (-1.58z)| lr 1.24e-04 | 4179.65 ms | 32.3% bf16 MFU | 125517 tok/s step 13896/19560 | loss 3.317425 (-0.30z)| norm 0.2657 (-1.15z)| lr 1.24e-04 | 4174.01 ms | 32.3% bf16 MFU | 125522 tok/s step 13897/19560 | loss 3.321840 (-0.22z)| norm 0.2505 (-2.11z)| lr 1.24e-04 | 4168.46 ms | 32.4% bf16 MFU | 125534 tok/s step 13898/19560 | loss 3.292468 (-0.79z)| norm 0.2906 (+0.51z)| lr 1.24e-04 | 4177.00 ms | 32.3% bf16 MFU | 125533 tok/s step 13899/19560 | loss 3.271997 (-1.17z)| norm 0.2657 (-1.13z)| lr 1.24e-04 | 4173.16 ms | 32.4% bf16 MFU | 125538 tok/s step 13900/19560 | loss 3.202344 (-2.48z)| norm 0.2886 (+0.38z)| lr 1.24e-04 | 4176.06 ms | 32.3% bf16 MFU | 125539 tok/s step 13901/19560 | loss 3.291350 (-0.77z)| norm 0.2769 (-0.39z)| lr 1.24e-04 | 4181.48 ms | 32.3% bf16 MFU | 125531 tok/s step 13902/19560 | loss 3.367971 (+0.69z)| norm 0.2733 (-0.61z)| lr 1.24e-04 | 4172.09 ms | 32.4% bf16 MFU | 125538 tok/s step 13903/19560 | loss 3.355819 (+0.49z)| norm 0.2773 (-0.34z)| lr 1.24e-04 | 4184.57 ms | 32.3% bf16 MFU | 125525 tok/s step 13904/19560 | loss 3.321114 (-0.19z)| norm 0.2863 (+0.25z)| lr 1.24e-04 | 4173.79 ms | 32.3% bf16 MFU | 125530 tok/s step 13905/19560 | loss 3.243748 (-1.69z)| norm 0.2822 (-0.02z)| lr 1.24e-04 | 4160.74 ms | 32.5% bf16 MFU | 125554 tok/s step 13906/19560 | loss 3.265314 (-1.24z)| norm 0.2734 (-0.60z)| lr 1.24e-04 | 4182.51 ms | 32.3% bf16 MFU | 125544 tok/s step 13907/19560 | loss 3.295016 (-0.66z)| norm 0.2790 (-0.23z)| lr 1.24e-04 | 4168.53 ms | 32.4% bf16 MFU | 125555 tok/s step 13908/19560 | loss 3.280445 (-0.93z)| norm 0.2614 (-1.37z)| lr 1.23e-04 | 4164.82 ms | 32.4% bf16 MFU | 125572 tok/s step 13909/19560 | loss 3.330068 (+0.05z)| norm 0.3105 (+1.81z)| lr 1.23e-04 | 4176.68 ms | 32.3% bf16 MFU | 125569 tok/s step 13910/19560 | loss 3.341181 (+0.27z)| norm 0.2873 (+0.30z)| lr 1.23e-04 | 4180.31 ms | 32.3% bf16 MFU | 125562 tok/s step 13911/19560 | loss 3.275942 (-1.00z)| norm 0.3070 (+1.54z)| lr 1.23e-04 | 4169.31 ms | 32.4% bf16 MFU | 125571 tok/s step 13912/19560 | loss 3.259785 (-1.30z)| norm 0.2923 (+0.59z)| lr 1.23e-04 | 4180.35 ms | 32.3% bf16 MFU | 125564 tok/s step 13913/19560 | loss 3.298146 (-0.54z)| norm 0.3000 (+1.07z)| lr 1.23e-04 | 4161.70 ms | 32.4% bf16 MFU | 125584 tok/s step 13914/19560 | loss 3.331545 (+0.13z)| norm 0.3017 (+1.16z)| lr 1.23e-04 | 4165.38 ms | 32.4% bf16 MFU | 125599 tok/s step 13915/19560 | loss 3.272139 (-1.03z)| norm 0.2923 (+0.54z)| lr 1.23e-04 | 4189.42 ms | 32.2% bf16 MFU | 125576 tok/s step 13916/19560 | loss 3.357678 (+0.64z)| norm 0.2918 (+0.50z)| lr 1.23e-04 | 4168.29 ms | 32.4% bf16 MFU | 125586 tok/s step 13917/19560 | loss 3.339684 (+0.29z)| norm 0.3153 (+2.00z)| lr 1.23e-04 | 4165.40 ms | 32.4% bf16 MFU | 125600 tok/s step 13918/19560 | loss 3.283279 (-0.81z)| norm 0.2660 (-1.23z)| lr 1.23e-04 | 4153.52 ms | 32.5% bf16 MFU | 125632 tok/s step 13919/19560 | loss 3.296391 (-0.55z)| norm 0.2900 (+0.34z)| lr 1.23e-04 | 4164.90 ms | 32.4% bf16 MFU | 125644 tok/s step 13920/19560 | loss 3.373846 (+0.96z)| norm 0.2808 (-0.28z)| lr 1.23e-04 | 4154.57 ms | 32.5% bf16 MFU | 125672 tok/s step 13921/19560 | loss 3.283785 (-0.79z)| norm 0.2586 (-1.74z)| lr 1.23e-04 | 4187.73 ms | 32.2% bf16 MFU | 125648 tok/s step 13922/19560 | loss 3.236253 (-1.69z)| norm 0.2881 (+0.20z)| lr 1.23e-04 | 4171.58 ms | 32.4% bf16 MFU | 125650 tok/s step 13923/19560 | loss 3.298794 (-0.47z)| norm 0.2751 (-0.65z)| lr 1.23e-04 | 4162.82 ms | 32.4% bf16 MFU | 125664 tok/s step 13924/19560 | loss 3.293193 (-0.56z)| norm 0.3079 (+1.49z)| lr 1.23e-04 | 4160.07 ms | 32.5% bf16 MFU | 125683 tok/s step 13925/19560 | loss 3.385709 (+1.23z)| norm 0.2890 (+0.24z)| lr 1.23e-04 | 4167.87 ms | 32.4% bf16 MFU | 125688 tok/s step 13926/19560 | loss 3.318367 (-0.08z)| norm 0.2820 (-0.22z)| lr 1.23e-04 | 4164.74 ms | 32.4% bf16 MFU | 125698 tok/s step 13927/19560 | loss 3.250096 (-1.39z)| norm 0.2810 (-0.30z)| lr 1.23e-04 | 4162.00 ms | 32.4% bf16 MFU | 125712 tok/s step 13928/19560 | loss 3.370945 (+0.94z)| norm 0.3156 (+1.96z)| lr 1.23e-04 | 4158.07 ms | 32.5% bf16 MFU | 125730 tok/s step 13929/19560 | loss 3.310579 (-0.22z)| norm 0.2598 (-1.68z)| lr 1.23e-04 | 4168.49 ms | 32.4% bf16 MFU | 125733 tok/s step 13930/19560 | loss 3.279805 (-0.80z)| norm 0.3277 (+2.66z)| lr 1.23e-04 | 4172.50 ms | 32.4% bf16 MFU | 125729 tok/s step 13931/19560 | loss 3.311126 (-0.20z)| norm 0.2605 (-1.62z)| lr 1.23e-04 | 4160.81 ms | 32.4% bf16 MFU | 125743 tok/s step 13932/19560 | loss 3.276031 (-0.87z)| norm 0.2914 (+0.34z)| lr 1.22e-04 | 4174.34 ms | 32.3% bf16 MFU | 125735 tok/s step 13933/19560 | loss 3.318064 (-0.05z)| norm 0.2848 (-0.08z)| lr 1.22e-04 | 4174.49 ms | 32.3% bf16 MFU | 125728 tok/s step 13934/19560 | loss 3.393427 (+1.41z)| norm 0.2811 (-0.31z)| lr 1.22e-04 | 4169.21 ms | 32.4% bf16 MFU | 125729 tok/s step 13935/19560 | loss 3.250804 (-1.44z)| norm 0.2781 (-0.49z)| lr 1.22e-04 | 4177.10 ms | 32.3% bf16 MFU | 125719 tok/s step 13936/19560 | loss 3.288700 (-0.63z)| norm 0.2622 (-1.48z)| lr 1.22e-04 | 4191.10 ms | 32.2% bf16 MFU | 125688 tok/s step 13937/19560 | loss 3.260672 (-1.21z)| norm 0.2779 (-0.48z)| lr 1.22e-04 | 4168.38 ms | 32.4% bf16 MFU | 125692 tok/s step 13938/19560 | loss 3.258276 (-1.24z)| norm 0.2772 (-0.52z)| lr 1.22e-04 | 4162.44 ms | 32.4% bf16 MFU | 125705 tok/s step 13939/19560 | loss 3.275928 (-0.86z)| norm 0.2621 (-1.45z)| lr 1.22e-04 | 4155.45 ms | 32.5% bf16 MFU | 125728 tok/s step 13940/19560 | loss 3.263199 (-1.12z)| norm 0.2689 (-1.02z)| lr 1.22e-04 | 4169.37 ms | 32.4% bf16 MFU | 125729 tok/s step 13941/19560 | loss 3.264919 (-1.06z)| norm 0.2665 (-1.16z)| lr 1.22e-04 | 4170.97 ms | 32.4% bf16 MFU | 125728 tok/s step 13942/19560 | loss 3.318079 (+0.09z)| norm 0.2736 (-0.69z)| lr 1.22e-04 | 4176.08 ms | 32.3% bf16 MFU | 125719 tok/s step 13943/19560 | loss 3.328817 (+0.33z)| norm 0.2510 (-2.09z)| lr 1.22e-04 | 4170.57 ms | 32.4% bf16 MFU | 125718 tok/s step 13944/19560 | loss 3.241971 (-1.55z)| norm 0.2838 (-0.00z)| lr 1.22e-04 | 4175.00 ms | 32.3% bf16 MFU | 125711 tok/s step 13945/19560 | loss 3.353040 (+0.95z)| norm 0.2571 (-1.70z)| lr 1.22e-04 | 4174.18 ms | 32.3% bf16 MFU | 125706 tok/s step 13946/19560 | loss 3.328460 (+0.39z)| norm 0.2552 (-1.78z)| lr 1.22e-04 | 4179.52 ms | 32.3% bf16 MFU | 125693 tok/s step 13947/19560 | loss 3.396162 (+1.93z)| norm 0.2854 (+0.16z)| lr 1.22e-04 | 4175.13 ms | 32.3% bf16 MFU | 125687 tok/s step 13948/19560 | loss 3.377714 (+1.50z)| norm 0.2773 (-0.36z)| lr 1.22e-04 | 4159.65 ms | 32.5% bf16 MFU | 125705 tok/s step 13949/19560 | loss 3.412116 (+2.37z)| norm 0.2744 (-0.53z)| lr 1.22e-04 | 4158.56 ms | 32.5% bf16 MFU | 125723 tok/s step 13950/19560 | loss 3.271692 (-0.92z)| norm 0.2782 (-0.27z)| lr 1.22e-04 | 4174.33 ms | 32.3% bf16 MFU | 125717 tok/s step 13951/19560 | loss 3.298437 (-0.28z)| norm 0.2777 (-0.30z)| lr 1.22e-04 | 4165.61 ms | 32.4% bf16 MFU | 125724 tok/s step 13952/19560 | loss 3.306918 (-0.07z)| norm 0.2482 (-2.18z)| lr 1.22e-04 | 4164.55 ms | 32.4% bf16 MFU | 125732 tok/s step 13953/19560 | loss 3.271139 (-0.90z)| norm 0.2764 (-0.36z)| lr 1.22e-04 | 4176.51 ms | 32.3% bf16 MFU | 125722 tok/s step 13954/19560 | loss 3.322086 (+0.31z)| norm 0.2651 (-1.08z)| lr 1.22e-04 | 4211.51 ms | 32.1% bf16 MFU | 125661 tok/s step 13955/19560 | loss 3.277343 (-0.75z)| norm 0.2750 (-0.44z)| lr 1.22e-04 | 4163.86 ms | 32.4% bf16 MFU | 125673 tok/s step 13956/19560 | loss 3.331461 (+0.55z)| norm 0.2731 (-0.56z)| lr 1.22e-04 | 4164.41 ms | 32.4% bf16 MFU | 125685 tok/s step 13957/19560 | loss 3.292450 (-0.37z)| norm 0.2711 (-0.69z)| lr 1.21e-04 | 4171.03 ms | 32.4% bf16 MFU | 125685 tok/s step 13958/19560 | loss 3.310753 (+0.08z)| norm 0.2961 (+0.93z)| lr 1.21e-04 | 4167.64 ms | 32.4% bf16 MFU | 125691 tok/s step 13959/19560 | loss 3.369463 (+1.48z)| norm 0.2630 (-1.20z)| lr 1.21e-04 | 4184.92 ms | 32.3% bf16 MFU | 125670 tok/s step 13960/19560 | loss 3.344709 (+0.88z)| norm 0.3126 (+1.96z)| lr 1.21e-04 | 4178.00 ms | 32.3% bf16 MFU | 125661 tok/s step 13961/19560 | loss 3.344337 (+0.87z)| norm 0.3118 (+1.88z)| lr 1.21e-04 | 4177.04 ms | 32.3% bf16 MFU | 125654 tok/s step 13962/19560 | loss 3.309770 (+0.05z)| norm 0.3013 (+1.20z)| lr 1.21e-04 | 4178.28 ms | 32.3% bf16 MFU | 125645 tok/s step 13963/19560 | loss 3.266263 (-0.99z)| norm 0.3164 (+2.09z)| lr 1.21e-04 | 4163.11 ms | 32.4% bf16 MFU | 125660 tok/s step 13964/19560 | loss 3.211222 (-2.29z)| norm 0.2725 (-0.61z)| lr 1.21e-04 | 4162.73 ms | 32.4% bf16 MFU | 125674 tok/s step 13965/19560 | loss 3.293410 (-0.31z)| norm 0.2906 (+0.49z)| lr 1.21e-04 | 4181.48 ms | 32.3% bf16 MFU | 125660 tok/s step 13966/19560 | loss 3.352560 (+1.10z)| norm 0.2954 (+0.78z)| lr 1.21e-04 | 4164.91 ms | 32.4% bf16 MFU | 125671 tok/s step 13967/19560 | loss 3.293825 (-0.31z)| norm 0.3153 (+1.96z)| lr 1.21e-04 | 4162.02 ms | 32.4% bf16 MFU | 125686 tok/s step 13968/19560 | loss 3.283080 (-0.56z)| norm 0.2885 (+0.33z)| lr 1.21e-04 | 4164.07 ms | 32.4% bf16 MFU | 125697 tok/s step 13969/19560 | loss 3.284759 (-0.52z)| norm 0.2831 (+0.01z)| lr 1.21e-04 | 4199.74 ms | 32.1% bf16 MFU | 125654 tok/s step 13970/19560 | loss 3.329736 (+0.57z)| norm 0.2762 (-0.41z)| lr 1.21e-04 | 4183.90 ms | 32.3% bf16 MFU | 125637 tok/s step 13971/19560 | loss 3.343085 (+0.89z)| norm 0.2895 (+0.40z)| lr 1.21e-04 | 4163.76 ms | 32.4% bf16 MFU | 125651 tok/s step 13972/19560 | loss 3.289905 (-0.40z)| norm 0.2802 (-0.16z)| lr 1.21e-04 | 4169.80 ms | 32.4% bf16 MFU | 125655 tok/s step 13973/19560 | loss 3.373208 (+1.60z)| norm 0.2669 (-0.96z)| lr 1.21e-04 | 4158.96 ms | 32.5% bf16 MFU | 125675 tok/s step 13974/19560 | loss 3.309721 (+0.06z)| norm 0.2898 (+0.46z)| lr 1.21e-04 | 4177.80 ms | 32.3% bf16 MFU | 125666 tok/s step 13975/19560 | loss 3.322282 (+0.38z)| norm 0.2658 (-1.02z)| lr 1.21e-04 | 4168.96 ms | 32.4% bf16 MFU | 125671 tok/s step 13976/19560 | loss 3.337106 (+0.74z)| norm 0.2723 (-0.60z)| lr 1.21e-04 | 4171.75 ms | 32.4% bf16 MFU | 125671 tok/s step 13977/19560 | loss 3.258646 (-1.15z)| norm 0.2609 (-1.29z)| lr 1.21e-04 | 4167.59 ms | 32.4% bf16 MFU | 125678 tok/s step 13978/19560 | loss 3.243990 (-1.48z)| norm 0.2621 (-1.21z)| lr 1.21e-04 | 4165.29 ms | 32.4% bf16 MFU | 125687 tok/s step 13979/19560 | loss 3.277055 (-0.69z)| norm 0.2859 (+0.27z)| lr 1.21e-04 | 4163.57 ms | 32.4% bf16 MFU | 125699 tok/s step 13980/19560 | loss 3.363840 (+1.37z)| norm 0.2638 (-1.09z)| lr 1.21e-04 | 4149.02 ms | 32.5% bf16 MFU | 125732 tok/s step 13981/19560 | loss 3.319190 (+0.31z)| norm 0.2744 (-0.44z)| lr 1.21e-04 | 4284.57 ms | 31.5% bf16 MFU | 125564 tok/s step 13982/19560 | loss 3.300507 (-0.13z)| norm 0.2760 (-0.34z)| lr 1.20e-04 | 4225.34 ms | 32.0% bf16 MFU | 125490 tok/s step 13983/19560 | loss 3.336521 (+0.72z)| norm 0.2606 (-1.28z)| lr 1.20e-04 | 4164.02 ms | 32.4% bf16 MFU | 125511 tok/s step 13984/19560 | loss 3.323711 (+0.43z)| norm 0.2766 (-0.30z)| lr 1.20e-04 | 4153.88 ms | 32.5% bf16 MFU | 125546 tok/s step 13985/19560 | loss 3.331075 (+0.61z)| norm 0.2576 (-1.44z)| lr 1.20e-04 | 4162.54 ms | 32.4% bf16 MFU | 125567 tok/s step 13986/19560 | loss 3.295384 (-0.28z)| norm 0.2769 (-0.27z)| lr 1.20e-04 | 4157.84 ms | 32.5% bf16 MFU | 125593 tok/s step 13987/19560 | loss 3.272228 (-0.84z)| norm 0.2559 (-1.53z)| lr 1.20e-04 | 4163.94 ms | 32.4% bf16 MFU | 125609 tok/s step 13988/19560 | loss 3.387846 (+1.96z)| norm 0.2822 (+0.07z)| lr 1.20e-04 | 4166.27 ms | 32.4% bf16 MFU | 125621 tok/s step 13989/19560 | loss 3.333405 (+0.62z)| norm 0.2570 (-1.44z)| lr 1.20e-04 | 4181.50 ms | 32.3% bf16 MFU | 125609 tok/s step 13990/19560 | loss 3.310463 (+0.05z)| norm 0.2744 (-0.39z)| lr 1.20e-04 | 4161.79 ms | 32.4% bf16 MFU | 125627 tok/s step 13991/19560 | loss 3.324034 (+0.39z)| norm 0.2722 (-0.52z)| lr 1.20e-04 | 4159.86 ms | 32.5% bf16 MFU | 125647 tok/s step 13992/19560 | loss 3.268723 (-0.95z)| norm 0.3100 (+1.74z)| lr 1.20e-04 | 4173.71 ms | 32.3% bf16 MFU | 125646 tok/s step 13993/19560 | loss 3.294177 (-0.33z)| norm 0.2665 (-0.86z)| lr 1.20e-04 | 4178.16 ms | 32.3% bf16 MFU | 125638 tok/s step 13994/19560 | loss 3.253569 (-1.31z)| norm 0.2519 (-1.71z)| lr 1.20e-04 | 4163.85 ms | 32.4% bf16 MFU | 125652 tok/s step 13995/19560 | loss 3.298213 (-0.20z)| norm 0.2715 (-0.53z)| lr 1.20e-04 | 4175.39 ms | 32.3% bf16 MFU | 125647 tok/s step 13996/19560 | loss 3.302613 (-0.08z)| norm 0.2643 (-0.98z)| lr 1.20e-04 | 4168.93 ms | 32.4% bf16 MFU | 125653 tok/s step 13997/19560 | loss 3.314928 (+0.24z)| norm 0.2644 (-0.96z)| lr 1.20e-04 | 4184.79 ms | 32.3% bf16 MFU | 125635 tok/s step 13998/19560 | loss 3.334043 (+0.72z)| norm 0.2808 (+0.08z)| lr 1.20e-04 | 4172.11 ms | 32.4% bf16 MFU | 125636 tok/s step 13999/19560 | loss 3.280488 (-0.63z)| norm 0.2643 (-0.95z)| lr 1.20e-04 | 4175.10 ms | 32.3% bf16 MFU | 125633 tok/s step 14000/19560 | loss 3.431730 (+3.05z)| norm 0.2744 (-0.31z)| lr 1.20e-04 | 13320.14 ms | 10.1% bf16 MFU | 121319 tok/s val loss 3.309019 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2989/10042 = 0.297650 step 14001/19560 | loss 3.340549 (+0.81z)| norm 0.2634 (-0.99z)| lr 1.20e-04 | 4156.45 ms | 32.5% bf16 MFU | 121560 tok/s step 14002/19560 | loss 3.315964 (+0.22z)| norm 0.2744 (-0.30z)| lr 1.20e-04 | 4163.52 ms | 32.4% bf16 MFU | 121779 tok/s step 14003/19560 | loss 3.328971 (+0.53z)| norm 0.2701 (-0.56z)| lr 1.20e-04 | 4153.76 ms | 32.5% bf16 MFU | 122001 tok/s step 14004/19560 | loss 3.344051 (+0.89z)| norm 0.2664 (-0.79z)| lr 1.20e-04 | 4240.65 ms | 31.8% bf16 MFU | 122082 tok/s step 14005/19560 | loss 3.337635 (+0.72z)| norm 0.2663 (-0.78z)| lr 1.20e-04 | 4160.28 ms | 32.5% bf16 MFU | 122279 tok/s step 14006/19560 | loss 3.376631 (+1.64z)| norm 0.2740 (-0.30z)| lr 1.20e-04 | 4150.66 ms | 32.5% bf16 MFU | 122481 tok/s step 14007/19560 | loss 3.311768 (+0.07z)| norm 0.2783 (-0.02z)| lr 1.19e-04 | 4171.35 ms | 32.4% bf16 MFU | 122641 tok/s step 14008/19560 | loss 3.291860 (-0.41z)| norm 0.2558 (-1.41z)| lr 1.19e-04 | 4326.88 ms | 31.2% bf16 MFU | 122568 tok/s step 14009/19560 | loss 3.335749 (+0.65z)| norm 0.2660 (-0.76z)| lr 1.19e-04 | 4157.42 ms | 32.5% bf16 MFU | 122745 tok/s step 14010/19560 | loss 3.307776 (-0.03z)| norm 0.2478 (-1.85z)| lr 1.19e-04 | 4158.07 ms | 32.5% bf16 MFU | 122912 tok/s step 14011/19560 | loss 3.272835 (-0.87z)| norm 0.2697 (-0.50z)| lr 1.19e-04 | 4154.14 ms | 32.5% bf16 MFU | 123077 tok/s step 14012/19560 | loss 3.312788 (+0.10z)| norm 0.2794 (+0.10z)| lr 1.19e-04 | 4158.91 ms | 32.5% bf16 MFU | 123226 tok/s step 14013/19560 | loss 3.290341 (-0.45z)| norm 0.2542 (-1.42z)| lr 1.19e-04 | 4150.77 ms | 32.5% bf16 MFU | 123380 tok/s step 14014/19560 | loss 3.336326 (+0.67z)| norm 0.2865 (+0.53z)| lr 1.19e-04 | 4145.70 ms | 32.6% bf16 MFU | 123535 tok/s step 14015/19560 | loss 3.326756 (+0.43z)| norm 0.2604 (-1.04z)| lr 1.19e-04 | 4152.21 ms | 32.5% bf16 MFU | 123671 tok/s step 14016/19560 | loss 3.261386 (-1.15z)| norm 0.2720 (-0.33z)| lr 1.19e-04 | 4160.41 ms | 32.5% bf16 MFU | 123789 tok/s step 14017/19560 | loss 3.218019 (-2.14z)| norm 0.2724 (-0.30z)| lr 1.19e-04 | 4161.85 ms | 32.4% bf16 MFU | 123898 tok/s step 14018/19560 | loss 3.333204 (+0.62z)| norm 0.2831 (+0.35z)| lr 1.19e-04 | 4167.50 ms | 32.4% bf16 MFU | 123993 tok/s step 14019/19560 | loss 3.234260 (-1.75z)| norm 0.2614 (-0.95z)| lr 1.19e-04 | 4149.24 ms | 32.5% bf16 MFU | 124112 tok/s step 14020/19560 | loss 3.307153 (-0.01z)| norm 0.2768 (-0.03z)| lr 1.19e-04 | 4166.87 ms | 32.4% bf16 MFU | 124197 tok/s step 14021/19560 | loss 3.338919 (+0.75z)| norm 0.2736 (-0.21z)| lr 1.19e-04 | 4183.12 ms | 32.3% bf16 MFU | 124254 tok/s step 14022/19560 | loss 3.281620 (-0.63z)| norm 0.2634 (-0.84z)| lr 1.19e-04 | 4163.79 ms | 32.4% bf16 MFU | 124337 tok/s step 14023/19560 | loss 3.323640 (+0.37z)| norm 0.2761 (-0.07z)| lr 1.19e-04 | 4159.98 ms | 32.5% bf16 MFU | 124422 tok/s step 14024/19560 | loss 3.277153 (-0.74z)| norm 0.2767 (-0.04z)| lr 1.19e-04 | 4157.94 ms | 32.5% bf16 MFU | 124505 tok/s step 14025/19560 | loss 3.313738 (+0.14z)| norm 0.2806 (+0.19z)| lr 1.19e-04 | 4421.04 ms | 30.5% bf16 MFU | 124210 tok/s step 14026/19560 | loss 3.342428 (+0.82z)| norm 0.2584 (-1.19z)| lr 1.19e-04 | 4212.27 ms | 32.1% bf16 MFU | 124222 tok/s step 14027/19560 | loss 3.265741 (-1.02z)| norm 0.2607 (-1.04z)| lr 1.19e-04 | 4249.13 ms | 31.8% bf16 MFU | 124181 tok/s step 14028/19560 | loss 3.332933 (+0.58z)| norm 0.2701 (-0.44z)| lr 1.19e-04 | 4382.08 ms | 30.8% bf16 MFU | 123954 tok/s step 14029/19560 | loss 3.290262 (-0.47z)| norm 0.2553 (-1.35z)| lr 1.19e-04 | 4298.84 ms | 31.4% bf16 MFU | 123854 tok/s step 14030/19560 | loss 3.310096 (+0.03z)| norm 0.2741 (-0.18z)| lr 1.19e-04 | 4171.96 ms | 32.4% bf16 MFU | 123945 tok/s step 14031/19560 | loss 3.286874 (-0.53z)| norm 0.2714 (-0.34z)| lr 1.19e-04 | 4281.50 ms | 31.5% bf16 MFU | 123870 tok/s step 14032/19560 | loss 3.260485 (-1.17z)| norm 0.2767 (-0.01z)| lr 1.18e-04 | 4206.01 ms | 32.1% bf16 MFU | 123909 tok/s step 14033/19560 | loss 3.316981 (+0.22z)| norm 0.2617 (-0.93z)| lr 1.18e-04 | 4148.82 ms | 32.5% bf16 MFU | 124032 tok/s step 14034/19560 | loss 3.327848 (+0.48z)| norm 0.2719 (-0.29z)| lr 1.18e-04 | 4236.60 ms | 31.9% bf16 MFU | 124018 tok/s step 14035/19560 | loss 3.321990 (+0.32z)| norm 0.2616 (-0.92z)| lr 1.18e-04 | 4162.09 ms | 32.4% bf16 MFU | 124116 tok/s step 14036/19560 | loss 3.285095 (-0.60z)| norm 0.2696 (-0.43z)| lr 1.18e-04 | 4208.67 ms | 32.1% bf16 MFU | 124139 tok/s step 14037/19560 | loss 3.270617 (-0.95z)| norm 0.2638 (-0.78z)| lr 1.18e-04 | 4202.64 ms | 32.1% bf16 MFU | 124169 tok/s step 14038/19560 | loss 3.290631 (-0.44z)| norm 0.2583 (-1.11z)| lr 1.18e-04 | 4171.62 ms | 32.4% bf16 MFU | 124245 tok/s step 14039/19560 | loss 3.284190 (-0.60z)| norm 0.2491 (-1.67z)| lr 1.18e-04 | 4151.31 ms | 32.5% bf16 MFU | 124347 tok/s step 14040/19560 | loss 3.341800 (+0.83z)| norm 0.2796 (+0.26z)| lr 1.18e-04 | 4158.73 ms | 32.5% bf16 MFU | 124434 tok/s step 14041/19560 | loss 3.282673 (-0.66z)| norm 0.2509 (-1.53z)| lr 1.18e-04 | 4223.57 ms | 32.0% bf16 MFU | 124419 tok/s step 14042/19560 | loss 3.299096 (-0.24z)| norm 0.2768 (+0.12z)| lr 1.18e-04 | 4147.44 ms | 32.6% bf16 MFU | 124518 tok/s step 14043/19560 | loss 3.316077 (+0.18z)| norm 0.2692 (-0.35z)| lr 1.18e-04 | 4171.03 ms | 32.4% bf16 MFU | 124577 tok/s step 14044/19560 | loss 3.250826 (-1.44z)| norm 0.2785 (+0.25z)| lr 1.18e-04 | 4164.66 ms | 32.4% bf16 MFU | 124643 tok/s step 14045/19560 | loss 3.333711 (+0.65z)| norm 0.2741 (-0.01z)| lr 1.18e-04 | 4187.44 ms | 32.2% bf16 MFU | 124671 tok/s step 14046/19560 | loss 3.313914 (+0.14z)| norm 0.2608 (-0.89z)| lr 1.18e-04 | 4148.08 ms | 32.5% bf16 MFU | 124757 tok/s step 14047/19560 | loss 3.328631 (+0.51z)| norm 0.2875 (+0.88z)| lr 1.18e-04 | 4156.94 ms | 32.5% bf16 MFU | 124825 tok/s step 14048/19560 | loss 3.334575 (+0.67z)| norm 0.2752 (+0.07z)| lr 1.18e-04 | 4153.43 ms | 32.5% bf16 MFU | 124896 tok/s step 14049/19560 | loss 3.308559 (+0.00z)| norm 0.2674 (-0.46z)| lr 1.18e-04 | 4153.98 ms | 32.5% bf16 MFU | 124961 tok/s step 14050/19560 | loss 3.295836 (-0.34z)| norm 0.3027 (+1.87z)| lr 1.18e-04 | 4153.60 ms | 32.5% bf16 MFU | 125025 tok/s step 14051/19560 | loss 3.282199 (-0.68z)| norm 0.2706 (-0.25z)| lr 1.18e-04 | 4152.90 ms | 32.5% bf16 MFU | 125086 tok/s step 14052/19560 | loss 3.311234 (+0.06z)| norm 0.2972 (+1.53z)| lr 1.18e-04 | 4150.59 ms | 32.5% bf16 MFU | 125147 tok/s step 14053/19560 | loss 3.301888 (-0.17z)| norm 0.2700 (-0.27z)| lr 1.18e-04 | 4174.46 ms | 32.3% bf16 MFU | 125170 tok/s step 14054/19560 | loss 3.345257 (+0.96z)| norm 0.2732 (-0.05z)| lr 1.18e-04 | 4154.03 ms | 32.5% bf16 MFU | 125222 tok/s step 14055/19560 | loss 3.269192 (-1.03z)| norm 0.2656 (-0.55z)| lr 1.18e-04 | 4155.96 ms | 32.5% bf16 MFU | 125268 tok/s step 14056/19560 | loss 3.283838 (-0.64z)| norm 0.2794 (+0.40z)| lr 1.18e-04 | 4161.55 ms | 32.4% bf16 MFU | 125304 tok/s step 14057/19560 | loss 3.305749 (-0.06z)| norm 0.2771 (+0.23z)| lr 1.17e-04 | 4152.00 ms | 32.5% bf16 MFU | 125353 tok/s step 14058/19560 | loss 3.311468 (+0.09z)| norm 0.2619 (-0.83z)| lr 1.17e-04 | 4148.87 ms | 32.5% bf16 MFU | 125403 tok/s step 14059/19560 | loss 3.255977 (-1.36z)| norm 0.2586 (-1.07z)| lr 1.17e-04 | 4156.03 ms | 32.5% bf16 MFU | 125441 tok/s step 14060/19560 | loss 3.294977 (-0.34z)| norm 0.2716 (-0.10z)| lr 1.17e-04 | 4156.79 ms | 32.5% bf16 MFU | 125475 tok/s step 14061/19560 | loss 3.368713 (+1.58z)| norm 0.2716 (-0.10z)| lr 1.17e-04 | 4159.32 ms | 32.5% bf16 MFU | 125504 tok/s step 14062/19560 | loss 3.345521 (+1.00z)| norm 0.2714 (-0.11z)| lr 1.17e-04 | 4155.62 ms | 32.5% bf16 MFU | 125537 tok/s step 14063/19560 | loss 3.255018 (-1.41z)| norm 0.2875 (+1.07z)| lr 1.17e-04 | 4158.11 ms | 32.5% bf16 MFU | 125564 tok/s step 14064/19560 | loss 3.328539 (+0.54z)| norm 0.2677 (-0.39z)| lr 1.17e-04 | 4151.45 ms | 32.5% bf16 MFU | 125601 tok/s step 14065/19560 | loss 3.291827 (-0.44z)| norm 0.2951 (+1.61z)| lr 1.17e-04 | 4164.49 ms | 32.4% bf16 MFU | 125615 tok/s step 14066/19560 | loss 3.271209 (-1.00z)| norm 0.2521 (-1.51z)| lr 1.17e-04 | 4149.11 ms | 32.5% bf16 MFU | 125653 tok/s step 14067/19560 | loss 3.311414 (+0.07z)| norm 0.2822 (+0.66z)| lr 1.17e-04 | 4156.00 ms | 32.5% bf16 MFU | 125678 tok/s step 14068/19560 | loss 3.297376 (-0.32z)| norm 0.2621 (-0.79z)| lr 1.17e-04 | 4151.49 ms | 32.5% bf16 MFU | 125708 tok/s step 14069/19560 | loss 3.287451 (-0.59z)| norm 0.2640 (-0.65z)| lr 1.17e-04 | 4152.17 ms | 32.5% bf16 MFU | 125736 tok/s step 14070/19560 | loss 3.291657 (-0.47z)| norm 0.2757 (+0.19z)| lr 1.17e-04 | 4152.81 ms | 32.5% bf16 MFU | 125762 tok/s step 14071/19560 | loss 3.348320 (+1.06z)| norm 0.2629 (-0.74z)| lr 1.17e-04 | 4154.74 ms | 32.5% bf16 MFU | 125783 tok/s step 14072/19560 | loss 3.285510 (-0.66z)| norm 0.2850 (+0.87z)| lr 1.17e-04 | 4157.20 ms | 32.5% bf16 MFU | 125800 tok/s step 14073/19560 | loss 3.354595 (+1.24z)| norm 0.2732 (-0.00z)| lr 1.17e-04 | 4154.40 ms | 32.5% bf16 MFU | 125820 tok/s step 14074/19560 | loss 3.354171 (+1.21z)| norm 0.2623 (-0.81z)| lr 1.17e-04 | 4155.81 ms | 32.5% bf16 MFU | 125837 tok/s step 14075/19560 | loss 3.270427 (-1.06z)| norm 0.2703 (-0.21z)| lr 1.17e-04 | 4169.51 ms | 32.4% bf16 MFU | 125832 tok/s step 14076/19560 | loss 3.339867 (+0.88z)| norm 0.2849 (+0.86z)| lr 1.17e-04 | 4151.32 ms | 32.5% bf16 MFU | 125855 tok/s step 14077/19560 | loss 3.344320 (+1.05z)| norm 0.2655 (-0.56z)| lr 1.17e-04 | 4155.20 ms | 32.5% bf16 MFU | 125871 tok/s step 14078/19560 | loss 3.373792 (+1.87z)| norm 0.2841 (+0.80z)| lr 1.17e-04 | 4164.76 ms | 32.4% bf16 MFU | 125872 tok/s step 14079/19560 | loss 3.349660 (+1.16z)| norm 0.2761 (+0.21z)| lr 1.17e-04 | 4157.55 ms | 32.5% bf16 MFU | 125884 tok/s step 14080/19560 | loss 3.350816 (+1.17z)| norm 0.2815 (+0.60z)| lr 1.17e-04 | 4151.33 ms | 32.5% bf16 MFU | 125904 tok/s step 14081/19560 | loss 3.267229 (-1.20z)| norm 0.2709 (-0.19z)| lr 1.17e-04 | 4151.78 ms | 32.5% bf16 MFU | 125923 tok/s step 14082/19560 | loss 3.304009 (-0.15z)| norm 0.2877 (+1.04z)| lr 1.17e-04 | 4165.71 ms | 32.4% bf16 MFU | 125920 tok/s step 14083/19560 | loss 3.226867 (-2.29z)| norm 0.2626 (-0.80z)| lr 1.16e-04 | 4157.01 ms | 32.5% bf16 MFU | 125930 tok/s step 14084/19560 | loss 3.285484 (-0.64z)| norm 0.2836 (+0.74z)| lr 1.16e-04 | 4162.23 ms | 32.4% bf16 MFU | 125932 tok/s step 14085/19560 | loss 3.263484 (-1.24z)| norm 0.2506 (-1.67z)| lr 1.16e-04 | 4161.03 ms | 32.4% bf16 MFU | 125935 tok/s step 14086/19560 | loss 3.273499 (-0.95z)| norm 0.2839 (+0.77z)| lr 1.16e-04 | 4153.64 ms | 32.5% bf16 MFU | 125949 tok/s step 14087/19560 | loss 3.299108 (-0.23z)| norm 0.2606 (-0.93z)| lr 1.16e-04 | 4154.68 ms | 32.5% bf16 MFU | 125962 tok/s step 14088/19560 | loss 3.326883 (+0.55z)| norm 0.2779 (+0.37z)| lr 1.16e-04 | 4154.23 ms | 32.5% bf16 MFU | 125974 tok/s step 14089/19560 | loss 3.343946 (+1.03z)| norm 0.2830 (+0.80z)| lr 1.16e-04 | 4152.20 ms | 32.5% bf16 MFU | 125988 tok/s step 14090/19560 | loss 3.274859 (-0.90z)| norm 0.2610 (-0.92z)| lr 1.16e-04 | 4145.01 ms | 32.6% bf16 MFU | 126013 tok/s step 14091/19560 | loss 3.334614 (+0.76z)| norm 0.2735 (+0.11z)| lr 1.16e-04 | 4155.06 ms | 32.5% bf16 MFU | 126022 tok/s step 14092/19560 | loss 3.386211 (+2.19z)| norm 0.2683 (-0.32z)| lr 1.16e-04 | 4160.56 ms | 32.5% bf16 MFU | 126021 tok/s step 14093/19560 | loss 3.347044 (+1.06z)| norm 0.2636 (-0.70z)| lr 1.16e-04 | 4154.99 ms | 32.5% bf16 MFU | 126029 tok/s step 14094/19560 | loss 3.296081 (-0.36z)| norm 0.2751 (+0.28z)| lr 1.16e-04 | 4160.27 ms | 32.5% bf16 MFU | 126029 tok/s step 14095/19560 | loss 3.331748 (+0.64z)| norm 0.2722 (+0.07z)| lr 1.16e-04 | 4158.06 ms | 32.5% bf16 MFU | 126032 tok/s step 14096/19560 | loss 3.305013 (-0.12z)| norm 0.2628 (-0.78z)| lr 1.16e-04 | 4995.56 ms | 27.0% bf16 MFU | 124978 tok/s step 14097/19560 | loss 3.324036 (+0.41z)| norm 0.2947 (+2.11z)| lr 1.16e-04 | 4156.74 ms | 32.5% bf16 MFU | 125036 tok/s step 14098/19560 | loss 3.330815 (+0.60z)| norm 0.2535 (-1.59z)| lr 1.16e-04 | 4149.82 ms | 32.5% bf16 MFU | 125101 tok/s step 14099/19560 | loss 3.369651 (+1.68z)| norm 0.2640 (-0.63z)| lr 1.16e-04 | 4155.52 ms | 32.5% bf16 MFU | 125154 tok/s step 14100/19560 | loss 3.326459 (+0.46z)| norm 0.2660 (-0.44z)| lr 1.16e-04 | 4163.68 ms | 32.4% bf16 MFU | 125192 tok/s step 14101/19560 | loss 3.277132 (-0.92z)| norm 0.2513 (-1.75z)| lr 1.16e-04 | 4153.53 ms | 32.5% bf16 MFU | 125244 tok/s step 14102/19560 | loss 3.323336 (+0.39z)| norm 0.2622 (-0.75z)| lr 1.16e-04 | 4152.78 ms | 32.5% bf16 MFU | 125294 tok/s step 14103/19560 | loss 3.333113 (+0.67z)| norm 0.2451 (-2.24z)| lr 1.16e-04 | 4152.48 ms | 32.5% bf16 MFU | 125343 tok/s step 14104/19560 | loss 3.352298 (+1.20z)| norm 0.2572 (-1.15z)| lr 1.16e-04 | 4162.97 ms | 32.4% bf16 MFU | 125373 tok/s step 14105/19560 | loss 3.310871 (+0.02z)| norm 0.2679 (-0.22z)| lr 1.16e-04 | 4150.60 ms | 32.5% bf16 MFU | 125420 tok/s step 14106/19560 | loss 3.326399 (+0.45z)| norm 0.2642 (-0.54z)| lr 1.16e-04 | 4153.23 ms | 32.5% bf16 MFU | 125461 tok/s step 14107/19560 | loss 3.342798 (+0.91z)| norm 0.2496 (-1.80z)| lr 1.16e-04 | 4153.22 ms | 32.5% bf16 MFU | 125499 tok/s step 14108/19560 | loss 3.310278 (-0.02z)| norm 0.2576 (-1.09z)| lr 1.15e-04 | 4154.95 ms | 32.5% bf16 MFU | 125534 tok/s step 14109/19560 | loss 3.360206 (+1.42z)| norm 0.2820 (+1.05z)| lr 1.15e-04 | 4164.71 ms | 32.4% bf16 MFU | 125551 tok/s step 14110/19560 | loss 3.286820 (-0.70z)| norm 0.2560 (-1.21z)| lr 1.15e-04 | 4157.72 ms | 32.5% bf16 MFU | 125579 tok/s step 14111/19560 | loss 3.369518 (+1.67z)| norm 0.2598 (-0.88z)| lr 1.15e-04 | 4152.19 ms | 32.5% bf16 MFU | 125613 tok/s step 14112/19560 | loss 3.316348 (+0.14z)| norm 0.2698 (-0.00z)| lr 1.15e-04 | 4159.66 ms | 32.5% bf16 MFU | 125635 tok/s step 14113/19560 | loss 3.321308 (+0.29z)| norm 0.2671 (-0.25z)| lr 1.15e-04 | 4159.72 ms | 32.5% bf16 MFU | 125655 tok/s step 14114/19560 | loss 3.284634 (-0.76z)| norm 0.3067 (+3.08z)| lr 1.15e-04 | 4192.52 ms | 32.2% bf16 MFU | 125625 tok/s step 14115/19560 | loss 3.274281 (-1.06z)| norm 0.2798 (+0.80z)| lr 1.15e-04 | 4153.95 ms | 32.5% bf16 MFU | 125654 tok/s step 14116/19560 | loss 3.322550 (+0.35z)| norm 0.2969 (+2.20z)| lr 1.15e-04 | 4186.30 ms | 32.3% bf16 MFU | 125633 tok/s step 14117/19560 | loss 3.301563 (-0.26z)| norm 0.2753 (+0.39z)| lr 1.15e-04 | 4156.76 ms | 32.5% bf16 MFU | 125658 tok/s step 14118/19560 | loss 3.301849 (-0.25z)| norm 0.2686 (-0.16z)| lr 1.15e-04 | 4148.46 ms | 32.5% bf16 MFU | 125694 tok/s step 14119/19560 | loss 3.280831 (-0.85z)| norm 0.3083 (+3.02z)| lr 1.15e-04 | 4161.04 ms | 32.4% bf16 MFU | 125710 tok/s step 14120/19560 | loss 3.341196 (+0.90z)| norm 0.2845 (+1.16z)| lr 1.15e-04 | 4154.36 ms | 32.5% bf16 MFU | 125734 tok/s step 14121/19560 | loss 3.349531 (+1.12z)| norm 0.3006 (+2.43z)| lr 1.15e-04 | 4156.89 ms | 32.5% bf16 MFU | 125754 tok/s step 14122/19560 | loss 3.292634 (-0.55z)| norm 0.2785 (+0.61z)| lr 1.15e-04 | 4157.23 ms | 32.5% bf16 MFU | 125772 tok/s step 14123/19560 | loss 3.340950 (+0.86z)| norm 0.2825 (+0.93z)| lr 1.15e-04 | 4166.86 ms | 32.4% bf16 MFU | 125774 tok/s step 14124/19560 | loss 3.281250 (-0.89z)| norm 0.2811 (+0.80z)| lr 1.15e-04 | 4168.67 ms | 32.4% bf16 MFU | 125774 tok/s step 14125/19560 | loss 3.302958 (-0.25z)| norm 0.2833 (+0.97z)| lr 1.15e-04 | 4147.56 ms | 32.6% bf16 MFU | 125806 tok/s step 14126/19560 | loss 3.317945 (+0.20z)| norm 0.2917 (+1.64z)| lr 1.15e-04 | 4159.05 ms | 32.5% bf16 MFU | 125818 tok/s step 14127/19560 | loss 3.254369 (-1.65z)| norm 0.2735 (+0.15z)| lr 1.15e-04 | 4155.70 ms | 32.5% bf16 MFU | 125836 tok/s step 14128/19560 | loss 3.325283 (+0.46z)| norm 0.2809 (+0.75z)| lr 1.15e-04 | 4177.53 ms | 32.3% bf16 MFU | 125819 tok/s step 14129/19560 | loss 3.367674 (+1.74z)| norm 0.2894 (+1.41z)| lr 1.15e-04 | 4163.40 ms | 32.4% bf16 MFU | 125824 tok/s step 14130/19560 | loss 3.349498 (+1.17z)| norm 0.2483 (-1.86z)| lr 1.15e-04 | 4155.33 ms | 32.5% bf16 MFU | 125842 tok/s step 14131/19560 | loss 3.345370 (+1.04z)| norm 0.2776 (+0.47z)| lr 1.15e-04 | 4158.11 ms | 32.5% bf16 MFU | 125854 tok/s step 14132/19560 | loss 3.291660 (-0.57z)| norm 0.2835 (+0.92z)| lr 1.15e-04 | 4147.75 ms | 32.6% bf16 MFU | 125882 tok/s step 14133/19560 | loss 3.259583 (-1.51z)| norm 0.2613 (-0.83z)| lr 1.14e-04 | 4154.22 ms | 32.5% bf16 MFU | 125898 tok/s step 14134/19560 | loss 3.309325 (+0.00z)| norm 0.2933 (+1.67z)| lr 1.14e-04 | 4161.99 ms | 32.4% bf16 MFU | 125901 tok/s step 14135/19560 | loss 3.315239 (+0.18z)| norm 0.2766 (+0.36z)| lr 1.14e-04 | 4162.84 ms | 32.4% bf16 MFU | 125904 tok/s step 14136/19560 | loss 3.283311 (-0.79z)| norm 0.2573 (-1.14z)| lr 1.14e-04 | 4153.56 ms | 32.5% bf16 MFU | 125920 tok/s step 14137/19560 | loss 3.309695 (+0.02z)| norm 0.2811 (+0.70z)| lr 1.14e-04 | 4171.12 ms | 32.4% bf16 MFU | 125908 tok/s step 14138/19560 | loss 3.243051 (-1.97z)| norm 0.2693 (-0.23z)| lr 1.14e-04 | 4166.31 ms | 32.4% bf16 MFU | 125905 tok/s step 14139/19560 | loss 3.351782 (+1.28z)| norm 0.2529 (-1.51z)| lr 1.14e-04 | 4161.82 ms | 32.4% bf16 MFU | 125909 tok/s step 14140/19560 | loss 3.301859 (-0.22z)| norm 0.2891 (+1.32z)| lr 1.14e-04 | 4154.95 ms | 32.5% bf16 MFU | 125922 tok/s step 14141/19560 | loss 3.326315 (+0.51z)| norm 0.2698 (-0.20z)| lr 1.14e-04 | 4166.58 ms | 32.4% bf16 MFU | 125918 tok/s step 14142/19560 | loss 3.340024 (+0.92z)| norm 0.2750 (+0.22z)| lr 1.14e-04 | 4155.72 ms | 32.5% bf16 MFU | 125930 tok/s step 14143/19560 | loss 3.306765 (-0.07z)| norm 0.2766 (+0.34z)| lr 1.14e-04 | 4150.67 ms | 32.5% bf16 MFU | 125949 tok/s step 14144/19560 | loss 3.333221 (+0.71z)| norm 0.2659 (-0.51z)| lr 1.14e-04 | 4154.13 ms | 32.5% bf16 MFU | 125962 tok/s step 14145/19560 | loss 3.311880 (+0.04z)| norm 0.2811 (+0.69z)| lr 1.14e-04 | 4155.34 ms | 32.5% bf16 MFU | 125973 tok/s step 14146/19560 | loss 3.261066 (-1.51z)| norm 0.2704 (-0.15z)| lr 1.14e-04 | 4148.37 ms | 32.5% bf16 MFU | 125993 tok/s step 14147/19560 | loss 3.298675 (-0.37z)| norm 0.2642 (-0.64z)| lr 1.14e-04 | 4151.99 ms | 32.5% bf16 MFU | 126007 tok/s step 14148/19560 | loss 3.277723 (-1.02z)| norm 0.2681 (-0.32z)| lr 1.14e-04 | 4163.40 ms | 32.4% bf16 MFU | 126003 tok/s step 14149/19560 | loss 3.322587 (+0.39z)| norm 0.2863 (+1.10z)| lr 1.14e-04 | 4150.84 ms | 32.5% bf16 MFU | 126019 tok/s step 14150/19560 | loss 3.311525 (+0.04z)| norm 0.2802 (+0.61z)| lr 1.14e-04 | 4155.28 ms | 32.5% bf16 MFU | 126026 tok/s step 14151/19560 | loss 3.306855 (-0.11z)| norm 0.2702 (-0.17z)| lr 1.14e-04 | 4160.93 ms | 32.4% bf16 MFU | 126025 tok/s step 14152/19560 | loss 3.308193 (-0.07z)| norm 0.2591 (-1.04z)| lr 1.14e-04 | 4163.73 ms | 32.4% bf16 MFU | 126020 tok/s step 14153/19560 | loss 3.347975 (+1.18z)| norm 0.2849 (+0.99z)| lr 1.14e-04 | 4151.81 ms | 32.5% bf16 MFU | 126033 tok/s step 14154/19560 | loss 3.285013 (-0.80z)| norm 0.2789 (+0.51z)| lr 1.14e-04 | 4156.81 ms | 32.5% bf16 MFU | 126037 tok/s step 14155/19560 | loss 3.320616 (+0.32z)| norm 0.2607 (-0.93z)| lr 1.14e-04 | 4159.44 ms | 32.5% bf16 MFU | 126038 tok/s step 14156/19560 | loss 3.324645 (+0.45z)| norm 0.2790 (+0.51z)| lr 1.14e-04 | 4165.31 ms | 32.4% bf16 MFU | 126030 tok/s step 14157/19560 | loss 3.356672 (+1.45z)| norm 0.2554 (-1.35z)| lr 1.14e-04 | 4177.27 ms | 32.3% bf16 MFU | 126004 tok/s step 14158/19560 | loss 3.294798 (-0.52z)| norm 0.2733 (+0.06z)| lr 1.14e-04 | 4151.95 ms | 32.5% bf16 MFU | 126017 tok/s step 14159/19560 | loss 3.294486 (-0.53z)| norm 0.2688 (-0.29z)| lr 1.13e-04 | 4804.56 ms | 28.1% bf16 MFU | 125172 tok/s step 14160/19560 | loss 3.262969 (-1.53z)| norm 0.2567 (-1.23z)| lr 1.13e-04 | 4149.46 ms | 32.5% bf16 MFU | 125231 tok/s step 14161/19560 | loss 3.268444 (-1.34z)| norm 0.2678 (-0.36z)| lr 1.13e-04 | 4172.18 ms | 32.4% bf16 MFU | 125253 tok/s step 14162/19560 | loss 3.317495 (+0.22z)| norm 0.2742 (+0.14z)| lr 1.13e-04 | 4173.05 ms | 32.4% bf16 MFU | 125272 tok/s step 14163/19560 | loss 3.342100 (+0.99z)| norm 0.2738 (+0.10z)| lr 1.13e-04 | 4156.07 ms | 32.5% bf16 MFU | 125316 tok/s step 14164/19560 | loss 3.415091 (+3.14z)| norm 0.2717 (-0.06z)| lr 1.13e-04 | 5366.53 ms | 25.2% bf16 MFU | 123935 tok/s step 14165/19560 | loss 3.304363 (-0.24z)| norm 0.3045 (+2.44z)| lr 1.13e-04 | 4148.58 ms | 32.5% bf16 MFU | 124057 tok/s step 14166/19560 | loss 3.327528 (+0.46z)| norm 0.2931 (+1.53z)| lr 1.13e-04 | 4148.39 ms | 32.5% bf16 MFU | 124173 tok/s step 14167/19560 | loss 3.218282 (-2.78z)| norm 0.2767 (+0.26z)| lr 1.13e-04 | 4143.47 ms | 32.6% bf16 MFU | 124291 tok/s step 14168/19560 | loss 3.303237 (-0.25z)| norm 0.2734 (+0.01z)| lr 1.13e-04 | 4158.52 ms | 32.5% bf16 MFU | 124381 tok/s step 14169/19560 | loss 3.310529 (-0.04z)| norm 0.2747 (+0.10z)| lr 1.13e-04 | 4160.61 ms | 32.5% bf16 MFU | 124462 tok/s step 14170/19560 | loss 3.313334 (+0.04z)| norm 0.2708 (-0.21z)| lr 1.13e-04 | 4153.35 ms | 32.5% bf16 MFU | 124551 tok/s step 14171/19560 | loss 3.313955 (+0.06z)| norm 0.2738 (+0.02z)| lr 1.13e-04 | 4156.53 ms | 32.5% bf16 MFU | 124630 tok/s step 14172/19560 | loss 3.336565 (+0.73z)| norm 0.2680 (-0.43z)| lr 1.13e-04 | 4150.50 ms | 32.5% bf16 MFU | 124714 tok/s step 14173/19560 | loss 3.297183 (-0.46z)| norm 0.2862 (+1.00z)| lr 1.13e-04 | 4157.15 ms | 32.5% bf16 MFU | 124785 tok/s step 14174/19560 | loss 3.279915 (-0.97z)| norm 0.2618 (-0.92z)| lr 1.13e-04 | 4139.49 ms | 32.6% bf16 MFU | 124878 tok/s step 14175/19560 | loss 3.361069 (+1.46z)| norm 0.2909 (+1.36z)| lr 1.13e-04 | 4148.72 ms | 32.5% bf16 MFU | 124953 tok/s step 14176/19560 | loss 3.348014 (+1.06z)| norm 0.2818 (+0.64z)| lr 1.13e-04 | 4160.15 ms | 32.5% bf16 MFU | 125007 tok/s step 14177/19560 | loss 3.308707 (-0.11z)| norm 0.2737 (+0.01z)| lr 1.13e-04 | 4163.33 ms | 32.4% bf16 MFU | 125053 tok/s step 14178/19560 | loss 3.288168 (-0.72z)| norm 0.2707 (-0.21z)| lr 1.13e-04 | 4152.25 ms | 32.5% bf16 MFU | 125113 tok/s step 14179/19560 | loss 3.309412 (-0.09z)| norm 0.2946 (+1.67z)| lr 1.13e-04 | 4162.05 ms | 32.4% bf16 MFU | 125156 tok/s step 14180/19560 | loss 3.282302 (-0.89z)| norm 0.2762 (+0.23z)| lr 1.13e-04 | 4153.84 ms | 32.5% bf16 MFU | 125209 tok/s step 14181/19560 | loss 3.317048 (+0.14z)| norm 0.2696 (-0.30z)| lr 1.13e-04 | 4161.75 ms | 32.4% bf16 MFU | 125248 tok/s step 14182/19560 | loss 3.296148 (-0.48z)| norm 0.2875 (+1.11z)| lr 1.13e-04 | 4148.98 ms | 32.5% bf16 MFU | 125304 tok/s step 14183/19560 | loss 3.376739 (+1.90z)| norm 0.3061 (+2.52z)| lr 1.13e-04 | 4160.33 ms | 32.5% bf16 MFU | 125339 tok/s step 14184/19560 | loss 3.274954 (-1.12z)| norm 0.2976 (+1.82z)| lr 1.13e-04 | 4155.48 ms | 32.5% bf16 MFU | 125381 tok/s step 14185/19560 | loss 3.228197 (-2.43z)| norm 0.2761 (+0.17z)| lr 1.12e-04 | 4159.65 ms | 32.5% bf16 MFU | 125414 tok/s step 14186/19560 | loss 3.312684 (+0.01z)| norm 0.2903 (+1.24z)| lr 1.12e-04 | 4154.55 ms | 32.5% bf16 MFU | 125453 tok/s step 14187/19560 | loss 3.330934 (+0.53z)| norm 0.2733 (-0.07z)| lr 1.12e-04 | 4163.13 ms | 32.4% bf16 MFU | 125477 tok/s step 14188/19560 | loss 3.313272 (+0.01z)| norm 0.2799 (+0.43z)| lr 1.12e-04 | 4161.09 ms | 32.4% bf16 MFU | 125503 tok/s step 14189/19560 | loss 3.310848 (-0.05z)| norm 0.2783 (+0.30z)| lr 1.12e-04 | 4150.62 ms | 32.5% bf16 MFU | 125544 tok/s step 14190/19560 | loss 3.289089 (-0.68z)| norm 0.2666 (-0.60z)| lr 1.12e-04 | 4159.53 ms | 32.5% bf16 MFU | 125569 tok/s step 14191/19560 | loss 3.308308 (-0.12z)| norm 0.2653 (-0.69z)| lr 1.12e-04 | 4159.41 ms | 32.5% bf16 MFU | 125593 tok/s step 14192/19560 | loss 3.250896 (-1.80z)| norm 0.2532 (-1.59z)| lr 1.12e-04 | 4152.32 ms | 32.5% bf16 MFU | 125626 tok/s step 14193/19560 | loss 3.286620 (-0.74z)| norm 0.2718 (-0.16z)| lr 1.12e-04 | 4154.47 ms | 32.5% bf16 MFU | 125655 tok/s step 14194/19560 | loss 3.269200 (-1.26z)| norm 0.2835 (+0.73z)| lr 1.12e-04 | 4152.10 ms | 32.5% bf16 MFU | 125686 tok/s step 14195/19560 | loss 3.266921 (-1.31z)| norm 0.2736 (-0.04z)| lr 1.12e-04 | 4155.23 ms | 32.5% bf16 MFU | 125710 tok/s step 14196/19560 | loss 3.329928 (+0.53z)| norm 0.2766 (+0.19z)| lr 1.12e-04 | 4166.08 ms | 32.4% bf16 MFU | 125717 tok/s step 14197/19560 | loss 3.287035 (-0.72z)| norm 0.2509 (-1.80z)| lr 1.12e-04 | 4156.18 ms | 32.5% bf16 MFU | 125739 tok/s step 14198/19560 | loss 3.353929 (+1.22z)| norm 0.2705 (-0.28z)| lr 1.12e-04 | 4149.10 ms | 32.5% bf16 MFU | 125770 tok/s step 14199/19560 | loss 3.343878 (+0.93z)| norm 0.2723 (-0.14z)| lr 1.12e-04 | 4165.74 ms | 32.4% bf16 MFU | 125774 tok/s step 14200/19560 | loss 3.325352 (+0.38z)| norm 0.2703 (-0.29z)| lr 1.12e-04 | 4156.72 ms | 32.5% bf16 MFU | 125792 tok/s step 14201/19560 | loss 3.338614 (+0.77z)| norm 0.2765 (+0.19z)| lr 1.12e-04 | 4159.55 ms | 32.5% bf16 MFU | 125805 tok/s step 14202/19560 | loss 3.261751 (-1.46z)| norm 0.2752 (+0.09z)| lr 1.12e-04 | 4165.29 ms | 32.4% bf16 MFU | 125808 tok/s step 14203/19560 | loss 3.226832 (-2.43z)| norm 0.2649 (-0.71z)| lr 1.12e-04 | 4162.18 ms | 32.4% bf16 MFU | 125816 tok/s step 14204/19560 | loss 3.364653 (+1.52z)| norm 0.2836 (+0.75z)| lr 1.12e-04 | 4154.13 ms | 32.5% bf16 MFU | 125835 tok/s step 14205/19560 | loss 3.338299 (+0.77z)| norm 0.2888 (+1.13z)| lr 1.12e-04 | 4166.59 ms | 32.4% bf16 MFU | 125835 tok/s step 14206/19560 | loss 3.279631 (-0.90z)| norm 0.2641 (-0.78z)| lr 1.12e-04 | 4152.61 ms | 32.5% bf16 MFU | 125856 tok/s step 14207/19560 | loss 3.397754 (+2.46z)| norm 0.2976 (+1.79z)| lr 1.12e-04 | 4156.86 ms | 32.5% bf16 MFU | 125870 tok/s step 14208/19560 | loss 3.319940 (+0.26z)| norm 0.2578 (-1.25z)| lr 1.12e-04 | 4162.40 ms | 32.4% bf16 MFU | 125874 tok/s step 14209/19560 | loss 3.267920 (-1.22z)| norm 0.2840 (+0.75z)| lr 1.12e-04 | 4164.02 ms | 32.4% bf16 MFU | 125876 tok/s step 14210/19560 | loss 3.365064 (+1.52z)| norm 0.2857 (+0.88z)| lr 1.11e-04 | 4158.39 ms | 32.5% bf16 MFU | 125886 tok/s step 14211/19560 | loss 3.331690 (+0.56z)| norm 0.2690 (-0.40z)| lr 1.11e-04 | 4157.54 ms | 32.5% bf16 MFU | 125897 tok/s step 14212/19560 | loss 3.321438 (+0.26z)| norm 0.2762 (+0.15z)| lr 1.11e-04 | 4155.12 ms | 32.5% bf16 MFU | 125911 tok/s step 14213/19560 | loss 3.324751 (+0.35z)| norm 0.2773 (+0.23z)| lr 1.11e-04 | 4151.92 ms | 32.5% bf16 MFU | 125929 tok/s step 14214/19560 | loss 3.261034 (-1.50z)| norm 0.2782 (+0.30z)| lr 1.11e-04 | 4163.81 ms | 32.4% bf16 MFU | 125929 tok/s step 14215/19560 | loss 3.287914 (-0.72z)| norm 0.2618 (-0.98z)| lr 1.11e-04 | 4168.94 ms | 32.4% bf16 MFU | 125920 tok/s step 14216/19560 | loss 3.289618 (-0.66z)| norm 0.2735 (-0.06z)| lr 1.11e-04 | 4156.49 ms | 32.5% bf16 MFU | 125931 tok/s step 14217/19560 | loss 3.325858 (+0.39z)| norm 0.2715 (-0.21z)| lr 1.11e-04 | 4225.13 ms | 32.0% bf16 MFU | 125839 tok/s step 14218/19560 | loss 3.377083 (+1.84z)| norm 0.2707 (-0.28z)| lr 1.11e-04 | 4168.10 ms | 32.4% bf16 MFU | 125836 tok/s step 14219/19560 | loss 3.309984 (-0.08z)| norm 0.2782 (+0.31z)| lr 1.11e-04 | 4174.59 ms | 32.3% bf16 MFU | 125824 tok/s step 14220/19560 | loss 3.304680 (-0.22z)| norm 0.2942 (+1.53z)| lr 1.11e-04 | 4166.34 ms | 32.4% bf16 MFU | 125825 tok/s step 14221/19560 | loss 3.341396 (+0.86z)| norm 0.2651 (-0.73z)| lr 1.11e-04 | 4149.02 ms | 32.5% bf16 MFU | 125852 tok/s step 14222/19560 | loss 3.311196 (-0.03z)| norm 0.2822 (+0.59z)| lr 1.11e-04 | 4158.56 ms | 32.5% bf16 MFU | 125863 tok/s step 14223/19560 | loss 3.336559 (+0.71z)| norm 0.2857 (+0.85z)| lr 1.11e-04 | 4157.93 ms | 32.5% bf16 MFU | 125874 tok/s step 14224/19560 | loss 3.342804 (+0.88z)| norm 0.2623 (-0.96z)| lr 1.11e-04 | 4154.36 ms | 32.5% bf16 MFU | 125891 tok/s step 14225/19560 | loss 3.273653 (-1.12z)| norm 0.2776 (+0.24z)| lr 1.11e-04 | 4157.12 ms | 32.5% bf16 MFU | 125902 tok/s step 14226/19560 | loss 3.404728 (+2.61z)| norm 0.2680 (-0.53z)| lr 1.11e-04 | 4161.05 ms | 32.4% bf16 MFU | 125907 tok/s step 14227/19560 | loss 3.448244 (+3.66z)| norm 0.3062 (+2.41z)| lr 1.11e-04 | 4157.03 ms | 32.5% bf16 MFU | 125918 tok/s step 14228/19560 | loss 3.324853 (+0.31z)| norm 0.2609 (-1.09z)| lr 1.11e-04 | 4148.80 ms | 32.5% bf16 MFU | 125940 tok/s step 14229/19560 | loss 3.338013 (+0.66z)| norm 0.2627 (-0.96z)| lr 1.11e-04 | 4161.33 ms | 32.4% bf16 MFU | 125943 tok/s step 14230/19560 | loss 3.312594 (-0.03z)| norm 0.2630 (-0.94z)| lr 1.11e-04 | 4154.57 ms | 32.5% bf16 MFU | 125955 tok/s step 14231/19560 | loss 3.296119 (-0.47z)| norm 0.2553 (-1.56z)| lr 1.11e-04 | 4160.33 ms | 32.5% bf16 MFU | 125959 tok/s step 14232/19560 | loss 3.343172 (+0.81z)| norm 0.2666 (-0.68z)| lr 1.11e-04 | 4159.76 ms | 32.5% bf16 MFU | 125963 tok/s step 14233/19560 | loss 3.283563 (-0.81z)| norm 0.2680 (-0.57z)| lr 1.11e-04 | 4159.31 ms | 32.5% bf16 MFU | 125967 tok/s step 14234/19560 | loss 3.324014 (+0.29z)| norm 0.2704 (-0.38z)| lr 1.11e-04 | 4150.97 ms | 32.5% bf16 MFU | 125984 tok/s step 14235/19560 | loss 3.314252 (+0.03z)| norm 0.2581 (-1.38z)| lr 1.11e-04 | 4155.16 ms | 32.5% bf16 MFU | 125994 tok/s step 14236/19560 | loss 3.328781 (+0.43z)| norm 0.3014 (+2.05z)| lr 1.10e-04 | 4157.47 ms | 32.5% bf16 MFU | 125999 tok/s step 14237/19560 | loss 3.325027 (+0.33z)| norm 0.2690 (-0.52z)| lr 1.10e-04 | 4159.19 ms | 32.5% bf16 MFU | 126002 tok/s step 14238/19560 | loss 3.345474 (+0.88z)| norm 0.2824 (+0.53z)| lr 1.10e-04 | 4146.89 ms | 32.6% bf16 MFU | 126023 tok/s step 14239/19560 | loss 3.346557 (+0.92z)| norm 0.2635 (-0.99z)| lr 1.10e-04 | 4161.51 ms | 32.4% bf16 MFU | 126022 tok/s step 14240/19560 | loss 3.307340 (-0.16z)| norm 0.2780 (+0.17z)| lr 1.10e-04 | 4159.75 ms | 32.5% bf16 MFU | 126022 tok/s step 14241/19560 | loss 3.289874 (-0.63z)| norm 0.2717 (-0.34z)| lr 1.10e-04 | 4164.57 ms | 32.4% bf16 MFU | 126016 tok/s step 14242/19560 | loss 3.286200 (-0.73z)| norm 0.2673 (-0.69z)| lr 1.10e-04 | 4156.84 ms | 32.5% bf16 MFU | 126021 tok/s step 14243/19560 | loss 3.360008 (+1.28z)| norm 0.2629 (-1.03z)| lr 1.10e-04 | 4149.37 ms | 32.5% bf16 MFU | 126038 tok/s step 14244/19560 | loss 3.351104 (+1.02z)| norm 0.2915 (+1.32z)| lr 1.10e-04 | 4163.00 ms | 32.4% bf16 MFU | 126033 tok/s step 14245/19560 | loss 3.336672 (+0.62z)| norm 0.2903 (+1.21z)| lr 1.10e-04 | 4150.14 ms | 32.5% bf16 MFU | 126048 tok/s step 14246/19560 | loss 3.320660 (+0.18z)| norm 0.2689 (-0.54z)| lr 1.10e-04 | 4154.11 ms | 32.5% bf16 MFU | 126056 tok/s step 14247/19560 | loss 3.400739 (+2.30z)| norm 0.2828 (+0.63z)| lr 1.10e-04 | 4160.55 ms | 32.5% bf16 MFU | 126054 tok/s step 14248/19560 | loss 3.289964 (-0.66z)| norm 0.2785 (+0.27z)| lr 1.10e-04 | 4149.96 ms | 32.5% bf16 MFU | 126068 tok/s step 14249/19560 | loss 3.301168 (-0.35z)| norm 0.2816 (+0.56z)| lr 1.10e-04 | 4159.50 ms | 32.5% bf16 MFU | 126067 tok/s step 14250/19560 | loss 3.329543 (+0.40z)| norm 0.2730 (-0.18z)| lr 1.10e-04 | 4183.30 ms | 32.3% bf16 MFU | 126030 tok/s val loss 3.305268 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3006/10042 = 0.299343 step 14251/19560 | loss 3.291517 (-0.61z)| norm 0.2819 (+0.59z)| lr 1.10e-04 | 4162.86 ms | 32.4% bf16 MFU | 126026 tok/s step 14252/19560 | loss 3.302630 (-0.32z)| norm 0.2957 (+1.75z)| lr 1.10e-04 | 4156.00 ms | 32.5% bf16 MFU | 126032 tok/s step 14253/19560 | loss 3.279011 (-0.95z)| norm 0.2841 (+0.76z)| lr 1.10e-04 | 4160.30 ms | 32.5% bf16 MFU | 126032 tok/s step 14254/19560 | loss 3.348997 (+0.93z)| norm 0.2950 (+1.68z)| lr 1.10e-04 | 4155.99 ms | 32.5% bf16 MFU | 126038 tok/s step 14255/19560 | loss 3.343822 (+0.78z)| norm 0.2755 (+0.02z)| lr 1.10e-04 | 4154.82 ms | 32.5% bf16 MFU | 126045 tok/s step 14256/19560 | loss 3.365711 (+1.35z)| norm 0.2821 (+0.58z)| lr 1.10e-04 | 4163.05 ms | 32.4% bf16 MFU | 126040 tok/s step 14257/19560 | loss 3.343492 (+0.76z)| norm 0.2789 (+0.32z)| lr 1.10e-04 | 4162.43 ms | 32.4% bf16 MFU | 126036 tok/s step 14258/19560 | loss 3.297975 (-0.46z)| norm 0.2874 (+1.03z)| lr 1.10e-04 | 4159.53 ms | 32.5% bf16 MFU | 126036 tok/s step 14259/19560 | loss 3.282585 (-0.86z)| norm 0.2794 (+0.34z)| lr 1.10e-04 | 4156.39 ms | 32.5% bf16 MFU | 126041 tok/s step 14260/19560 | loss 3.324751 (+0.27z)| norm 0.2633 (-1.04z)| lr 1.10e-04 | 4156.49 ms | 32.5% bf16 MFU | 126046 tok/s step 14261/19560 | loss 3.333712 (+0.51z)| norm 0.2788 (+0.29z)| lr 1.10e-04 | 4166.04 ms | 32.4% bf16 MFU | 126036 tok/s step 14262/19560 | loss 3.274257 (-1.11z)| norm 0.2708 (-0.39z)| lr 1.09e-04 | 4158.32 ms | 32.5% bf16 MFU | 126038 tok/s step 14263/19560 | loss 3.333795 (+0.51z)| norm 0.2751 (-0.02z)| lr 1.09e-04 | 4152.59 ms | 32.5% bf16 MFU | 126049 tok/s step 14264/19560 | loss 3.306113 (-0.25z)| norm 0.2643 (-0.97z)| lr 1.09e-04 | 4155.56 ms | 32.5% bf16 MFU | 126055 tok/s step 14265/19560 | loss 3.293159 (-0.60z)| norm 0.2734 (-0.17z)| lr 1.09e-04 | 4161.36 ms | 32.4% bf16 MFU | 126052 tok/s step 14266/19560 | loss 3.358365 (+1.16z)| norm 0.2724 (-0.26z)| lr 1.09e-04 | 4163.06 ms | 32.4% bf16 MFU | 126046 tok/s step 14267/19560 | loss 3.318691 (+0.08z)| norm 0.2667 (-0.78z)| lr 1.09e-04 | 4149.97 ms | 32.5% bf16 MFU | 126061 tok/s step 14268/19560 | loss 3.355247 (+1.07z)| norm 0.2644 (-0.97z)| lr 1.09e-04 | 4156.63 ms | 32.5% bf16 MFU | 126064 tok/s step 14269/19560 | loss 3.276047 (-1.09z)| norm 0.2599 (-1.36z)| lr 1.09e-04 | 4160.80 ms | 32.4% bf16 MFU | 126061 tok/s step 14270/19560 | loss 3.315856 (+0.01z)| norm 0.2600 (-1.33z)| lr 1.09e-04 | 4154.78 ms | 32.5% bf16 MFU | 126068 tok/s step 14271/19560 | loss 3.358512 (+1.16z)| norm 0.2803 (+0.46z)| lr 1.09e-04 | 4159.19 ms | 32.5% bf16 MFU | 126067 tok/s step 14272/19560 | loss 3.313684 (-0.06z)| norm 0.2536 (-1.87z)| lr 1.09e-04 | 4154.28 ms | 32.5% bf16 MFU | 126074 tok/s step 14273/19560 | loss 3.363648 (+1.29z)| norm 0.2926 (+1.52z)| lr 1.09e-04 | 4158.72 ms | 32.5% bf16 MFU | 126074 tok/s step 14274/19560 | loss 3.295505 (-0.57z)| norm 0.2644 (-0.92z)| lr 1.09e-04 | 4159.23 ms | 32.5% bf16 MFU | 126073 tok/s step 14275/19560 | loss 3.278898 (-1.02z)| norm 0.2573 (-1.52z)| lr 1.09e-04 | 4153.94 ms | 32.5% bf16 MFU | 126080 tok/s step 14276/19560 | loss 3.339254 (+0.61z)| norm 0.2597 (-1.30z)| lr 1.09e-04 | 4161.24 ms | 32.4% bf16 MFU | 126076 tok/s step 14277/19560 | loss 3.269616 (-1.27z)| norm 0.2510 (-2.00z)| lr 1.09e-04 | 4157.51 ms | 32.5% bf16 MFU | 126077 tok/s step 14278/19560 | loss 3.319438 (+0.08z)| norm 0.2596 (-1.25z)| lr 1.09e-04 | 4156.77 ms | 32.5% bf16 MFU | 126080 tok/s step 14279/19560 | loss 3.386568 (+1.86z)| norm 0.2625 (-1.00z)| lr 1.09e-04 | 4149.39 ms | 32.5% bf16 MFU | 126093 tok/s step 14280/19560 | loss 3.352942 (+0.95z)| norm 0.2505 (-1.98z)| lr 1.09e-04 | 4163.48 ms | 32.4% bf16 MFU | 126085 tok/s step 14281/19560 | loss 3.310429 (-0.18z)| norm 0.2654 (-0.73z)| lr 1.09e-04 | 4168.36 ms | 32.4% bf16 MFU | 126070 tok/s step 14282/19560 | loss 3.371826 (+1.44z)| norm 0.2593 (-1.22z)| lr 1.09e-04 | 4159.65 ms | 32.5% bf16 MFU | 126068 tok/s step 14283/19560 | loss 3.315200 (-0.07z)| norm 0.2721 (-0.17z)| lr 1.09e-04 | 4148.15 ms | 32.5% bf16 MFU | 126084 tok/s step 14284/19560 | loss 3.305085 (-0.34z)| norm 0.2728 (-0.10z)| lr 1.09e-04 | 4155.54 ms | 32.5% bf16 MFU | 126088 tok/s step 14285/19560 | loss 3.310689 (-0.18z)| norm 0.2680 (-0.52z)| lr 1.09e-04 | 4168.24 ms | 32.4% bf16 MFU | 126073 tok/s step 14286/19560 | loss 3.334723 (+0.46z)| norm 0.2937 (+1.61z)| lr 1.09e-04 | 4153.09 ms | 32.5% bf16 MFU | 126081 tok/s step 14287/19560 | loss 3.364224 (+1.23z)| norm 0.2684 (-0.49z)| lr 1.09e-04 | 4153.54 ms | 32.5% bf16 MFU | 126089 tok/s step 14288/19560 | loss 3.295786 (-0.61z)| norm 0.2716 (-0.24z)| lr 1.08e-04 | 4152.61 ms | 32.5% bf16 MFU | 126097 tok/s step 14289/19560 | loss 3.360873 (+1.12z)| norm 0.2652 (-0.77z)| lr 1.08e-04 | 4164.19 ms | 32.4% bf16 MFU | 126087 tok/s step 14290/19560 | loss 3.334097 (+0.40z)| norm 0.2802 (+0.48z)| lr 1.08e-04 | 4150.03 ms | 32.5% bf16 MFU | 126100 tok/s step 14291/19560 | loss 3.342528 (+0.62z)| norm 0.2771 (+0.22z)| lr 1.08e-04 | 4159.06 ms | 32.5% bf16 MFU | 126098 tok/s step 14292/19560 | loss 3.350724 (+0.88z)| norm 0.2596 (-1.23z)| lr 1.08e-04 | 4160.55 ms | 32.5% bf16 MFU | 126093 tok/s step 14293/19560 | loss 3.324814 (+0.16z)| norm 0.2761 (+0.17z)| lr 1.08e-04 | 4149.21 ms | 32.5% bf16 MFU | 126107 tok/s step 14294/19560 | loss 3.345228 (+0.72z)| norm 0.2925 (+1.56z)| lr 1.08e-04 | 4151.61 ms | 32.5% bf16 MFU | 126116 tok/s step 14295/19560 | loss 3.252286 (-1.87z)| norm 0.2826 (+0.71z)| lr 1.08e-04 | 4157.60 ms | 32.5% bf16 MFU | 126115 tok/s step 14296/19560 | loss 3.306118 (-0.37z)| norm 0.2926 (+1.54z)| lr 1.08e-04 | 4155.83 ms | 32.5% bf16 MFU | 126117 tok/s step 14297/19560 | loss 3.359735 (+1.11z)| norm 0.2630 (-0.94z)| lr 1.08e-04 | 4175.82 ms | 32.3% bf16 MFU | 126089 tok/s step 14298/19560 | loss 3.328730 (+0.25z)| norm 0.3335 (+4.52z)| lr 1.08e-04 | 4156.83 ms | 32.5% bf16 MFU | 126091 tok/s step 14299/19560 | loss 3.334529 (+0.40z)| norm 0.2792 (+0.34z)| lr 1.08e-04 | 4148.97 ms | 32.5% bf16 MFU | 126105 tok/s step 14300/19560 | loss 3.378738 (+1.61z)| norm 0.2639 (-0.84z)| lr 1.08e-04 | 4156.00 ms | 32.5% bf16 MFU | 126107 tok/s step 14301/19560 | loss 3.293207 (-0.75z)| norm 0.2691 (-0.42z)| lr 1.08e-04 | 4162.29 ms | 32.4% bf16 MFU | 126100 tok/s step 14302/19560 | loss 3.384585 (+1.73z)| norm 0.2747 (-0.00z)| lr 1.08e-04 | 4166.95 ms | 32.4% bf16 MFU | 126086 tok/s step 14303/19560 | loss 3.273112 (-1.29z)| norm 0.2588 (-1.21z)| lr 1.08e-04 | 4150.32 ms | 32.5% bf16 MFU | 126098 tok/s step 14304/19560 | loss 3.310139 (-0.27z)| norm 0.2686 (-0.45z)| lr 1.08e-04 | 4159.53 ms | 32.5% bf16 MFU | 126095 tok/s step 14305/19560 | loss 3.315424 (-0.13z)| norm 0.2561 (-1.39z)| lr 1.08e-04 | 4159.38 ms | 32.5% bf16 MFU | 126093 tok/s step 14306/19560 | loss 3.291260 (-0.79z)| norm 0.2690 (-0.40z)| lr 1.08e-04 | 4157.09 ms | 32.5% bf16 MFU | 126094 tok/s step 14307/19560 | loss 3.260879 (-1.60z)| norm 0.2623 (-0.90z)| lr 1.08e-04 | 4146.70 ms | 32.6% bf16 MFU | 126111 tok/s step 14308/19560 | loss 3.319860 (-0.01z)| norm 0.2816 (+0.58z)| lr 1.08e-04 | 4152.05 ms | 32.5% bf16 MFU | 126119 tok/s step 14309/19560 | loss 3.382498 (+1.66z)| norm 0.2779 (+0.30z)| lr 1.08e-04 | 4155.19 ms | 32.5% bf16 MFU | 126122 tok/s step 14310/19560 | loss 3.323202 (+0.06z)| norm 0.2856 (+0.89z)| lr 1.08e-04 | 4159.09 ms | 32.5% bf16 MFU | 126119 tok/s step 14311/19560 | loss 3.384873 (+1.72z)| norm 0.2601 (-1.08z)| lr 1.08e-04 | 4151.62 ms | 32.5% bf16 MFU | 126127 tok/s step 14312/19560 | loss 3.298183 (-0.62z)| norm 0.2974 (+1.87z)| lr 1.08e-04 | 4156.95 ms | 32.5% bf16 MFU | 126127 tok/s step 14313/19560 | loss 3.267907 (-1.47z)| norm 0.2739 (+0.02z)| lr 1.08e-04 | 4159.88 ms | 32.5% bf16 MFU | 126122 tok/s step 14314/19560 | loss 3.334486 (+0.35z)| norm 0.2717 (-0.15z)| lr 1.07e-04 | 4154.12 ms | 32.5% bf16 MFU | 126127 tok/s step 14315/19560 | loss 3.335400 (+0.38z)| norm 0.2805 (+0.55z)| lr 1.07e-04 | 4153.02 ms | 32.5% bf16 MFU | 126132 tok/s step 14316/19560 | loss 3.355209 (+0.91z)| norm 0.2799 (+0.50z)| lr 1.07e-04 | 4157.55 ms | 32.5% bf16 MFU | 126131 tok/s step 14317/19560 | loss 3.364253 (+1.14z)| norm 0.2760 (+0.19z)| lr 1.07e-04 | 4167.01 ms | 32.4% bf16 MFU | 126115 tok/s step 14318/19560 | loss 3.312650 (-0.27z)| norm 0.2765 (+0.23z)| lr 1.07e-04 | 4157.49 ms | 32.5% bf16 MFU | 126115 tok/s step 14319/19560 | loss 3.320997 (-0.05z)| norm 0.2917 (+1.40z)| lr 1.07e-04 | 4242.95 ms | 31.8% bf16 MFU | 125988 tok/s step 14320/19560 | loss 3.324787 (+0.04z)| norm 0.2863 (+0.97z)| lr 1.07e-04 | 4153.94 ms | 32.5% bf16 MFU | 125999 tok/s step 14321/19560 | loss 3.366004 (+1.17z)| norm 0.2562 (-1.40z)| lr 1.07e-04 | 4148.66 ms | 32.5% bf16 MFU | 126018 tok/s step 14322/19560 | loss 3.260331 (-1.75z)| norm 0.2797 (+0.45z)| lr 1.07e-04 | 4151.23 ms | 32.5% bf16 MFU | 126032 tok/s step 14323/19560 | loss 3.311172 (-0.36z)| norm 0.2846 (+0.83z)| lr 1.07e-04 | 4153.30 ms | 32.5% bf16 MFU | 126042 tok/s step 14324/19560 | loss 3.341537 (+0.48z)| norm 0.2568 (-1.34z)| lr 1.07e-04 | 4153.44 ms | 32.5% bf16 MFU | 126051 tok/s step 14325/19560 | loss 3.336667 (+0.34z)| norm 0.2833 (+0.72z)| lr 1.07e-04 | 4154.72 ms | 32.5% bf16 MFU | 126058 tok/s step 14326/19560 | loss 3.313803 (-0.29z)| norm 0.2906 (+1.28z)| lr 1.07e-04 | 4143.40 ms | 32.6% bf16 MFU | 126082 tok/s step 14327/19560 | loss 3.333247 (+0.25z)| norm 0.2553 (-1.47z)| lr 1.07e-04 | 4146.57 ms | 32.6% bf16 MFU | 126100 tok/s step 14328/19560 | loss 3.303027 (-0.59z)| norm 0.2699 (-0.33z)| lr 1.07e-04 | 4153.68 ms | 32.5% bf16 MFU | 126106 tok/s step 14329/19560 | loss 3.302082 (-0.61z)| norm 0.2651 (-0.70z)| lr 1.07e-04 | 4154.07 ms | 32.5% bf16 MFU | 126111 tok/s step 14330/19560 | loss 3.336008 (+0.33z)| norm 0.2595 (-1.12z)| lr 1.07e-04 | 4164.93 ms | 32.4% bf16 MFU | 126100 tok/s step 14331/19560 | loss 3.295160 (-0.87z)| norm 0.2705 (-0.27z)| lr 1.07e-04 | 4150.61 ms | 32.5% bf16 MFU | 126111 tok/s step 14332/19560 | loss 3.348525 (+0.69z)| norm 0.2823 (+0.64z)| lr 1.07e-04 | 4154.41 ms | 32.5% bf16 MFU | 126115 tok/s step 14333/19560 | loss 3.333379 (+0.25z)| norm 0.2694 (-0.34z)| lr 1.07e-04 | 4161.03 ms | 32.4% bf16 MFU | 126109 tok/s step 14334/19560 | loss 3.268787 (-1.63z)| norm 0.2685 (-0.42z)| lr 1.07e-04 | 4164.61 ms | 32.4% bf16 MFU | 126098 tok/s step 14335/19560 | loss 3.351462 (+0.80z)| norm 0.2911 (+1.35z)| lr 1.07e-04 | 4162.75 ms | 32.4% bf16 MFU | 126091 tok/s step 14336/19560 | loss 3.329542 (+0.15z)| norm 0.2515 (-1.74z)| lr 1.07e-04 | 4167.16 ms | 32.4% bf16 MFU | 126077 tok/s step 14337/19560 | loss 3.340290 (+0.46z)| norm 0.3148 (+3.07z)| lr 1.07e-04 | 4159.80 ms | 32.5% bf16 MFU | 126075 tok/s step 14338/19560 | loss 3.286952 (-1.12z)| norm 0.2524 (-1.59z)| lr 1.07e-04 | 4155.04 ms | 32.5% bf16 MFU | 126080 tok/s step 14339/19560 | loss 3.324373 (+0.00z)| norm 0.2664 (-0.55z)| lr 1.07e-04 | 4157.14 ms | 32.5% bf16 MFU | 126082 tok/s step 14340/19560 | loss 3.276024 (-1.42z)| norm 0.2659 (-0.58z)| lr 1.06e-04 | 4166.25 ms | 32.4% bf16 MFU | 126070 tok/s step 14341/19560 | loss 3.298002 (-0.76z)| norm 0.2754 (+0.13z)| lr 1.06e-04 | 4163.27 ms | 32.4% bf16 MFU | 126063 tok/s step 14342/19560 | loss 3.319778 (-0.13z)| norm 0.2704 (-0.24z)| lr 1.06e-04 | 4158.89 ms | 32.5% bf16 MFU | 126063 tok/s step 14343/19560 | loss 3.263012 (-1.81z)| norm 0.2684 (-0.40z)| lr 1.06e-04 | 4161.46 ms | 32.4% bf16 MFU | 126059 tok/s step 14344/19560 | loss 3.354157 (+0.89z)| norm 0.2657 (-0.58z)| lr 1.06e-04 | 4161.29 ms | 32.4% bf16 MFU | 126056 tok/s step 14345/19560 | loss 3.299073 (-0.75z)| norm 0.2943 (+1.52z)| lr 1.06e-04 | 4160.08 ms | 32.5% bf16 MFU | 126055 tok/s step 14346/19560 | loss 3.480312 (+4.31z)| norm 0.2721 (-0.13z)| lr 1.06e-04 | 4156.48 ms | 32.5% bf16 MFU | 126059 tok/s step 14347/19560 | loss 3.317799 (-0.20z)| norm 0.2716 (-0.16z)| lr 1.06e-04 | 4156.82 ms | 32.5% bf16 MFU | 126062 tok/s step 14348/19560 | loss 3.283408 (-1.15z)| norm 0.2759 (+0.18z)| lr 1.06e-04 | 4160.99 ms | 32.4% bf16 MFU | 126059 tok/s step 14349/19560 | loss 3.305023 (-0.54z)| norm 0.2982 (+1.80z)| lr 1.06e-04 | 4157.41 ms | 32.5% bf16 MFU | 126062 tok/s step 14350/19560 | loss 3.328096 (+0.09z)| norm 0.2667 (-0.52z)| lr 1.06e-04 | 4170.10 ms | 32.4% bf16 MFU | 126045 tok/s step 14351/19560 | loss 3.411655 (+2.34z)| norm 0.2815 (+0.58z)| lr 1.06e-04 | 4152.16 ms | 32.5% bf16 MFU | 126056 tok/s step 14352/19560 | loss 3.345791 (+0.55z)| norm 0.2695 (-0.32z)| lr 1.06e-04 | 4164.63 ms | 32.4% bf16 MFU | 126048 tok/s step 14353/19560 | loss 3.295729 (-0.81z)| norm 0.2872 (+0.99z)| lr 1.06e-04 | 4167.11 ms | 32.4% bf16 MFU | 126036 tok/s step 14354/19560 | loss 3.407121 (+2.22z)| norm 0.2690 (-0.36z)| lr 1.06e-04 | 4161.74 ms | 32.4% bf16 MFU | 126033 tok/s step 14355/19560 | loss 3.376525 (+1.45z)| norm 0.2739 (+0.03z)| lr 1.06e-04 | 4165.21 ms | 32.4% bf16 MFU | 126025 tok/s step 14356/19560 | loss 3.288593 (-1.02z)| norm 0.2707 (-0.23z)| lr 1.06e-04 | 4162.67 ms | 32.4% bf16 MFU | 126021 tok/s step 14357/19560 | loss 3.348758 (+0.67z)| norm 0.2709 (-0.21z)| lr 1.06e-04 | 4165.54 ms | 32.4% bf16 MFU | 126014 tok/s step 14358/19560 | loss 3.284443 (-1.12z)| norm 0.2603 (-1.02z)| lr 1.06e-04 | 4150.56 ms | 32.5% bf16 MFU | 126029 tok/s step 14359/19560 | loss 3.375060 (+1.38z)| norm 0.2858 (+0.91z)| lr 1.06e-04 | 4159.76 ms | 32.5% bf16 MFU | 126029 tok/s step 14360/19560 | loss 3.330758 (+0.16z)| norm 0.2645 (-0.72z)| lr 1.06e-04 | 4159.06 ms | 32.5% bf16 MFU | 126031 tok/s step 14361/19560 | loss 3.289834 (-0.98z)| norm 0.2866 (+0.96z)| lr 1.06e-04 | 4161.62 ms | 32.4% bf16 MFU | 126028 tok/s step 14362/19560 | loss 3.337233 (+0.33z)| norm 0.2644 (-0.73z)| lr 1.06e-04 | 4158.20 ms | 32.5% bf16 MFU | 126031 tok/s step 14363/19560 | loss 3.337816 (+0.34z)| norm 0.2831 (+0.68z)| lr 1.06e-04 | 4161.26 ms | 32.4% bf16 MFU | 126029 tok/s step 14364/19560 | loss 3.358485 (+0.91z)| norm 0.2860 (+0.92z)| lr 1.06e-04 | 4219.06 ms | 32.0% bf16 MFU | 125941 tok/s step 14365/19560 | loss 3.328572 (+0.08z)| norm 0.2649 (-0.71z)| lr 1.06e-04 | 4167.28 ms | 32.4% bf16 MFU | 125935 tok/s step 14366/19560 | loss 3.349216 (+0.65z)| norm 0.2768 (+0.22z)| lr 1.05e-04 | 4154.85 ms | 32.5% bf16 MFU | 125947 tok/s step 14367/19560 | loss 3.347307 (+0.60z)| norm 0.2690 (-0.39z)| lr 1.05e-04 | 4161.22 ms | 32.4% bf16 MFU | 125950 tok/s step 14368/19560 | loss 3.300565 (-0.70z)| norm 0.2613 (-0.97z)| lr 1.05e-04 | 4163.43 ms | 32.4% bf16 MFU | 125948 tok/s step 14369/19560 | loss 3.327954 (+0.05z)| norm 0.2551 (-1.43z)| lr 1.05e-04 | 4154.52 ms | 32.5% bf16 MFU | 125961 tok/s step 14370/19560 | loss 3.370782 (+1.23z)| norm 0.2809 (+0.53z)| lr 1.05e-04 | 4167.09 ms | 32.4% bf16 MFU | 125954 tok/s step 14371/19560 | loss 3.363005 (+1.01z)| norm 0.2745 (+0.04z)| lr 1.05e-04 | 4161.50 ms | 32.4% bf16 MFU | 125955 tok/s step 14372/19560 | loss 3.322915 (-0.10z)| norm 0.2457 (-2.13z)| lr 1.05e-04 | 4158.00 ms | 32.5% bf16 MFU | 125962 tok/s step 14373/19560 | loss 3.389674 (+1.73z)| norm 0.2800 (+0.50z)| lr 1.05e-04 | 4180.06 ms | 32.3% bf16 MFU | 125935 tok/s step 14374/19560 | loss 3.315649 (-0.31z)| norm 0.2585 (-1.14z)| lr 1.05e-04 | 4166.36 ms | 32.4% bf16 MFU | 125930 tok/s step 14375/19560 | loss 3.316674 (-0.27z)| norm 0.2805 (+0.54z)| lr 1.05e-04 | 4168.55 ms | 32.4% bf16 MFU | 125922 tok/s step 14376/19560 | loss 3.303277 (-0.64z)| norm 0.2578 (-1.18z)| lr 1.05e-04 | 4173.48 ms | 32.4% bf16 MFU | 125908 tok/s step 14377/19560 | loss 3.282713 (-1.21z)| norm 0.2737 (+0.04z)| lr 1.05e-04 | 4191.48 ms | 32.2% bf16 MFU | 125866 tok/s step 14378/19560 | loss 3.300854 (-0.70z)| norm 0.2531 (-1.50z)| lr 1.05e-04 | 4167.25 ms | 32.4% bf16 MFU | 125864 tok/s step 14379/19560 | loss 3.369912 (+1.21z)| norm 0.2768 (+0.28z)| lr 1.05e-04 | 4160.89 ms | 32.4% bf16 MFU | 125871 tok/s step 14380/19560 | loss 3.292761 (-0.93z)| norm 0.2684 (-0.34z)| lr 1.05e-04 | 4163.83 ms | 32.4% bf16 MFU | 125873 tok/s step 14381/19560 | loss 3.368659 (+1.15z)| norm 0.2754 (+0.20z)| lr 1.05e-04 | 4157.72 ms | 32.5% bf16 MFU | 125884 tok/s step 14382/19560 | loss 3.334662 (+0.21z)| norm 0.2692 (-0.26z)| lr 1.05e-04 | 4162.87 ms | 32.4% bf16 MFU | 125887 tok/s step 14383/19560 | loss 3.409436 (+2.23z)| norm 0.2972 (+1.87z)| lr 1.05e-04 | 4161.95 ms | 32.4% bf16 MFU | 125891 tok/s step 14384/19560 | loss 3.347245 (+0.54z)| norm 0.2816 (+0.68z)| lr 1.05e-04 | 4164.33 ms | 32.4% bf16 MFU | 125892 tok/s step 14385/19560 | loss 3.323639 (-0.10z)| norm 0.2613 (-0.86z)| lr 1.05e-04 | 4165.31 ms | 32.4% bf16 MFU | 125891 tok/s step 14386/19560 | loss 3.391099 (+1.71z)| norm 0.2749 (+0.19z)| lr 1.05e-04 | 4159.71 ms | 32.5% bf16 MFU | 125898 tok/s step 14387/19560 | loss 3.337043 (+0.24z)| norm 0.2917 (+1.45z)| lr 1.05e-04 | 4162.94 ms | 32.4% bf16 MFU | 125900 tok/s step 14388/19560 | loss 3.303117 (-0.68z)| norm 0.2666 (-0.46z)| lr 1.05e-04 | 4160.06 ms | 32.5% bf16 MFU | 125907 tok/s step 14389/19560 | loss 3.357683 (+0.80z)| norm 0.2678 (-0.36z)| lr 1.05e-04 | 4158.14 ms | 32.5% bf16 MFU | 125916 tok/s step 14390/19560 | loss 3.292044 (-1.00z)| norm 0.2903 (+1.33z)| lr 1.05e-04 | 4158.64 ms | 32.5% bf16 MFU | 125924 tok/s step 14391/19560 | loss 3.310489 (-0.49z)| norm 0.2642 (-0.63z)| lr 1.05e-04 | 4162.45 ms | 32.4% bf16 MFU | 125925 tok/s step 14392/19560 | loss 3.329341 (+0.02z)| norm 0.2687 (-0.29z)| lr 1.05e-04 | 4161.77 ms | 32.4% bf16 MFU | 125928 tok/s step 14393/19560 | loss 3.295102 (-0.92z)| norm 0.2894 (+1.25z)| lr 1.04e-04 | 4160.62 ms | 32.5% bf16 MFU | 125932 tok/s step 14394/19560 | loss 3.450340 (+3.18z)| norm 0.2981 (+1.86z)| lr 1.04e-04 | 4155.27 ms | 32.5% bf16 MFU | 125944 tok/s step 14395/19560 | loss 3.291648 (-0.98z)| norm 0.2806 (+0.56z)| lr 1.04e-04 | 4160.52 ms | 32.5% bf16 MFU | 125948 tok/s step 14396/19560 | loss 3.319504 (-0.24z)| norm 0.2849 (+0.87z)| lr 1.04e-04 | 4157.75 ms | 32.5% bf16 MFU | 125955 tok/s step 14397/19560 | loss 3.273162 (-1.46z)| norm 0.2940 (+1.51z)| lr 1.04e-04 | 4164.15 ms | 32.4% bf16 MFU | 125953 tok/s step 14398/19560 | loss 3.346071 (+0.45z)| norm 0.2838 (+0.75z)| lr 1.04e-04 | 4163.30 ms | 32.4% bf16 MFU | 125952 tok/s step 14399/19560 | loss 3.307699 (-0.55z)| norm 0.3041 (+2.19z)| lr 1.04e-04 | 4167.07 ms | 32.4% bf16 MFU | 125945 tok/s step 14400/19560 | loss 3.371124 (+1.10z)| norm 0.2734 (-0.04z)| lr 1.04e-04 | 4154.68 ms | 32.5% bf16 MFU | 125957 tok/s step 14401/19560 | loss 3.311677 (-0.45z)| norm 0.2789 (+0.36z)| lr 1.04e-04 | 4149.76 ms | 32.5% bf16 MFU | 125976 tok/s step 14402/19560 | loss 3.309762 (-0.50z)| norm 0.2798 (+0.42z)| lr 1.04e-04 | 4164.93 ms | 32.4% bf16 MFU | 125972 tok/s step 14403/19560 | loss 3.355333 (+0.68z)| norm 0.2759 (+0.13z)| lr 1.04e-04 | 4156.97 ms | 32.5% bf16 MFU | 125979 tok/s step 14404/19560 | loss 3.331873 (+0.07z)| norm 0.2643 (-0.73z)| lr 1.04e-04 | 4156.60 ms | 32.5% bf16 MFU | 125987 tok/s step 14405/19560 | loss 3.345701 (+0.42z)| norm 0.2900 (+1.16z)| lr 1.04e-04 | 4157.50 ms | 32.5% bf16 MFU | 125993 tok/s step 14406/19560 | loss 3.318631 (-0.30z)| norm 0.2902 (+1.15z)| lr 1.04e-04 | 4155.60 ms | 32.5% bf16 MFU | 126002 tok/s step 14407/19560 | loss 3.310341 (-0.51z)| norm 0.2651 (-0.72z)| lr 1.04e-04 | 4165.44 ms | 32.4% bf16 MFU | 125995 tok/s step 14408/19560 | loss 3.308725 (-0.54z)| norm 0.2786 (+0.27z)| lr 1.04e-04 | 4222.98 ms | 32.0% bf16 MFU | 125903 tok/s step 14409/19560 | loss 3.277495 (-1.37z)| norm 0.2880 (+0.97z)| lr 1.04e-04 | 4171.01 ms | 32.4% bf16 MFU | 125892 tok/s step 14410/19560 | loss 3.375500 (+1.25z)| norm 0.2776 (+0.18z)| lr 1.04e-04 | 4171.95 ms | 32.4% bf16 MFU | 125881 tok/s step 14411/19560 | loss 3.313543 (-0.40z)| norm 0.2771 (+0.13z)| lr 1.04e-04 | 4164.97 ms | 32.4% bf16 MFU | 125881 tok/s step 14412/19560 | loss 3.371164 (+1.11z)| norm 0.2849 (+0.72z)| lr 1.04e-04 | 4175.77 ms | 32.3% bf16 MFU | 125865 tok/s step 14413/19560 | loss 3.334196 (+0.13z)| norm 0.2718 (-0.28z)| lr 1.04e-04 | 4162.93 ms | 32.4% bf16 MFU | 125869 tok/s step 14414/19560 | loss 3.333404 (+0.11z)| norm 0.2675 (-0.59z)| lr 1.04e-04 | 4182.07 ms | 32.3% bf16 MFU | 125844 tok/s step 14415/19560 | loss 3.308770 (-0.54z)| norm 0.2867 (+0.86z)| lr 1.04e-04 | 4160.94 ms | 32.4% bf16 MFU | 125852 tok/s step 14416/19560 | loss 3.314269 (-0.40z)| norm 0.2616 (-1.04z)| lr 1.04e-04 | 4169.01 ms | 32.4% bf16 MFU | 125847 tok/s step 14417/19560 | loss 3.344890 (+0.43z)| norm 0.2821 (+0.51z)| lr 1.04e-04 | 4177.54 ms | 32.3% bf16 MFU | 125830 tok/s step 14418/19560 | loss 3.324244 (-0.13z)| norm 0.3000 (+1.83z)| lr 1.04e-04 | 4168.40 ms | 32.4% bf16 MFU | 125827 tok/s step 14419/19560 | loss 3.371308 (+1.12z)| norm 0.2896 (+1.04z)| lr 1.03e-04 | 4175.14 ms | 32.3% bf16 MFU | 125814 tok/s step 14420/19560 | loss 3.324290 (-0.12z)| norm 0.2728 (-0.23z)| lr 1.03e-04 | 4167.12 ms | 32.4% bf16 MFU | 125814 tok/s step 14421/19560 | loss 3.342982 (+0.37z)| norm 0.2678 (-0.59z)| lr 1.03e-04 | 4177.51 ms | 32.3% bf16 MFU | 125799 tok/s step 14422/19560 | loss 3.363363 (+0.91z)| norm 0.2777 (+0.15z)| lr 1.03e-04 | 4164.30 ms | 32.4% bf16 MFU | 125804 tok/s step 14423/19560 | loss 3.329495 (-0.01z)| norm 0.2652 (-0.77z)| lr 1.03e-04 | 4171.74 ms | 32.4% bf16 MFU | 125797 tok/s step 14424/19560 | loss 3.410563 (+2.13z)| norm 0.3020 (+1.97z)| lr 1.03e-04 | 4177.91 ms | 32.3% bf16 MFU | 125782 tok/s step 14425/19560 | loss 3.309270 (-0.56z)| norm 0.2746 (-0.08z)| lr 1.03e-04 | 4172.19 ms | 32.4% bf16 MFU | 125776 tok/s step 14426/19560 | loss 3.385173 (+1.44z)| norm 0.2774 (+0.17z)| lr 1.03e-04 | 4154.95 ms | 32.5% bf16 MFU | 125797 tok/s step 14427/19560 | loss 3.331467 (+0.02z)| norm 0.2823 (+0.57z)| lr 1.03e-04 | 4170.60 ms | 32.4% bf16 MFU | 125792 tok/s step 14428/19560 | loss 3.305842 (-0.64z)| norm 0.2781 (+0.22z)| lr 1.03e-04 | 4169.81 ms | 32.4% bf16 MFU | 125789 tok/s step 14429/19560 | loss 3.316097 (-0.38z)| norm 0.2837 (+0.67z)| lr 1.03e-04 | 4159.03 ms | 32.5% bf16 MFU | 125803 tok/s step 14430/19560 | loss 3.370397 (+1.08z)| norm 0.2802 (+0.38z)| lr 1.03e-04 | 4164.08 ms | 32.4% bf16 MFU | 125808 tok/s step 14431/19560 | loss 3.320105 (-0.28z)| norm 0.2667 (-0.73z)| lr 1.03e-04 | 4164.33 ms | 32.4% bf16 MFU | 125813 tok/s step 14432/19560 | loss 3.291118 (-1.06z)| norm 0.2622 (-1.08z)| lr 1.03e-04 | 4170.76 ms | 32.4% bf16 MFU | 125807 tok/s step 14433/19560 | loss 3.315212 (-0.41z)| norm 0.3118 (+2.85z)| lr 1.03e-04 | 4170.98 ms | 32.4% bf16 MFU | 125802 tok/s step 14434/19560 | loss 3.312590 (-0.48z)| norm 0.2561 (-1.56z)| lr 1.03e-04 | 4158.44 ms | 32.5% bf16 MFU | 125816 tok/s step 14435/19560 | loss 3.290592 (-1.09z)| norm 0.2791 (+0.25z)| lr 1.03e-04 | 4159.89 ms | 32.5% bf16 MFU | 125827 tok/s step 14436/19560 | loss 3.346190 (+0.41z)| norm 0.2650 (-0.86z)| lr 1.03e-04 | 4157.50 ms | 32.5% bf16 MFU | 125841 tok/s step 14437/19560 | loss 3.298499 (-0.87z)| norm 0.2666 (-0.72z)| lr 1.03e-04 | 4167.04 ms | 32.4% bf16 MFU | 125839 tok/s step 14438/19560 | loss 3.327541 (-0.08z)| norm 0.2757 (+0.00z)| lr 1.03e-04 | 4163.90 ms | 32.4% bf16 MFU | 125843 tok/s step 14439/19560 | loss 3.315579 (-0.39z)| norm 0.2707 (-0.40z)| lr 1.03e-04 | 4166.03 ms | 32.4% bf16 MFU | 125843 tok/s step 14440/19560 | loss 3.354022 (+0.66z)| norm 0.2587 (-1.34z)| lr 1.03e-04 | 4172.56 ms | 32.4% bf16 MFU | 125834 tok/s step 14441/19560 | loss 3.368338 (+1.04z)| norm 0.2689 (-0.52z)| lr 1.03e-04 | 4178.72 ms | 32.3% bf16 MFU | 125815 tok/s step 14442/19560 | loss 3.255141 (-2.06z)| norm 0.2821 (+0.53z)| lr 1.03e-04 | 4166.76 ms | 32.4% bf16 MFU | 125816 tok/s step 14443/19560 | loss 3.293900 (-0.99z)| norm 0.2746 (-0.07z)| lr 1.03e-04 | 4160.68 ms | 32.5% bf16 MFU | 125826 tok/s step 14444/19560 | loss 3.316890 (-0.35z)| norm 0.2786 (+0.25z)| lr 1.03e-04 | 4177.00 ms | 32.3% bf16 MFU | 125810 tok/s step 14445/19560 | loss 3.287581 (-1.13z)| norm 0.2533 (-1.73z)| lr 1.03e-04 | 4171.66 ms | 32.4% bf16 MFU | 125804 tok/s step 14446/19560 | loss 3.282313 (-1.26z)| norm 0.2599 (-1.20z)| lr 1.02e-04 | 4178.36 ms | 32.3% bf16 MFU | 125787 tok/s step 14447/19560 | loss 3.345883 (+0.45z)| norm 0.2708 (-0.33z)| lr 1.02e-04 | 4176.88 ms | 32.3% bf16 MFU | 125774 tok/s step 14448/19560 | loss 3.393088 (+1.70z)| norm 0.2582 (-1.30z)| lr 1.02e-04 | 4166.82 ms | 32.4% bf16 MFU | 125777 tok/s step 14449/19560 | loss 3.303743 (-0.68z)| norm 0.2627 (-0.95z)| lr 1.02e-04 | 4182.49 ms | 32.3% bf16 MFU | 125755 tok/s step 14450/19560 | loss 3.287626 (-1.13z)| norm 0.2584 (-1.27z)| lr 1.02e-04 | 4173.55 ms | 32.4% bf16 MFU | 125749 tok/s step 14451/19560 | loss 3.273745 (-1.49z)| norm 0.2635 (-0.86z)| lr 1.02e-04 | 4159.52 ms | 32.5% bf16 MFU | 125764 tok/s step 14452/19560 | loss 3.323422 (-0.15z)| norm 0.2650 (-0.76z)| lr 1.02e-04 | 4164.53 ms | 32.4% bf16 MFU | 125770 tok/s step 14453/19560 | loss 3.278473 (-1.34z)| norm 0.2649 (-0.75z)| lr 1.02e-04 | 4175.81 ms | 32.3% bf16 MFU | 125759 tok/s step 14454/19560 | loss 3.336459 (+0.21z)| norm 0.2721 (-0.17z)| lr 1.02e-04 | 4163.69 ms | 32.4% bf16 MFU | 125767 tok/s step 14455/19560 | loss 3.381383 (+1.39z)| norm 0.2665 (-0.63z)| lr 1.02e-04 | 4159.42 ms | 32.5% bf16 MFU | 125781 tok/s step 14456/19560 | loss 3.331407 (+0.06z)| norm 0.2697 (-0.37z)| lr 1.02e-04 | 4172.05 ms | 32.4% bf16 MFU | 125776 tok/s step 14457/19560 | loss 3.331684 (+0.06z)| norm 0.2733 (-0.09z)| lr 1.02e-04 | 4159.14 ms | 32.5% bf16 MFU | 125790 tok/s step 14458/19560 | loss 3.320404 (-0.24z)| norm 0.2683 (-0.50z)| lr 1.02e-04 | 4161.31 ms | 32.4% bf16 MFU | 125800 tok/s step 14459/19560 | loss 3.281984 (-1.25z)| norm 0.2646 (-0.79z)| lr 1.02e-04 | 4163.95 ms | 32.4% bf16 MFU | 125805 tok/s step 14460/19560 | loss 3.293101 (-0.95z)| norm 0.2653 (-0.73z)| lr 1.02e-04 | 4168.71 ms | 32.4% bf16 MFU | 125803 tok/s step 14461/19560 | loss 3.349731 (+0.55z)| norm 0.2751 (+0.06z)| lr 1.02e-04 | 4175.28 ms | 32.3% bf16 MFU | 125792 tok/s step 14462/19560 | loss 3.291913 (-0.99z)| norm 0.2741 (-0.02z)| lr 1.02e-04 | 4172.34 ms | 32.4% bf16 MFU | 125785 tok/s step 14463/19560 | loss 3.300260 (-0.76z)| norm 0.2864 (+0.98z)| lr 1.02e-04 | 4181.61 ms | 32.3% bf16 MFU | 125765 tok/s step 14464/19560 | loss 3.305030 (-0.62z)| norm 0.2677 (-0.56z)| lr 1.02e-04 | 4177.04 ms | 32.3% bf16 MFU | 125752 tok/s step 14465/19560 | loss 3.293013 (-0.93z)| norm 0.3028 (+2.37z)| lr 1.02e-04 | 4165.71 ms | 32.4% bf16 MFU | 125758 tok/s step 14466/19560 | loss 3.309567 (-0.50z)| norm 0.2638 (-0.90z)| lr 1.02e-04 | 4165.41 ms | 32.4% bf16 MFU | 125763 tok/s step 14467/19560 | loss 3.382621 (+1.42z)| norm 0.2561 (-1.53z)| lr 1.02e-04 | 4185.01 ms | 32.3% bf16 MFU | 125739 tok/s step 14468/19560 | loss 3.340482 (+0.29z)| norm 0.2718 (-0.22z)| lr 1.02e-04 | 4171.06 ms | 32.4% bf16 MFU | 125737 tok/s step 14469/19560 | loss 3.315987 (-0.36z)| norm 0.2540 (-1.68z)| lr 1.02e-04 | 4176.38 ms | 32.3% bf16 MFU | 125727 tok/s step 14470/19560 | loss 3.306645 (-0.60z)| norm 0.2683 (-0.49z)| lr 1.02e-04 | 4164.73 ms | 32.4% bf16 MFU | 125735 tok/s step 14471/19560 | loss 3.289565 (-1.07z)| norm 0.2766 (+0.19z)| lr 1.02e-04 | 4174.13 ms | 32.3% bf16 MFU | 125728 tok/s step 14472/19560 | loss 3.306385 (-0.61z)| norm 0.2738 (-0.05z)| lr 1.01e-04 | 4165.27 ms | 32.4% bf16 MFU | 125735 tok/s step 14473/19560 | loss 3.344354 (+0.40z)| norm 0.3129 (+3.10z)| lr 1.01e-04 | 4169.55 ms | 32.4% bf16 MFU | 125736 tok/s step 14474/19560 | loss 3.290519 (-1.07z)| norm 0.2631 (-0.92z)| lr 1.01e-04 | 4156.12 ms | 32.5% bf16 MFU | 125756 tok/s step 14475/19560 | loss 3.313487 (-0.42z)| norm 0.2661 (-0.67z)| lr 1.01e-04 | 4162.81 ms | 32.4% bf16 MFU | 125766 tok/s step 14476/19560 | loss 3.329612 (+0.03z)| norm 0.2834 (+0.71z)| lr 1.01e-04 | 4168.42 ms | 32.4% bf16 MFU | 125766 tok/s step 14477/19560 | loss 3.329998 (+0.04z)| norm 0.2599 (-1.15z)| lr 1.01e-04 | 4157.01 ms | 32.5% bf16 MFU | 125784 tok/s step 14478/19560 | loss 3.315035 (-0.39z)| norm 0.2809 (+0.53z)| lr 1.01e-04 | 4155.51 ms | 32.5% bf16 MFU | 125803 tok/s step 14479/19560 | loss 3.316676 (-0.33z)| norm 0.2649 (-0.75z)| lr 1.01e-04 | 4156.17 ms | 32.5% bf16 MFU | 125820 tok/s step 14480/19560 | loss 3.289189 (-1.12z)| norm 0.2738 (-0.03z)| lr 1.01e-04 | 4172.47 ms | 32.4% bf16 MFU | 125812 tok/s step 14481/19560 | loss 3.316279 (-0.33z)| norm 0.2584 (-1.26z)| lr 1.01e-04 | 4151.12 ms | 32.5% bf16 MFU | 125836 tok/s step 14482/19560 | loss 3.334182 (+0.22z)| norm 0.2639 (-0.81z)| lr 1.01e-04 | 4175.86 ms | 32.3% bf16 MFU | 125822 tok/s step 14483/19560 | loss 3.386400 (+1.78z)| norm 0.2890 (+1.20z)| lr 1.01e-04 | 4178.64 ms | 32.3% bf16 MFU | 125805 tok/s step 14484/19560 | loss 3.372283 (+1.34z)| norm 0.2711 (-0.24z)| lr 1.01e-04 | 4174.37 ms | 32.3% bf16 MFU | 125794 tok/s step 14485/19560 | loss 3.334159 (+0.20z)| norm 0.2644 (-0.77z)| lr 1.01e-04 | 4168.51 ms | 32.4% bf16 MFU | 125793 tok/s step 14486/19560 | loss 3.342963 (+0.45z)| norm 0.2663 (-0.62z)| lr 1.01e-04 | 4161.12 ms | 32.4% bf16 MFU | 125803 tok/s step 14487/19560 | loss 3.327765 (+0.00z)| norm 0.2648 (-0.73z)| lr 1.01e-04 | 4175.95 ms | 32.3% bf16 MFU | 125791 tok/s step 14488/19560 | loss 3.444429 (+3.36z)| norm 0.2722 (-0.14z)| lr 1.01e-04 | 4182.03 ms | 32.3% bf16 MFU | 125769 tok/s step 14489/19560 | loss 3.281456 (-1.36z)| norm 0.2501 (-1.88z)| lr 1.01e-04 | 4169.33 ms | 32.4% bf16 MFU | 125768 tok/s step 14490/19560 | loss 3.303853 (-0.70z)| norm 0.2823 (+0.68z)| lr 1.01e-04 | 4174.23 ms | 32.3% bf16 MFU | 125760 tok/s step 14491/19560 | loss 3.319138 (-0.26z)| norm 0.2655 (-0.65z)| lr 1.01e-04 | 4169.16 ms | 32.4% bf16 MFU | 125760 tok/s step 14492/19560 | loss 3.357666 (+0.86z)| norm 0.2719 (-0.13z)| lr 1.01e-04 | 4155.41 ms | 32.5% bf16 MFU | 125780 tok/s step 14493/19560 | loss 3.351211 (+0.66z)| norm 0.2702 (-0.27z)| lr 1.01e-04 | 4181.97 ms | 32.3% bf16 MFU | 125760 tok/s step 14494/19560 | loss 3.382255 (+1.54z)| norm 0.2773 (+0.30z)| lr 1.01e-04 | 4166.12 ms | 32.4% bf16 MFU | 125764 tok/s step 14495/19560 | loss 3.380672 (+1.47z)| norm 0.2770 (+0.26z)| lr 1.01e-04 | 4168.02 ms | 32.4% bf16 MFU | 125765 tok/s step 14496/19560 | loss 3.321403 (-0.21z)| norm 0.2703 (-0.28z)| lr 1.01e-04 | 4164.10 ms | 32.4% bf16 MFU | 125772 tok/s step 14497/19560 | loss 3.336983 (+0.23z)| norm 0.2773 (+0.28z)| lr 1.01e-04 | 4163.00 ms | 32.4% bf16 MFU | 125781 tok/s step 14498/19560 | loss 3.329527 (+0.03z)| norm 0.2601 (-1.10z)| lr 1.01e-04 | 4166.56 ms | 32.4% bf16 MFU | 125783 tok/s step 14499/19560 | loss 3.303888 (-0.70z)| norm 0.2670 (-0.54z)| lr 1.00e-04 | 4161.06 ms | 32.4% bf16 MFU | 125794 tok/s step 14500/19560 | loss 3.325466 (-0.08z)| norm 0.2729 (-0.08z)| lr 1.00e-04 | 4179.42 ms | 32.3% bf16 MFU | 125777 tok/s val loss 3.300595 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2997/10042 = 0.298447 step 14501/19560 | loss 3.303594 (-0.69z)| norm 0.2667 (-0.59z)| lr 1.00e-04 | 4161.88 ms | 32.4% bf16 MFU | 125786 tok/s step 14502/19560 | loss 3.329575 (+0.06z)| norm 0.2658 (-0.67z)| lr 1.00e-04 | 4155.02 ms | 32.5% bf16 MFU | 125806 tok/s step 14503/19560 | loss 3.315989 (-0.34z)| norm 0.2759 (+0.17z)| lr 1.00e-04 | 4176.46 ms | 32.3% bf16 MFU | 125793 tok/s step 14504/19560 | loss 3.334344 (+0.19z)| norm 0.2738 (-0.01z)| lr 1.00e-04 | 4161.01 ms | 32.4% bf16 MFU | 125803 tok/s step 14505/19560 | loss 3.352962 (+0.72z)| norm 0.2690 (-0.41z)| lr 1.00e-04 | 4173.48 ms | 32.4% bf16 MFU | 125794 tok/s step 14506/19560 | loss 3.323411 (-0.15z)| norm 0.2874 (+1.12z)| lr 1.00e-04 | 4165.41 ms | 32.4% bf16 MFU | 125798 tok/s step 14507/19560 | loss 3.317596 (-0.31z)| norm 0.2776 (+0.29z)| lr 1.00e-04 | 4159.18 ms | 32.5% bf16 MFU | 125811 tok/s step 14508/19560 | loss 3.307217 (-0.62z)| norm 0.2744 (+0.02z)| lr 1.00e-04 | 4170.51 ms | 32.4% bf16 MFU | 125806 tok/s step 14509/19560 | loss 3.324123 (-0.11z)| norm 0.2775 (+0.28z)| lr 1.00e-04 | 4159.16 ms | 32.5% bf16 MFU | 125818 tok/s step 14510/19560 | loss 3.372996 (+1.32z)| norm 0.2683 (-0.50z)| lr 1.00e-04 | 4166.71 ms | 32.4% bf16 MFU | 125819 tok/s step 14511/19560 | loss 3.300987 (-0.79z)| norm 0.2532 (-1.75z)| lr 1.00e-04 | 4163.73 ms | 32.4% bf16 MFU | 125824 tok/s step 14512/19560 | loss 3.311865 (-0.46z)| norm 0.2794 (+0.47z)| lr 1.00e-04 | 4169.16 ms | 32.4% bf16 MFU | 125820 tok/s step 14513/19560 | loss 3.366333 (+1.16z)| norm 0.2699 (-0.34z)| lr 1.00e-04 | 4163.15 ms | 32.4% bf16 MFU | 125826 tok/s step 14514/19560 | loss 3.241118 (-2.52z)| norm 0.2692 (-0.40z)| lr 9.99e-05 | 4158.53 ms | 32.5% bf16 MFU | 125838 tok/s step 14515/19560 | loss 3.292138 (-1.00z)| norm 0.2602 (-1.15z)| lr 9.99e-05 | 4167.30 ms | 32.4% bf16 MFU | 125837 tok/s step 14516/19560 | loss 3.312658 (-0.39z)| norm 0.2598 (-1.17z)| lr 9.98e-05 | 4168.92 ms | 32.4% bf16 MFU | 125833 tok/s step 14517/19560 | loss 3.299797 (-0.76z)| norm 0.2546 (-1.59z)| lr 9.98e-05 | 4166.19 ms | 32.4% bf16 MFU | 125834 tok/s step 14518/19560 | loss 3.289062 (-1.08z)| norm 0.2785 (+0.44z)| lr 9.98e-05 | 4161.35 ms | 32.4% bf16 MFU | 125841 tok/s step 14519/19560 | loss 3.352953 (+0.80z)| norm 0.2565 (-1.42z)| lr 9.97e-05 | 4167.41 ms | 32.4% bf16 MFU | 125840 tok/s step 14520/19560 | loss 3.298080 (-0.81z)| norm 0.2638 (-0.80z)| lr 9.97e-05 | 4172.46 ms | 32.4% bf16 MFU | 125830 tok/s step 14521/19560 | loss 3.349570 (+0.69z)| norm 0.2866 (+1.12z)| lr 9.97e-05 | 4152.13 ms | 32.5% bf16 MFU | 125852 tok/s step 14522/19560 | loss 3.339628 (+0.45z)| norm 0.2619 (-0.95z)| lr 9.96e-05 | 4157.09 ms | 32.5% bf16 MFU | 125866 tok/s step 14523/19560 | loss 3.422189 (+2.90z)| norm 0.2748 (+0.16z)| lr 9.96e-05 | 4170.83 ms | 32.4% bf16 MFU | 125858 tok/s step 14524/19560 | loss 3.318502 (-0.23z)| norm 0.2798 (+0.59z)| lr 9.95e-05 | 4164.46 ms | 32.4% bf16 MFU | 125860 tok/s step 14525/19560 | loss 3.325798 (-0.03z)| norm 0.2643 (-0.73z)| lr 9.95e-05 | 4163.08 ms | 32.4% bf16 MFU | 125863 tok/s step 14526/19560 | loss 3.359067 (+0.98z)| norm 0.3012 (+2.41z)| lr 9.95e-05 | 4158.16 ms | 32.5% bf16 MFU | 125875 tok/s step 14527/19560 | loss 3.317329 (-0.29z)| norm 0.2676 (-0.43z)| lr 9.94e-05 | 4181.70 ms | 32.3% bf16 MFU | 125850 tok/s step 14528/19560 | loss 3.396244 (+2.09z)| norm 0.2824 (+0.85z)| lr 9.94e-05 | 4174.73 ms | 32.3% bf16 MFU | 125836 tok/s step 14529/19560 | loss 3.344211 (+0.51z)| norm 0.2808 (+0.72z)| lr 9.94e-05 | 4171.03 ms | 32.4% bf16 MFU | 125830 tok/s step 14530/19560 | loss 3.330126 (+0.08z)| norm 0.2806 (+0.69z)| lr 9.93e-05 | 4157.77 ms | 32.5% bf16 MFU | 125843 tok/s step 14531/19560 | loss 3.319005 (-0.25z)| norm 0.2849 (+1.06z)| lr 9.93e-05 | 4164.56 ms | 32.4% bf16 MFU | 125845 tok/s step 14532/19560 | loss 3.366449 (+1.18z)| norm 0.2721 (-0.05z)| lr 9.92e-05 | 4168.28 ms | 32.4% bf16 MFU | 125842 tok/s step 14533/19560 | loss 3.411183 (+2.45z)| norm 0.2752 (+0.22z)| lr 9.92e-05 | 4167.12 ms | 32.4% bf16 MFU | 125841 tok/s step 14534/19560 | loss 3.359931 (+0.93z)| norm 0.2884 (+1.39z)| lr 9.92e-05 | 4170.87 ms | 32.4% bf16 MFU | 125834 tok/s step 14535/19560 | loss 3.385062 (+1.63z)| norm 0.2906 (+1.55z)| lr 9.91e-05 | 4169.76 ms | 32.4% bf16 MFU | 125829 tok/s step 14536/19560 | loss 3.386747 (+1.65z)| norm 0.2750 (+0.19z)| lr 9.91e-05 | 4164.71 ms | 32.4% bf16 MFU | 125832 tok/s step 14537/19560 | loss 3.377420 (+1.36z)| norm 0.2735 (+0.07z)| lr 9.91e-05 | 4159.52 ms | 32.5% bf16 MFU | 125843 tok/s step 14538/19560 | loss 3.328993 (-0.02z)| norm 0.2897 (+1.47z)| lr 9.90e-05 | 4174.52 ms | 32.3% bf16 MFU | 125830 tok/s step 14539/19560 | loss 3.320921 (-0.26z)| norm 0.2757 (+0.26z)| lr 9.90e-05 | 4171.86 ms | 32.4% bf16 MFU | 125822 tok/s step 14540/19560 | loss 3.286794 (-1.23z)| norm 0.2956 (+1.96z)| lr 9.90e-05 | 4167.04 ms | 32.4% bf16 MFU | 125822 tok/s step 14541/19560 | loss 3.288161 (-1.17z)| norm 0.2965 (+2.00z)| lr 9.89e-05 | 4171.07 ms | 32.4% bf16 MFU | 125816 tok/s step 14542/19560 | loss 3.344669 (+0.45z)| norm 0.3072 (+2.79z)| lr 9.89e-05 | 4170.50 ms | 32.4% bf16 MFU | 125811 tok/s step 14543/19560 | loss 3.307895 (-0.61z)| norm 0.3022 (+2.33z)| lr 9.88e-05 | 4176.15 ms | 32.3% bf16 MFU | 125797 tok/s step 14544/19560 | loss 3.283673 (-1.29z)| norm 0.3041 (+2.40z)| lr 9.88e-05 | 4160.30 ms | 32.5% bf16 MFU | 125808 tok/s step 14545/19560 | loss 3.324166 (-0.13z)| norm 0.2897 (+1.25z)| lr 9.88e-05 | 4172.78 ms | 32.4% bf16 MFU | 125800 tok/s step 14546/19560 | loss 3.363595 (+0.99z)| norm 0.3071 (+2.59z)| lr 9.87e-05 | 4148.79 ms | 32.5% bf16 MFU | 125829 tok/s step 14547/19560 | loss 3.312795 (-0.45z)| norm 0.2918 (+1.39z)| lr 9.87e-05 | 4163.68 ms | 32.4% bf16 MFU | 125833 tok/s step 14548/19560 | loss 3.310991 (-0.50z)| norm 0.2869 (+1.00z)| lr 9.87e-05 | 4162.43 ms | 32.4% bf16 MFU | 125840 tok/s step 14549/19560 | loss 3.340975 (+0.36z)| norm 0.2842 (+0.78z)| lr 9.86e-05 | 4168.60 ms | 32.4% bf16 MFU | 125836 tok/s step 14550/19560 | loss 3.318297 (-0.28z)| norm 0.3001 (+1.97z)| lr 9.86e-05 | 4163.03 ms | 32.4% bf16 MFU | 125841 tok/s step 14551/19560 | loss 3.321935 (-0.17z)| norm 0.2704 (-0.31z)| lr 9.85e-05 | 4163.78 ms | 32.4% bf16 MFU | 125845 tok/s step 14552/19560 | loss 3.290267 (-1.08z)| norm 0.2953 (+1.62z)| lr 9.85e-05 | 4166.13 ms | 32.4% bf16 MFU | 125845 tok/s step 14553/19560 | loss 3.312648 (-0.42z)| norm 0.2723 (-0.16z)| lr 9.85e-05 | 4171.05 ms | 32.4% bf16 MFU | 125838 tok/s step 14554/19560 | loss 3.329924 (+0.10z)| norm 0.2872 (+0.98z)| lr 9.84e-05 | 4165.80 ms | 32.4% bf16 MFU | 125838 tok/s step 14555/19560 | loss 3.323388 (-0.09z)| norm 0.2696 (-0.36z)| lr 9.84e-05 | 4159.27 ms | 32.5% bf16 MFU | 125849 tok/s step 14556/19560 | loss 3.308480 (-0.53z)| norm 0.2885 (+1.08z)| lr 9.84e-05 | 4163.08 ms | 32.4% bf16 MFU | 125854 tok/s step 14557/19560 | loss 3.371948 (+1.33z)| norm 0.2731 (-0.09z)| lr 9.83e-05 | 4158.31 ms | 32.5% bf16 MFU | 125865 tok/s step 14558/19560 | loss 3.408876 (+2.37z)| norm 0.2946 (+1.54z)| lr 9.83e-05 | 4176.67 ms | 32.3% bf16 MFU | 125848 tok/s step 14559/19560 | loss 3.329877 (+0.07z)| norm 0.2748 (+0.02z)| lr 9.82e-05 | 4165.49 ms | 32.4% bf16 MFU | 125849 tok/s step 14560/19560 | loss 3.371036 (+1.25z)| norm 0.2744 (-0.01z)| lr 9.82e-05 | 4169.87 ms | 32.4% bf16 MFU | 125843 tok/s step 14561/19560 | loss 3.335865 (+0.23z)| norm 0.2734 (-0.07z)| lr 9.82e-05 | 4180.51 ms | 32.3% bf16 MFU | 125822 tok/s step 14562/19560 | loss 3.334653 (+0.19z)| norm 0.2721 (-0.19z)| lr 9.81e-05 | 4170.24 ms | 32.4% bf16 MFU | 125817 tok/s step 14563/19560 | loss 3.334914 (+0.18z)| norm 0.2634 (-0.86z)| lr 9.81e-05 | 4200.44 ms | 32.1% bf16 MFU | 125767 tok/s step 14564/19560 | loss 3.334903 (+0.19z)| norm 0.2928 (+1.45z)| lr 9.81e-05 | 4157.89 ms | 32.5% bf16 MFU | 125783 tok/s step 14565/19560 | loss 3.321598 (-0.21z)| norm 0.2680 (-0.52z)| lr 9.80e-05 | 4159.97 ms | 32.5% bf16 MFU | 125795 tok/s step 14566/19560 | loss 3.296214 (-0.94z)| norm 0.2736 (-0.07z)| lr 9.80e-05 | 4160.59 ms | 32.5% bf16 MFU | 125806 tok/s step 14567/19560 | loss 3.332296 (+0.11z)| norm 0.2790 (+0.35z)| lr 9.80e-05 | 4167.16 ms | 32.4% bf16 MFU | 125807 tok/s step 14568/19560 | loss 3.240286 (-2.49z)| norm 0.2666 (-0.63z)| lr 9.79e-05 | 4163.33 ms | 32.4% bf16 MFU | 125813 tok/s step 14569/19560 | loss 3.306363 (-0.60z)| norm 0.2611 (-1.06z)| lr 9.79e-05 | 4166.40 ms | 32.4% bf16 MFU | 125814 tok/s step 14570/19560 | loss 3.382989 (+1.58z)| norm 0.2742 (-0.02z)| lr 9.78e-05 | 4189.17 ms | 32.2% bf16 MFU | 125781 tok/s step 14571/19560 | loss 3.286133 (-1.21z)| norm 0.2878 (+1.04z)| lr 9.78e-05 | 4181.97 ms | 32.3% bf16 MFU | 125760 tok/s step 14572/19560 | loss 3.357758 (+0.84z)| norm 0.2676 (-0.54z)| lr 9.78e-05 | 4156.33 ms | 32.5% bf16 MFU | 125780 tok/s step 14573/19560 | loss 3.414358 (+2.39z)| norm 0.2956 (+1.64z)| lr 9.77e-05 | 4154.04 ms | 32.5% bf16 MFU | 125801 tok/s step 14574/19560 | loss 3.292117 (-1.06z)| norm 0.2649 (-0.79z)| lr 9.77e-05 | 4167.43 ms | 32.4% bf16 MFU | 125801 tok/s step 14575/19560 | loss 3.334366 (+0.14z)| norm 0.2679 (-0.55z)| lr 9.77e-05 | 4174.73 ms | 32.3% bf16 MFU | 125791 tok/s step 14576/19560 | loss 3.287280 (-1.18z)| norm 0.2736 (-0.11z)| lr 9.76e-05 | 4161.72 ms | 32.4% bf16 MFU | 125800 tok/s step 14577/19560 | loss 3.266935 (-1.73z)| norm 0.2437 (-2.43z)| lr 9.76e-05 | 4170.95 ms | 32.4% bf16 MFU | 125795 tok/s step 14578/19560 | loss 3.337894 (+0.26z)| norm 0.2691 (-0.46z)| lr 9.75e-05 | 4173.41 ms | 32.4% bf16 MFU | 125787 tok/s step 14579/19560 | loss 3.288463 (-1.15z)| norm 0.2610 (-1.09z)| lr 9.75e-05 | 4179.81 ms | 32.3% bf16 MFU | 125769 tok/s step 14580/19560 | loss 3.306338 (-0.64z)| norm 0.2579 (-1.32z)| lr 9.75e-05 | 4163.30 ms | 32.4% bf16 MFU | 125777 tok/s step 14581/19560 | loss 3.295226 (-0.96z)| norm 0.2551 (-1.52z)| lr 9.74e-05 | 4178.12 ms | 32.3% bf16 MFU | 125762 tok/s step 14582/19560 | loss 3.319593 (-0.26z)| norm 0.2747 (-0.01z)| lr 9.74e-05 | 4168.93 ms | 32.4% bf16 MFU | 125762 tok/s step 14583/19560 | loss 3.443738 (+3.17z)| norm 0.2696 (-0.41z)| lr 9.74e-05 | 4161.32 ms | 32.4% bf16 MFU | 125774 tok/s step 14584/19560 | loss 3.297389 (-0.87z)| norm 0.2719 (-0.22z)| lr 9.73e-05 | 4166.01 ms | 32.4% bf16 MFU | 125777 tok/s step 14585/19560 | loss 3.288090 (-1.11z)| norm 0.2984 (+1.79z)| lr 9.73e-05 | 4194.31 ms | 32.2% bf16 MFU | 125739 tok/s step 14586/19560 | loss 3.351717 (+0.63z)| norm 0.2686 (-0.49z)| lr 9.73e-05 | 4175.51 ms | 32.3% bf16 MFU | 125730 tok/s step 14587/19560 | loss 3.262331 (-1.81z)| norm 0.3680 (+5.99z)| lr 9.72e-05 | 4174.72 ms | 32.3% bf16 MFU | 125723 tok/s step 14588/19560 | loss 3.281275 (-1.28z)| norm 0.2845 (+0.56z)| lr 9.72e-05 | 4161.09 ms | 32.4% bf16 MFU | 125736 tok/s step 14589/19560 | loss 3.319528 (-0.24z)| norm 0.2782 (+0.14z)| lr 9.71e-05 | 4159.41 ms | 32.5% bf16 MFU | 125752 tok/s step 14590/19560 | loss 3.334546 (+0.16z)| norm 0.2580 (-1.16z)| lr 9.71e-05 | 4173.65 ms | 32.3% bf16 MFU | 125745 tok/s step 14591/19560 | loss 3.332031 (+0.08z)| norm 0.2668 (-0.58z)| lr 9.71e-05 | 4174.38 ms | 32.3% bf16 MFU | 125738 tok/s step 14592/19560 | loss 3.339324 (+0.28z)| norm 0.2755 (-0.02z)| lr 9.70e-05 | 4163.04 ms | 32.4% bf16 MFU | 125748 tok/s step 14593/19560 | loss 3.309762 (-0.54z)| norm 0.2719 (-0.24z)| lr 9.70e-05 | 4165.50 ms | 32.4% bf16 MFU | 125754 tok/s step 14594/19560 | loss 3.310061 (-0.53z)| norm 0.2741 (-0.10z)| lr 9.70e-05 | 4159.44 ms | 32.5% bf16 MFU | 125768 tok/s step 14595/19560 | loss 3.230218 (-2.64z)| norm 0.2785 (+0.18z)| lr 9.69e-05 | 4182.37 ms | 32.3% bf16 MFU | 125748 tok/s step 14596/19560 | loss 3.332083 (+0.11z)| norm 0.2577 (-1.18z)| lr 9.69e-05 | 4166.06 ms | 32.4% bf16 MFU | 125753 tok/s step 14597/19560 | loss 3.369183 (+1.09z)| norm 0.2704 (-0.36z)| lr 9.68e-05 | 4161.31 ms | 32.4% bf16 MFU | 125765 tok/s step 14598/19560 | loss 3.302752 (-0.69z)| norm 0.2672 (-0.57z)| lr 9.68e-05 | 4188.17 ms | 32.2% bf16 MFU | 125736 tok/s step 14599/19560 | loss 3.320273 (-0.23z)| norm 0.2600 (-1.03z)| lr 9.68e-05 | 4214.04 ms | 32.0% bf16 MFU | 125670 tok/s step 14600/19560 | loss 3.301330 (-0.74z)| norm 0.2507 (-1.62z)| lr 9.67e-05 | 4179.28 ms | 32.3% bf16 MFU | 125659 tok/s step 14601/19560 | loss 3.303535 (-0.67z)| norm 0.2716 (-0.24z)| lr 9.67e-05 | 4193.94 ms | 32.2% bf16 MFU | 125626 tok/s step 14602/19560 | loss 3.336584 (+0.21z)| norm 0.2592 (-1.06z)| lr 9.67e-05 | 4170.06 ms | 32.4% bf16 MFU | 125631 tok/s step 14603/19560 | loss 3.339626 (+0.29z)| norm 0.2771 (+0.13z)| lr 9.66e-05 | 4167.99 ms | 32.4% bf16 MFU | 125639 tok/s step 14604/19560 | loss 3.348854 (+0.53z)| norm 0.2774 (+0.15z)| lr 9.66e-05 | 4163.16 ms | 32.4% bf16 MFU | 125654 tok/s step 14605/19560 | loss 3.375100 (+1.22z)| norm 0.3022 (+1.77z)| lr 9.66e-05 | 4162.72 ms | 32.4% bf16 MFU | 125669 tok/s step 14606/19560 | loss 3.370362 (+1.08z)| norm 0.2996 (+1.57z)| lr 9.65e-05 | 4180.56 ms | 32.3% bf16 MFU | 125656 tok/s step 14607/19560 | loss 3.331309 (+0.04z)| norm 0.2776 (+0.12z)| lr 9.65e-05 | 4166.01 ms | 32.4% bf16 MFU | 125665 tok/s step 14608/19560 | loss 3.342629 (+0.33z)| norm 0.2843 (+0.56z)| lr 9.64e-05 | 4176.52 ms | 32.3% bf16 MFU | 125659 tok/s step 14609/19560 | loss 3.259211 (-1.87z)| norm 0.2663 (-0.64z)| lr 9.64e-05 | 4188.08 ms | 32.2% bf16 MFU | 125635 tok/s step 14610/19560 | loss 3.348419 (+0.49z)| norm 0.2800 (+0.26z)| lr 9.64e-05 | 4162.13 ms | 32.4% bf16 MFU | 125652 tok/s step 14611/19560 | loss 3.309420 (-0.53z)| norm 0.2681 (-0.52z)| lr 9.63e-05 | 4190.22 ms | 32.2% bf16 MFU | 125625 tok/s step 14612/19560 | loss 3.335278 (+0.16z)| norm 0.2709 (-0.33z)| lr 9.63e-05 | 4167.09 ms | 32.4% bf16 MFU | 125635 tok/s step 14613/19560 | loss 3.297172 (-0.85z)| norm 0.2656 (-0.68z)| lr 9.63e-05 | 4174.05 ms | 32.3% bf16 MFU | 125633 tok/s step 14614/19560 | loss 3.314050 (-0.39z)| norm 0.2979 (+1.44z)| lr 9.62e-05 | 4160.13 ms | 32.5% bf16 MFU | 125653 tok/s step 14615/19560 | loss 3.350849 (+0.59z)| norm 0.2611 (-0.99z)| lr 9.62e-05 | 4173.46 ms | 32.4% bf16 MFU | 125652 tok/s step 14616/19560 | loss 3.422911 (+2.55z)| norm 0.2801 (+0.26z)| lr 9.61e-05 | 4193.77 ms | 32.2% bf16 MFU | 125620 tok/s step 14617/19560 | loss 3.364636 (+0.96z)| norm 0.2677 (-0.57z)| lr 9.61e-05 | 4169.71 ms | 32.4% bf16 MFU | 125626 tok/s step 14618/19560 | loss 3.329343 (-0.00z)| norm 0.2709 (-0.35z)| lr 9.61e-05 | 4173.57 ms | 32.4% bf16 MFU | 125625 tok/s step 14619/19560 | loss 3.384671 (+1.47z)| norm 0.2775 (+0.08z)| lr 9.60e-05 | 4178.49 ms | 32.3% bf16 MFU | 125618 tok/s step 14620/19560 | loss 3.308054 (-0.58z)| norm 0.2899 (+0.90z)| lr 9.60e-05 | 4181.04 ms | 32.3% bf16 MFU | 125607 tok/s step 14621/19560 | loss 3.217697 (-2.89z)| norm 0.2561 (-1.34z)| lr 9.60e-05 | 4153.16 ms | 32.5% bf16 MFU | 125638 tok/s step 14622/19560 | loss 3.313226 (-0.39z)| norm 0.2903 (+0.91z)| lr 9.59e-05 | 4166.40 ms | 32.4% bf16 MFU | 125648 tok/s step 14623/19560 | loss 3.378486 (+1.33z)| norm 0.2808 (+0.28z)| lr 9.59e-05 | 4158.38 ms | 32.5% bf16 MFU | 125670 tok/s step 14624/19560 | loss 3.343527 (+0.40z)| norm 0.2754 (-0.07z)| lr 9.59e-05 | 4181.38 ms | 32.3% bf16 MFU | 125656 tok/s step 14625/19560 | loss 3.301297 (-0.70z)| norm 0.2768 (+0.02z)| lr 9.58e-05 | 4179.18 ms | 32.3% bf16 MFU | 125646 tok/s step 14626/19560 | loss 3.303364 (-0.64z)| norm 0.2562 (-1.33z)| lr 9.58e-05 | 4187.72 ms | 32.2% bf16 MFU | 125623 tok/s step 14627/19560 | loss 3.386445 (+1.51z)| norm 0.2920 (+1.00z)| lr 9.57e-05 | 4186.98 ms | 32.2% bf16 MFU | 125603 tok/s step 14628/19560 | loss 3.310050 (-0.47z)| norm 0.2877 (+0.72z)| lr 9.57e-05 | 4162.53 ms | 32.4% bf16 MFU | 125620 tok/s step 14629/19560 | loss 3.407411 (+2.00z)| norm 0.2717 (-0.34z)| lr 9.57e-05 | 4161.02 ms | 32.4% bf16 MFU | 125639 tok/s step 14630/19560 | loss 3.370613 (+1.05z)| norm 0.2786 (+0.11z)| lr 9.56e-05 | 4155.36 ms | 32.5% bf16 MFU | 125666 tok/s step 14631/19560 | loss 3.327776 (-0.04z)| norm 0.2711 (-0.38z)| lr 9.56e-05 | 4173.56 ms | 32.4% bf16 MFU | 125664 tok/s step 14632/19560 | loss 3.361740 (+0.81z)| norm 0.2889 (+0.78z)| lr 9.56e-05 | 4157.63 ms | 32.5% bf16 MFU | 125686 tok/s step 14633/19560 | loss 3.349673 (+0.51z)| norm 0.2664 (-0.69z)| lr 9.55e-05 | 4168.63 ms | 32.4% bf16 MFU | 125690 tok/s step 14634/19560 | loss 3.314108 (-0.39z)| norm 0.2816 (+0.31z)| lr 9.55e-05 | 4160.28 ms | 32.5% bf16 MFU | 125707 tok/s step 14635/19560 | loss 3.412292 (+2.05z)| norm 0.2933 (+1.06z)| lr 9.55e-05 | 4172.00 ms | 32.4% bf16 MFU | 125705 tok/s step 14636/19560 | loss 3.296330 (-0.85z)| norm 0.2834 (+0.41z)| lr 9.54e-05 | 4168.98 ms | 32.4% bf16 MFU | 125707 tok/s step 14637/19560 | loss 3.250789 (-1.94z)| norm 0.2870 (+0.64z)| lr 9.54e-05 | 4161.53 ms | 32.4% bf16 MFU | 125721 tok/s step 14638/19560 | loss 3.343538 (+0.35z)| norm 0.2646 (-0.82z)| lr 9.53e-05 | 4180.13 ms | 32.3% bf16 MFU | 125706 tok/s step 14639/19560 | loss 3.304757 (-0.61z)| norm 0.2835 (+0.40z)| lr 9.53e-05 | 4168.32 ms | 32.4% bf16 MFU | 125710 tok/s step 14640/19560 | loss 3.330094 (+0.01z)| norm 0.2683 (-0.59z)| lr 9.53e-05 | 4183.65 ms | 32.3% bf16 MFU | 125690 tok/s step 14641/19560 | loss 3.338732 (+0.23z)| norm 0.2877 (+0.67z)| lr 9.52e-05 | 4167.14 ms | 32.4% bf16 MFU | 125697 tok/s step 14642/19560 | loss 3.333441 (+0.08z)| norm 0.2636 (-0.90z)| lr 9.52e-05 | 4167.56 ms | 32.4% bf16 MFU | 125702 tok/s step 14643/19560 | loss 3.280397 (-1.25z)| norm 0.2749 (-0.17z)| lr 9.52e-05 | 4191.29 ms | 32.2% bf16 MFU | 125671 tok/s step 14644/19560 | loss 3.328033 (-0.05z)| norm 0.2754 (-0.15z)| lr 9.51e-05 | 4176.44 ms | 32.3% bf16 MFU | 125664 tok/s step 14645/19560 | loss 3.340314 (+0.25z)| norm 0.2732 (-0.31z)| lr 9.51e-05 | 4157.54 ms | 32.5% bf16 MFU | 125686 tok/s step 14646/19560 | loss 3.309947 (-0.52z)| norm 0.2621 (-1.03z)| lr 9.51e-05 | 4175.69 ms | 32.3% bf16 MFU | 125680 tok/s step 14647/19560 | loss 3.369629 (+0.98z)| norm 0.6309 (+10.15z)| lr 9.50e-05 | 4186.65 ms | 32.2% bf16 MFU | 125657 tok/s step 14648/19560 | loss 3.345515 (+0.37z)| norm 0.2789 (-0.05z)| lr 9.50e-05 | 4186.84 ms | 32.2% bf16 MFU | 125636 tok/s step 14649/19560 | loss 3.267445 (-1.58z)| norm 0.2600 (-0.59z)| lr 9.49e-05 | 4182.26 ms | 32.3% bf16 MFU | 125622 tok/s step 14650/19560 | loss 3.322758 (-0.19z)| norm 0.2725 (-0.23z)| lr 9.49e-05 | 4181.99 ms | 32.3% bf16 MFU | 125609 tok/s step 14651/19560 | loss 3.383048 (+1.35z)| norm 0.2861 (+0.16z)| lr 9.49e-05 | 4167.14 ms | 32.4% bf16 MFU | 125620 tok/s step 14652/19560 | loss 3.310289 (-0.50z)| norm 0.2640 (-0.48z)| lr 9.48e-05 | 4188.92 ms | 32.2% bf16 MFU | 125597 tok/s step 14653/19560 | loss 3.293666 (-0.92z)| norm 0.2670 (-0.39z)| lr 9.48e-05 | 4173.02 ms | 32.4% bf16 MFU | 125599 tok/s step 14654/19560 | loss 3.372302 (+1.08z)| norm 0.2830 (+0.07z)| lr 9.48e-05 | 4171.83 ms | 32.4% bf16 MFU | 125602 tok/s step 14655/19560 | loss 3.311219 (-0.47z)| norm 0.2740 (-0.19z)| lr 9.47e-05 | 4163.86 ms | 32.4% bf16 MFU | 125618 tok/s step 14656/19560 | loss 3.300387 (-0.73z)| norm 0.2683 (-0.35z)| lr 9.47e-05 | 4189.59 ms | 32.2% bf16 MFU | 125594 tok/s step 14657/19560 | loss 3.356895 (+0.71z)| norm 0.2709 (-0.27z)| lr 9.47e-05 | 4173.35 ms | 32.4% bf16 MFU | 125596 tok/s step 14658/19560 | loss 3.297296 (-0.80z)| norm 0.2759 (-0.13z)| lr 9.46e-05 | 4165.07 ms | 32.4% bf16 MFU | 125610 tok/s step 14659/19560 | loss 3.349536 (+0.52z)| norm 0.2854 (+0.15z)| lr 9.46e-05 | 4166.54 ms | 32.4% bf16 MFU | 125621 tok/s step 14660/19560 | loss 3.324927 (-0.10z)| norm 0.2728 (-0.22z)| lr 9.45e-05 | 4176.03 ms | 32.3% bf16 MFU | 125617 tok/s step 14661/19560 | loss 3.361449 (+0.86z)| norm 0.2821 (+0.05z)| lr 9.45e-05 | 4180.56 ms | 32.3% bf16 MFU | 125607 tok/s step 14662/19560 | loss 3.271172 (-1.46z)| norm 0.2719 (-0.24z)| lr 9.45e-05 | 4172.35 ms | 32.4% bf16 MFU | 125610 tok/s step 14663/19560 | loss 3.289433 (-0.97z)| norm 0.2769 (-0.09z)| lr 9.44e-05 | 4167.81 ms | 32.4% bf16 MFU | 125619 tok/s step 14664/19560 | loss 3.309914 (-0.43z)| norm 0.2800 (-0.00z)| lr 9.44e-05 | 4180.70 ms | 32.3% bf16 MFU | 125608 tok/s step 14665/19560 | loss 3.410659 (+2.17z)| norm 0.2819 (+0.05z)| lr 9.44e-05 | 4167.59 ms | 32.4% bf16 MFU | 125618 tok/s step 14666/19560 | loss 3.313042 (-0.35z)| norm 0.2595 (-0.59z)| lr 9.43e-05 | 4167.97 ms | 32.4% bf16 MFU | 125626 tok/s step 14667/19560 | loss 3.333446 (+0.18z)| norm 0.2734 (-0.19z)| lr 9.43e-05 | 4170.76 ms | 32.4% bf16 MFU | 125630 tok/s step 14668/19560 | loss 3.346187 (+0.49z)| norm 0.2823 (+0.07z)| lr 9.43e-05 | 4164.91 ms | 32.4% bf16 MFU | 125643 tok/s step 14669/19560 | loss 3.298051 (-0.75z)| norm 0.2648 (-0.43z)| lr 9.42e-05 | 4177.20 ms | 32.3% bf16 MFU | 125636 tok/s step 14670/19560 | loss 3.374779 (+1.22z)| norm 0.2825 (+0.09z)| lr 9.42e-05 | 4153.12 ms | 32.5% bf16 MFU | 125667 tok/s step 14671/19560 | loss 3.326676 (-0.02z)| norm 0.2841 (+0.14z)| lr 9.41e-05 | 4175.52 ms | 32.3% bf16 MFU | 125661 tok/s step 14672/19560 | loss 3.380865 (+1.35z)| norm 0.2706 (-0.24z)| lr 9.41e-05 | 4170.05 ms | 32.4% bf16 MFU | 125665 tok/s step 14673/19560 | loss 3.321270 (-0.18z)| norm 0.2748 (-0.12z)| lr 9.41e-05 | 4166.89 ms | 32.4% bf16 MFU | 125672 tok/s step 14674/19560 | loss 3.347688 (+0.50z)| norm 0.2756 (-0.09z)| lr 9.40e-05 | 4166.74 ms | 32.4% bf16 MFU | 125680 tok/s step 14675/19560 | loss 3.381773 (+1.36z)| norm 0.2779 (-0.02z)| lr 9.40e-05 | 4151.06 ms | 32.5% bf16 MFU | 125711 tok/s step 14676/19560 | loss 3.363015 (+0.87z)| norm 0.2787 (+0.01z)| lr 9.40e-05 | 4183.43 ms | 32.3% bf16 MFU | 125692 tok/s step 14677/19560 | loss 3.343946 (+0.38z)| norm 0.2597 (-0.54z)| lr 9.39e-05 | 4312.25 ms | 31.3% bf16 MFU | 125486 tok/s step 14678/19560 | loss 3.277236 (-1.31z)| norm 0.2615 (-0.48z)| lr 9.39e-05 | 4243.46 ms | 31.8% bf16 MFU | 125390 tok/s step 14679/19560 | loss 3.325651 (-0.08z)| norm 0.2761 (-0.06z)| lr 9.39e-05 | 4161.18 ms | 32.4% bf16 MFU | 125420 tok/s step 14680/19560 | loss 3.297338 (-0.80z)| norm 0.2590 (-0.55z)| lr 9.38e-05 | 4185.95 ms | 32.3% bf16 MFU | 125411 tok/s step 14681/19560 | loss 3.355559 (+0.67z)| norm 0.2589 (-0.55z)| lr 9.38e-05 | 4174.80 ms | 32.3% bf16 MFU | 125420 tok/s step 14682/19560 | loss 3.306807 (-0.56z)| norm 0.2658 (-0.34z)| lr 9.37e-05 | 4177.35 ms | 32.3% bf16 MFU | 125424 tok/s step 14683/19560 | loss 3.378508 (+1.24z)| norm 0.2689 (-0.25z)| lr 9.37e-05 | 4164.06 ms | 32.4% bf16 MFU | 125449 tok/s step 14684/19560 | loss 3.329910 (+0.01z)| norm 0.2548 (-0.65z)| lr 9.37e-05 | 4173.01 ms | 32.4% bf16 MFU | 125458 tok/s step 14685/19560 | loss 3.350946 (+0.54z)| norm 0.2658 (-0.33z)| lr 9.36e-05 | 4159.29 ms | 32.5% bf16 MFU | 125488 tok/s step 14686/19560 | loss 3.333291 (+0.11z)| norm 0.2597 (-0.50z)| lr 9.36e-05 | 4152.00 ms | 32.5% bf16 MFU | 125527 tok/s step 14687/19560 | loss 3.315125 (-0.35z)| norm 0.2598 (-0.49z)| lr 9.36e-05 | 4181.79 ms | 32.3% bf16 MFU | 125519 tok/s step 14688/19560 | loss 3.312019 (-0.42z)| norm 0.2792 (+0.07z)| lr 9.35e-05 | 4167.47 ms | 32.4% bf16 MFU | 125534 tok/s step 14689/19560 | loss 3.324828 (-0.09z)| norm 0.2579 (-0.55z)| lr 9.35e-05 | 4179.27 ms | 32.3% bf16 MFU | 125530 tok/s step 14690/19560 | loss 3.304029 (-0.62z)| norm 0.2545 (-0.64z)| lr 9.35e-05 | 4162.14 ms | 32.4% bf16 MFU | 125551 tok/s step 14691/19560 | loss 3.401981 (+1.87z)| norm 0.2686 (-0.23z)| lr 9.34e-05 | 4167.50 ms | 32.4% bf16 MFU | 125564 tok/s step 14692/19560 | loss 3.347897 (+0.49z)| norm 0.2680 (-0.24z)| lr 9.34e-05 | 4164.59 ms | 32.4% bf16 MFU | 125580 tok/s step 14693/19560 | loss 3.339382 (+0.27z)| norm 0.2665 (-0.29z)| lr 9.33e-05 | 4182.01 ms | 32.3% bf16 MFU | 125570 tok/s step 14694/19560 | loss 3.379228 (+1.26z)| norm 0.2750 (-0.04z)| lr 9.33e-05 | 4168.76 ms | 32.4% bf16 MFU | 125580 tok/s step 14695/19560 | loss 3.366683 (+0.94z)| norm 0.2698 (-0.19z)| lr 9.33e-05 | 4174.79 ms | 32.3% bf16 MFU | 125580 tok/s step 14696/19560 | loss 3.363174 (+0.84z)| norm 0.2612 (-0.44z)| lr 9.32e-05 | 4164.01 ms | 32.4% bf16 MFU | 125596 tok/s step 14697/19560 | loss 3.383513 (+1.34z)| norm 0.2917 (+0.44z)| lr 9.32e-05 | 4171.31 ms | 32.4% bf16 MFU | 125601 tok/s step 14698/19560 | loss 3.363310 (+0.83z)| norm 0.2734 (-0.09z)| lr 9.32e-05 | 4174.83 ms | 32.3% bf16 MFU | 125600 tok/s step 14699/19560 | loss 3.338851 (+0.19z)| norm 0.2803 (+0.11z)| lr 9.31e-05 | 4159.78 ms | 32.5% bf16 MFU | 125622 tok/s step 14700/19560 | loss 3.346754 (+0.40z)| norm 0.2885 (+0.35z)| lr 9.31e-05 | 4166.64 ms | 32.4% bf16 MFU | 125632 tok/s step 14701/19560 | loss 3.330257 (-0.01z)| norm 0.2793 (+0.08z)| lr 9.31e-05 | 4171.06 ms | 32.4% bf16 MFU | 125635 tok/s step 14702/19560 | loss 3.289901 (-1.07z)| norm 0.2784 (+0.05z)| lr 9.30e-05 | 4181.27 ms | 32.3% bf16 MFU | 125623 tok/s step 14703/19560 | loss 3.234771 (-2.44z)| norm 0.2846 (+0.23z)| lr 9.30e-05 | 4171.99 ms | 32.4% bf16 MFU | 125625 tok/s step 14704/19560 | loss 3.323539 (-0.17z)| norm 0.2866 (+0.29z)| lr 9.29e-05 | 4159.31 ms | 32.5% bf16 MFU | 125647 tok/s step 14705/19560 | loss 3.311344 (-0.50z)| norm 0.2683 (-0.25z)| lr 9.29e-05 | 4171.73 ms | 32.4% bf16 MFU | 125648 tok/s step 14706/19560 | loss 3.341849 (+0.29z)| norm 0.2709 (-0.18z)| lr 9.29e-05 | 4217.66 ms | 32.0% bf16 MFU | 125581 tok/s step 14707/19560 | loss 3.351352 (+0.53z)| norm 0.2695 (-0.22z)| lr 9.28e-05 | 4167.34 ms | 32.4% bf16 MFU | 125593 tok/s step 14708/19560 | loss 3.354931 (+0.61z)| norm 0.2797 (+0.07z)| lr 9.28e-05 | 4215.34 ms | 32.0% bf16 MFU | 125532 tok/s step 14709/19560 | loss 3.275545 (-1.45z)| norm 0.2551 (-0.65z)| lr 9.28e-05 | 4162.94 ms | 32.4% bf16 MFU | 125552 tok/s step 14710/19560 | loss 3.387125 (+1.43z)| norm 0.2719 (-0.16z)| lr 9.27e-05 | 4163.29 ms | 32.4% bf16 MFU | 125571 tok/s step 14711/19560 | loss 3.313813 (-0.45z)| norm 0.2473 (-0.87z)| lr 9.27e-05 | 4175.58 ms | 32.3% bf16 MFU | 125571 tok/s step 14712/19560 | loss 3.326881 (-0.11z)| norm 0.2494 (-0.80z)| lr 9.27e-05 | 4215.18 ms | 32.0% bf16 MFU | 125511 tok/s step 14713/19560 | loss 3.275381 (-1.48z)| norm 0.2619 (-0.43z)| lr 9.26e-05 | 4206.16 ms | 32.1% bf16 MFU | 125468 tok/s step 14714/19560 | loss 3.327703 (-0.08z)| norm 0.2605 (-0.47z)| lr 9.26e-05 | 4176.06 ms | 32.3% bf16 MFU | 125472 tok/s step 14715/19560 | loss 3.301927 (-0.79z)| norm 0.2672 (-0.26z)| lr 9.25e-05 | 4168.02 ms | 32.4% bf16 MFU | 125488 tok/s step 14716/19560 | loss 3.388325 (+1.52z)| norm 0.2566 (-0.57z)| lr 9.25e-05 | 4158.61 ms | 32.5% bf16 MFU | 125517 tok/s step 14717/19560 | loss 3.370772 (+1.03z)| norm 0.2765 (+0.03z)| lr 9.25e-05 | 4172.84 ms | 32.4% bf16 MFU | 125523 tok/s step 14718/19560 | loss 3.320016 (-0.33z)| norm 0.2773 (+0.05z)| lr 9.24e-05 | 4174.08 ms | 32.3% bf16 MFU | 125527 tok/s step 14719/19560 | loss 3.334554 (+0.06z)| norm 0.2602 (-0.46z)| lr 9.24e-05 | 4163.55 ms | 32.4% bf16 MFU | 125547 tok/s step 14720/19560 | loss 3.350920 (+0.50z)| norm 0.2798 (+0.13z)| lr 9.24e-05 | 4165.10 ms | 32.4% bf16 MFU | 125564 tok/s step 14721/19560 | loss 3.327599 (-0.13z)| norm 0.2584 (-0.51z)| lr 9.23e-05 | 4182.05 ms | 32.3% bf16 MFU | 125554 tok/s step 14722/19560 | loss 3.362106 (+0.78z)| norm 0.2731 (-0.07z)| lr 9.23e-05 | 4171.60 ms | 32.4% bf16 MFU | 125560 tok/s step 14723/19560 | loss 3.317472 (-0.44z)| norm 0.2584 (-0.51z)| lr 9.23e-05 | 4176.78 ms | 32.3% bf16 MFU | 125558 tok/s step 14724/19560 | loss 3.333320 (-0.01z)| norm 0.2604 (-0.45z)| lr 9.22e-05 | 4161.89 ms | 32.4% bf16 MFU | 125579 tok/s step 14725/19560 | loss 3.348261 (+0.41z)| norm 0.2687 (-0.20z)| lr 9.22e-05 | 4178.87 ms | 32.3% bf16 MFU | 125573 tok/s step 14726/19560 | loss 3.266438 (-1.83z)| norm 0.2604 (-0.45z)| lr 9.22e-05 | 4156.51 ms | 32.5% bf16 MFU | 125601 tok/s step 14727/19560 | loss 3.274336 (-1.59z)| norm 0.2744 (-0.03z)| lr 9.21e-05 | 4164.69 ms | 32.4% bf16 MFU | 125616 tok/s step 14728/19560 | loss 3.331827 (-0.03z)| norm 0.2682 (-0.22z)| lr 9.21e-05 | 4168.83 ms | 32.4% bf16 MFU | 125623 tok/s step 14729/19560 | loss 3.323422 (-0.26z)| norm 0.2667 (-0.26z)| lr 9.20e-05 | 4174.24 ms | 32.3% bf16 MFU | 125622 tok/s step 14730/19560 | loss 3.361229 (+0.76z)| norm 0.2621 (-0.40z)| lr 9.20e-05 | 4167.98 ms | 32.4% bf16 MFU | 125630 tok/s step 14731/19560 | loss 3.377122 (+1.18z)| norm 0.2644 (-0.33z)| lr 9.20e-05 | 4177.08 ms | 32.3% bf16 MFU | 125625 tok/s step 14732/19560 | loss 3.353730 (+0.54z)| norm 0.2671 (-0.25z)| lr 9.19e-05 | 4188.64 ms | 32.2% bf16 MFU | 125602 tok/s step 14733/19560 | loss 3.283209 (-1.34z)| norm 0.2814 (+0.19z)| lr 9.19e-05 | 4159.78 ms | 32.5% bf16 MFU | 125624 tok/s step 14734/19560 | loss 3.324539 (-0.22z)| norm 0.2621 (-0.39z)| lr 9.19e-05 | 4238.66 ms | 31.9% bf16 MFU | 125527 tok/s step 14735/19560 | loss 3.290334 (-1.13z)| norm 0.2613 (-0.41z)| lr 9.18e-05 | 4180.19 ms | 32.3% bf16 MFU | 125522 tok/s step 14736/19560 | loss 3.437093 (+2.73z)| norm 0.2966 (+0.65z)| lr 9.18e-05 | 4178.36 ms | 32.3% bf16 MFU | 125520 tok/s step 14737/19560 | loss 3.310214 (-0.62z)| norm 0.2557 (-0.57z)| lr 9.18e-05 | 4164.74 ms | 32.4% bf16 MFU | 125538 tok/s step 14738/19560 | loss 3.363925 (+0.81z)| norm 0.2829 (+0.24z)| lr 9.17e-05 | 4169.13 ms | 32.4% bf16 MFU | 125549 tok/s step 14739/19560 | loss 3.338076 (+0.12z)| norm 0.2923 (+0.52z)| lr 9.17e-05 | 4159.57 ms | 32.5% bf16 MFU | 125574 tok/s step 14740/19560 | loss 3.317294 (-0.43z)| norm 0.2896 (+0.43z)| lr 9.16e-05 | 4157.65 ms | 32.5% bf16 MFU | 125600 tok/s step 14741/19560 | loss 3.310693 (-0.61z)| norm 0.2974 (+0.66z)| lr 9.16e-05 | 4163.04 ms | 32.4% bf16 MFU | 125617 tok/s step 14742/19560 | loss 3.339600 (+0.15z)| norm 0.2810 (+0.17z)| lr 9.16e-05 | 4167.76 ms | 32.4% bf16 MFU | 125626 tok/s step 14743/19560 | loss 3.355966 (+0.59z)| norm 0.3034 (+0.83z)| lr 9.15e-05 | 4186.34 ms | 32.3% bf16 MFU | 125606 tok/s step 14744/19560 | loss 3.240633 (-2.44z)| norm 0.2626 (-0.39z)| lr 9.15e-05 | 4196.15 ms | 32.2% bf16 MFU | 125573 tok/s step 14745/19560 | loss 3.333598 (+0.04z)| norm 0.2898 (+0.42z)| lr 9.15e-05 | 4167.81 ms | 32.4% bf16 MFU | 125584 tok/s step 14746/19560 | loss 3.301056 (-0.82z)| norm 0.2468 (-0.85z)| lr 9.14e-05 | 4163.69 ms | 32.4% bf16 MFU | 125601 tok/s step 14747/19560 | loss 3.286865 (-1.18z)| norm 0.2624 (-0.39z)| lr 9.14e-05 | 4181.06 ms | 32.3% bf16 MFU | 125591 tok/s step 14748/19560 | loss 3.340889 (+0.25z)| norm 0.2828 (+0.23z)| lr 9.14e-05 | 4182.66 ms | 32.3% bf16 MFU | 125579 tok/s step 14749/19560 | loss 3.270560 (-1.68z)| norm 0.2614 (-0.41z)| lr 9.13e-05 | 4158.79 ms | 32.5% bf16 MFU | 125603 tok/s step 14750/19560 | loss 3.341437 (+0.25z)| norm 0.2651 (-0.30z)| lr 9.13e-05 | 4151.29 ms | 32.5% bf16 MFU | 125638 tok/s val loss 3.297497 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3002/10042 = 0.298944 step 14751/19560 | loss 3.335433 (+0.10z)| norm 0.2757 (+0.02z)| lr 9.13e-05 | 4174.73 ms | 32.3% bf16 MFU | 125635 tok/s step 14752/19560 | loss 3.298778 (-0.90z)| norm 0.2657 (-0.28z)| lr 9.12e-05 | 4171.43 ms | 32.4% bf16 MFU | 125638 tok/s step 14753/19560 | loss 3.386206 (+1.48z)| norm 0.2977 (+0.67z)| lr 9.12e-05 | 4244.65 ms | 31.8% bf16 MFU | 125532 tok/s step 14754/19560 | loss 3.360735 (+0.77z)| norm 0.2608 (-0.43z)| lr 9.11e-05 | 4183.47 ms | 32.3% bf16 MFU | 125521 tok/s step 14755/19560 | loss 3.263839 (-1.84z)| norm 0.2651 (-0.29z)| lr 9.11e-05 | 4181.53 ms | 32.3% bf16 MFU | 125514 tok/s step 14756/19560 | loss 3.294494 (-1.00z)| norm 0.2574 (-0.52z)| lr 9.11e-05 | 4222.38 ms | 32.0% bf16 MFU | 125447 tok/s step 14757/19560 | loss 3.305501 (-0.69z)| norm 0.2588 (-0.47z)| lr 9.10e-05 | 4152.60 ms | 32.5% bf16 MFU | 125487 tok/s step 14758/19560 | loss 3.356979 (+0.73z)| norm 0.2699 (-0.14z)| lr 9.10e-05 | 4199.56 ms | 32.2% bf16 MFU | 125455 tok/s step 14759/19560 | loss 3.351476 (+0.57z)| norm 0.2561 (-0.55z)| lr 9.10e-05 | 4174.40 ms | 32.3% bf16 MFU | 125462 tok/s step 14760/19560 | loss 3.298616 (-0.87z)| norm 0.2688 (-0.16z)| lr 9.09e-05 | 4160.61 ms | 32.5% bf16 MFU | 125490 tok/s step 14761/19560 | loss 3.325787 (-0.12z)| norm 0.2603 (-0.41z)| lr 9.09e-05 | 4206.53 ms | 32.1% bf16 MFU | 125447 tok/s step 14762/19560 | loss 3.281858 (-1.32z)| norm 0.2645 (-0.29z)| lr 9.09e-05 | 4158.05 ms | 32.5% bf16 MFU | 125479 tok/s step 14763/19560 | loss 3.351636 (+0.62z)| norm 0.2752 (+0.04z)| lr 9.08e-05 | 4157.99 ms | 32.5% bf16 MFU | 125510 tok/s step 14764/19560 | loss 3.323657 (-0.17z)| norm 0.2589 (-0.44z)| lr 9.08e-05 | 4167.55 ms | 32.4% bf16 MFU | 125524 tok/s step 14765/19560 | loss 3.338362 (+0.23z)| norm 0.2719 (-0.05z)| lr 9.07e-05 | 4176.41 ms | 32.3% bf16 MFU | 125525 tok/s step 14766/19560 | loss 3.394506 (+1.80z)| norm 0.2709 (-0.09z)| lr 9.07e-05 | 4167.75 ms | 32.4% bf16 MFU | 125539 tok/s step 14767/19560 | loss 3.320244 (-0.30z)| norm 0.2587 (-0.44z)| lr 9.07e-05 | 4211.77 ms | 32.1% bf16 MFU | 125486 tok/s step 14768/19560 | loss 3.316802 (-0.39z)| norm 0.2569 (-0.49z)| lr 9.06e-05 | 4177.71 ms | 32.3% bf16 MFU | 125486 tok/s step 14769/19560 | loss 3.312944 (-0.50z)| norm 0.2720 (-0.04z)| lr 9.06e-05 | 4183.31 ms | 32.3% bf16 MFU | 125478 tok/s step 14770/19560 | loss 3.354686 (+0.68z)| norm 0.2803 (+0.20z)| lr 9.06e-05 | 4155.05 ms | 32.5% bf16 MFU | 125514 tok/s step 14771/19560 | loss 3.289041 (-1.18z)| norm 0.2716 (-0.06z)| lr 9.05e-05 | 4159.08 ms | 32.5% bf16 MFU | 125541 tok/s step 14772/19560 | loss 3.373610 (+1.20z)| norm 0.2657 (-0.23z)| lr 9.05e-05 | 4162.39 ms | 32.4% bf16 MFU | 125562 tok/s step 14773/19560 | loss 3.398154 (+1.85z)| norm 0.2804 (+0.21z)| lr 9.05e-05 | 4163.09 ms | 32.4% bf16 MFU | 125580 tok/s step 14774/19560 | loss 3.379055 (+1.30z)| norm 0.2661 (-0.22z)| lr 9.04e-05 | 4157.33 ms | 32.5% bf16 MFU | 125607 tok/s step 14775/19560 | loss 3.293649 (-1.04z)| norm 0.2705 (-0.01z)| lr 9.04e-05 | 4185.23 ms | 32.3% bf16 MFU | 125590 tok/s step 14776/19560 | loss 3.287027 (-1.21z)| norm 0.2496 (-1.86z)| lr 9.04e-05 | 4153.29 ms | 32.5% bf16 MFU | 125622 tok/s step 14777/19560 | loss 3.389338 (+1.58z)| norm 0.2734 (+0.26z)| lr 9.03e-05 | 4164.77 ms | 32.4% bf16 MFU | 125636 tok/s step 14778/19560 | loss 3.378343 (+1.26z)| norm 0.2780 (+0.67z)| lr 9.03e-05 | 4169.33 ms | 32.4% bf16 MFU | 125641 tok/s step 14779/19560 | loss 3.314694 (-0.47z)| norm 0.2667 (-0.33z)| lr 9.02e-05 | 4171.12 ms | 32.4% bf16 MFU | 125644 tok/s step 14780/19560 | loss 3.280966 (-1.39z)| norm 0.2777 (+0.65z)| lr 9.02e-05 | 4160.75 ms | 32.5% bf16 MFU | 125662 tok/s step 14781/19560 | loss 3.368362 (+0.99z)| norm 0.2888 (+1.62z)| lr 9.02e-05 | 4155.12 ms | 32.5% bf16 MFU | 125688 tok/s step 14782/19560 | loss 3.313788 (-0.49z)| norm 0.2797 (+0.81z)| lr 9.01e-05 | 4161.25 ms | 32.4% bf16 MFU | 125703 tok/s step 14783/19560 | loss 3.324343 (-0.21z)| norm 0.2868 (+1.43z)| lr 9.01e-05 | 4157.58 ms | 32.5% bf16 MFU | 125723 tok/s step 14784/19560 | loss 3.348919 (+0.46z)| norm 0.2947 (+2.08z)| lr 9.01e-05 | 4167.08 ms | 32.4% bf16 MFU | 125728 tok/s step 14785/19560 | loss 3.283421 (-1.32z)| norm 0.2793 (+0.72z)| lr 9.00e-05 | 4169.23 ms | 32.4% bf16 MFU | 125729 tok/s step 14786/19560 | loss 3.281365 (-1.37z)| norm 0.2811 (+0.88z)| lr 9.00e-05 | 4165.34 ms | 32.4% bf16 MFU | 125736 tok/s step 14787/19560 | loss 3.282381 (-1.32z)| norm 0.2790 (+0.70z)| lr 9.00e-05 | 4161.00 ms | 32.4% bf16 MFU | 125749 tok/s step 14788/19560 | loss 3.393058 (+1.65z)| norm 0.2836 (+1.09z)| lr 8.99e-05 | 4180.76 ms | 32.3% bf16 MFU | 125732 tok/s step 14789/19560 | loss 3.420176 (+2.32z)| norm 0.2798 (+0.77z)| lr 8.99e-05 | 4209.93 ms | 32.1% bf16 MFU | 125672 tok/s step 14790/19560 | loss 3.343680 (+0.29z)| norm 0.3115 (+3.34z)| lr 8.99e-05 | 4205.20 ms | 32.1% bf16 MFU | 125622 tok/s step 14791/19560 | loss 3.337229 (+0.11z)| norm 0.2873 (+1.31z)| lr 8.98e-05 | 4371.24 ms | 30.9% bf16 MFU | 125338 tok/s step 14792/19560 | loss 3.329478 (-0.10z)| norm 0.2937 (+1.82z)| lr 8.98e-05 | 4269.72 ms | 31.6% bf16 MFU | 125211 tok/s step 14793/19560 | loss 3.314855 (-0.48z)| norm 0.2840 (+1.02z)| lr 8.97e-05 | 4166.76 ms | 32.4% bf16 MFU | 125242 tok/s step 14794/19560 | loss 3.390141 (+1.54z)| norm 0.2826 (+0.89z)| lr 8.97e-05 | 4159.18 ms | 32.5% bf16 MFU | 125283 tok/s step 14795/19560 | loss 3.349590 (+0.44z)| norm 0.2850 (+1.08z)| lr 8.97e-05 | 4166.08 ms | 32.4% bf16 MFU | 125311 tok/s step 14796/19560 | loss 3.376554 (+1.16z)| norm 0.2940 (+1.78z)| lr 8.96e-05 | 4175.33 ms | 32.3% bf16 MFU | 125324 tok/s step 14797/19560 | loss 3.307786 (-0.69z)| norm 0.2626 (-0.75z)| lr 8.96e-05 | 4171.88 ms | 32.4% bf16 MFU | 125341 tok/s step 14798/19560 | loss 3.297565 (-0.95z)| norm 0.2833 (+0.92z)| lr 8.96e-05 | 4161.84 ms | 32.4% bf16 MFU | 125373 tok/s step 14799/19560 | loss 3.269063 (-1.69z)| norm 0.2717 (-0.01z)| lr 8.95e-05 | 4194.90 ms | 32.2% bf16 MFU | 125353 tok/s step 14800/19560 | loss 3.314714 (-0.46z)| norm 0.2682 (-0.29z)| lr 8.95e-05 | 4155.18 ms | 32.5% bf16 MFU | 125394 tok/s step 14801/19560 | loss 3.261837 (-1.84z)| norm 0.2670 (-0.39z)| lr 8.95e-05 | 4160.28 ms | 32.5% bf16 MFU | 125426 tok/s step 14802/19560 | loss 3.294619 (-0.96z)| norm 0.2720 (+0.03z)| lr 8.94e-05 | 4205.20 ms | 32.1% bf16 MFU | 125388 tok/s step 14803/19560 | loss 3.257362 (-1.90z)| norm 0.2595 (-0.97z)| lr 8.94e-05 | 4157.83 ms | 32.5% bf16 MFU | 125424 tok/s step 14804/19560 | loss 3.283030 (-1.21z)| norm 0.2730 (+0.12z)| lr 8.94e-05 | 4168.38 ms | 32.4% bf16 MFU | 125441 tok/s step 14805/19560 | loss 3.315772 (-0.35z)| norm 0.2853 (+1.09z)| lr 8.93e-05 | 4161.78 ms | 32.4% bf16 MFU | 125468 tok/s step 14806/19560 | loss 3.336062 (+0.17z)| norm 0.2763 (+0.36z)| lr 8.93e-05 | 4159.15 ms | 32.5% bf16 MFU | 125498 tok/s step 14807/19560 | loss 3.274298 (-1.43z)| norm 0.2571 (-1.17z)| lr 8.93e-05 | 4159.87 ms | 32.5% bf16 MFU | 125524 tok/s step 14808/19560 | loss 3.321826 (-0.20z)| norm 0.2846 (+1.02z)| lr 8.92e-05 | 4161.76 ms | 32.4% bf16 MFU | 125547 tok/s step 14809/19560 | loss 3.265999 (-1.62z)| norm 0.2785 (+0.52z)| lr 8.92e-05 | 4162.44 ms | 32.4% bf16 MFU | 125568 tok/s step 14810/19560 | loss 3.231255 (-2.45z)| norm 0.2522 (-1.58z)| lr 8.91e-05 | 4165.96 ms | 32.4% bf16 MFU | 125582 tok/s step 14811/19560 | loss 3.381586 (+1.35z)| norm 0.2846 (+1.00z)| lr 8.91e-05 | 4155.27 ms | 32.5% bf16 MFU | 125611 tok/s step 14812/19560 | loss 3.259923 (-1.69z)| norm 0.2570 (-1.20z)| lr 8.91e-05 | 4166.81 ms | 32.4% bf16 MFU | 125622 tok/s step 14813/19560 | loss 3.275537 (-1.28z)| norm 0.2559 (-1.28z)| lr 8.90e-05 | 4151.58 ms | 32.5% bf16 MFU | 125655 tok/s step 14814/19560 | loss 3.309149 (-0.44z)| norm 0.2574 (-1.16z)| lr 8.90e-05 | 4217.34 ms | 32.0% bf16 MFU | 125588 tok/s step 14815/19560 | loss 3.344190 (+0.43z)| norm 0.2589 (-1.03z)| lr 8.90e-05 | 4162.19 ms | 32.4% bf16 MFU | 125607 tok/s step 14816/19560 | loss 3.312304 (-0.37z)| norm 0.2589 (-1.02z)| lr 8.89e-05 | 4171.89 ms | 32.4% bf16 MFU | 125610 tok/s step 14817/19560 | loss 3.456871 (+3.08z)| norm 0.3146 (+3.22z)| lr 8.89e-05 | 4157.19 ms | 32.5% bf16 MFU | 125636 tok/s step 14818/19560 | loss 3.322588 (-0.13z)| norm 0.2794 (+0.53z)| lr 8.89e-05 | 4149.87 ms | 32.5% bf16 MFU | 125671 tok/s step 14819/19560 | loss 3.289913 (-0.90z)| norm 0.2700 (-0.19z)| lr 8.88e-05 | 4156.69 ms | 32.5% bf16 MFU | 125694 tok/s step 14820/19560 | loss 3.332066 (+0.12z)| norm 0.2677 (-0.36z)| lr 8.88e-05 | 4159.13 ms | 32.5% bf16 MFU | 125712 tok/s step 14821/19560 | loss 3.292679 (-0.82z)| norm 0.2738 (+0.09z)| lr 8.88e-05 | 4165.82 ms | 32.4% bf16 MFU | 125719 tok/s step 14822/19560 | loss 3.290531 (-0.86z)| norm 0.2571 (-1.16z)| lr 8.87e-05 | 4155.91 ms | 32.5% bf16 MFU | 125741 tok/s step 14823/19560 | loss 3.293057 (-0.79z)| norm 0.2611 (-0.86z)| lr 8.87e-05 | 4158.26 ms | 32.5% bf16 MFU | 125758 tok/s step 14824/19560 | loss 3.305790 (-0.47z)| norm 0.2727 (+0.02z)| lr 8.86e-05 | 4155.95 ms | 32.5% bf16 MFU | 125778 tok/s step 14825/19560 | loss 3.283509 (-1.00z)| norm 0.2617 (-0.80z)| lr 8.86e-05 | 4166.89 ms | 32.4% bf16 MFU | 125780 tok/s step 14826/19560 | loss 3.330353 (+0.15z)| norm 0.2648 (-0.56z)| lr 8.86e-05 | 4165.72 ms | 32.4% bf16 MFU | 125784 tok/s step 14827/19560 | loss 3.337203 (+0.32z)| norm 0.2613 (-0.82z)| lr 8.85e-05 | 4153.79 ms | 32.5% bf16 MFU | 125806 tok/s step 14828/19560 | loss 3.334827 (+0.27z)| norm 0.2622 (-0.74z)| lr 8.85e-05 | 4160.82 ms | 32.4% bf16 MFU | 125816 tok/s step 14829/19560 | loss 3.297207 (-0.65z)| norm 0.2627 (-0.69z)| lr 8.85e-05 | 4171.77 ms | 32.4% bf16 MFU | 125809 tok/s step 14830/19560 | loss 3.314737 (-0.23z)| norm 0.2737 (+0.16z)| lr 8.84e-05 | 4162.70 ms | 32.4% bf16 MFU | 125816 tok/s step 14831/19560 | loss 3.365878 (+1.02z)| norm 0.3940 (+7.21z)| lr 8.84e-05 | 4150.65 ms | 32.5% bf16 MFU | 125841 tok/s step 14832/19560 | loss 3.351718 (+0.66z)| norm 0.3045 (+1.87z)| lr 8.84e-05 | 4155.86 ms | 32.5% bf16 MFU | 125856 tok/s step 14833/19560 | loss 3.319126 (-0.15z)| norm 0.2844 (+0.68z)| lr 8.83e-05 | 4162.37 ms | 32.4% bf16 MFU | 125862 tok/s step 14834/19560 | loss 3.251348 (-1.80z)| norm 0.3089 (+2.06z)| lr 8.83e-05 | 4159.32 ms | 32.5% bf16 MFU | 125871 tok/s step 14835/19560 | loss 3.319724 (-0.11z)| norm 0.2700 (-0.17z)| lr 8.83e-05 | 4158.75 ms | 32.5% bf16 MFU | 125881 tok/s step 14836/19560 | loss 3.262853 (-1.48z)| norm 0.2864 (+0.77z)| lr 8.82e-05 | 4163.73 ms | 32.4% bf16 MFU | 125883 tok/s step 14837/19560 | loss 3.274519 (-1.20z)| norm 0.2683 (-0.28z)| lr 8.82e-05 | 4349.40 ms | 31.0% bf16 MFU | 125616 tok/s step 14838/19560 | loss 3.257886 (-1.58z)| norm 0.2852 (+0.69z)| lr 8.82e-05 | 4156.39 ms | 32.5% bf16 MFU | 125642 tok/s step 14839/19560 | loss 3.237505 (-2.03z)| norm 0.2698 (-0.21z)| lr 8.81e-05 | 4160.69 ms | 32.5% bf16 MFU | 125660 tok/s step 14840/19560 | loss 3.275996 (-1.09z)| norm 0.2594 (-0.83z)| lr 8.81e-05 | 4158.86 ms | 32.5% bf16 MFU | 125681 tok/s step 14841/19560 | loss 3.321949 (+0.00z)| norm 0.2763 (+0.16z)| lr 8.80e-05 | 4166.60 ms | 32.4% bf16 MFU | 125688 tok/s step 14842/19560 | loss 3.248173 (-1.74z)| norm 0.2696 (-0.24z)| lr 8.80e-05 | 4884.71 ms | 27.6% bf16 MFU | 124770 tok/s step 14843/19560 | loss 3.287584 (-0.80z)| norm 0.2754 (+0.09z)| lr 8.80e-05 | 4161.99 ms | 32.4% bf16 MFU | 124830 tok/s step 14844/19560 | loss 3.326470 (+0.14z)| norm 0.2593 (-0.85z)| lr 8.79e-05 | 4153.05 ms | 32.5% bf16 MFU | 124901 tok/s step 14845/19560 | loss 3.328161 (+0.19z)| norm 0.2532 (-1.19z)| lr 8.79e-05 | 4163.87 ms | 32.4% bf16 MFU | 124952 tok/s step 14846/19560 | loss 3.311121 (-0.22z)| norm 0.2673 (-0.36z)| lr 8.79e-05 | 4158.33 ms | 32.5% bf16 MFU | 125008 tok/s step 14847/19560 | loss 3.280533 (-0.94z)| norm 0.2814 (+0.45z)| lr 8.78e-05 | 4160.44 ms | 32.5% bf16 MFU | 125058 tok/s step 14848/19560 | loss 3.339746 (+0.48z)| norm 0.2545 (-1.11z)| lr 8.78e-05 | 4161.30 ms | 32.4% bf16 MFU | 125105 tok/s step 14849/19560 | loss 3.365788 (+1.09z)| norm 0.2740 (+0.02z)| lr 8.78e-05 | 4166.07 ms | 32.4% bf16 MFU | 125142 tok/s step 14850/19560 | loss 3.339619 (+0.47z)| norm 0.2729 (-0.04z)| lr 8.77e-05 | 4154.43 ms | 32.5% bf16 MFU | 125195 tok/s step 14851/19560 | loss 3.269310 (-1.20z)| norm 0.2674 (-0.37z)| lr 8.77e-05 | 4165.61 ms | 32.4% bf16 MFU | 125228 tok/s step 14852/19560 | loss 3.312485 (-0.16z)| norm 0.2483 (-1.47z)| lr 8.77e-05 | 4159.75 ms | 32.5% bf16 MFU | 125269 tok/s step 14853/19560 | loss 3.331621 (+0.30z)| norm 0.2892 (+0.90z)| lr 8.76e-05 | 4156.25 ms | 32.5% bf16 MFU | 125313 tok/s step 14854/19560 | loss 3.294270 (-0.61z)| norm 0.2590 (-0.85z)| lr 8.76e-05 | 4161.60 ms | 32.4% bf16 MFU | 125346 tok/s step 14855/19560 | loss 3.318269 (-0.04z)| norm 0.2673 (-0.37z)| lr 8.76e-05 | 4161.93 ms | 32.4% bf16 MFU | 125377 tok/s step 14856/19560 | loss 3.295496 (-0.58z)| norm 0.2742 (+0.03z)| lr 8.75e-05 | 4169.69 ms | 32.4% bf16 MFU | 125395 tok/s step 14857/19560 | loss 3.345294 (+0.62z)| norm 0.2434 (-1.73z)| lr 8.75e-05 | 4158.39 ms | 32.5% bf16 MFU | 125430 tok/s step 14858/19560 | loss 3.338262 (+0.45z)| norm 0.2823 (+0.49z)| lr 8.74e-05 | 4153.52 ms | 32.5% bf16 MFU | 125470 tok/s step 14859/19560 | loss 3.288394 (-0.74z)| norm 0.2599 (-0.79z)| lr 8.74e-05 | 4170.11 ms | 32.4% bf16 MFU | 125482 tok/s step 14860/19560 | loss 3.300405 (-0.44z)| norm 0.2719 (-0.10z)| lr 8.74e-05 | 4147.80 ms | 32.6% bf16 MFU | 125528 tok/s step 14861/19560 | loss 3.284336 (-0.83z)| norm 0.2647 (-0.51z)| lr 8.73e-05 | 4157.00 ms | 32.5% bf16 MFU | 125558 tok/s step 14862/19560 | loss 3.299161 (-0.46z)| norm 0.2751 (+0.08z)| lr 8.73e-05 | 4156.49 ms | 32.5% bf16 MFU | 125587 tok/s step 14863/19560 | loss 3.270540 (-1.15z)| norm 0.2528 (-1.19z)| lr 8.73e-05 | 4163.20 ms | 32.4% bf16 MFU | 125604 tok/s step 14864/19560 | loss 3.260000 (-1.41z)| norm 0.2669 (-0.37z)| lr 8.72e-05 | 4170.63 ms | 32.4% bf16 MFU | 125610 tok/s step 14865/19560 | loss 3.288435 (-0.70z)| norm 0.2540 (-1.11z)| lr 8.72e-05 | 4156.42 ms | 32.5% bf16 MFU | 125636 tok/s step 14866/19560 | loss 3.263322 (-1.30z)| norm 0.2646 (-0.50z)| lr 8.72e-05 | 4167.43 ms | 32.4% bf16 MFU | 125645 tok/s step 14867/19560 | loss 3.322544 (+0.17z)| norm 0.2538 (-1.10z)| lr 8.71e-05 | 4156.52 ms | 32.5% bf16 MFU | 125669 tok/s step 14868/19560 | loss 3.346137 (+0.75z)| norm 0.2906 (+1.02z)| lr 8.71e-05 | 4160.37 ms | 32.5% bf16 MFU | 125687 tok/s step 14869/19560 | loss 3.381742 (+1.60z)| norm 0.2875 (+0.85z)| lr 8.71e-05 | 4164.08 ms | 32.4% bf16 MFU | 125698 tok/s step 14870/19560 | loss 3.387400 (+1.72z)| norm 0.2727 (-0.00z)| lr 8.70e-05 | 4166.67 ms | 32.4% bf16 MFU | 125704 tok/s step 14871/19560 | loss 3.297563 (-0.46z)| norm 0.2694 (-0.18z)| lr 8.70e-05 | 4161.64 ms | 32.4% bf16 MFU | 125718 tok/s step 14872/19560 | loss 3.338091 (+0.52z)| norm 0.2675 (-0.30z)| lr 8.70e-05 | 4158.11 ms | 32.5% bf16 MFU | 125737 tok/s step 14873/19560 | loss 3.317542 (+0.02z)| norm 0.2877 (+0.89z)| lr 8.69e-05 | 4159.37 ms | 32.5% bf16 MFU | 125752 tok/s step 14874/19560 | loss 3.353651 (+0.89z)| norm 0.2701 (-0.16z)| lr 8.69e-05 | 4153.45 ms | 32.5% bf16 MFU | 125776 tok/s step 14875/19560 | loss 3.286087 (-0.77z)| norm 0.2615 (-0.66z)| lr 8.68e-05 | 4158.50 ms | 32.5% bf16 MFU | 125791 tok/s step 14876/19560 | loss 3.277107 (-0.98z)| norm 0.2771 (+0.26z)| lr 8.68e-05 | 4161.42 ms | 32.4% bf16 MFU | 125801 tok/s step 14877/19560 | loss 3.290805 (-0.65z)| norm 0.2517 (-1.23z)| lr 8.68e-05 | 4170.15 ms | 32.4% bf16 MFU | 125797 tok/s step 14878/19560 | loss 3.290658 (-0.64z)| norm 0.2705 (-0.13z)| lr 8.67e-05 | 4169.49 ms | 32.4% bf16 MFU | 125794 tok/s step 14879/19560 | loss 3.292325 (-0.59z)| norm 0.2462 (-1.54z)| lr 8.67e-05 | 4163.04 ms | 32.4% bf16 MFU | 125802 tok/s step 14880/19560 | loss 3.451606 (+3.17z)| norm 0.2856 (+0.76z)| lr 8.67e-05 | 4161.17 ms | 32.4% bf16 MFU | 125811 tok/s step 14881/19560 | loss 3.302319 (-0.35z)| norm 0.2664 (-0.35z)| lr 8.66e-05 | 4156.05 ms | 32.5% bf16 MFU | 125828 tok/s step 14882/19560 | loss 3.383737 (+1.59z)| norm 0.2643 (-0.48z)| lr 8.66e-05 | 4158.44 ms | 32.5% bf16 MFU | 125841 tok/s step 14883/19560 | loss 3.305141 (-0.29z)| norm 0.2680 (-0.26z)| lr 8.66e-05 | 4176.83 ms | 32.3% bf16 MFU | 125825 tok/s step 14884/19560 | loss 3.257007 (-1.42z)| norm 0.2628 (-0.57z)| lr 8.65e-05 | 4168.60 ms | 32.4% bf16 MFU | 125822 tok/s step 14885/19560 | loss 3.237019 (-1.86z)| norm 0.2743 (+0.10z)| lr 8.65e-05 | 4162.79 ms | 32.4% bf16 MFU | 125828 tok/s step 14886/19560 | loss 3.266881 (-1.14z)| norm 0.2487 (-1.39z)| lr 8.65e-05 | 4162.04 ms | 32.4% bf16 MFU | 125835 tok/s step 14887/19560 | loss 3.284468 (-0.72z)| norm 0.2802 (+0.45z)| lr 8.64e-05 | 4154.97 ms | 32.5% bf16 MFU | 125853 tok/s step 14888/19560 | loss 3.273761 (-0.96z)| norm 0.2631 (-0.55z)| lr 8.64e-05 | 4157.47 ms | 32.5% bf16 MFU | 125866 tok/s step 14889/19560 | loss 3.243660 (-1.63z)| norm 0.2626 (-0.59z)| lr 8.64e-05 | 4159.81 ms | 32.5% bf16 MFU | 125874 tok/s step 14890/19560 | loss 3.300103 (-0.33z)| norm 0.2577 (-0.87z)| lr 8.63e-05 | 4154.30 ms | 32.5% bf16 MFU | 125891 tok/s step 14891/19560 | loss 3.303100 (-0.26z)| norm 0.2602 (-0.71z)| lr 8.63e-05 | 4155.11 ms | 32.5% bf16 MFU | 125905 tok/s step 14892/19560 | loss 3.287040 (-0.62z)| norm 0.2685 (-0.23z)| lr 8.62e-05 | 4165.21 ms | 32.4% bf16 MFU | 125903 tok/s step 14893/19560 | loss 3.317037 (+0.08z)| norm 0.2517 (-1.20z)| lr 8.62e-05 | 4163.83 ms | 32.4% bf16 MFU | 125904 tok/s step 14894/19560 | loss 3.381351 (+1.58z)| norm 0.2722 (-0.00z)| lr 8.62e-05 | 4159.77 ms | 32.5% bf16 MFU | 125911 tok/s step 14895/19560 | loss 3.294327 (-0.44z)| norm 0.2599 (-0.72z)| lr 8.61e-05 | 4171.38 ms | 32.4% bf16 MFU | 125899 tok/s step 14896/19560 | loss 3.253686 (-1.37z)| norm 0.2738 (+0.08z)| lr 8.61e-05 | 4147.10 ms | 32.6% bf16 MFU | 125926 tok/s step 14897/19560 | loss 3.261272 (-1.18z)| norm 0.2764 (+0.23z)| lr 8.61e-05 | 4172.61 ms | 32.4% bf16 MFU | 125912 tok/s step 14898/19560 | loss 3.267755 (-1.01z)| norm 0.2616 (-0.63z)| lr 8.60e-05 | 4160.42 ms | 32.5% bf16 MFU | 125917 tok/s step 14899/19560 | loss 3.312431 (+0.01z)| norm 0.2687 (-0.21z)| lr 8.60e-05 | 4154.23 ms | 32.5% bf16 MFU | 125932 tok/s step 14900/19560 | loss 3.351942 (+0.93z)| norm 0.2654 (-0.41z)| lr 8.60e-05 | 4164.80 ms | 32.4% bf16 MFU | 125929 tok/s step 14901/19560 | loss 3.308929 (-0.05z)| norm 0.2834 (+0.65z)| lr 8.59e-05 | 4155.79 ms | 32.5% bf16 MFU | 125941 tok/s step 14902/19560 | loss 3.303399 (-0.17z)| norm 0.2733 (+0.05z)| lr 8.59e-05 | 4164.10 ms | 32.4% bf16 MFU | 125939 tok/s step 14903/19560 | loss 3.288952 (-0.51z)| norm 0.2736 (+0.07z)| lr 8.59e-05 | 4161.08 ms | 32.4% bf16 MFU | 125942 tok/s step 14904/19560 | loss 3.347464 (+0.86z)| norm 0.2883 (+0.92z)| lr 8.58e-05 | 4166.34 ms | 32.4% bf16 MFU | 125937 tok/s step 14905/19560 | loss 3.270377 (-0.95z)| norm 0.2718 (-0.05z)| lr 8.58e-05 | 4175.06 ms | 32.3% bf16 MFU | 125919 tok/s step 14906/19560 | loss 3.358998 (+1.18z)| norm 0.2884 (+0.92z)| lr 8.58e-05 | 4151.69 ms | 32.5% bf16 MFU | 125937 tok/s step 14907/19560 | loss 3.244609 (-1.54z)| norm 0.2574 (-0.89z)| lr 8.57e-05 | 4152.90 ms | 32.5% bf16 MFU | 125952 tok/s step 14908/19560 | loss 3.328813 (+0.45z)| norm 0.2869 (+0.82z)| lr 8.57e-05 | 4160.48 ms | 32.5% bf16 MFU | 125956 tok/s step 14909/19560 | loss 3.277742 (-0.75z)| norm 0.2650 (-0.45z)| lr 8.57e-05 | 4152.21 ms | 32.5% bf16 MFU | 125971 tok/s step 14910/19560 | loss 3.359752 (+1.20z)| norm 0.2985 (+1.50z)| lr 8.56e-05 | 4157.86 ms | 32.5% bf16 MFU | 125977 tok/s step 14911/19560 | loss 3.338871 (+0.70z)| norm 0.2635 (-0.52z)| lr 8.56e-05 | 4164.53 ms | 32.4% bf16 MFU | 125973 tok/s step 14912/19560 | loss 3.270463 (-0.91z)| norm 0.2749 (+0.14z)| lr 8.55e-05 | 4152.73 ms | 32.5% bf16 MFU | 125987 tok/s step 14913/19560 | loss 3.317244 (+0.19z)| norm 0.2635 (-0.51z)| lr 8.55e-05 | 4152.04 ms | 32.5% bf16 MFU | 126001 tok/s step 14914/19560 | loss 3.302201 (-0.17z)| norm 0.2566 (-0.90z)| lr 8.55e-05 | 4165.34 ms | 32.4% bf16 MFU | 125995 tok/s step 14915/19560 | loss 3.312176 (+0.06z)| norm 0.2660 (-0.35z)| lr 8.54e-05 | 4154.51 ms | 32.5% bf16 MFU | 126005 tok/s step 14916/19560 | loss 3.311251 (+0.06z)| norm 0.2601 (-0.68z)| lr 8.54e-05 | 4152.75 ms | 32.5% bf16 MFU | 126017 tok/s step 14917/19560 | loss 3.270721 (-0.92z)| norm 0.2613 (-0.60z)| lr 8.54e-05 | 4155.72 ms | 32.5% bf16 MFU | 126024 tok/s step 14918/19560 | loss 3.248596 (-1.45z)| norm 0.2643 (-0.42z)| lr 8.53e-05 | 4160.33 ms | 32.5% bf16 MFU | 126024 tok/s step 14919/19560 | loss 3.290437 (-0.40z)| norm 0.2654 (-0.34z)| lr 8.53e-05 | 4166.55 ms | 32.4% bf16 MFU | 126015 tok/s step 14920/19560 | loss 3.292695 (-0.34z)| norm 0.2878 (+1.00z)| lr 8.53e-05 | 4152.11 ms | 32.5% bf16 MFU | 126027 tok/s step 14921/19560 | loss 3.372819 (+1.62z)| norm 0.2828 (+0.70z)| lr 8.52e-05 | 4153.00 ms | 32.5% bf16 MFU | 126038 tok/s step 14922/19560 | loss 3.227762 (-1.92z)| norm 0.2689 (-0.12z)| lr 8.52e-05 | 4149.02 ms | 32.5% bf16 MFU | 126054 tok/s step 14923/19560 | loss 3.328927 (+0.58z)| norm 0.2685 (-0.14z)| lr 8.52e-05 | 4165.25 ms | 32.4% bf16 MFU | 126045 tok/s step 14924/19560 | loss 3.281274 (-0.58z)| norm 0.2556 (-0.90z)| lr 8.51e-05 | 4158.15 ms | 32.5% bf16 MFU | 126047 tok/s step 14925/19560 | loss 3.294606 (-0.25z)| norm 0.2886 (+1.08z)| lr 8.51e-05 | 4155.82 ms | 32.5% bf16 MFU | 126053 tok/s step 14926/19560 | loss 3.262310 (-1.04z)| norm 0.2642 (-0.39z)| lr 8.51e-05 | 4147.11 ms | 32.6% bf16 MFU | 126071 tok/s step 14927/19560 | loss 3.300815 (-0.09z)| norm 0.2507 (-1.18z)| lr 8.50e-05 | 4162.80 ms | 32.4% bf16 MFU | 126065 tok/s step 14928/19560 | loss 3.291907 (-0.31z)| norm 0.2550 (-0.92z)| lr 8.50e-05 | 4156.41 ms | 32.5% bf16 MFU | 126069 tok/s step 14929/19560 | loss 3.319877 (+0.38z)| norm 0.2682 (-0.13z)| lr 8.50e-05 | 4153.32 ms | 32.5% bf16 MFU | 126077 tok/s step 14930/19560 | loss 3.241186 (-1.57z)| norm 0.2506 (-1.16z)| lr 8.49e-05 | 4156.26 ms | 32.5% bf16 MFU | 126080 tok/s step 14931/19560 | loss 3.302643 (-0.05z)| norm 0.2528 (-1.03z)| lr 8.49e-05 | 4155.87 ms | 32.5% bf16 MFU | 126084 tok/s step 14932/19560 | loss 3.262730 (-1.04z)| norm 0.2499 (-1.18z)| lr 8.49e-05 | 4156.69 ms | 32.5% bf16 MFU | 126087 tok/s step 14933/19560 | loss 3.313680 (+0.23z)| norm 0.2667 (-0.18z)| lr 8.48e-05 | 4146.96 ms | 32.6% bf16 MFU | 126104 tok/s step 14934/19560 | loss 3.239407 (-1.59z)| norm 0.2543 (-0.90z)| lr 8.48e-05 | 4157.20 ms | 32.5% bf16 MFU | 126104 tok/s step 14935/19560 | loss 3.305024 (+0.02z)| norm 0.2507 (-1.11z)| lr 8.47e-05 | 4159.55 ms | 32.5% bf16 MFU | 126101 tok/s step 14936/19560 | loss 3.355702 (+1.26z)| norm 0.2839 (+0.85z)| lr 8.47e-05 | 4161.98 ms | 32.4% bf16 MFU | 126095 tok/s step 14937/19560 | loss 3.334471 (+0.73z)| norm 0.2781 (+0.51z)| lr 8.47e-05 | 4156.48 ms | 32.5% bf16 MFU | 126097 tok/s step 14938/19560 | loss 3.360238 (+1.35z)| norm 0.2875 (+1.04z)| lr 8.46e-05 | 4162.30 ms | 32.4% bf16 MFU | 126090 tok/s step 14939/19560 | loss 3.336105 (+0.77z)| norm 0.2696 (-0.01z)| lr 8.46e-05 | 4149.22 ms | 32.5% bf16 MFU | 126103 tok/s step 14940/19560 | loss 3.309148 (+0.08z)| norm 0.2873 (+1.03z)| lr 8.46e-05 | 4167.84 ms | 32.4% bf16 MFU | 126088 tok/s step 14941/19560 | loss 3.353821 (+1.19z)| norm 0.3472 (+4.20z)| lr 8.45e-05 | 4156.46 ms | 32.5% bf16 MFU | 126090 tok/s step 14942/19560 | loss 3.266762 (-0.98z)| norm 0.2679 (-0.15z)| lr 8.45e-05 | 4156.49 ms | 32.5% bf16 MFU | 126093 tok/s step 14943/19560 | loss 3.300791 (-0.13z)| norm 0.2973 (+1.43z)| lr 8.45e-05 | 4698.77 ms | 28.7% bf16 MFU | 125367 tok/s step 14944/19560 | loss 3.457645 (+3.59z)| norm 0.2957 (+1.32z)| lr 8.44e-05 | 4156.25 ms | 32.5% bf16 MFU | 125406 tok/s step 14945/19560 | loss 3.315995 (+0.25z)| norm 0.2847 (+0.75z)| lr 8.44e-05 | 4157.14 ms | 32.5% bf16 MFU | 125442 tok/s step 14946/19560 | loss 3.314720 (+0.22z)| norm 0.2645 (-0.36z)| lr 8.44e-05 | 4155.15 ms | 32.5% bf16 MFU | 125478 tok/s step 14947/19560 | loss 3.273870 (-0.80z)| norm 0.2648 (-0.34z)| lr 8.43e-05 | 4162.87 ms | 32.4% bf16 MFU | 125502 tok/s step 14948/19560 | loss 3.327744 (+0.55z)| norm 0.2937 (+1.24z)| lr 8.43e-05 | 4166.61 ms | 32.4% bf16 MFU | 125518 tok/s step 14949/19560 | loss 3.313949 (+0.20z)| norm 0.2581 (-0.71z)| lr 8.43e-05 | 4166.14 ms | 32.4% bf16 MFU | 125534 tok/s step 14950/19560 | loss 3.314519 (+0.21z)| norm 0.2759 (+0.26z)| lr 8.42e-05 | 4156.37 ms | 32.5% bf16 MFU | 125565 tok/s step 14951/19560 | loss 3.269872 (-0.90z)| norm 0.2731 (+0.10z)| lr 8.42e-05 | 4165.24 ms | 32.4% bf16 MFU | 125580 tok/s step 14952/19560 | loss 3.319063 (+0.33z)| norm 0.2620 (-0.50z)| lr 8.42e-05 | 4161.06 ms | 32.4% bf16 MFU | 125601 tok/s step 14953/19560 | loss 3.317322 (+0.28z)| norm 0.2675 (-0.20z)| lr 8.41e-05 | 4154.15 ms | 32.5% bf16 MFU | 125631 tok/s step 14954/19560 | loss 3.301889 (-0.10z)| norm 0.2763 (+0.27z)| lr 8.41e-05 | 4159.21 ms | 32.5% bf16 MFU | 125653 tok/s step 14955/19560 | loss 3.334549 (+0.72z)| norm 0.2519 (-1.07z)| lr 8.41e-05 | 4156.68 ms | 32.5% bf16 MFU | 125677 tok/s step 14956/19560 | loss 3.291553 (-0.35z)| norm 0.2659 (-0.29z)| lr 8.40e-05 | 4161.34 ms | 32.4% bf16 MFU | 125692 tok/s step 14957/19560 | loss 3.246597 (-1.46z)| norm 0.2527 (-1.01z)| lr 8.40e-05 | 4159.70 ms | 32.5% bf16 MFU | 125710 tok/s step 14958/19560 | loss 3.301354 (-0.09z)| norm 0.2562 (-0.81z)| lr 8.39e-05 | 4150.61 ms | 32.5% bf16 MFU | 125740 tok/s step 14959/19560 | loss 3.325330 (+0.52z)| norm 0.2680 (-0.14z)| lr 8.39e-05 | 4162.67 ms | 32.4% bf16 MFU | 125750 tok/s step 14960/19560 | loss 3.311898 (+0.19z)| norm 0.2553 (-1.00z)| lr 8.39e-05 | 4166.05 ms | 32.4% bf16 MFU | 125755 tok/s step 14961/19560 | loss 3.278008 (-0.66z)| norm 0.2777 (+0.56z)| lr 8.38e-05 | 4160.78 ms | 32.4% bf16 MFU | 125768 tok/s step 14962/19560 | loss 3.273770 (-0.77z)| norm 0.2582 (-0.79z)| lr 8.38e-05 | 4153.60 ms | 32.5% bf16 MFU | 125791 tok/s step 14963/19560 | loss 3.306162 (+0.05z)| norm 0.2572 (-0.85z)| lr 8.38e-05 | 4156.68 ms | 32.5% bf16 MFU | 125808 tok/s step 14964/19560 | loss 3.218470 (-2.13z)| norm 0.2655 (-0.25z)| lr 8.37e-05 | 4163.06 ms | 32.4% bf16 MFU | 125814 tok/s step 14965/19560 | loss 3.283578 (-0.51z)| norm 0.2662 (-0.20z)| lr 8.37e-05 | 4163.87 ms | 32.4% bf16 MFU | 125819 tok/s step 14966/19560 | loss 3.303289 (-0.02z)| norm 0.2678 (-0.07z)| lr 8.37e-05 | 4153.15 ms | 32.5% bf16 MFU | 125840 tok/s step 14967/19560 | loss 3.235561 (-1.73z)| norm 0.2612 (-0.54z)| lr 8.36e-05 | 4155.38 ms | 32.5% bf16 MFU | 125857 tok/s step 14968/19560 | loss 3.281146 (-0.58z)| norm 0.2710 (+0.16z)| lr 8.36e-05 | 4209.95 ms | 32.1% bf16 MFU | 125791 tok/s step 14969/19560 | loss 3.287768 (-0.41z)| norm 0.2614 (-0.53z)| lr 8.36e-05 | 4149.21 ms | 32.5% bf16 MFU | 125819 tok/s step 14970/19560 | loss 3.316683 (+0.31z)| norm 0.2530 (-1.12z)| lr 8.35e-05 | 4158.56 ms | 32.5% bf16 MFU | 125832 tok/s step 14971/19560 | loss 3.330396 (+0.65z)| norm 0.2704 (+0.14z)| lr 8.35e-05 | 4167.29 ms | 32.4% bf16 MFU | 125831 tok/s step 14972/19560 | loss 3.313467 (+0.22z)| norm 0.2651 (-0.25z)| lr 8.35e-05 | 4153.41 ms | 32.5% bf16 MFU | 125851 tok/s step 14973/19560 | loss 3.280273 (-0.61z)| norm 0.2535 (-1.09z)| lr 8.34e-05 | 4153.86 ms | 32.5% bf16 MFU | 125869 tok/s step 14974/19560 | loss 3.304335 (-0.00z)| norm 0.2612 (-0.53z)| lr 8.34e-05 | 4157.31 ms | 32.5% bf16 MFU | 125881 tok/s step 14975/19560 | loss 3.324475 (+0.50z)| norm 0.2724 (+0.28z)| lr 8.34e-05 | 4152.26 ms | 32.5% bf16 MFU | 125900 tok/s step 14976/19560 | loss 3.395351 (+2.25z)| norm 0.2672 (-0.10z)| lr 8.33e-05 | 4159.13 ms | 32.5% bf16 MFU | 125908 tok/s step 14977/19560 | loss 3.289335 (-0.38z)| norm 0.2547 (-0.99z)| lr 8.33e-05 | 4161.56 ms | 32.4% bf16 MFU | 125912 tok/s step 14978/19560 | loss 3.336057 (+0.79z)| norm 0.2632 (-0.37z)| lr 8.33e-05 | 4160.81 ms | 32.4% bf16 MFU | 125917 tok/s step 14979/19560 | loss 3.324682 (+0.50z)| norm 0.2590 (-0.67z)| lr 8.32e-05 | 4185.77 ms | 32.3% bf16 MFU | 125884 tok/s step 14980/19560 | loss 3.329089 (+0.60z)| norm 0.2772 (+0.63z)| lr 8.32e-05 | 4396.08 ms | 30.7% bf16 MFU | 125553 tok/s step 14981/19560 | loss 3.282997 (-0.55z)| norm 0.2573 (-0.80z)| lr 8.32e-05 | 4441.63 ms | 30.4% bf16 MFU | 125177 tok/s step 14982/19560 | loss 3.293674 (-0.28z)| norm 0.2518 (-1.19z)| lr 8.31e-05 | 4212.48 ms | 32.1% bf16 MFU | 125141 tok/s step 14983/19560 | loss 3.309996 (+0.14z)| norm 0.2862 (+1.29z)| lr 8.31e-05 | 4280.37 ms | 31.5% bf16 MFU | 125008 tok/s step 14984/19560 | loss 3.296481 (-0.21z)| norm 0.2614 (-0.49z)| lr 8.30e-05 | 4281.32 ms | 31.5% bf16 MFU | 124881 tok/s step 14985/19560 | loss 3.270754 (-0.84z)| norm 0.2791 (+0.77z)| lr 8.30e-05 | 4165.01 ms | 32.4% bf16 MFU | 124931 tok/s step 14986/19560 | loss 3.277907 (-0.65z)| norm 0.2786 (+0.74z)| lr 8.30e-05 | 4169.22 ms | 32.4% bf16 MFU | 124972 tok/s step 14987/19560 | loss 3.336178 (+0.81z)| norm 0.3052 (+2.60z)| lr 8.29e-05 | 4184.38 ms | 32.3% bf16 MFU | 124988 tok/s step 14988/19560 | loss 3.294031 (-0.25z)| norm 0.2599 (-0.63z)| lr 8.29e-05 | 4157.12 ms | 32.5% bf16 MFU | 125045 tok/s step 14989/19560 | loss 3.329570 (+0.64z)| norm 0.2590 (-0.69z)| lr 8.29e-05 | 4173.93 ms | 32.3% bf16 MFU | 125073 tok/s step 14990/19560 | loss 3.285339 (-0.47z)| norm 0.2665 (-0.15z)| lr 8.28e-05 | 4155.65 ms | 32.5% bf16 MFU | 125127 tok/s step 14991/19560 | loss 3.339826 (+0.88z)| norm 0.2707 (+0.13z)| lr 8.28e-05 | 4202.73 ms | 32.1% bf16 MFU | 125108 tok/s step 14992/19560 | loss 3.255754 (-1.23z)| norm 0.2427 (-1.83z)| lr 8.28e-05 | 4313.47 ms | 31.3% bf16 MFU | 124930 tok/s step 14993/19560 | loss 3.328958 (+0.60z)| norm 0.2731 (+0.31z)| lr 8.27e-05 | 4162.68 ms | 32.4% bf16 MFU | 124981 tok/s step 14994/19560 | loss 3.268838 (-0.91z)| norm 0.2487 (-1.40z)| lr 8.27e-05 | 4163.92 ms | 32.4% bf16 MFU | 125028 tok/s step 14995/19560 | loss 3.319145 (+0.36z)| norm 0.2416 (-1.88z)| lr 8.27e-05 | 4155.61 ms | 32.5% bf16 MFU | 125085 tok/s step 14996/19560 | loss 3.337304 (+0.82z)| norm 0.2696 (+0.09z)| lr 8.26e-05 | 4166.50 ms | 32.4% bf16 MFU | 125122 tok/s step 14997/19560 | loss 3.277579 (-0.68z)| norm 0.2540 (-0.99z)| lr 8.26e-05 | 4163.79 ms | 32.4% bf16 MFU | 125162 tok/s step 14998/19560 | loss 3.340389 (+0.95z)| norm 0.2742 (+0.43z)| lr 8.26e-05 | 4162.24 ms | 32.4% bf16 MFU | 125202 tok/s step 14999/19560 | loss 3.369367 (+1.67z)| norm 0.2606 (-0.53z)| lr 8.25e-05 | 4164.61 ms | 32.4% bf16 MFU | 125236 tok/s step 15000/19560 | loss 3.236314 (-1.71z)| norm 0.2589 (-0.64z)| lr 8.25e-05 | 4196.33 ms | 32.2% bf16 MFU | 125222 tok/s val loss 3.294779 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3004/10042 = 0.299144 Writing checkpoint at step 15000 Writing model to log124M/model_00015000.bin Writing state to log124M/state_00015000_00000.bin step 15001/19560 | loss 3.252533 (-1.27z)| norm 0.2640 (-0.27z)| lr 8.25e-05 | 4189.57 ms | 32.2% bf16 MFU | 125218 tok/s step 15002/19560 | loss 3.428036 (+3.04z)| norm 0.2657 (-0.15z)| lr 8.24e-05 | 4267.35 ms | 31.6% bf16 MFU | 125100 tok/s step 15003/19560 | loss 3.344987 (+1.00z)| norm 0.3516 (+5.23z)| lr 8.24e-05 | 4144.24 ms | 32.6% bf16 MFU | 125170 tok/s step 15004/19560 | loss 3.283189 (-0.51z)| norm 0.2742 (+0.36z)| lr 8.24e-05 | 4158.43 ms | 32.5% bf16 MFU | 125216 tok/s step 15005/19560 | loss 3.343169 (+0.94z)| norm 0.2615 (-0.45z)| lr 8.23e-05 | 4159.50 ms | 32.5% bf16 MFU | 125257 tok/s step 15006/19560 | loss 3.344877 (+0.97z)| norm 0.2705 (+0.13z)| lr 8.23e-05 | 4212.90 ms | 32.0% bf16 MFU | 125217 tok/s step 15007/19560 | loss 3.301156 (-0.09z)| norm 0.2744 (+0.36z)| lr 8.23e-05 | 4153.00 ms | 32.5% bf16 MFU | 125268 tok/s step 15008/19560 | loss 3.287058 (-0.43z)| norm 0.2563 (-0.78z)| lr 8.22e-05 | 4223.43 ms | 32.0% bf16 MFU | 125211 tok/s step 15009/19560 | loss 3.250527 (-1.34z)| norm 0.2690 (+0.03z)| lr 8.22e-05 | 4149.98 ms | 32.5% bf16 MFU | 125268 tok/s step 15010/19560 | loss 3.270562 (-0.82z)| norm 0.2507 (-1.13z)| lr 8.22e-05 | 4165.18 ms | 32.4% bf16 MFU | 125298 tok/s step 15011/19560 | loss 3.297056 (-0.14z)| norm 0.2636 (-0.31z)| lr 8.21e-05 | 4168.41 ms | 32.4% bf16 MFU | 125322 tok/s step 15012/19560 | loss 3.459073 (+3.78z)| norm 0.3009 (+2.01z)| lr 8.21e-05 | 4158.90 ms | 32.5% bf16 MFU | 125359 tok/s step 15013/19560 | loss 3.362792 (+1.41z)| norm 0.2873 (+1.15z)| lr 8.20e-05 | 4226.64 ms | 31.9% bf16 MFU | 125293 tok/s step 15014/19560 | loss 3.327160 (+0.53z)| norm 0.2715 (+0.16z)| lr 8.20e-05 | 4163.92 ms | 32.4% bf16 MFU | 125324 tok/s step 15015/19560 | loss 3.291971 (-0.33z)| norm 0.2791 (+0.63z)| lr 8.20e-05 | 4156.93 ms | 32.5% bf16 MFU | 125364 tok/s step 15016/19560 | loss 3.379251 (+1.77z)| norm 0.2747 (+0.35z)| lr 8.19e-05 | 4157.95 ms | 32.5% bf16 MFU | 125401 tok/s step 15017/19560 | loss 3.287192 (-0.47z)| norm 0.2606 (-0.53z)| lr 8.19e-05 | 4168.03 ms | 32.4% bf16 MFU | 125420 tok/s step 15018/19560 | loss 3.307213 (+0.01z)| norm 0.2697 (+0.04z)| lr 8.19e-05 | 4213.71 ms | 32.0% bf16 MFU | 125370 tok/s step 15019/19560 | loss 3.219635 (-2.08z)| norm 0.2557 (-0.84z)| lr 8.18e-05 | 4162.78 ms | 32.4% bf16 MFU | 125399 tok/s step 15020/19560 | loss 3.320802 (+0.35z)| norm 0.2694 (+0.02z)| lr 8.18e-05 | 4165.30 ms | 32.4% bf16 MFU | 125423 tok/s step 15021/19560 | loss 3.268577 (-0.89z)| norm 0.2703 (+0.07z)| lr 8.18e-05 | 4196.08 ms | 32.2% bf16 MFU | 125399 tok/s step 15022/19560 | loss 3.258123 (-1.13z)| norm 0.2473 (-1.36z)| lr 8.17e-05 | 4160.53 ms | 32.5% bf16 MFU | 125430 tok/s step 15023/19560 | loss 3.296238 (-0.21z)| norm 0.2575 (-0.72z)| lr 8.17e-05 | 4167.87 ms | 32.4% bf16 MFU | 125448 tok/s step 15024/19560 | loss 3.264353 (-0.99z)| norm 0.2575 (-0.71z)| lr 8.17e-05 | 4166.68 ms | 32.4% bf16 MFU | 125467 tok/s step 15025/19560 | loss 3.340896 (+0.85z)| norm 0.2585 (-0.64z)| lr 8.16e-05 | 4176.65 ms | 32.3% bf16 MFU | 125470 tok/s step 15026/19560 | loss 3.296423 (-0.23z)| norm 0.2584 (-0.65z)| lr 8.16e-05 | 4162.92 ms | 32.4% bf16 MFU | 125493 tok/s step 15027/19560 | loss 3.278527 (-0.66z)| norm 0.2556 (-0.81z)| lr 8.16e-05 | 4166.62 ms | 32.4% bf16 MFU | 125510 tok/s step 15028/19560 | loss 3.340913 (+0.86z)| norm 0.2506 (-1.11z)| lr 8.15e-05 | 4156.90 ms | 32.5% bf16 MFU | 125541 tok/s step 15029/19560 | loss 3.388453 (+1.97z)| norm 0.2583 (-0.62z)| lr 8.15e-05 | 4158.10 ms | 32.5% bf16 MFU | 125568 tok/s step 15030/19560 | loss 3.310852 (+0.11z)| norm 0.2547 (-0.83z)| lr 8.15e-05 | 4159.30 ms | 32.5% bf16 MFU | 125593 tok/s step 15031/19560 | loss 3.248872 (-1.36z)| norm 0.2722 (+0.25z)| lr 8.14e-05 | 4199.24 ms | 32.2% bf16 MFU | 125556 tok/s step 15032/19560 | loss 3.301509 (-0.10z)| norm 0.2500 (-1.10z)| lr 8.14e-05 | 4154.57 ms | 32.5% bf16 MFU | 125588 tok/s step 15033/19560 | loss 3.306753 (+0.02z)| norm 0.2727 (+0.30z)| lr 8.14e-05 | 4158.53 ms | 32.5% bf16 MFU | 125612 tok/s step 15034/19560 | loss 3.351340 (+1.09z)| norm 0.2653 (-0.15z)| lr 8.13e-05 | 4165.01 ms | 32.4% bf16 MFU | 125625 tok/s step 15035/19560 | loss 3.303959 (-0.05z)| norm 0.2636 (-0.26z)| lr 8.13e-05 | 4158.12 ms | 32.5% bf16 MFU | 125648 tok/s step 15036/19560 | loss 3.309937 (+0.09z)| norm 0.2809 (+0.82z)| lr 8.13e-05 | 4158.58 ms | 32.5% bf16 MFU | 125670 tok/s step 15037/19560 | loss 3.251926 (-1.30z)| norm 0.2648 (-0.18z)| lr 8.12e-05 | 4158.65 ms | 32.5% bf16 MFU | 125690 tok/s step 15038/19560 | loss 3.331491 (+0.63z)| norm 0.2607 (-0.43z)| lr 8.12e-05 | 4228.34 ms | 31.9% bf16 MFU | 125605 tok/s step 15039/19560 | loss 3.249616 (-1.34z)| norm 0.2763 (+0.55z)| lr 8.12e-05 | 4167.01 ms | 32.4% bf16 MFU | 125616 tok/s step 15040/19560 | loss 3.266728 (-0.92z)| norm 0.2654 (-0.13z)| lr 8.11e-05 | 4166.92 ms | 32.4% bf16 MFU | 125626 tok/s step 15041/19560 | loss 3.298321 (-0.16z)| norm 0.2885 (+1.31z)| lr 8.11e-05 | 4176.15 ms | 32.3% bf16 MFU | 125622 tok/s step 15042/19560 | loss 3.255637 (-1.17z)| norm 0.2666 (-0.07z)| lr 8.11e-05 | 4166.89 ms | 32.4% bf16 MFU | 125632 tok/s step 15043/19560 | loss 3.232659 (-1.69z)| norm 0.2719 (+0.26z)| lr 8.10e-05 | 4160.92 ms | 32.4% bf16 MFU | 125650 tok/s step 15044/19560 | loss 3.266615 (-0.87z)| norm 0.2942 (+1.63z)| lr 8.10e-05 | 4160.47 ms | 32.5% bf16 MFU | 125669 tok/s step 15045/19560 | loss 3.266520 (-0.87z)| norm 0.2657 (-0.15z)| lr 8.10e-05 | 4159.98 ms | 32.5% bf16 MFU | 125687 tok/s step 15046/19560 | loss 3.299260 (-0.11z)| norm 0.2836 (+0.96z)| lr 8.09e-05 | 4157.56 ms | 32.5% bf16 MFU | 125708 tok/s step 15047/19560 | loss 3.270164 (-0.80z)| norm 0.2877 (+1.19z)| lr 8.09e-05 | 4158.41 ms | 32.5% bf16 MFU | 125726 tok/s step 15048/19560 | loss 3.317475 (+0.32z)| norm 0.2675 (-0.05z)| lr 8.09e-05 | 4166.80 ms | 32.4% bf16 MFU | 125731 tok/s step 15049/19560 | loss 3.239141 (-1.51z)| norm 0.2514 (-1.03z)| lr 8.08e-05 | 4166.64 ms | 32.4% bf16 MFU | 125736 tok/s step 15050/19560 | loss 3.301811 (-0.04z)| norm 0.2714 (+0.21z)| lr 8.08e-05 | 4159.69 ms | 32.5% bf16 MFU | 125751 tok/s step 15051/19560 | loss 3.238530 (-1.53z)| norm 0.2834 (+0.95z)| lr 8.07e-05 | 4159.08 ms | 32.5% bf16 MFU | 125767 tok/s step 15052/19560 | loss 3.366502 (+1.50z)| norm 0.2626 (-0.34z)| lr 8.07e-05 | 4161.35 ms | 32.4% bf16 MFU | 125778 tok/s step 15053/19560 | loss 3.249285 (-1.27z)| norm 0.2842 (+1.00z)| lr 8.07e-05 | 4150.66 ms | 32.5% bf16 MFU | 125805 tok/s step 15054/19560 | loss 3.286783 (-0.39z)| norm 0.2766 (+0.52z)| lr 8.06e-05 | 4168.20 ms | 32.4% bf16 MFU | 125804 tok/s step 15055/19560 | loss 3.257490 (-1.07z)| norm 0.2653 (-0.19z)| lr 8.06e-05 | 4155.14 ms | 32.5% bf16 MFU | 125822 tok/s step 15056/19560 | loss 3.339493 (+0.85z)| norm 0.2849 (+1.02z)| lr 8.06e-05 | 4162.79 ms | 32.4% bf16 MFU | 125829 tok/s step 15057/19560 | loss 3.304909 (+0.04z)| norm 0.2678 (-0.05z)| lr 8.05e-05 | 4154.88 ms | 32.5% bf16 MFU | 125846 tok/s step 15058/19560 | loss 3.314095 (+0.25z)| norm 0.3040 (+2.15z)| lr 8.05e-05 | 4156.20 ms | 32.5% bf16 MFU | 125861 tok/s step 15059/19560 | loss 3.258999 (-1.05z)| norm 0.2897 (+1.26z)| lr 8.05e-05 | 4166.17 ms | 32.4% bf16 MFU | 125861 tok/s step 15060/19560 | loss 3.263227 (-0.95z)| norm 0.2886 (+1.17z)| lr 8.04e-05 | 4160.78 ms | 32.4% bf16 MFU | 125868 tok/s step 15061/19560 | loss 3.277041 (-0.61z)| norm 0.2818 (+0.74z)| lr 8.04e-05 | 4160.61 ms | 32.5% bf16 MFU | 125875 tok/s step 15062/19560 | loss 3.312495 (+0.21z)| norm 0.2835 (+0.83z)| lr 8.04e-05 | 4167.45 ms | 32.4% bf16 MFU | 125872 tok/s step 15063/19560 | loss 3.256647 (-1.10z)| norm 0.2830 (+0.79z)| lr 8.03e-05 | 4161.31 ms | 32.4% bf16 MFU | 125878 tok/s step 15064/19560 | loss 3.274494 (-0.67z)| norm 0.2623 (-0.47z)| lr 8.03e-05 | 4171.12 ms | 32.4% bf16 MFU | 125868 tok/s step 15065/19560 | loss 3.285868 (-0.39z)| norm 0.2568 (-0.80z)| lr 8.03e-05 | 4165.23 ms | 32.4% bf16 MFU | 125869 tok/s step 15066/19560 | loss 3.339897 (+0.90z)| norm 0.2742 (+0.28z)| lr 8.02e-05 | 4168.33 ms | 32.4% bf16 MFU | 125864 tok/s step 15067/19560 | loss 3.395692 (+2.19z)| norm 0.2708 (+0.07z)| lr 8.02e-05 | 4181.51 ms | 32.3% bf16 MFU | 125840 tok/s step 15068/19560 | loss 3.310312 (+0.18z)| norm 0.2591 (-0.65z)| lr 8.02e-05 | 4157.38 ms | 32.5% bf16 MFU | 125854 tok/s step 15069/19560 | loss 3.332031 (+0.70z)| norm 0.2821 (+0.90z)| lr 8.01e-05 | 4202.85 ms | 32.1% bf16 MFU | 125798 tok/s step 15070/19560 | loss 3.316142 (+0.32z)| norm 0.2834 (+0.97z)| lr 8.01e-05 | 4161.32 ms | 32.4% bf16 MFU | 125808 tok/s step 15071/19560 | loss 3.361966 (+1.38z)| norm 0.2714 (+0.17z)| lr 8.01e-05 | 4155.50 ms | 32.5% bf16 MFU | 125826 tok/s step 15072/19560 | loss 3.280497 (-0.53z)| norm 0.2682 (-0.04z)| lr 8.00e-05 | 4154.78 ms | 32.5% bf16 MFU | 125844 tok/s step 15073/19560 | loss 3.352226 (+1.23z)| norm 0.2819 (+0.92z)| lr 8.00e-05 | 4167.29 ms | 32.4% bf16 MFU | 125842 tok/s step 15074/19560 | loss 3.408853 (+2.54z)| norm 0.8177 (+10.81z)| lr 8.00e-05 | 4162.04 ms | 32.4% bf16 MFU | 125849 tok/s step 15075/19560 | loss 3.269258 (-0.81z)| norm 0.2752 (+0.04z)| lr 7.99e-05 | 4160.51 ms | 32.5% bf16 MFU | 125857 tok/s step 15076/19560 | loss 3.267595 (-0.83z)| norm 0.2813 (+0.16z)| lr 7.99e-05 | 4163.44 ms | 32.4% bf16 MFU | 125860 tok/s step 15077/19560 | loss 3.294636 (-0.18z)| norm 0.2834 (+0.20z)| lr 7.99e-05 | 4157.93 ms | 32.5% bf16 MFU | 125872 tok/s step 15078/19560 | loss 3.311196 (+0.22z)| norm 0.2936 (+0.40z)| lr 7.98e-05 | 4164.41 ms | 32.4% bf16 MFU | 125873 tok/s step 15079/19560 | loss 3.327397 (+0.59z)| norm 0.2711 (-0.04z)| lr 7.98e-05 | 4161.34 ms | 32.4% bf16 MFU | 125879 tok/s step 15080/19560 | loss 3.296794 (-0.14z)| norm 0.2724 (-0.02z)| lr 7.98e-05 | 4165.53 ms | 32.4% bf16 MFU | 125878 tok/s step 15081/19560 | loss 3.236206 (-1.56z)| norm 0.2683 (-0.10z)| lr 7.97e-05 | 4179.54 ms | 32.3% bf16 MFU | 125857 tok/s step 15082/19560 | loss 3.276292 (-0.60z)| norm 0.2686 (-0.09z)| lr 7.97e-05 | 4150.98 ms | 32.5% bf16 MFU | 125879 tok/s step 15083/19560 | loss 3.272400 (-0.68z)| norm 0.2756 (+0.04z)| lr 7.97e-05 | 4161.67 ms | 32.4% bf16 MFU | 125884 tok/s step 15084/19560 | loss 3.297733 (-0.08z)| norm 0.2711 (-0.05z)| lr 7.96e-05 | 4159.84 ms | 32.5% bf16 MFU | 125892 tok/s step 15085/19560 | loss 3.289789 (-0.28z)| norm 0.2765 (+0.05z)| lr 7.96e-05 | 4160.46 ms | 32.5% bf16 MFU | 125898 tok/s step 15086/19560 | loss 3.319670 (+0.43z)| norm 0.2705 (-0.07z)| lr 7.96e-05 | 4165.30 ms | 32.4% bf16 MFU | 125896 tok/s step 15087/19560 | loss 3.320651 (+0.45z)| norm 0.2715 (-0.05z)| lr 7.95e-05 | 4153.29 ms | 32.5% bf16 MFU | 125913 tok/s step 15088/19560 | loss 3.270849 (-0.73z)| norm 0.2872 (+0.26z)| lr 7.95e-05 | 4161.83 ms | 32.4% bf16 MFU | 125916 tok/s step 15089/19560 | loss 3.281537 (-0.47z)| norm 0.2536 (-0.41z)| lr 7.95e-05 | 4158.24 ms | 32.5% bf16 MFU | 125925 tok/s step 15090/19560 | loss 3.301304 (-0.01z)| norm 0.2578 (-0.32z)| lr 7.94e-05 | 4165.85 ms | 32.4% bf16 MFU | 125921 tok/s step 15091/19560 | loss 3.262181 (-0.93z)| norm 0.2663 (-0.15z)| lr 7.94e-05 | 4155.18 ms | 32.5% bf16 MFU | 125934 tok/s step 15092/19560 | loss 3.332042 (+0.72z)| norm 0.2664 (-0.15z)| lr 7.94e-05 | 4167.24 ms | 32.4% bf16 MFU | 125928 tok/s step 15093/19560 | loss 3.263890 (-0.92z)| norm 0.2597 (-0.28z)| lr 7.93e-05 | 4156.39 ms | 32.5% bf16 MFU | 125939 tok/s step 15094/19560 | loss 3.342935 (+0.98z)| norm 0.2881 (+0.28z)| lr 7.93e-05 | 4166.20 ms | 32.4% bf16 MFU | 125934 tok/s step 15095/19560 | loss 3.232276 (-1.68z)| norm 0.2757 (+0.03z)| lr 7.93e-05 | 4173.76 ms | 32.3% bf16 MFU | 125918 tok/s step 15096/19560 | loss 3.248535 (-1.27z)| norm 0.2783 (+0.08z)| lr 7.92e-05 | 4165.61 ms | 32.4% bf16 MFU | 125915 tok/s step 15097/19560 | loss 3.266330 (-0.84z)| norm 0.2651 (-0.18z)| lr 7.92e-05 | 4164.55 ms | 32.4% bf16 MFU | 125914 tok/s step 15098/19560 | loss 3.316644 (+0.35z)| norm 0.2659 (-0.17z)| lr 7.92e-05 | 4159.58 ms | 32.5% bf16 MFU | 125920 tok/s step 15099/19560 | loss 3.249730 (-1.22z)| norm 0.2620 (-0.25z)| lr 7.91e-05 | 4163.65 ms | 32.4% bf16 MFU | 125920 tok/s step 15100/19560 | loss 3.253844 (-1.11z)| norm 0.2928 (+0.36z)| lr 7.91e-05 | 4154.47 ms | 32.5% bf16 MFU | 125934 tok/s step 15101/19560 | loss 3.362952 (+1.44z)| norm 0.2914 (+0.33z)| lr 7.91e-05 | 4150.87 ms | 32.5% bf16 MFU | 125953 tok/s step 15102/19560 | loss 3.242665 (-1.35z)| norm 0.2802 (+0.10z)| lr 7.90e-05 | 4159.30 ms | 32.5% bf16 MFU | 125958 tok/s step 15103/19560 | loss 3.331660 (+0.71z)| norm 0.2816 (+0.13z)| lr 7.90e-05 | 4151.12 ms | 32.5% bf16 MFU | 125975 tok/s step 15104/19560 | loss 3.301222 (+0.02z)| norm 0.2850 (+0.19z)| lr 7.90e-05 | 4158.48 ms | 32.5% bf16 MFU | 125980 tok/s step 15105/19560 | loss 3.323580 (+0.55z)| norm 0.2636 (-0.23z)| lr 7.89e-05 | 4167.22 ms | 32.4% bf16 MFU | 125972 tok/s step 15106/19560 | loss 3.319518 (+0.45z)| norm 0.2705 (-0.10z)| lr 7.89e-05 | 4155.73 ms | 32.5% bf16 MFU | 125981 tok/s step 15107/19560 | loss 3.300554 (+0.01z)| norm 0.2668 (-0.17z)| lr 7.88e-05 | 4164.61 ms | 32.4% bf16 MFU | 125977 tok/s step 15108/19560 | loss 3.340415 (+0.95z)| norm 0.2655 (-0.20z)| lr 7.88e-05 | 4162.33 ms | 32.4% bf16 MFU | 125976 tok/s step 15109/19560 | loss 3.247781 (-1.23z)| norm 0.2661 (-0.19z)| lr 7.88e-05 | 4160.27 ms | 32.5% bf16 MFU | 125978 tok/s step 15110/19560 | loss 3.279558 (-0.48z)| norm 0.2733 (-0.05z)| lr 7.87e-05 | 4169.90 ms | 32.4% bf16 MFU | 125966 tok/s step 15111/19560 | loss 3.241907 (-1.34z)| norm 0.2820 (+0.13z)| lr 7.87e-05 | 4155.33 ms | 32.5% bf16 MFU | 125976 tok/s step 15112/19560 | loss 3.357174 (+1.33z)| norm 0.2713 (-0.09z)| lr 7.87e-05 | 4159.32 ms | 32.5% bf16 MFU | 125980 tok/s step 15113/19560 | loss 3.262831 (-0.86z)| norm 0.2626 (-0.26z)| lr 7.86e-05 | 4160.85 ms | 32.4% bf16 MFU | 125981 tok/s step 15114/19560 | loss 3.342078 (+0.97z)| norm 0.2760 (+0.01z)| lr 7.86e-05 | 4151.19 ms | 32.5% bf16 MFU | 125997 tok/s step 15115/19560 | loss 3.289219 (-0.25z)| norm 0.2552 (-0.40z)| lr 7.86e-05 | 4156.48 ms | 32.5% bf16 MFU | 126004 tok/s step 15116/19560 | loss 3.244965 (-1.26z)| norm 0.2658 (-0.19z)| lr 7.85e-05 | 4157.04 ms | 32.5% bf16 MFU | 126010 tok/s step 15117/19560 | loss 3.310447 (+0.26z)| norm 0.2766 (+0.03z)| lr 7.85e-05 | 4161.08 ms | 32.4% bf16 MFU | 126009 tok/s step 15118/19560 | loss 3.259392 (-0.91z)| norm 0.2602 (-0.30z)| lr 7.85e-05 | 4157.88 ms | 32.5% bf16 MFU | 126014 tok/s step 15119/19560 | loss 3.232870 (-1.50z)| norm 0.2631 (-0.24z)| lr 7.84e-05 | 4162.25 ms | 32.4% bf16 MFU | 126011 tok/s step 15120/19560 | loss 3.271949 (-0.61z)| norm 0.2687 (-0.13z)| lr 7.84e-05 | 4184.02 ms | 32.3% bf16 MFU | 125976 tok/s step 15121/19560 | loss 3.345531 (+1.08z)| norm 0.2638 (-0.23z)| lr 7.84e-05 | 4160.37 ms | 32.5% bf16 MFU | 125978 tok/s step 15122/19560 | loss 3.317461 (+0.42z)| norm 0.2579 (-0.35z)| lr 7.83e-05 | 4200.81 ms | 32.1% bf16 MFU | 125919 tok/s step 15123/19560 | loss 3.267284 (-0.71z)| norm 0.2655 (-0.20z)| lr 7.83e-05 | 4166.33 ms | 32.4% bf16 MFU | 125915 tok/s step 15124/19560 | loss 3.297656 (-0.01z)| norm 0.2746 (-0.02z)| lr 7.83e-05 | 4162.78 ms | 32.4% bf16 MFU | 125917 tok/s step 15125/19560 | loss 3.259718 (-0.88z)| norm 0.2523 (-0.47z)| lr 7.82e-05 | 4170.83 ms | 32.4% bf16 MFU | 125906 tok/s step 15126/19560 | loss 3.315074 (+0.40z)| norm 0.2668 (-0.17z)| lr 7.82e-05 | 4158.52 ms | 32.5% bf16 MFU | 125915 tok/s step 15127/19560 | loss 3.269475 (-0.64z)| norm 0.2592 (-0.33z)| lr 7.82e-05 | 4183.71 ms | 32.3% bf16 MFU | 125885 tok/s step 15128/19560 | loss 3.250246 (-1.09z)| norm 0.2613 (-0.29z)| lr 7.81e-05 | 4158.68 ms | 32.5% bf16 MFU | 125894 tok/s step 15129/19560 | loss 3.313058 (+0.36z)| norm 0.2792 (+0.07z)| lr 7.81e-05 | 4173.28 ms | 32.4% bf16 MFU | 125881 tok/s step 15130/19560 | loss 3.346855 (+1.20z)| norm 0.2867 (+0.22z)| lr 7.81e-05 | 4156.34 ms | 32.5% bf16 MFU | 125894 tok/s step 15131/19560 | loss 3.286056 (-0.26z)| norm 0.2866 (+0.23z)| lr 7.80e-05 | 4162.26 ms | 32.4% bf16 MFU | 125897 tok/s step 15132/19560 | loss 3.264132 (-0.78z)| norm 0.2824 (+0.14z)| lr 7.80e-05 | 4159.89 ms | 32.5% bf16 MFU | 125904 tok/s step 15133/19560 | loss 3.280221 (-0.38z)| norm 0.2858 (+0.21z)| lr 7.80e-05 | 4159.54 ms | 32.5% bf16 MFU | 125911 tok/s step 15134/19560 | loss 3.318783 (+0.56z)| norm 0.2674 (-0.16z)| lr 7.79e-05 | 4150.90 ms | 32.5% bf16 MFU | 125931 tok/s step 15135/19560 | loss 3.373207 (+1.85z)| norm 0.2935 (+0.36z)| lr 7.79e-05 | 4152.41 ms | 32.5% bf16 MFU | 125948 tok/s step 15136/19560 | loss 3.257447 (-0.93z)| norm 0.2819 (+0.12z)| lr 7.79e-05 | 4171.96 ms | 32.4% bf16 MFU | 125934 tok/s step 15137/19560 | loss 3.370266 (+1.75z)| norm 0.2661 (-0.20z)| lr 7.78e-05 | 4153.07 ms | 32.5% bf16 MFU | 125949 tok/s step 15138/19560 | loss 3.331445 (+0.81z)| norm 0.2882 (+0.24z)| lr 7.78e-05 | 4154.04 ms | 32.5% bf16 MFU | 125962 tok/s step 15139/19560 | loss 3.277121 (-0.48z)| norm 0.2659 (-0.21z)| lr 7.78e-05 | 4157.85 ms | 32.5% bf16 MFU | 125969 tok/s step 15140/19560 | loss 3.238536 (-1.43z)| norm 0.2728 (-0.06z)| lr 7.77e-05 | 4167.19 ms | 32.4% bf16 MFU | 125961 tok/s step 15141/19560 | loss 3.296087 (+0.03z)| norm 0.2732 (-0.05z)| lr 7.77e-05 | 4153.90 ms | 32.5% bf16 MFU | 125974 tok/s step 15142/19560 | loss 3.249472 (-1.14z)| norm 0.2454 (-0.61z)| lr 7.77e-05 | 4164.58 ms | 32.4% bf16 MFU | 125970 tok/s step 15143/19560 | loss 3.301849 (+0.18z)| norm 0.2702 (-0.11z)| lr 7.76e-05 | 4156.05 ms | 32.5% bf16 MFU | 125979 tok/s step 15144/19560 | loss 3.306217 (+0.32z)| norm 0.2792 (+0.07z)| lr 7.76e-05 | 4156.18 ms | 32.5% bf16 MFU | 125987 tok/s step 15145/19560 | loss 3.286813 (-0.18z)| norm 0.2533 (-0.45z)| lr 7.76e-05 | 4150.03 ms | 32.5% bf16 MFU | 126004 tok/s step 15146/19560 | loss 3.259878 (-0.87z)| norm 0.2612 (-0.29z)| lr 7.75e-05 | 4158.76 ms | 32.5% bf16 MFU | 126008 tok/s step 15147/19560 | loss 3.293578 (-0.02z)| norm 0.2820 (+0.13z)| lr 7.75e-05 | 4163.72 ms | 32.4% bf16 MFU | 126003 tok/s step 15148/19560 | loss 3.256899 (-0.96z)| norm 0.2519 (-0.48z)| lr 7.75e-05 | 4154.68 ms | 32.5% bf16 MFU | 126013 tok/s step 15149/19560 | loss 3.291576 (-0.06z)| norm 0.2669 (-0.17z)| lr 7.74e-05 | 4155.18 ms | 32.5% bf16 MFU | 126021 tok/s step 15150/19560 | loss 3.300106 (+0.15z)| norm 0.2752 (-0.01z)| lr 7.74e-05 | 4150.29 ms | 32.5% bf16 MFU | 126036 tok/s step 15151/19560 | loss 3.229938 (-1.65z)| norm 0.2678 (-0.16z)| lr 7.74e-05 | 4229.40 ms | 31.9% bf16 MFU | 125932 tok/s step 15152/19560 | loss 3.281118 (-0.33z)| norm 0.2642 (-0.24z)| lr 7.73e-05 | 4169.63 ms | 32.4% bf16 MFU | 125923 tok/s step 15153/19560 | loss 3.318197 (+0.64z)| norm 0.2687 (-0.15z)| lr 7.73e-05 | 4166.08 ms | 32.4% bf16 MFU | 125919 tok/s step 15154/19560 | loss 3.329169 (+0.92z)| norm 0.2543 (-0.44z)| lr 7.73e-05 | 4156.50 ms | 32.5% bf16 MFU | 125930 tok/s step 15155/19560 | loss 3.304607 (+0.27z)| norm 0.3057 (+0.60z)| lr 7.72e-05 | 4153.28 ms | 32.5% bf16 MFU | 125945 tok/s step 15156/19560 | loss 3.256202 (-0.97z)| norm 0.2618 (-0.30z)| lr 7.72e-05 | 4152.50 ms | 32.5% bf16 MFU | 125961 tok/s step 15157/19560 | loss 3.291751 (-0.02z)| norm 0.2882 (+0.23z)| lr 7.72e-05 | 4164.30 ms | 32.4% bf16 MFU | 125958 tok/s step 15158/19560 | loss 3.325773 (+0.88z)| norm 0.2811 (+0.09z)| lr 7.71e-05 | 4159.09 ms | 32.5% bf16 MFU | 125963 tok/s step 15159/19560 | loss 3.302437 (+0.25z)| norm 0.2607 (-0.32z)| lr 7.71e-05 | 4170.01 ms | 32.4% bf16 MFU | 125951 tok/s step 15160/19560 | loss 3.274444 (-0.50z)| norm 0.2692 (-0.16z)| lr 7.71e-05 | 4149.63 ms | 32.5% bf16 MFU | 125971 tok/s step 15161/19560 | loss 3.277900 (-0.40z)| norm 0.2681 (-0.18z)| lr 7.70e-05 | 4157.78 ms | 32.5% bf16 MFU | 125977 tok/s step 15162/19560 | loss 3.247878 (-1.18z)| norm 0.2648 (-0.25z)| lr 7.70e-05 | 4166.61 ms | 32.4% bf16 MFU | 125970 tok/s step 15163/19560 | loss 3.281297 (-0.28z)| norm 0.2915 (+0.29z)| lr 7.70e-05 | 4159.63 ms | 32.5% bf16 MFU | 125973 tok/s step 15164/19560 | loss 3.263857 (-0.74z)| norm 0.2779 (+0.02z)| lr 7.69e-05 | 4160.27 ms | 32.5% bf16 MFU | 125976 tok/s step 15165/19560 | loss 3.247156 (-1.18z)| norm 0.2636 (-0.27z)| lr 7.69e-05 | 4167.82 ms | 32.4% bf16 MFU | 125967 tok/s step 15166/19560 | loss 3.294853 (+0.10z)| norm 0.2701 (-0.14z)| lr 7.69e-05 | 4148.08 ms | 32.5% bf16 MFU | 125988 tok/s step 15167/19560 | loss 3.271153 (-0.54z)| norm 0.2709 (-0.13z)| lr 7.68e-05 | 4153.08 ms | 32.5% bf16 MFU | 126001 tok/s step 15168/19560 | loss 3.293584 (+0.06z)| norm 0.2965 (+0.39z)| lr 7.68e-05 | 4157.33 ms | 32.5% bf16 MFU | 126006 tok/s step 15169/19560 | loss 3.309384 (+0.48z)| norm 0.2541 (-0.47z)| lr 7.68e-05 | 4158.51 ms | 32.5% bf16 MFU | 126010 tok/s step 15170/19560 | loss 3.313389 (+0.58z)| norm 0.2773 (+0.00z)| lr 7.67e-05 | 4165.30 ms | 32.4% bf16 MFU | 126003 tok/s step 15171/19560 | loss 3.305986 (+0.37z)| norm 0.2668 (-0.21z)| lr 7.67e-05 | 4190.36 ms | 32.2% bf16 MFU | 125959 tok/s step 15172/19560 | loss 3.338643 (+1.24z)| norm 0.3079 (+0.62z)| lr 7.67e-05 | 4227.80 ms | 31.9% bf16 MFU | 125861 tok/s step 15173/19560 | loss 3.245866 (-1.28z)| norm 0.2617 (-0.31z)| lr 7.66e-05 | 4167.52 ms | 32.4% bf16 MFU | 125858 tok/s step 15174/19560 | loss 3.304827 (+0.32z)| norm 0.2706 (-0.13z)| lr 7.66e-05 | 4163.09 ms | 32.4% bf16 MFU | 125862 tok/s step 15175/19560 | loss 3.316509 (+0.63z)| norm 0.2638 (-0.27z)| lr 7.66e-05 | 4159.80 ms | 32.5% bf16 MFU | 125871 tok/s step 15176/19560 | loss 3.246564 (-1.25z)| norm 0.2639 (-0.26z)| lr 7.65e-05 | 4174.03 ms | 32.3% bf16 MFU | 125858 tok/s step 15177/19560 | loss 3.342993 (+1.33z)| norm 0.2653 (-0.24z)| lr 7.65e-05 | 4181.13 ms | 32.3% bf16 MFU | 125835 tok/s step 15178/19560 | loss 3.315156 (+0.58z)| norm 0.2630 (-0.28z)| lr 7.65e-05 | 4173.97 ms | 32.3% bf16 MFU | 125823 tok/s step 15179/19560 | loss 3.291011 (-0.09z)| norm 0.2573 (-0.39z)| lr 7.64e-05 | 4168.78 ms | 32.4% bf16 MFU | 125820 tok/s step 15180/19560 | loss 3.313416 (+0.54z)| norm 0.2654 (-0.23z)| lr 7.64e-05 | 4168.24 ms | 32.4% bf16 MFU | 125818 tok/s step 15181/19560 | loss 3.337320 (+1.19z)| norm 0.2657 (-0.22z)| lr 7.64e-05 | 4161.81 ms | 32.4% bf16 MFU | 125826 tok/s step 15182/19560 | loss 3.277274 (-0.47z)| norm 0.2514 (-0.51z)| lr 7.63e-05 | 4167.40 ms | 32.4% bf16 MFU | 125825 tok/s step 15183/19560 | loss 3.236063 (-1.60z)| norm 0.2605 (-0.32z)| lr 7.63e-05 | 4171.86 ms | 32.4% bf16 MFU | 125818 tok/s step 15184/19560 | loss 3.261853 (-0.87z)| norm 0.2680 (-0.17z)| lr 7.63e-05 | 4164.15 ms | 32.4% bf16 MFU | 125822 tok/s step 15185/19560 | loss 3.264189 (-0.80z)| norm 0.2879 (+0.23z)| lr 7.62e-05 | 4172.19 ms | 32.4% bf16 MFU | 125814 tok/s step 15186/19560 | loss 3.347155 (+1.46z)| norm 0.2638 (-0.25z)| lr 7.62e-05 | 8753.50 ms | 15.4% bf16 MFU | 122518 tok/s step 15187/19560 | loss 3.306212 (+0.34z)| norm 0.2786 (+0.05z)| lr 7.62e-05 | 4131.59 ms | 32.7% bf16 MFU | 122737 tok/s step 15188/19560 | loss 3.272896 (-0.58z)| norm 0.2651 (-0.22z)| lr 7.61e-05 | 4150.88 ms | 32.5% bf16 MFU | 122916 tok/s step 15189/19560 | loss 3.297087 (+0.08z)| norm 0.2704 (-0.11z)| lr 7.61e-05 | 4158.29 ms | 32.5% bf16 MFU | 123074 tok/s step 15190/19560 | loss 3.351873 (+1.56z)| norm 0.2846 (+0.18z)| lr 7.61e-05 | 4161.53 ms | 32.4% bf16 MFU | 123219 tok/s step 15191/19560 | loss 3.314884 (+0.55z)| norm 0.2615 (-0.28z)| lr 7.60e-05 | 4162.77 ms | 32.4% bf16 MFU | 123356 tok/s step 15192/19560 | loss 3.317500 (+0.61z)| norm 0.2690 (-0.13z)| lr 7.60e-05 | 4167.44 ms | 32.4% bf16 MFU | 123478 tok/s step 15193/19560 | loss 3.347237 (+1.39z)| norm 0.2610 (-0.30z)| lr 7.60e-05 | 4147.90 ms | 32.6% bf16 MFU | 123624 tok/s step 15194/19560 | loss 3.321474 (+0.70z)| norm 0.2730 (-0.05z)| lr 7.59e-05 | 4161.04 ms | 32.4% bf16 MFU | 123743 tok/s step 15195/19560 | loss 3.266889 (-0.77z)| norm 0.2830 (+0.15z)| lr 7.59e-05 | 4158.49 ms | 32.5% bf16 MFU | 123860 tok/s step 15196/19560 | loss 3.294191 (-0.01z)| norm 0.2667 (-0.18z)| lr 7.59e-05 | 4171.94 ms | 32.4% bf16 MFU | 123950 tok/s step 15197/19560 | loss 3.255492 (-1.07z)| norm 0.2535 (-0.45z)| lr 7.58e-05 | 4160.57 ms | 32.5% bf16 MFU | 124053 tok/s step 15198/19560 | loss 3.290744 (-0.08z)| norm 0.2662 (-0.19z)| lr 7.58e-05 | 4187.03 ms | 32.2% bf16 MFU | 124112 tok/s step 15199/19560 | loss 3.348356 (+1.54z)| norm 0.2562 (-0.39z)| lr 7.58e-05 | 4166.32 ms | 32.4% bf16 MFU | 124198 tok/s step 15200/19560 | loss 3.343330 (+1.38z)| norm 0.2612 (-0.28z)| lr 7.57e-05 | 4151.38 ms | 32.5% bf16 MFU | 124303 tok/s step 15201/19560 | loss 3.268337 (-0.71z)| norm 0.2701 (-0.10z)| lr 7.57e-05 | 4163.60 ms | 32.4% bf16 MFU | 124384 tok/s step 15202/19560 | loss 3.320295 (+0.81z)| norm 0.2662 (-0.40z)| lr 7.57e-05 | 4167.38 ms | 32.4% bf16 MFU | 124455 tok/s step 15203/19560 | loss 3.337105 (+1.28z)| norm 0.2654 (-0.47z)| lr 7.56e-05 | 4176.28 ms | 32.3% bf16 MFU | 124509 tok/s step 15204/19560 | loss 3.310436 (+0.49z)| norm 0.2605 (-0.89z)| lr 7.56e-05 | 4158.78 ms | 32.5% bf16 MFU | 124587 tok/s step 15205/19560 | loss 3.258703 (-1.01z)| norm 0.2666 (-0.34z)| lr 7.56e-05 | 4173.45 ms | 32.4% bf16 MFU | 124639 tok/s step 15206/19560 | loss 3.284312 (-0.25z)| norm 0.2632 (-0.63z)| lr 7.55e-05 | 4167.95 ms | 32.4% bf16 MFU | 124696 tok/s step 15207/19560 | loss 3.308386 (+0.45z)| norm 0.2654 (-0.43z)| lr 7.55e-05 | 4159.75 ms | 32.5% bf16 MFU | 124764 tok/s step 15208/19560 | loss 3.340054 (+1.36z)| norm 0.2806 (+0.92z)| lr 7.55e-05 | 4175.74 ms | 32.3% bf16 MFU | 124803 tok/s step 15209/19560 | loss 3.290383 (-0.10z)| norm 0.2642 (-0.54z)| lr 7.54e-05 | 4183.31 ms | 32.3% bf16 MFU | 124829 tok/s step 15210/19560 | loss 3.353712 (+1.73z)| norm 0.2867 (+1.44z)| lr 7.54e-05 | 4179.21 ms | 32.3% bf16 MFU | 124861 tok/s step 15211/19560 | loss 3.267305 (-0.78z)| norm 0.2779 (+0.67z)| lr 7.54e-05 | 4166.04 ms | 32.4% bf16 MFU | 124910 tok/s step 15212/19560 | loss 3.384662 (+2.54z)| norm 0.2787 (+0.73z)| lr 7.53e-05 | 4176.07 ms | 32.3% bf16 MFU | 124942 tok/s step 15213/19560 | loss 3.236494 (-1.62z)| norm 0.2883 (+1.56z)| lr 7.53e-05 | 4167.92 ms | 32.4% bf16 MFU | 124984 tok/s step 15214/19560 | loss 3.244046 (-1.39z)| norm 0.2593 (-0.97z)| lr 7.53e-05 | 4169.85 ms | 32.4% bf16 MFU | 125022 tok/s step 15215/19560 | loss 3.264642 (-0.80z)| norm 0.2725 (+0.18z)| lr 7.52e-05 | 4167.69 ms | 32.4% bf16 MFU | 125060 tok/s step 15216/19560 | loss 3.351536 (+1.58z)| norm 0.2765 (+0.54z)| lr 7.52e-05 | 4165.10 ms | 32.4% bf16 MFU | 125101 tok/s step 15217/19560 | loss 3.407318 (+2.99z)| norm 0.2675 (-0.26z)| lr 7.52e-05 | 4173.81 ms | 32.3% bf16 MFU | 125127 tok/s step 15218/19560 | loss 3.363800 (+1.79z)| norm 0.2769 (+0.56z)| lr 7.51e-05 | 4169.36 ms | 32.4% bf16 MFU | 125158 tok/s step 15219/19560 | loss 3.296033 (+0.01z)| norm 0.2624 (-0.72z)| lr 7.51e-05 | 4190.41 ms | 32.2% bf16 MFU | 125156 tok/s step 15220/19560 | loss 3.376937 (+2.10z)| norm 0.2669 (-0.33z)| lr 7.51e-05 | 4161.64 ms | 32.4% bf16 MFU | 125197 tok/s step 15221/19560 | loss 3.277663 (-0.48z)| norm 0.2667 (-0.35z)| lr 7.50e-05 | 4176.63 ms | 32.3% bf16 MFU | 125214 tok/s step 15222/19560 | loss 3.381672 (+2.19z)| norm 0.2657 (-0.43z)| lr 7.50e-05 | 4169.44 ms | 32.4% bf16 MFU | 125240 tok/s step 15223/19560 | loss 3.326125 (+0.75z)| norm 0.2577 (-1.13z)| lr 7.50e-05 | 4170.01 ms | 32.4% bf16 MFU | 125265 tok/s step 15224/19560 | loss 3.364205 (+1.70z)| norm 0.2765 (+0.56z)| lr 7.49e-05 | 4180.42 ms | 32.3% bf16 MFU | 125272 tok/s step 15225/19560 | loss 3.312536 (+0.36z)| norm 0.2762 (+0.52z)| lr 7.49e-05 | 4168.06 ms | 32.4% bf16 MFU | 125298 tok/s step 15226/19560 | loss 3.312269 (+0.35z)| norm 0.2691 (-0.12z)| lr 7.49e-05 | 4175.25 ms | 32.3% bf16 MFU | 125312 tok/s step 15227/19560 | loss 3.303167 (+0.11z)| norm 0.2710 (+0.04z)| lr 7.48e-05 | 4180.39 ms | 32.3% bf16 MFU | 125317 tok/s step 15228/19560 | loss 3.308592 (+0.24z)| norm 0.2900 (+1.76z)| lr 7.48e-05 | 4173.02 ms | 32.4% bf16 MFU | 125333 tok/s step 15229/19560 | loss 3.303826 (+0.13z)| norm 0.2844 (+1.27z)| lr 7.48e-05 | 4169.75 ms | 32.4% bf16 MFU | 125353 tok/s step 15230/19560 | loss 3.247602 (-1.36z)| norm 0.2678 (-0.23z)| lr 7.47e-05 | 4190.82 ms | 32.2% bf16 MFU | 125341 tok/s step 15231/19560 | loss 3.330881 (+0.84z)| norm 0.2762 (+0.55z)| lr 7.47e-05 | 4159.61 ms | 32.5% bf16 MFU | 125376 tok/s step 15232/19560 | loss 3.294528 (-0.12z)| norm 0.2649 (-0.48z)| lr 7.47e-05 | 4173.62 ms | 32.4% bf16 MFU | 125388 tok/s step 15233/19560 | loss 3.344280 (+1.19z)| norm 0.2952 (+2.24z)| lr 7.46e-05 | 4160.98 ms | 32.4% bf16 MFU | 125418 tok/s step 15234/19560 | loss 3.360834 (+1.61z)| norm 0.2816 (+1.00z)| lr 7.46e-05 | 4171.95 ms | 32.4% bf16 MFU | 125431 tok/s step 15235/19560 | loss 3.322465 (+0.60z)| norm 0.2639 (-0.59z)| lr 7.46e-05 | 4176.35 ms | 32.3% bf16 MFU | 125436 tok/s step 15236/19560 | loss 3.332632 (+0.87z)| norm 0.2644 (-0.54z)| lr 7.45e-05 | 4170.52 ms | 32.4% bf16 MFU | 125450 tok/s step 15237/19560 | loss 3.350002 (+1.30z)| norm 0.2661 (-0.38z)| lr 7.45e-05 | 4174.44 ms | 32.3% bf16 MFU | 125457 tok/s step 15238/19560 | loss 3.304044 (+0.09z)| norm 0.2718 (+0.13z)| lr 7.45e-05 | 4177.16 ms | 32.3% bf16 MFU | 125460 tok/s step 15239/19560 | loss 3.323910 (+0.60z)| norm 0.2719 (+0.14z)| lr 7.44e-05 | 4161.15 ms | 32.4% bf16 MFU | 125487 tok/s step 15240/19560 | loss 3.317188 (+0.44z)| norm 0.2665 (-0.34z)| lr 7.44e-05 | 4161.38 ms | 32.4% bf16 MFU | 125512 tok/s step 15241/19560 | loss 3.290913 (-0.27z)| norm 0.2663 (-0.36z)| lr 7.44e-05 | 4162.65 ms | 32.4% bf16 MFU | 125534 tok/s step 15242/19560 | loss 3.329582 (+0.77z)| norm 0.2807 (+0.93z)| lr 7.43e-05 | 4168.53 ms | 32.4% bf16 MFU | 125546 tok/s step 15243/19560 | loss 3.295161 (-0.16z)| norm 0.2637 (-0.61z)| lr 7.43e-05 | 4167.85 ms | 32.4% bf16 MFU | 125558 tok/s step 15244/19560 | loss 3.394516 (+2.44z)| norm 0.2749 (+0.40z)| lr 7.43e-05 | 4178.02 ms | 32.3% bf16 MFU | 125555 tok/s step 15245/19560 | loss 3.266605 (-0.93z)| norm 0.2676 (-0.26z)| lr 7.42e-05 | 4175.77 ms | 32.3% bf16 MFU | 125555 tok/s step 15246/19560 | loss 3.298917 (-0.08z)| norm 0.2790 (+0.77z)| lr 7.42e-05 | 4176.61 ms | 32.3% bf16 MFU | 125553 tok/s step 15247/19560 | loss 3.344053 (+1.09z)| norm 0.2669 (-0.34z)| lr 7.42e-05 | 4169.86 ms | 32.4% bf16 MFU | 125562 tok/s step 15248/19560 | loss 3.276656 (-0.70z)| norm 0.2648 (-0.53z)| lr 7.41e-05 | 4167.56 ms | 32.4% bf16 MFU | 125574 tok/s step 15249/19560 | loss 3.292253 (-0.28z)| norm 0.2618 (-0.79z)| lr 7.41e-05 | 4168.52 ms | 32.4% bf16 MFU | 125584 tok/s step 15250/19560 | loss 3.379648 (+2.02z)| norm 0.2658 (-0.44z)| lr 7.41e-05 | 4168.41 ms | 32.4% bf16 MFU | 125594 tok/s val loss 3.293086 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 2996/10042 = 0.298347 step 15251/19560 | loss 3.335257 (+0.83z)| norm 0.2694 (-0.11z)| lr 7.41e-05 | 4156.09 ms | 32.5% bf16 MFU | 125622 tok/s step 15252/19560 | loss 3.339286 (+0.93z)| norm 0.2667 (-0.35z)| lr 7.40e-05 | 4177.83 ms | 32.3% bf16 MFU | 125615 tok/s step 15253/19560 | loss 3.330402 (+0.68z)| norm 0.2645 (-0.57z)| lr 7.40e-05 | 4165.96 ms | 32.4% bf16 MFU | 125627 tok/s step 15254/19560 | loss 3.364158 (+1.55z)| norm 0.2660 (-0.43z)| lr 7.40e-05 | 4157.93 ms | 32.5% bf16 MFU | 125650 tok/s step 15255/19560 | loss 3.326196 (+0.55z)| norm 0.2639 (-0.63z)| lr 7.39e-05 | 4162.15 ms | 32.4% bf16 MFU | 125666 tok/s step 15256/19560 | loss 3.299790 (-0.16z)| norm 0.2729 (+0.20z)| lr 7.39e-05 | 4168.63 ms | 32.4% bf16 MFU | 125671 tok/s step 15257/19560 | loss 3.301504 (-0.11z)| norm 0.2633 (-0.68z)| lr 7.39e-05 | 4174.98 ms | 32.3% bf16 MFU | 125667 tok/s step 15258/19560 | loss 3.272665 (-0.86z)| norm 0.2821 (+1.06z)| lr 7.38e-05 | 4167.11 ms | 32.4% bf16 MFU | 125674 tok/s step 15259/19560 | loss 3.303683 (-0.04z)| norm 0.2524 (-1.67z)| lr 7.38e-05 | 4164.95 ms | 32.4% bf16 MFU | 125684 tok/s step 15260/19560 | loss 3.316148 (+0.28z)| norm 0.2496 (-1.89z)| lr 7.38e-05 | 4163.85 ms | 32.4% bf16 MFU | 125696 tok/s step 15261/19560 | loss 3.349795 (+1.16z)| norm 0.2732 (+0.30z)| lr 7.37e-05 | 4160.92 ms | 32.4% bf16 MFU | 125711 tok/s step 15262/19560 | loss 3.336446 (+0.80z)| norm 0.2645 (-0.51z)| lr 7.37e-05 | 4158.91 ms | 32.5% bf16 MFU | 125729 tok/s step 15263/19560 | loss 3.301841 (-0.10z)| norm 0.2740 (+0.39z)| lr 7.37e-05 | 4159.35 ms | 32.5% bf16 MFU | 125745 tok/s step 15264/19560 | loss 3.336600 (+0.81z)| norm 0.2538 (-1.49z)| lr 7.36e-05 | 4160.10 ms | 32.5% bf16 MFU | 125759 tok/s step 15265/19560 | loss 3.298131 (-0.21z)| norm 0.2756 (+0.55z)| lr 7.36e-05 | 4163.68 ms | 32.4% bf16 MFU | 125767 tok/s step 15266/19560 | loss 3.283743 (-0.59z)| norm 0.2631 (-0.61z)| lr 7.36e-05 | 4161.18 ms | 32.4% bf16 MFU | 125779 tok/s step 15267/19560 | loss 3.283644 (-0.59z)| norm 0.2575 (-1.13z)| lr 7.35e-05 | 4156.96 ms | 32.5% bf16 MFU | 125796 tok/s step 15268/19560 | loss 3.301100 (-0.13z)| norm 0.2698 (+0.04z)| lr 7.35e-05 | 4166.99 ms | 32.4% bf16 MFU | 125797 tok/s step 15269/19560 | loss 3.376387 (+1.90z)| norm 0.2855 (+1.50z)| lr 7.35e-05 | 4174.10 ms | 32.3% bf16 MFU | 125787 tok/s step 15270/19560 | loss 3.286734 (-0.55z)| norm 0.2580 (-1.11z)| lr 7.34e-05 | 4169.65 ms | 32.4% bf16 MFU | 125785 tok/s step 15271/19560 | loss 3.324624 (+0.48z)| norm 0.2730 (+0.32z)| lr 7.34e-05 | 4163.68 ms | 32.4% bf16 MFU | 125792 tok/s step 15272/19560 | loss 3.270580 (-0.99z)| norm 0.2648 (-0.45z)| lr 7.34e-05 | 4164.51 ms | 32.4% bf16 MFU | 125797 tok/s step 15273/19560 | loss 3.321561 (+0.40z)| norm 0.2884 (+1.77z)| lr 7.33e-05 | 4156.88 ms | 32.5% bf16 MFU | 125813 tok/s step 15274/19560 | loss 3.323050 (+0.43z)| norm 0.2650 (-0.46z)| lr 7.33e-05 | 4168.14 ms | 32.4% bf16 MFU | 125812 tok/s step 15275/19560 | loss 3.385006 (+2.08z)| norm 0.2797 (+0.94z)| lr 7.33e-05 | 4172.27 ms | 32.4% bf16 MFU | 125804 tok/s step 15276/19560 | loss 3.467628 (+4.02z)| norm 0.3130 (+3.87z)| lr 7.32e-05 | 4171.70 ms | 32.4% bf16 MFU | 125798 tok/s step 15277/19560 | loss 3.290152 (-0.50z)| norm 0.2580 (-1.10z)| lr 7.32e-05 | 4166.66 ms | 32.4% bf16 MFU | 125799 tok/s step 15278/19560 | loss 3.292368 (-0.44z)| norm 0.2874 (+1.53z)| lr 7.32e-05 | 4174.52 ms | 32.3% bf16 MFU | 125789 tok/s step 15279/19560 | loss 3.368401 (+1.48z)| norm 0.2909 (+1.80z)| lr 7.31e-05 | 4169.97 ms | 32.4% bf16 MFU | 125786 tok/s step 15280/19560 | loss 3.316327 (+0.13z)| norm 0.2705 (-0.00z)| lr 7.31e-05 | 4154.11 ms | 32.5% bf16 MFU | 125807 tok/s step 15281/19560 | loss 3.358624 (+1.21z)| norm 0.2759 (+0.47z)| lr 7.31e-05 | 4179.79 ms | 32.3% bf16 MFU | 125789 tok/s step 15282/19560 | loss 3.342635 (+0.79z)| norm 0.2467 (-2.08z)| lr 7.30e-05 | 4165.47 ms | 32.4% bf16 MFU | 125792 tok/s step 15283/19560 | loss 3.340952 (+0.74z)| norm 0.2905 (+1.80z)| lr 7.30e-05 | 4172.28 ms | 32.4% bf16 MFU | 125786 tok/s step 15284/19560 | loss 3.325686 (+0.34z)| norm 0.2648 (-0.51z)| lr 7.30e-05 | 4158.26 ms | 32.5% bf16 MFU | 125801 tok/s step 15285/19560 | loss 3.296499 (-0.41z)| norm 0.2799 (+0.86z)| lr 7.29e-05 | 4175.77 ms | 32.3% bf16 MFU | 125788 tok/s step 15286/19560 | loss 3.335540 (+0.59z)| norm 0.2757 (+0.49z)| lr 7.29e-05 | 4171.55 ms | 32.4% bf16 MFU | 125783 tok/s step 15287/19560 | loss 3.261367 (-1.30z)| norm 0.2594 (-0.99z)| lr 7.29e-05 | 4165.73 ms | 32.4% bf16 MFU | 125787 tok/s step 15288/19560 | loss 3.268622 (-1.11z)| norm 0.2688 (-0.14z)| lr 7.28e-05 | 4173.90 ms | 32.3% bf16 MFU | 125778 tok/s step 15289/19560 | loss 3.227882 (-2.11z)| norm 0.2851 (+1.32z)| lr 7.28e-05 | 4168.47 ms | 32.4% bf16 MFU | 125778 tok/s step 15290/19560 | loss 3.300656 (-0.29z)| norm 0.2559 (-1.30z)| lr 7.28e-05 | 4161.10 ms | 32.4% bf16 MFU | 125789 tok/s step 15291/19560 | loss 3.424139 (+2.73z)| norm 0.3080 (+3.27z)| lr 7.27e-05 | 4159.90 ms | 32.5% bf16 MFU | 125801 tok/s step 15292/19560 | loss 3.250252 (-1.55z)| norm 0.2981 (+2.34z)| lr 7.27e-05 | 4171.91 ms | 32.4% bf16 MFU | 125795 tok/s step 15293/19560 | loss 3.314519 (+0.02z)| norm 0.2819 (+0.95z)| lr 7.27e-05 | 4156.03 ms | 32.5% bf16 MFU | 125812 tok/s step 15294/19560 | loss 3.372363 (+1.43z)| norm 0.2805 (+0.82z)| lr 7.26e-05 | 4172.42 ms | 32.4% bf16 MFU | 125805 tok/s step 15295/19560 | loss 3.321602 (+0.17z)| norm 0.2688 (-0.18z)| lr 7.26e-05 | 4160.18 ms | 32.5% bf16 MFU | 125816 tok/s step 15296/19560 | loss 3.249108 (-1.60z)| norm 0.2621 (-0.73z)| lr 7.26e-05 | 4182.57 ms | 32.3% bf16 MFU | 125792 tok/s step 15297/19560 | loss 3.327442 (+0.32z)| norm 0.2770 (+0.54z)| lr 7.25e-05 | 4151.56 ms | 32.5% bf16 MFU | 125817 tok/s step 15298/19560 | loss 3.252694 (-1.49z)| norm 0.2732 (+0.21z)| lr 7.25e-05 | 4171.55 ms | 32.4% bf16 MFU | 125810 tok/s step 15299/19560 | loss 3.357545 (+1.04z)| norm 0.2958 (+2.12z)| lr 7.25e-05 | 4168.55 ms | 32.4% bf16 MFU | 125808 tok/s step 15300/19560 | loss 3.258408 (-1.33z)| norm 0.2715 (+0.07z)| lr 7.24e-05 | 4164.33 ms | 32.4% bf16 MFU | 125813 tok/s step 15301/19560 | loss 3.384006 (+1.66z)| norm 0.2603 (-0.92z)| lr 7.24e-05 | 4154.37 ms | 32.5% bf16 MFU | 125832 tok/s step 15302/19560 | loss 3.260694 (-1.29z)| norm 0.3009 (+2.60z)| lr 7.24e-05 | 4161.38 ms | 32.4% bf16 MFU | 125840 tok/s step 15303/19560 | loss 3.352619 (+0.90z)| norm 0.2483 (-1.92z)| lr 7.23e-05 | 4163.05 ms | 32.4% bf16 MFU | 125845 tok/s step 15304/19560 | loss 3.322293 (+0.17z)| norm 0.2749 (+0.34z)| lr 7.23e-05 | 4157.40 ms | 32.5% bf16 MFU | 125858 tok/s step 15305/19560 | loss 3.270294 (-1.07z)| norm 0.2761 (+0.44z)| lr 7.23e-05 | 4157.96 ms | 32.5% bf16 MFU | 125870 tok/s step 15306/19560 | loss 3.283760 (-0.74z)| norm 0.2670 (-0.34z)| lr 7.23e-05 | 4162.46 ms | 32.4% bf16 MFU | 125874 tok/s step 15307/19560 | loss 3.270513 (-1.05z)| norm 0.2571 (-1.19z)| lr 7.22e-05 | 4167.44 ms | 32.4% bf16 MFU | 125871 tok/s step 15308/19560 | loss 3.393682 (+1.85z)| norm 0.2808 (+0.83z)| lr 7.22e-05 | 4159.19 ms | 32.5% bf16 MFU | 125880 tok/s step 15309/19560 | loss 3.253648 (-1.42z)| norm 0.2563 (-1.25z)| lr 7.22e-05 | 4168.07 ms | 32.4% bf16 MFU | 125875 tok/s step 15310/19560 | loss 3.292555 (-0.52z)| norm 0.2821 (+0.93z)| lr 7.21e-05 | 4167.28 ms | 32.4% bf16 MFU | 125872 tok/s step 15311/19560 | loss 3.345222 (+0.71z)| norm 0.2804 (+0.77z)| lr 7.21e-05 | 4163.93 ms | 32.4% bf16 MFU | 125874 tok/s step 15312/19560 | loss 3.314339 (-0.04z)| norm 0.2929 (+1.80z)| lr 7.21e-05 | 4165.34 ms | 32.4% bf16 MFU | 125874 tok/s step 15313/19560 | loss 3.586362 (+5.59z)| norm 0.3150 (+3.50z)| lr 7.20e-05 | 4167.48 ms | 32.4% bf16 MFU | 125870 tok/s step 15314/19560 | loss 3.309939 (-0.17z)| norm 0.2753 (+0.27z)| lr 7.20e-05 | 4168.97 ms | 32.4% bf16 MFU | 125865 tok/s step 15315/19560 | loss 3.329866 (+0.24z)| norm 0.3036 (+2.50z)| lr 7.20e-05 | 4158.48 ms | 32.5% bf16 MFU | 125876 tok/s step 15316/19560 | loss 3.246955 (-1.48z)| norm 0.2815 (+0.73z)| lr 7.19e-05 | 4160.05 ms | 32.5% bf16 MFU | 125883 tok/s step 15317/19560 | loss 3.380928 (+1.29z)| norm 0.3095 (+2.83z)| lr 7.19e-05 | 4160.52 ms | 32.5% bf16 MFU | 125890 tok/s step 15318/19560 | loss 3.319885 (+0.03z)| norm 0.3057 (+2.48z)| lr 7.19e-05 | 4167.74 ms | 32.4% bf16 MFU | 125885 tok/s step 15319/19560 | loss 3.332523 (+0.29z)| norm 0.2859 (+0.97z)| lr 7.18e-05 | 4176.99 ms | 32.3% bf16 MFU | 125867 tok/s step 15320/19560 | loss 3.387478 (+1.40z)| norm 0.2952 (+1.64z)| lr 7.18e-05 | 4160.78 ms | 32.5% bf16 MFU | 125874 tok/s step 15321/19560 | loss 3.294024 (-0.51z)| norm 0.3049 (+2.30z)| lr 7.18e-05 | 4164.20 ms | 32.4% bf16 MFU | 125875 tok/s step 15322/19560 | loss 3.340077 (+0.44z)| norm 0.2769 (+0.25z)| lr 7.17e-05 | 4155.11 ms | 32.5% bf16 MFU | 125891 tok/s step 15323/19560 | loss 3.248840 (-1.43z)| norm 0.2893 (+1.14z)| lr 7.17e-05 | 4170.19 ms | 32.4% bf16 MFU | 125882 tok/s step 15324/19560 | loss 3.303842 (-0.31z)| norm 0.2970 (+1.67z)| lr 7.17e-05 | 4157.27 ms | 32.5% bf16 MFU | 125894 tok/s step 15325/19560 | loss 3.422628 (+2.07z)| norm 0.2824 (+0.61z)| lr 7.16e-05 | 4154.66 ms | 32.5% bf16 MFU | 125909 tok/s step 15326/19560 | loss 3.305931 (-0.29z)| norm 0.2823 (+0.59z)| lr 7.16e-05 | 4158.57 ms | 32.5% bf16 MFU | 125917 tok/s step 15327/19560 | loss 3.253349 (-1.33z)| norm 0.2748 (+0.04z)| lr 7.16e-05 | 4158.14 ms | 32.5% bf16 MFU | 125925 tok/s step 15328/19560 | loss 3.302994 (-0.33z)| norm 0.2635 (-0.79z)| lr 7.15e-05 | 4167.63 ms | 32.4% bf16 MFU | 125919 tok/s step 15329/19560 | loss 3.363196 (+0.87z)| norm 0.2735 (-0.06z)| lr 7.15e-05 | 4162.35 ms | 32.4% bf16 MFU | 125921 tok/s step 15330/19560 | loss 3.291297 (-0.57z)| norm 0.2674 (-0.51z)| lr 7.15e-05 | 4162.20 ms | 32.4% bf16 MFU | 125923 tok/s step 15331/19560 | loss 3.299745 (-0.40z)| norm 0.2655 (-0.64z)| lr 7.14e-05 | 4161.40 ms | 32.4% bf16 MFU | 125927 tok/s step 15332/19560 | loss 3.318776 (-0.01z)| norm 0.2576 (-1.22z)| lr 7.14e-05 | 4157.42 ms | 32.5% bf16 MFU | 125936 tok/s step 15333/19560 | loss 3.336304 (+0.33z)| norm 0.2715 (-0.21z)| lr 7.14e-05 | 4156.38 ms | 32.5% bf16 MFU | 125946 tok/s step 15334/19560 | loss 3.393536 (+1.46z)| norm 0.2721 (-0.17z)| lr 7.13e-05 | 4167.03 ms | 32.4% bf16 MFU | 125940 tok/s step 15335/19560 | loss 3.351689 (+0.61z)| norm 0.2625 (-0.87z)| lr 7.13e-05 | 4167.75 ms | 32.4% bf16 MFU | 125932 tok/s step 15336/19560 | loss 3.301795 (-0.38z)| norm 0.2705 (-0.28z)| lr 7.13e-05 | 4159.32 ms | 32.5% bf16 MFU | 125938 tok/s step 15337/19560 | loss 3.281019 (-0.80z)| norm 0.2812 (+0.49z)| lr 7.12e-05 | 4175.21 ms | 32.3% bf16 MFU | 125920 tok/s step 15338/19560 | loss 3.340743 (+0.40z)| norm 0.2517 (-1.63z)| lr 7.12e-05 | 4172.80 ms | 32.4% bf16 MFU | 125906 tok/s step 15339/19560 | loss 3.348589 (+0.55z)| norm 0.2630 (-0.80z)| lr 7.12e-05 | 4197.08 ms | 32.2% bf16 MFU | 125857 tok/s step 15340/19560 | loss 3.310164 (-0.22z)| norm 0.2722 (-0.13z)| lr 7.11e-05 | 4173.36 ms | 32.4% bf16 MFU | 125845 tok/s step 15341/19560 | loss 3.331399 (+0.20z)| norm 0.2792 (+0.38z)| lr 7.11e-05 | 4172.03 ms | 32.4% bf16 MFU | 125836 tok/s step 15342/19560 | loss 3.336865 (+0.30z)| norm 0.2608 (-0.96z)| lr 7.11e-05 | 4387.90 ms | 30.8% bf16 MFU | 125519 tok/s step 15343/19560 | loss 3.256867 (-1.35z)| norm 0.2706 (-0.24z)| lr 7.11e-05 | 4172.95 ms | 32.4% bf16 MFU | 125525 tok/s step 15344/19560 | loss 3.368507 (+0.95z)| norm 0.2759 (+0.15z)| lr 7.10e-05 | 4157.15 ms | 32.5% bf16 MFU | 125554 tok/s step 15345/19560 | loss 3.323905 (+0.05z)| norm 0.2662 (-0.56z)| lr 7.10e-05 | 4151.27 ms | 32.5% bf16 MFU | 125592 tok/s step 15346/19560 | loss 3.359100 (+0.78z)| norm 0.2740 (+0.00z)| lr 7.10e-05 | 6694.75 ms | 20.2% bf16 MFU | 123228 tok/s step 15347/19560 | loss 3.309786 (-0.25z)| norm 0.2805 (+0.47z)| lr 7.09e-05 | 9922.75 ms | 13.6% bf16 MFU | 119708 tok/s step 15348/19560 | loss 3.279727 (-0.87z)| norm 0.2736 (-0.03z)| lr 7.09e-05 | 4140.82 ms | 32.6% bf16 MFU | 120053 tok/s step 15349/19560 | loss 3.366408 (+0.94z)| norm 0.2689 (-0.38z)| lr 7.09e-05 | 4154.80 ms | 32.5% bf16 MFU | 120360 tok/s step 15350/19560 | loss 3.394697 (+1.53z)| norm 0.2761 (+0.14z)| lr 7.08e-05 | 4134.80 ms | 32.7% bf16 MFU | 120682 tok/s step 15351/19560 | loss 3.362434 (+0.84z)| norm 0.2759 (+0.12z)| lr 7.08e-05 | 4146.50 ms | 32.6% bf16 MFU | 120970 tok/s step 15352/19560 | loss 3.281746 (-0.83z)| norm 0.2679 (-0.47z)| lr 7.08e-05 | 4163.95 ms | 32.4% bf16 MFU | 121217 tok/s step 15353/19560 | loss 3.242711 (-1.62z)| norm 0.2808 (+0.48z)| lr 7.07e-05 | 4137.92 ms | 32.6% bf16 MFU | 121491 tok/s step 15354/19560 | loss 3.287539 (-0.68z)| norm 0.2900 (+1.13z)| lr 7.07e-05 | 4146.86 ms | 32.6% bf16 MFU | 121738 tok/s step 15355/19560 | loss 3.302200 (-0.38z)| norm 0.2999 (+1.82z)| lr 7.07e-05 | 4150.19 ms | 32.5% bf16 MFU | 121968 tok/s step 15356/19560 | loss 3.365993 (+0.92z)| norm 0.2526 (-1.57z)| lr 7.06e-05 | 4158.70 ms | 32.5% bf16 MFU | 122173 tok/s step 15357/19560 | loss 3.308424 (-0.26z)| norm 0.2705 (-0.27z)| lr 7.06e-05 | 4148.82 ms | 32.5% bf16 MFU | 122383 tok/s step 15358/19560 | loss 3.235451 (-1.76z)| norm 0.2644 (-0.71z)| lr 7.06e-05 | 4151.38 ms | 32.5% bf16 MFU | 122578 tok/s step 15359/19560 | loss 3.322078 (+0.02z)| norm 0.2586 (-1.11z)| lr 7.05e-05 | 4148.50 ms | 32.5% bf16 MFU | 122768 tok/s step 15360/19560 | loss 3.353649 (+0.66z)| norm 0.2752 (+0.07z)| lr 7.05e-05 | 4145.44 ms | 32.6% bf16 MFU | 122954 tok/s step 15361/19560 | loss 3.309590 (-0.24z)| norm 0.2855 (+0.82z)| lr 7.05e-05 | 4216.98 ms | 32.0% bf16 MFU | 123022 tok/s step 15362/19560 | loss 3.285581 (-0.72z)| norm 0.2646 (-0.68z)| lr 7.04e-05 | 4543.03 ms | 29.7% bf16 MFU | 122641 tok/s step 15363/19560 | loss 3.272808 (-0.97z)| norm 0.2646 (-0.68z)| lr 7.04e-05 | 4160.78 ms | 32.4% bf16 MFU | 122810 tok/s step 15364/19560 | loss 3.248600 (-1.44z)| norm 0.2573 (-1.20z)| lr 7.04e-05 | 4207.13 ms | 32.1% bf16 MFU | 122900 tok/s step 15365/19560 | loss 3.256369 (-1.26z)| norm 0.2623 (-0.83z)| lr 7.03e-05 | 4156.44 ms | 32.5% bf16 MFU | 123062 tok/s step 15366/19560 | loss 3.291568 (-0.55z)| norm 0.2862 (+0.87z)| lr 7.03e-05 | 4157.04 ms | 32.5% bf16 MFU | 123215 tok/s step 15367/19560 | loss 3.342341 (+0.47z)| norm 0.2461 (-1.95z)| lr 7.03e-05 | 4157.36 ms | 32.5% bf16 MFU | 123360 tok/s step 15368/19560 | loss 3.385339 (+1.32z)| norm 0.2744 (+0.04z)| lr 7.02e-05 | 4149.26 ms | 32.5% bf16 MFU | 123510 tok/s step 15369/19560 | loss 3.320575 (+0.02z)| norm 0.2738 (-0.01z)| lr 7.02e-05 | 4158.63 ms | 32.5% bf16 MFU | 123638 tok/s step 15370/19560 | loss 3.269519 (-0.99z)| norm 0.2533 (-1.43z)| lr 7.02e-05 | 4177.39 ms | 32.3% bf16 MFU | 123731 tok/s step 15371/19560 | loss 3.351166 (+0.63z)| norm 0.2939 (+1.39z)| lr 7.02e-05 | 4157.73 ms | 32.5% bf16 MFU | 123850 tok/s step 15372/19560 | loss 3.311135 (-0.16z)| norm 0.2629 (-0.76z)| lr 7.01e-05 | 4166.99 ms | 32.4% bf16 MFU | 123948 tok/s step 15373/19560 | loss 3.278198 (-0.82z)| norm 0.2645 (-0.65z)| lr 7.01e-05 | 4177.77 ms | 32.3% bf16 MFU | 124025 tok/s step 15374/19560 | loss 3.380095 (+1.21z)| norm 0.2844 (+0.73z)| lr 7.01e-05 | 4151.65 ms | 32.5% bf16 MFU | 124138 tok/s step 15375/19560 | loss 3.334606 (+0.30z)| norm 0.2736 (-0.03z)| lr 7.00e-05 | 4159.08 ms | 32.5% bf16 MFU | 124234 tok/s step 15376/19560 | loss 3.301962 (-0.36z)| norm 0.2725 (-0.10z)| lr 7.00e-05 | 4152.82 ms | 32.5% bf16 MFU | 124335 tok/s step 15377/19560 | loss 3.323291 (+0.07z)| norm 0.2881 (+0.97z)| lr 7.00e-05 | 4148.55 ms | 32.5% bf16 MFU | 124437 tok/s step 15378/19560 | loss 3.315097 (-0.09z)| norm 0.2747 (+0.03z)| lr 6.99e-05 | 4149.04 ms | 32.5% bf16 MFU | 124534 tok/s step 15379/19560 | loss 3.252551 (-1.34z)| norm 0.2679 (-0.44z)| lr 6.99e-05 | 4169.38 ms | 32.4% bf16 MFU | 124594 tok/s step 15380/19560 | loss 3.329323 (+0.21z)| norm 0.2660 (-0.58z)| lr 6.99e-05 | 4153.94 ms | 32.5% bf16 MFU | 124675 tok/s step 15381/19560 | loss 3.272871 (-0.91z)| norm 0.2781 (+0.26z)| lr 6.98e-05 | 4142.25 ms | 32.6% bf16 MFU | 124770 tok/s step 15382/19560 | loss 3.293598 (-0.49z)| norm 0.2579 (-1.14z)| lr 6.98e-05 | 4157.27 ms | 32.5% bf16 MFU | 124837 tok/s step 15383/19560 | loss 3.274415 (-0.86z)| norm 0.2521 (-1.53z)| lr 6.98e-05 | 4147.87 ms | 32.6% bf16 MFU | 124915 tok/s step 15384/19560 | loss 3.296417 (-0.42z)| norm 0.2861 (+0.81z)| lr 6.97e-05 | 4151.87 ms | 32.5% bf16 MFU | 124983 tok/s step 15385/19560 | loss 3.265530 (-1.03z)| norm 0.2507 (-1.60z)| lr 6.97e-05 | 4153.50 ms | 32.5% bf16 MFU | 125046 tok/s step 15386/19560 | loss 3.267607 (-0.99z)| norm 0.2550 (-1.29z)| lr 6.97e-05 | 4243.48 ms | 31.8% bf16 MFU | 124971 tok/s step 15387/19560 | loss 3.269931 (-0.93z)| norm 0.2650 (-0.62z)| lr 6.96e-05 | 4152.66 ms | 32.5% bf16 MFU | 125035 tok/s step 15388/19560 | loss 3.300799 (-0.32z)| norm 0.2572 (-1.17z)| lr 6.96e-05 | 4146.11 ms | 32.6% bf16 MFU | 125106 tok/s step 15389/19560 | loss 3.321816 (+0.11z)| norm 0.2757 (+0.11z)| lr 6.96e-05 | 4150.48 ms | 32.5% bf16 MFU | 125167 tok/s step 15390/19560 | loss 3.283796 (-0.64z)| norm 0.2478 (-1.79z)| lr 6.95e-05 | 4178.74 ms | 32.3% bf16 MFU | 125182 tok/s step 15391/19560 | loss 3.262139 (-1.06z)| norm 0.2863 (+0.83z)| lr 6.95e-05 | 4158.61 ms | 32.5% bf16 MFU | 125226 tok/s step 15392/19560 | loss 3.281249 (-0.67z)| norm 0.2600 (-0.97z)| lr 6.95e-05 | 4155.22 ms | 32.5% bf16 MFU | 125274 tok/s step 15393/19560 | loss 3.290657 (-0.49z)| norm 0.2492 (-1.67z)| lr 6.94e-05 | 4149.84 ms | 32.5% bf16 MFU | 125327 tok/s step 15394/19560 | loss 3.277371 (-0.75z)| norm 0.2658 (-0.55z)| lr 6.94e-05 | 4154.91 ms | 32.5% bf16 MFU | 125370 tok/s step 15395/19560 | loss 3.284174 (-0.61z)| norm 0.2523 (-1.46z)| lr 6.94e-05 | 4156.00 ms | 32.5% bf16 MFU | 125409 tok/s step 15396/19560 | loss 3.313833 (-0.03z)| norm 0.2588 (-1.01z)| lr 6.94e-05 | 4155.19 ms | 32.5% bf16 MFU | 125447 tok/s step 15397/19560 | loss 3.307757 (-0.14z)| norm 0.2573 (-1.09z)| lr 6.93e-05 | 4145.52 ms | 32.6% bf16 MFU | 125499 tok/s step 15398/19560 | loss 3.333982 (+0.37z)| norm 0.2764 (+0.17z)| lr 6.93e-05 | 4151.08 ms | 32.5% bf16 MFU | 125539 tok/s step 15399/19560 | loss 3.309447 (-0.11z)| norm 0.2623 (-0.76z)| lr 6.93e-05 | 4152.70 ms | 32.5% bf16 MFU | 125574 tok/s step 15400/19560 | loss 3.294721 (-0.41z)| norm 0.2537 (-1.33z)| lr 6.92e-05 | 4148.52 ms | 32.5% bf16 MFU | 125615 tok/s step 15401/19560 | loss 3.297522 (-0.35z)| norm 0.2535 (-1.32z)| lr 6.92e-05 | 4151.88 ms | 32.5% bf16 MFU | 125648 tok/s step 15402/19560 | loss 3.328351 (+0.26z)| norm 0.2612 (-0.81z)| lr 6.92e-05 | 4138.64 ms | 32.6% bf16 MFU | 125699 tok/s step 15403/19560 | loss 3.337439 (+0.46z)| norm 0.2580 (-1.00z)| lr 6.91e-05 | 4154.36 ms | 32.5% bf16 MFU | 125725 tok/s step 15404/19560 | loss 3.322903 (+0.19z)| norm 0.2530 (-1.32z)| lr 6.91e-05 | 4158.67 ms | 32.5% bf16 MFU | 125742 tok/s step 15405/19560 | loss 3.294682 (-0.40z)| norm 0.2725 (-0.02z)| lr 6.91e-05 | 4141.78 ms | 32.6% bf16 MFU | 125784 tok/s step 15406/19560 | loss 3.344656 (+0.64z)| norm 0.2696 (-0.21z)| lr 6.90e-05 | 4153.42 ms | 32.5% bf16 MFU | 125806 tok/s step 15407/19560 | loss 3.318305 (+0.10z)| norm 0.2490 (-1.58z)| lr 6.90e-05 | 4154.95 ms | 32.5% bf16 MFU | 125825 tok/s step 15408/19560 | loss 3.284243 (-0.61z)| norm 0.2718 (-0.04z)| lr 6.90e-05 | 4144.99 ms | 32.6% bf16 MFU | 125858 tok/s step 15409/19560 | loss 3.209406 (-2.12z)| norm 0.2629 (-0.63z)| lr 6.89e-05 | 4154.65 ms | 32.5% bf16 MFU | 125875 tok/s step 15410/19560 | loss 3.254715 (-1.17z)| norm 0.2766 (+0.28z)| lr 6.89e-05 | 4154.13 ms | 32.5% bf16 MFU | 125892 tok/s step 15411/19560 | loss 3.301947 (-0.19z)| norm 0.2640 (-0.57z)| lr 6.89e-05 | 4146.25 ms | 32.6% bf16 MFU | 125920 tok/s step 15412/19560 | loss 3.330323 (+0.39z)| norm 0.2597 (-0.86z)| lr 6.88e-05 | 4155.02 ms | 32.5% bf16 MFU | 125933 tok/s step 15413/19560 | loss 3.221994 (-1.80z)| norm 0.2809 (+0.59z)| lr 6.88e-05 | 4155.91 ms | 32.5% bf16 MFU | 125944 tok/s step 15414/19560 | loss 3.327229 (+0.34z)| norm 0.2752 (+0.20z)| lr 6.88e-05 | 4147.15 ms | 32.6% bf16 MFU | 125968 tok/s step 15415/19560 | loss 3.278272 (-0.66z)| norm 0.2642 (-0.56z)| lr 6.87e-05 | 4149.18 ms | 32.5% bf16 MFU | 125987 tok/s step 15416/19560 | loss 3.341726 (+0.62z)| norm 0.2766 (+0.29z)| lr 6.87e-05 | 4176.07 ms | 32.3% bf16 MFU | 125965 tok/s step 15417/19560 | loss 3.299981 (-0.25z)| norm 0.2924 (+1.37z)| lr 6.87e-05 | 4156.14 ms | 32.5% bf16 MFU | 125974 tok/s step 15418/19560 | loss 3.337106 (+0.51z)| norm 0.2784 (+0.40z)| lr 6.86e-05 | 4163.00 ms | 32.4% bf16 MFU | 125973 tok/s step 15419/19560 | loss 3.280897 (-0.63z)| norm 0.2614 (-0.76z)| lr 6.86e-05 | 4151.58 ms | 32.5% bf16 MFU | 125988 tok/s step 15420/19560 | loss 3.257665 (-1.13z)| norm 0.3113 (+2.69z)| lr 6.86e-05 | 4147.92 ms | 32.6% bf16 MFU | 126009 tok/s step 15421/19560 | loss 3.288173 (-0.48z)| norm 0.2678 (-0.30z)| lr 6.86e-05 | 4154.90 ms | 32.5% bf16 MFU | 126018 tok/s step 15422/19560 | loss 3.292713 (-0.37z)| norm 0.2713 (-0.06z)| lr 6.85e-05 | 4148.11 ms | 32.5% bf16 MFU | 126036 tok/s step 15423/19560 | loss 3.277251 (-0.69z)| norm 0.3083 (+2.42z)| lr 6.85e-05 | 4192.92 ms | 32.2% bf16 MFU | 125987 tok/s step 15424/19560 | loss 3.284413 (-0.55z)| norm 0.2711 (-0.10z)| lr 6.85e-05 | 4159.78 ms | 32.5% bf16 MFU | 125989 tok/s step 15425/19560 | loss 3.253130 (-1.19z)| norm 0.2785 (+0.40z)| lr 6.84e-05 | 4153.35 ms | 32.5% bf16 MFU | 126001 tok/s step 15426/19560 | loss 3.279846 (-0.64z)| norm 0.2695 (-0.20z)| lr 6.84e-05 | 4147.12 ms | 32.6% bf16 MFU | 126022 tok/s step 15427/19560 | loss 3.251013 (-1.23z)| norm 0.2682 (-0.28z)| lr 6.84e-05 | 4165.76 ms | 32.4% bf16 MFU | 126014 tok/s step 15428/19560 | loss 3.306994 (-0.05z)| norm 0.2570 (-1.03z)| lr 6.83e-05 | 4151.47 ms | 32.5% bf16 MFU | 126028 tok/s step 15429/19560 | loss 3.317847 (+0.19z)| norm 0.2707 (-0.11z)| lr 6.83e-05 | 4143.25 ms | 32.6% bf16 MFU | 126053 tok/s step 15430/19560 | loss 3.252542 (-1.21z)| norm 0.2667 (-0.36z)| lr 6.83e-05 | 4155.72 ms | 32.5% bf16 MFU | 126059 tok/s step 15431/19560 | loss 3.293171 (-0.33z)| norm 0.2761 (+0.27z)| lr 6.82e-05 | 4150.62 ms | 32.5% bf16 MFU | 126072 tok/s step 15432/19560 | loss 3.354223 (+0.98z)| norm 0.2686 (-0.25z)| lr 6.82e-05 | 4159.18 ms | 32.5% bf16 MFU | 126071 tok/s step 15433/19560 | loss 3.312409 (+0.08z)| norm 0.2702 (-0.13z)| lr 6.82e-05 | 4161.06 ms | 32.4% bf16 MFU | 126067 tok/s step 15434/19560 | loss 3.296675 (-0.27z)| norm 0.2850 (+0.89z)| lr 6.81e-05 | 4144.56 ms | 32.6% bf16 MFU | 126089 tok/s step 15435/19560 | loss 3.294385 (-0.32z)| norm 0.2810 (+0.60z)| lr 6.81e-05 | 4158.77 ms | 32.5% bf16 MFU | 126088 tok/s step 15436/19560 | loss 3.329305 (+0.45z)| norm 0.2608 (-0.80z)| lr 6.81e-05 | 4154.04 ms | 32.5% bf16 MFU | 126094 tok/s step 15437/19560 | loss 3.307170 (-0.04z)| norm 0.2907 (+1.26z)| lr 6.80e-05 | 4151.33 ms | 32.5% bf16 MFU | 126104 tok/s step 15438/19560 | loss 3.271010 (-0.83z)| norm 0.2391 (-2.27z)| lr 6.80e-05 | 4156.77 ms | 32.5% bf16 MFU | 126105 tok/s step 15439/19560 | loss 3.336494 (+0.61z)| norm 0.2614 (-0.73z)| lr 6.80e-05 | 4156.82 ms | 32.5% bf16 MFU | 126106 tok/s step 15440/19560 | loss 3.305252 (-0.08z)| norm 0.2643 (-0.52z)| lr 6.80e-05 | 4151.20 ms | 32.5% bf16 MFU | 126116 tok/s step 15441/19560 | loss 3.264042 (-1.10z)| norm 0.2727 (+0.08z)| lr 6.79e-05 | 4155.59 ms | 32.5% bf16 MFU | 126118 tok/s step 15442/19560 | loss 3.318765 (+0.32z)| norm 0.2544 (-1.21z)| lr 6.79e-05 | 4148.76 ms | 32.5% bf16 MFU | 126131 tok/s step 15443/19560 | loss 3.376200 (+1.78z)| norm 0.2829 (+0.84z)| lr 6.79e-05 | 4154.20 ms | 32.5% bf16 MFU | 126135 tok/s step 15444/19560 | loss 3.349672 (+1.09z)| norm 0.2754 (+0.31z)| lr 6.78e-05 | 4155.41 ms | 32.5% bf16 MFU | 126137 tok/s step 15445/19560 | loss 3.351533 (+1.15z)| norm 0.2588 (-0.89z)| lr 6.78e-05 | 4147.46 ms | 32.6% bf16 MFU | 126150 tok/s step 15446/19560 | loss 3.264529 (-1.10z)| norm 0.2690 (-0.11z)| lr 6.78e-05 | 4150.13 ms | 32.5% bf16 MFU | 126159 tok/s step 15447/19560 | loss 3.333021 (+0.68z)| norm 0.2707 (+0.03z)| lr 6.77e-05 | 4155.92 ms | 32.5% bf16 MFU | 126159 tok/s step 15448/19560 | loss 3.273935 (-0.84z)| norm 0.2583 (-0.91z)| lr 6.77e-05 | 4163.90 ms | 32.4% bf16 MFU | 126147 tok/s step 15449/19560 | loss 3.305890 (-0.00z)| norm 0.2670 (-0.22z)| lr 6.77e-05 | 4153.66 ms | 32.5% bf16 MFU | 126151 tok/s step 15450/19560 | loss 3.273146 (-0.85z)| norm 0.2563 (-1.07z)| lr 6.76e-05 | 4150.80 ms | 32.5% bf16 MFU | 126159 tok/s step 15451/19560 | loss 3.415281 (+2.79z)| norm 0.2678 (-0.13z)| lr 6.76e-05 | 4146.92 ms | 32.6% bf16 MFU | 126172 tok/s step 15452/19560 | loss 3.274774 (-0.82z)| norm 0.2766 (+0.60z)| lr 6.76e-05 | 4145.49 ms | 32.6% bf16 MFU | 126187 tok/s step 15453/19560 | loss 3.353827 (+1.26z)| norm 0.2734 (+0.35z)| lr 6.75e-05 | 4151.09 ms | 32.5% bf16 MFU | 126193 tok/s step 15454/19560 | loss 3.326218 (+0.53z)| norm 0.2599 (-0.76z)| lr 6.75e-05 | 4149.98 ms | 32.5% bf16 MFU | 126200 tok/s step 15455/19560 | loss 3.309111 (+0.06z)| norm 0.2766 (+0.63z)| lr 6.75e-05 | 4152.93 ms | 32.5% bf16 MFU | 126202 tok/s step 15456/19560 | loss 3.282782 (-0.63z)| norm 0.2703 (+0.10z)| lr 6.74e-05 | 4156.14 ms | 32.5% bf16 MFU | 126199 tok/s step 15457/19560 | loss 3.284158 (-0.58z)| norm 0.2643 (-0.39z)| lr 6.74e-05 | 4164.97 ms | 32.4% bf16 MFU | 126183 tok/s step 15458/19560 | loss 3.304314 (-0.05z)| norm 0.2806 (+0.95z)| lr 6.74e-05 | 4157.07 ms | 32.5% bf16 MFU | 126180 tok/s step 15459/19560 | loss 3.282822 (-0.62z)| norm 0.2594 (-0.80z)| lr 6.73e-05 | 4153.35 ms | 32.5% bf16 MFU | 126183 tok/s step 15460/19560 | loss 3.355438 (+1.31z)| norm 0.2653 (-0.31z)| lr 6.73e-05 | 4142.41 ms | 32.6% bf16 MFU | 126202 tok/s step 15461/19560 | loss 3.264765 (-1.08z)| norm 0.2722 (+0.26z)| lr 6.73e-05 | 4148.59 ms | 32.5% bf16 MFU | 126211 tok/s step 15462/19560 | loss 3.316961 (+0.32z)| norm 0.2685 (-0.05z)| lr 6.73e-05 | 4162.68 ms | 32.4% bf16 MFU | 126198 tok/s step 15463/19560 | loss 3.292419 (-0.33z)| norm 0.2671 (-0.17z)| lr 6.72e-05 | 4154.01 ms | 32.5% bf16 MFU | 126199 tok/s step 15464/19560 | loss 3.330906 (+0.71z)| norm 0.2783 (+0.75z)| lr 6.72e-05 | 4149.74 ms | 32.5% bf16 MFU | 126206 tok/s step 15465/19560 | loss 3.336102 (+0.84z)| norm 0.2746 (+0.45z)| lr 6.72e-05 | 4143.43 ms | 32.6% bf16 MFU | 126222 tok/s step 15466/19560 | loss 3.320570 (+0.42z)| norm 0.2720 (+0.23z)| lr 6.71e-05 | 4150.95 ms | 32.5% bf16 MFU | 126226 tok/s step 15467/19560 | loss 3.290265 (-0.39z)| norm 0.2909 (+1.76z)| lr 6.71e-05 | 4150.87 ms | 32.5% bf16 MFU | 126230 tok/s step 15468/19560 | loss 3.251019 (-1.44z)| norm 0.2827 (+1.07z)| lr 6.71e-05 | 4148.48 ms | 32.5% bf16 MFU | 126238 tok/s step 15469/19560 | loss 3.274808 (-0.78z)| norm 0.2587 (-0.88z)| lr 6.70e-05 | 4159.33 ms | 32.5% bf16 MFU | 126229 tok/s step 15470/19560 | loss 3.307824 (+0.12z)| norm 0.2774 (+0.65z)| lr 6.70e-05 | 4152.23 ms | 32.5% bf16 MFU | 126230 tok/s step 15471/19560 | loss 3.288408 (-0.42z)| norm 0.2683 (-0.11z)| lr 6.70e-05 | 4151.06 ms | 32.5% bf16 MFU | 126234 tok/s step 15472/19560 | loss 3.326355 (+0.64z)| norm 0.2950 (+2.05z)| lr 6.69e-05 | 4150.84 ms | 32.5% bf16 MFU | 126238 tok/s step 15473/19560 | loss 3.290061 (-0.36z)| norm 0.2733 (+0.28z)| lr 6.69e-05 | 4146.50 ms | 32.6% bf16 MFU | 126248 tok/s step 15474/19560 | loss 3.234076 (-1.88z)| norm 0.2532 (-1.32z)| lr 6.69e-05 | 4161.05 ms | 32.4% bf16 MFU | 126236 tok/s step 15475/19560 | loss 3.300272 (-0.05z)| norm 0.2632 (-0.50z)| lr 6.68e-05 | 4154.87 ms | 32.5% bf16 MFU | 126233 tok/s step 15476/19560 | loss 3.262948 (-1.07z)| norm 0.2665 (-0.23z)| lr 6.68e-05 | 4149.96 ms | 32.5% bf16 MFU | 126238 tok/s step 15477/19560 | loss 3.336666 (+0.97z)| norm 0.2723 (+0.23z)| lr 6.68e-05 | 4159.72 ms | 32.5% bf16 MFU | 126228 tok/s step 15478/19560 | loss 3.217539 (-2.31z)| norm 0.2609 (-0.68z)| lr 6.68e-05 | 4153.36 ms | 32.5% bf16 MFU | 126228 tok/s step 15479/19560 | loss 3.280857 (-0.53z)| norm 0.2651 (-0.34z)| lr 6.67e-05 | 4155.06 ms | 32.5% bf16 MFU | 126226 tok/s step 15480/19560 | loss 3.300580 (+0.02z)| norm 0.2612 (-0.65z)| lr 6.67e-05 | 4152.81 ms | 32.5% bf16 MFU | 126227 tok/s step 15481/19560 | loss 3.287200 (-0.37z)| norm 0.2736 (+0.36z)| lr 6.67e-05 | 4164.55 ms | 32.4% bf16 MFU | 126210 tok/s step 15482/19560 | loss 3.411593 (+3.04z)| norm 0.2695 (+0.05z)| lr 6.66e-05 | 4159.15 ms | 32.5% bf16 MFU | 126203 tok/s step 15483/19560 | loss 3.286298 (-0.40z)| norm 0.2463 (-1.84z)| lr 6.66e-05 | 4149.84 ms | 32.5% bf16 MFU | 126210 tok/s step 15484/19560 | loss 3.296123 (-0.12z)| norm 0.2589 (-0.80z)| lr 6.66e-05 | 4152.47 ms | 32.5% bf16 MFU | 126212 tok/s step 15485/19560 | loss 3.217962 (-2.23z)| norm 0.2849 (+1.34z)| lr 6.65e-05 | 4153.53 ms | 32.5% bf16 MFU | 126213 tok/s step 15486/19560 | loss 3.308195 (+0.22z)| norm 0.2518 (-1.38z)| lr 6.65e-05 | 4156.17 ms | 32.5% bf16 MFU | 126210 tok/s step 15487/19560 | loss 3.294014 (-0.17z)| norm 0.2670 (-0.14z)| lr 6.65e-05 | 4152.39 ms | 32.5% bf16 MFU | 126212 tok/s step 15488/19560 | loss 3.244581 (-1.51z)| norm 0.2667 (-0.15z)| lr 6.64e-05 | 4144.16 ms | 32.6% bf16 MFU | 126227 tok/s step 15489/19560 | loss 3.296725 (-0.07z)| norm 0.2577 (-0.88z)| lr 6.64e-05 | 4153.32 ms | 32.5% bf16 MFU | 126228 tok/s step 15490/19560 | loss 3.330802 (+0.86z)| norm 0.2758 (+0.60z)| lr 6.64e-05 | 4151.08 ms | 32.5% bf16 MFU | 126231 tok/s step 15491/19560 | loss 3.315714 (+0.44z)| norm 0.2612 (-0.60z)| lr 6.63e-05 | 4154.29 ms | 32.5% bf16 MFU | 126230 tok/s step 15492/19560 | loss 3.264320 (-0.99z)| norm 0.2712 (+0.21z)| lr 6.63e-05 | 4147.54 ms | 32.6% bf16 MFU | 126239 tok/s step 15493/19560 | loss 3.342520 (+1.16z)| norm 0.2754 (+0.55z)| lr 6.63e-05 | 4145.38 ms | 32.6% bf16 MFU | 126251 tok/s step 15494/19560 | loss 3.318913 (+0.50z)| norm 0.2624 (-0.51z)| lr 6.62e-05 | 4155.91 ms | 32.5% bf16 MFU | 126246 tok/s step 15495/19560 | loss 3.299502 (-0.03z)| norm 0.2655 (-0.26z)| lr 6.62e-05 | 4150.30 ms | 32.5% bf16 MFU | 126250 tok/s step 15496/19560 | loss 3.311516 (+0.33z)| norm 0.2698 (+0.10z)| lr 6.62e-05 | 4157.20 ms | 32.5% bf16 MFU | 126243 tok/s step 15497/19560 | loss 3.254461 (-1.28z)| norm 0.2736 (+0.43z)| lr 6.62e-05 | 4151.67 ms | 32.5% bf16 MFU | 126245 tok/s step 15498/19560 | loss 3.338786 (+1.10z)| norm 0.2596 (-0.76z)| lr 6.61e-05 | 4164.31 ms | 32.4% bf16 MFU | 126228 tok/s step 15499/19560 | loss 3.278910 (-0.58z)| norm 0.2578 (-0.91z)| lr 6.61e-05 | 4159.51 ms | 32.5% bf16 MFU | 126219 tok/s step 15500/19560 | loss 3.318980 (+0.56z)| norm 0.2494 (-1.61z)| lr 6.61e-05 | 4166.58 ms | 32.4% bf16 MFU | 126199 tok/s val loss 3.288961 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3021/10042 = 0.300836 step 15501/19560 | loss 3.320596 (+0.59z)| norm 0.2779 (+0.81z)| lr 6.60e-05 | 4151.50 ms | 32.5% bf16 MFU | 126204 tok/s step 15502/19560 | loss 3.246006 (-1.52z)| norm 0.2458 (-1.88z)| lr 6.60e-05 | 4151.87 ms | 32.5% bf16 MFU | 126208 tok/s step 15503/19560 | loss 3.252565 (-1.31z)| norm 0.2579 (-0.85z)| lr 6.60e-05 | 4155.74 ms | 32.5% bf16 MFU | 126205 tok/s step 15504/19560 | loss 3.321736 (+0.67z)| norm 0.2584 (-0.79z)| lr 6.59e-05 | 4151.95 ms | 32.5% bf16 MFU | 126209 tok/s step 15505/19560 | loss 3.312031 (+0.40z)| norm 0.2698 (+0.18z)| lr 6.59e-05 | 4154.02 ms | 32.5% bf16 MFU | 126209 tok/s step 15506/19560 | loss 3.291915 (-0.17z)| norm 0.2808 (+1.11z)| lr 6.59e-05 | 4159.33 ms | 32.5% bf16 MFU | 126201 tok/s step 15507/19560 | loss 3.263598 (-0.99z)| norm 0.2709 (+0.27z)| lr 6.58e-05 | 4149.39 ms | 32.5% bf16 MFU | 126209 tok/s step 15508/19560 | loss 3.305190 (+0.21z)| norm 0.2627 (-0.43z)| lr 6.58e-05 | 4149.95 ms | 32.5% bf16 MFU | 126215 tok/s step 15509/19560 | loss 3.248496 (-1.42z)| norm 0.2690 (+0.12z)| lr 6.58e-05 | 4147.28 ms | 32.6% bf16 MFU | 126225 tok/s step 15510/19560 | loss 3.347609 (+1.41z)| norm 0.2816 (+1.16z)| lr 6.57e-05 | 4146.99 ms | 32.6% bf16 MFU | 126235 tok/s step 15511/19560 | loss 3.300814 (+0.07z)| norm 0.2567 (-0.95z)| lr 6.57e-05 | 4151.48 ms | 32.5% bf16 MFU | 126238 tok/s step 15512/19560 | loss 3.297039 (-0.04z)| norm 0.2584 (-0.79z)| lr 6.57e-05 | 4152.07 ms | 32.5% bf16 MFU | 126240 tok/s step 15513/19560 | loss 3.298630 (+0.00z)| norm 0.2702 (+0.21z)| lr 6.57e-05 | 4160.77 ms | 32.5% bf16 MFU | 126228 tok/s step 15514/19560 | loss 3.314654 (+0.45z)| norm 0.2568 (-0.96z)| lr 6.56e-05 | 4159.50 ms | 32.5% bf16 MFU | 126219 tok/s step 15515/19560 | loss 3.259796 (-1.12z)| norm 0.2514 (-1.40z)| lr 6.56e-05 | 4151.17 ms | 32.5% bf16 MFU | 126223 tok/s step 15516/19560 | loss 3.284181 (-0.42z)| norm 0.2722 (+0.38z)| lr 6.56e-05 | 4150.06 ms | 32.5% bf16 MFU | 126228 tok/s step 15517/19560 | loss 3.311522 (+0.37z)| norm 0.2586 (-0.78z)| lr 6.55e-05 | 4147.02 ms | 32.6% bf16 MFU | 126238 tok/s step 15518/19560 | loss 3.388452 (+2.49z)| norm 0.3093 (+3.40z)| lr 6.55e-05 | 4146.68 ms | 32.6% bf16 MFU | 126248 tok/s step 15519/19560 | loss 3.347431 (+1.32z)| norm 0.2797 (+0.96z)| lr 6.55e-05 | 4148.77 ms | 32.5% bf16 MFU | 126254 tok/s step 15520/19560 | loss 3.319734 (+0.54z)| norm 0.2823 (+1.16z)| lr 6.54e-05 | 4147.85 ms | 32.6% bf16 MFU | 126261 tok/s step 15521/19560 | loss 3.322799 (+0.62z)| norm 0.2709 (+0.21z)| lr 6.54e-05 | 4163.91 ms | 32.4% bf16 MFU | 126244 tok/s step 15522/19560 | loss 3.310641 (+0.27z)| norm 0.2605 (-0.67z)| lr 6.54e-05 | 4162.87 ms | 32.4% bf16 MFU | 126229 tok/s step 15523/19560 | loss 3.278586 (-0.62z)| norm 0.2566 (-1.00z)| lr 6.53e-05 | 4147.39 ms | 32.6% bf16 MFU | 126238 tok/s step 15524/19560 | loss 3.325656 (+0.69z)| norm 0.2750 (+0.54z)| lr 6.53e-05 | 4158.48 ms | 32.5% bf16 MFU | 126230 tok/s step 15525/19560 | loss 3.313803 (+0.36z)| norm 0.2755 (+0.57z)| lr 6.53e-05 | 4154.70 ms | 32.5% bf16 MFU | 126228 tok/s step 15526/19560 | loss 3.362878 (+1.70z)| norm 0.2733 (+0.39z)| lr 6.53e-05 | 4153.70 ms | 32.5% bf16 MFU | 126228 tok/s step 15527/19560 | loss 3.247765 (-1.45z)| norm 0.2784 (+0.80z)| lr 6.52e-05 | 4149.15 ms | 32.5% bf16 MFU | 126235 tok/s step 15528/19560 | loss 3.268409 (-0.88z)| norm 0.2626 (-0.53z)| lr 6.52e-05 | 4148.62 ms | 32.5% bf16 MFU | 126242 tok/s step 15529/19560 | loss 3.293387 (-0.20z)| norm 0.2743 (+0.44z)| lr 6.52e-05 | 4143.70 ms | 32.6% bf16 MFU | 126256 tok/s step 15530/19560 | loss 3.256327 (-1.19z)| norm 0.2609 (-0.70z)| lr 6.51e-05 | 4150.31 ms | 32.5% bf16 MFU | 126259 tok/s step 15531/19560 | loss 3.269132 (-0.83z)| norm 0.2621 (-0.60z)| lr 6.51e-05 | 4161.50 ms | 32.4% bf16 MFU | 126246 tok/s step 15532/19560 | loss 3.359886 (+1.63z)| norm 0.2699 (+0.05z)| lr 6.51e-05 | 4149.97 ms | 32.5% bf16 MFU | 126250 tok/s step 15533/19560 | loss 3.272460 (-0.73z)| norm 0.2798 (+0.90z)| lr 6.50e-05 | 4149.84 ms | 32.5% bf16 MFU | 126255 tok/s step 15534/19560 | loss 3.290166 (-0.24z)| norm 0.2763 (+0.59z)| lr 6.50e-05 | 4153.38 ms | 32.5% bf16 MFU | 126253 tok/s step 15535/19560 | loss 3.292480 (-0.18z)| norm 0.2784 (+0.76z)| lr 6.50e-05 | 4175.62 ms | 32.3% bf16 MFU | 126219 tok/s step 15536/19560 | loss 3.263083 (-0.97z)| norm 0.2641 (-0.47z)| lr 6.49e-05 | 4143.58 ms | 32.6% bf16 MFU | 126234 tok/s step 15537/19560 | loss 3.297988 (-0.04z)| norm 0.2612 (-0.72z)| lr 6.49e-05 | 4163.12 ms | 32.4% bf16 MFU | 126219 tok/s step 15538/19560 | loss 3.313498 (+0.38z)| norm 0.2711 (+0.15z)| lr 6.49e-05 | 4209.21 ms | 32.1% bf16 MFU | 126136 tok/s step 15539/19560 | loss 3.256999 (-1.18z)| norm 0.2748 (+0.45z)| lr 6.48e-05 | 4155.51 ms | 32.5% bf16 MFU | 126138 tok/s step 15540/19560 | loss 3.290600 (-0.24z)| norm 0.2520 (-1.50z)| lr 6.48e-05 | 4150.65 ms | 32.5% bf16 MFU | 126147 tok/s step 15541/19560 | loss 3.339481 (+1.11z)| norm 0.2808 (+0.98z)| lr 6.48e-05 | 4157.88 ms | 32.5% bf16 MFU | 126144 tok/s step 15542/19560 | loss 3.307586 (+0.21z)| norm 0.2765 (+0.60z)| lr 6.48e-05 | 4151.21 ms | 32.5% bf16 MFU | 126152 tok/s step 15543/19560 | loss 3.306809 (+0.19z)| norm 0.2768 (+0.62z)| lr 6.47e-05 | 4154.85 ms | 32.5% bf16 MFU | 126154 tok/s step 15544/19560 | loss 3.286708 (-0.37z)| norm 0.2547 (-1.26z)| lr 6.47e-05 | 4150.91 ms | 32.5% bf16 MFU | 126161 tok/s step 15545/19560 | loss 3.330412 (+0.86z)| norm 0.2644 (-0.42z)| lr 6.47e-05 | 4157.30 ms | 32.5% bf16 MFU | 126159 tok/s step 15546/19560 | loss 3.312428 (+0.36z)| norm 0.2588 (-0.89z)| lr 6.46e-05 | 4160.13 ms | 32.5% bf16 MFU | 126152 tok/s step 15547/19560 | loss 3.222709 (-2.15z)| norm 0.2646 (-0.39z)| lr 6.46e-05 | 4141.60 ms | 32.6% bf16 MFU | 126174 tok/s step 15548/19560 | loss 3.345146 (+1.26z)| norm 0.2469 (-1.96z)| lr 6.46e-05 | 4149.55 ms | 32.5% bf16 MFU | 126183 tok/s step 15549/19560 | loss 3.330629 (+0.84z)| norm 0.2756 (+0.63z)| lr 6.45e-05 | 4147.96 ms | 32.6% bf16 MFU | 126193 tok/s step 15550/19560 | loss 3.306467 (+0.17z)| norm 0.2831 (+1.29z)| lr 6.45e-05 | 4149.38 ms | 32.5% bf16 MFU | 126201 tok/s step 15551/19560 | loss 3.299947 (-0.02z)| norm 0.2509 (-1.63z)| lr 6.45e-05 | 4155.34 ms | 32.5% bf16 MFU | 126200 tok/s step 15552/19560 | loss 3.323932 (+0.64z)| norm 0.2686 (+0.03z)| lr 6.44e-05 | 4258.37 ms | 31.7% bf16 MFU | 126046 tok/s step 15553/19560 | loss 3.356705 (+1.53z)| norm 0.2641 (-0.38z)| lr 6.44e-05 | 4171.38 ms | 32.4% bf16 MFU | 126028 tok/s step 15554/19560 | loss 3.356694 (+1.50z)| norm 0.2526 (-1.44z)| lr 6.44e-05 | 4170.72 ms | 32.4% bf16 MFU | 126012 tok/s step 15555/19560 | loss 3.341930 (+1.08z)| norm 0.2587 (-0.85z)| lr 6.44e-05 | 4152.26 ms | 32.5% bf16 MFU | 126025 tok/s step 15556/19560 | loss 3.355783 (+1.44z)| norm 0.2486 (-1.77z)| lr 6.43e-05 | 4154.82 ms | 32.5% bf16 MFU | 126033 tok/s step 15557/19560 | loss 3.241973 (-1.66z)| norm 0.2554 (-1.13z)| lr 6.43e-05 | 4155.49 ms | 32.5% bf16 MFU | 126040 tok/s step 15558/19560 | loss 3.267693 (-0.97z)| norm 0.2571 (-0.97z)| lr 6.43e-05 | 4169.34 ms | 32.4% bf16 MFU | 126025 tok/s step 15559/19560 | loss 3.316451 (+0.36z)| norm 0.2742 (+0.59z)| lr 6.42e-05 | 4158.01 ms | 32.5% bf16 MFU | 126028 tok/s step 15560/19560 | loss 3.346355 (+1.18z)| norm 0.2658 (-0.17z)| lr 6.42e-05 | 4167.32 ms | 32.4% bf16 MFU | 126017 tok/s step 15561/19560 | loss 3.315127 (+0.33z)| norm 0.2399 (-2.45z)| lr 6.42e-05 | 4158.52 ms | 32.5% bf16 MFU | 126020 tok/s step 15562/19560 | loss 3.287096 (-0.44z)| norm 0.2569 (-0.92z)| lr 6.41e-05 | 4150.92 ms | 32.5% bf16 MFU | 126035 tok/s step 15563/19560 | loss 3.326563 (+0.63z)| norm 0.2676 (+0.05z)| lr 6.41e-05 | 4157.66 ms | 32.5% bf16 MFU | 126038 tok/s step 15564/19560 | loss 3.382361 (+2.11z)| norm 0.2708 (+0.33z)| lr 6.41e-05 | 4150.87 ms | 32.5% bf16 MFU | 126051 tok/s step 15565/19560 | loss 3.315401 (+0.31z)| norm 0.2672 (+0.02z)| lr 6.40e-05 | 4163.01 ms | 32.4% bf16 MFU | 126046 tok/s step 15566/19560 | loss 3.450306 (+3.69z)| norm 0.2715 (+0.40z)| lr 6.40e-05 | 4149.58 ms | 32.5% bf16 MFU | 126061 tok/s step 15567/19560 | loss 3.345097 (+1.01z)| norm 0.2558 (-1.08z)| lr 6.40e-05 | 4156.04 ms | 32.5% bf16 MFU | 126065 tok/s step 15568/19560 | loss 3.324171 (+0.48z)| norm 0.2724 (+0.49z)| lr 6.39e-05 | 4159.04 ms | 32.5% bf16 MFU | 126065 tok/s step 15569/19560 | loss 3.312572 (+0.17z)| norm 0.2906 (+2.15z)| lr 6.39e-05 | 4161.82 ms | 32.4% bf16 MFU | 126061 tok/s step 15570/19560 | loss 3.351769 (+1.16z)| norm 0.2598 (-0.71z)| lr 6.39e-05 | 4165.42 ms | 32.4% bf16 MFU | 126051 tok/s step 15571/19560 | loss 3.313575 (+0.20z)| norm 0.2975 (+2.72z)| lr 6.39e-05 | 4157.12 ms | 32.5% bf16 MFU | 126054 tok/s step 15572/19560 | loss 3.341913 (+0.94z)| norm 0.2697 (+0.20z)| lr 6.38e-05 | 4162.75 ms | 32.4% bf16 MFU | 126049 tok/s step 15573/19560 | loss 3.258586 (-1.19z)| norm 0.2571 (-0.95z)| lr 6.38e-05 | 4153.44 ms | 32.5% bf16 MFU | 126058 tok/s step 15574/19560 | loss 3.308394 (+0.08z)| norm 0.2671 (-0.03z)| lr 6.38e-05 | 4166.63 ms | 32.4% bf16 MFU | 126047 tok/s step 15575/19560 | loss 3.336814 (+0.82z)| norm 0.2589 (-0.77z)| lr 6.37e-05 | 4152.05 ms | 32.5% bf16 MFU | 126058 tok/s step 15576/19560 | loss 3.338972 (+0.86z)| norm 0.2720 (+0.41z)| lr 6.37e-05 | 4154.27 ms | 32.5% bf16 MFU | 126065 tok/s step 15577/19560 | loss 3.395034 (+2.24z)| norm 0.2822 (+1.32z)| lr 6.37e-05 | 4162.94 ms | 32.4% bf16 MFU | 126059 tok/s step 15578/19560 | loss 3.272569 (-0.86z)| norm 0.2721 (+0.40z)| lr 6.36e-05 | 4147.10 ms | 32.6% bf16 MFU | 126077 tok/s step 15579/19560 | loss 3.362710 (+1.47z)| norm 0.2581 (-0.86z)| lr 6.36e-05 | 4165.14 ms | 32.4% bf16 MFU | 126067 tok/s step 15580/19560 | loss 3.285448 (-0.53z)| norm 0.2603 (-0.65z)| lr 6.36e-05 | 4165.18 ms | 32.4% bf16 MFU | 126057 tok/s step 15581/19560 | loss 3.345649 (+1.03z)| norm 0.2700 (+0.23z)| lr 6.35e-05 | 4159.00 ms | 32.5% bf16 MFU | 126058 tok/s step 15582/19560 | loss 3.320121 (+0.37z)| norm 0.2604 (-0.64z)| lr 6.35e-05 | 4208.55 ms | 32.1% bf16 MFU | 125984 tok/s step 15583/19560 | loss 3.325636 (+0.51z)| norm 0.2694 (+0.18z)| lr 6.35e-05 | 4152.71 ms | 32.5% bf16 MFU | 125997 tok/s step 15584/19560 | loss 3.337882 (+0.81z)| norm 0.2556 (-1.06z)| lr 6.35e-05 | 4169.58 ms | 32.4% bf16 MFU | 125984 tok/s step 15585/19560 | loss 3.331030 (+0.63z)| norm 0.2655 (-0.17z)| lr 6.34e-05 | 4175.85 ms | 32.3% bf16 MFU | 125963 tok/s step 15586/19560 | loss 3.351718 (+1.15z)| norm 0.2655 (-0.16z)| lr 6.34e-05 | 4178.27 ms | 32.3% bf16 MFU | 125939 tok/s step 15587/19560 | loss 3.353061 (+1.16z)| norm 0.2800 (+1.14z)| lr 6.34e-05 | 4155.70 ms | 32.5% bf16 MFU | 125950 tok/s step 15588/19560 | loss 3.311524 (+0.10z)| norm 0.2537 (-1.22z)| lr 6.33e-05 | 4154.92 ms | 32.5% bf16 MFU | 125961 tok/s step 15589/19560 | loss 3.257787 (-1.28z)| norm 0.2451 (-1.95z)| lr 6.33e-05 | 4161.93 ms | 32.4% bf16 MFU | 125962 tok/s step 15590/19560 | loss 3.321713 (+0.37z)| norm 0.2652 (-0.16z)| lr 6.33e-05 | 4167.28 ms | 32.4% bf16 MFU | 125954 tok/s step 15591/19560 | loss 3.335241 (+0.71z)| norm 0.2627 (-0.38z)| lr 6.32e-05 | 4155.92 ms | 32.5% bf16 MFU | 125964 tok/s step 15592/19560 | loss 3.335588 (+0.71z)| norm 0.2698 (+0.25z)| lr 6.32e-05 | 4158.28 ms | 32.5% bf16 MFU | 125970 tok/s step 15593/19560 | loss 3.236548 (-1.80z)| norm 0.2519 (-1.31z)| lr 6.32e-05 | 4153.72 ms | 32.5% bf16 MFU | 125983 tok/s step 15594/19560 | loss 3.314554 (+0.19z)| norm 0.2624 (-0.38z)| lr 6.31e-05 | 4176.24 ms | 32.3% bf16 MFU | 125961 tok/s step 15595/19560 | loss 3.300575 (-0.17z)| norm 0.2704 (+0.35z)| lr 6.31e-05 | 4166.93 ms | 32.4% bf16 MFU | 125954 tok/s step 15596/19560 | loss 3.320625 (+0.33z)| norm 0.2524 (-1.26z)| lr 6.31e-05 | 4148.71 ms | 32.5% bf16 MFU | 125975 tok/s step 15597/19560 | loss 3.233630 (-1.87z)| norm 0.2640 (-0.22z)| lr 6.31e-05 | 4153.80 ms | 32.5% bf16 MFU | 125987 tok/s step 15598/19560 | loss 3.521218 (+4.86z)| norm 0.3011 (+3.03z)| lr 6.30e-05 | 4207.71 ms | 32.1% bf16 MFU | 125918 tok/s step 15599/19560 | loss 3.326156 (+0.39z)| norm 0.2492 (-1.49z)| lr 6.30e-05 | 4170.83 ms | 32.4% bf16 MFU | 125907 tok/s step 15600/19560 | loss 3.350989 (+0.95z)| norm 0.2527 (-1.18z)| lr 6.30e-05 | 4158.03 ms | 32.5% bf16 MFU | 125916 tok/s step 15601/19560 | loss 3.383196 (+1.65z)| norm 0.2534 (-1.10z)| lr 6.29e-05 | 4162.75 ms | 32.4% bf16 MFU | 125918 tok/s step 15602/19560 | loss 3.300978 (-0.22z)| norm 0.2577 (-0.72z)| lr 6.29e-05 | 4219.19 ms | 32.0% bf16 MFU | 125835 tok/s step 15603/19560 | loss 3.315094 (+0.10z)| norm 0.2587 (-0.63z)| lr 6.29e-05 | 4173.18 ms | 32.4% bf16 MFU | 125825 tok/s step 15604/19560 | loss 3.310431 (-0.02z)| norm 0.2655 (-0.04z)| lr 6.28e-05 | 4154.55 ms | 32.5% bf16 MFU | 125843 tok/s step 15605/19560 | loss 3.317526 (+0.15z)| norm 0.2639 (-0.17z)| lr 6.28e-05 | 4175.22 ms | 32.3% bf16 MFU | 125830 tok/s step 15606/19560 | loss 3.372981 (+1.41z)| norm 0.2627 (-0.28z)| lr 6.28e-05 | 4170.25 ms | 32.4% bf16 MFU | 125824 tok/s step 15607/19560 | loss 3.334710 (+0.51z)| norm 0.2694 (+0.31z)| lr 6.28e-05 | 4171.87 ms | 32.4% bf16 MFU | 125817 tok/s step 15608/19560 | loss 3.317056 (+0.10z)| norm 0.2593 (-0.58z)| lr 6.27e-05 | 4162.34 ms | 32.4% bf16 MFU | 125824 tok/s step 15609/19560 | loss 3.320732 (+0.18z)| norm 0.2827 (+1.48z)| lr 6.27e-05 | 4162.39 ms | 32.4% bf16 MFU | 125831 tok/s step 15610/19560 | loss 3.283075 (-0.69z)| norm 0.2661 (+0.02z)| lr 6.27e-05 | 4163.40 ms | 32.4% bf16 MFU | 125836 tok/s step 15611/19560 | loss 3.303557 (-0.21z)| norm 0.2526 (-1.18z)| lr 6.26e-05 | 4155.06 ms | 32.5% bf16 MFU | 125853 tok/s step 15612/19560 | loss 3.252390 (-1.41z)| norm 0.2546 (-0.99z)| lr 6.26e-05 | 4160.18 ms | 32.5% bf16 MFU | 125861 tok/s step 15613/19560 | loss 3.282557 (-0.72z)| norm 0.2626 (-0.28z)| lr 6.26e-05 | 4160.92 ms | 32.4% bf16 MFU | 125868 tok/s step 15614/19560 | loss 3.339454 (+0.65z)| norm 0.2552 (-0.95z)| lr 6.25e-05 | 4165.06 ms | 32.4% bf16 MFU | 125869 tok/s step 15615/19560 | loss 3.342486 (+0.71z)| norm 0.2545 (-1.00z)| lr 6.25e-05 | 4162.42 ms | 32.4% bf16 MFU | 125873 tok/s step 15616/19560 | loss 3.395590 (+1.95z)| norm 0.2734 (+0.68z)| lr 6.25e-05 | 4156.12 ms | 32.5% bf16 MFU | 125887 tok/s step 15617/19560 | loss 3.259435 (-1.29z)| norm 0.2808 (+1.32z)| lr 6.24e-05 | 4163.11 ms | 32.4% bf16 MFU | 125890 tok/s step 15618/19560 | loss 3.300210 (-0.32z)| norm 0.2709 (+0.44z)| lr 6.24e-05 | 4172.41 ms | 32.4% bf16 MFU | 125878 tok/s step 15619/19560 | loss 3.350646 (+0.87z)| norm 0.2958 (+2.56z)| lr 6.24e-05 | 4152.06 ms | 32.5% bf16 MFU | 125898 tok/s step 15620/19560 | loss 3.277067 (-0.88z)| norm 0.2881 (+1.86z)| lr 6.24e-05 | 4166.34 ms | 32.4% bf16 MFU | 125895 tok/s step 15621/19560 | loss 3.268299 (-1.07z)| norm 0.2617 (-0.38z)| lr 6.23e-05 | 4161.83 ms | 32.4% bf16 MFU | 125899 tok/s step 15622/19560 | loss 3.352877 (+0.93z)| norm 0.2755 (+0.79z)| lr 6.23e-05 | 4157.01 ms | 32.5% bf16 MFU | 125910 tok/s step 15623/19560 | loss 3.275424 (-0.90z)| norm 0.2598 (-0.55z)| lr 6.23e-05 | 4153.76 ms | 32.5% bf16 MFU | 125925 tok/s step 15624/19560 | loss 3.285215 (-0.66z)| norm 0.2715 (+0.45z)| lr 6.22e-05 | 4150.24 ms | 32.5% bf16 MFU | 125945 tok/s step 15625/19560 | loss 3.307337 (-0.15z)| norm 0.2761 (+0.83z)| lr 6.22e-05 | 4154.88 ms | 32.5% bf16 MFU | 125957 tok/s step 15626/19560 | loss 3.312455 (-0.02z)| norm 0.2851 (+1.57z)| lr 6.22e-05 | 4165.94 ms | 32.4% bf16 MFU | 125952 tok/s step 15627/19560 | loss 3.357506 (+1.03z)| norm 0.2821 (+1.29z)| lr 6.21e-05 | 4170.53 ms | 32.4% bf16 MFU | 125940 tok/s step 15628/19560 | loss 3.329365 (+0.36z)| norm 0.2721 (+0.45z)| lr 6.21e-05 | 4172.94 ms | 32.4% bf16 MFU | 125925 tok/s step 15629/19560 | loss 3.269168 (-1.05z)| norm 0.2686 (+0.16z)| lr 6.21e-05 | 4176.71 ms | 32.3% bf16 MFU | 125905 tok/s step 15630/19560 | loss 3.212425 (-2.35z)| norm 0.2856 (+1.58z)| lr 6.20e-05 | 4154.40 ms | 32.5% bf16 MFU | 125920 tok/s step 15631/19560 | loss 3.285254 (-0.67z)| norm 0.2603 (-0.58z)| lr 6.20e-05 | 4161.31 ms | 32.4% bf16 MFU | 125924 tok/s step 15632/19560 | loss 3.273323 (-0.94z)| norm 0.2561 (-0.93z)| lr 6.20e-05 | 4163.08 ms | 32.4% bf16 MFU | 125924 tok/s step 15633/19560 | loss 3.300208 (-0.31z)| norm 0.2672 (+0.02z)| lr 6.20e-05 | 4150.97 ms | 32.5% bf16 MFU | 125943 tok/s step 15634/19560 | loss 3.264212 (-1.14z)| norm 0.2591 (-0.66z)| lr 6.19e-05 | 4158.65 ms | 32.5% bf16 MFU | 125950 tok/s step 15635/19560 | loss 3.318518 (+0.12z)| norm 0.2703 (+0.29z)| lr 6.19e-05 | 4150.78 ms | 32.5% bf16 MFU | 125968 tok/s step 15636/19560 | loss 3.304864 (-0.20z)| norm 0.2728 (+0.50z)| lr 6.19e-05 | 4158.11 ms | 32.5% bf16 MFU | 125974 tok/s step 15637/19560 | loss 3.302958 (-0.26z)| norm 0.2614 (-0.47z)| lr 6.18e-05 | 4167.09 ms | 32.4% bf16 MFU | 125966 tok/s step 15638/19560 | loss 3.246504 (-1.56z)| norm 0.2686 (+0.16z)| lr 6.18e-05 | 4160.52 ms | 32.5% bf16 MFU | 125968 tok/s step 15639/19560 | loss 3.334464 (+0.49z)| norm 0.2787 (+1.01z)| lr 6.18e-05 | 4152.41 ms | 32.5% bf16 MFU | 125983 tok/s step 15640/19560 | loss 3.329200 (+0.36z)| norm 0.2697 (+0.23z)| lr 6.17e-05 | 4166.84 ms | 32.4% bf16 MFU | 125975 tok/s step 15641/19560 | loss 3.257709 (-1.29z)| norm 0.2776 (+0.90z)| lr 6.17e-05 | 4170.87 ms | 32.4% bf16 MFU | 125961 tok/s step 15642/19560 | loss 3.322984 (+0.22z)| norm 0.2796 (+1.05z)| lr 6.17e-05 | 4166.58 ms | 32.4% bf16 MFU | 125955 tok/s step 15643/19560 | loss 3.302335 (-0.27z)| norm 0.2563 (-0.95z)| lr 6.17e-05 | 4170.75 ms | 32.4% bf16 MFU | 125942 tok/s step 15644/19560 | loss 3.269816 (-1.02z)| norm 0.2706 (+0.28z)| lr 6.16e-05 | 4155.82 ms | 32.5% bf16 MFU | 125953 tok/s step 15645/19560 | loss 3.291937 (-0.50z)| norm 0.2613 (-0.52z)| lr 6.16e-05 | 4183.63 ms | 32.3% bf16 MFU | 125921 tok/s step 15646/19560 | loss 3.337096 (+0.57z)| norm 0.2621 (-0.44z)| lr 6.16e-05 | 4163.93 ms | 32.4% bf16 MFU | 125921 tok/s step 15647/19560 | loss 3.393102 (+1.86z)| norm 0.2759 (+0.81z)| lr 6.15e-05 | 4157.09 ms | 32.5% bf16 MFU | 125931 tok/s step 15648/19560 | loss 3.314046 (+0.01z)| norm 0.2639 (-0.26z)| lr 6.15e-05 | 4163.06 ms | 32.4% bf16 MFU | 125931 tok/s step 15649/19560 | loss 3.444405 (+2.93z)| norm 0.2667 (-0.00z)| lr 6.15e-05 | 4162.59 ms | 32.4% bf16 MFU | 125932 tok/s step 15650/19560 | loss 3.332193 (+0.40z)| norm 0.2779 (+1.00z)| lr 6.14e-05 | 4159.25 ms | 32.5% bf16 MFU | 125938 tok/s step 15651/19560 | loss 3.285977 (-0.64z)| norm 0.2620 (-0.45z)| lr 6.14e-05 | 4151.15 ms | 32.5% bf16 MFU | 125956 tok/s step 15652/19560 | loss 3.307823 (-0.15z)| norm 0.2571 (-0.89z)| lr 6.14e-05 | 4157.36 ms | 32.5% bf16 MFU | 125964 tok/s step 15653/19560 | loss 3.357205 (+0.95z)| norm 0.2571 (-0.88z)| lr 6.14e-05 | 4159.58 ms | 32.5% bf16 MFU | 125968 tok/s step 15654/19560 | loss 3.298896 (-0.35z)| norm 0.2625 (-0.37z)| lr 6.13e-05 | 4240.30 ms | 31.8% bf16 MFU | 125852 tok/s step 15655/19560 | loss 3.348894 (+0.77z)| norm 0.2615 (-0.45z)| lr 6.13e-05 | 4164.29 ms | 32.4% bf16 MFU | 125854 tok/s step 15656/19560 | loss 3.248871 (-1.49z)| norm 0.2579 (-0.78z)| lr 6.13e-05 | 4170.47 ms | 32.4% bf16 MFU | 125847 tok/s step 15657/19560 | loss 3.503592 (+3.96z)| norm 0.2800 (+1.24z)| lr 6.12e-05 | 4152.02 ms | 32.5% bf16 MFU | 125869 tok/s step 15658/19560 | loss 3.334827 (+0.38z)| norm 0.2550 (-1.04z)| lr 6.12e-05 | 4159.09 ms | 32.5% bf16 MFU | 125878 tok/s step 15659/19560 | loss 3.437422 (+2.48z)| norm 0.2851 (+1.67z)| lr 6.12e-05 | 4154.54 ms | 32.5% bf16 MFU | 125894 tok/s step 15660/19560 | loss 3.349940 (+0.66z)| norm 0.2632 (-0.30z)| lr 6.11e-05 | 4153.86 ms | 32.5% bf16 MFU | 125910 tok/s step 15661/19560 | loss 3.292736 (-0.54z)| norm 0.2459 (-1.82z)| lr 6.11e-05 | 4169.10 ms | 32.4% bf16 MFU | 125902 tok/s step 15662/19560 | loss 3.285931 (-0.68z)| norm 0.2545 (-1.04z)| lr 6.11e-05 | 4154.47 ms | 32.5% bf16 MFU | 125917 tok/s step 15663/19560 | loss 3.288657 (-0.62z)| norm 0.2805 (+1.28z)| lr 6.10e-05 | 4162.41 ms | 32.4% bf16 MFU | 125919 tok/s step 15664/19560 | loss 3.306000 (-0.27z)| norm 0.2532 (-1.14z)| lr 6.10e-05 | 4157.16 ms | 32.5% bf16 MFU | 125929 tok/s step 15665/19560 | loss 3.302141 (-0.35z)| norm 0.2521 (-1.22z)| lr 6.10e-05 | 4159.67 ms | 32.5% bf16 MFU | 125935 tok/s step 15666/19560 | loss 3.299827 (-0.40z)| norm 0.2758 (+0.86z)| lr 6.10e-05 | 4158.07 ms | 32.5% bf16 MFU | 125942 tok/s step 15667/19560 | loss 3.241320 (-1.62z)| norm 0.2579 (-0.70z)| lr 6.09e-05 | 4156.74 ms | 32.5% bf16 MFU | 125952 tok/s step 15668/19560 | loss 3.252489 (-1.37z)| norm 0.2682 (+0.20z)| lr 6.09e-05 | 4172.53 ms | 32.4% bf16 MFU | 125937 tok/s step 15669/19560 | loss 3.364392 (+0.95z)| norm 0.2707 (+0.43z)| lr 6.09e-05 | 4176.07 ms | 32.3% bf16 MFU | 125917 tok/s step 15670/19560 | loss 3.236449 (-1.67z)| norm 0.2603 (-0.49z)| lr 6.08e-05 | 4150.97 ms | 32.5% bf16 MFU | 125937 tok/s step 15671/19560 | loss 3.319604 (+0.03z)| norm 0.2566 (-0.81z)| lr 6.08e-05 | 4179.46 ms | 32.3% bf16 MFU | 125912 tok/s step 15672/19560 | loss 3.368415 (+1.02z)| norm 0.2524 (-1.18z)| lr 6.08e-05 | 4156.05 ms | 32.5% bf16 MFU | 125924 tok/s step 15673/19560 | loss 3.394572 (+1.53z)| norm 0.2624 (-0.28z)| lr 6.07e-05 | 4159.38 ms | 32.5% bf16 MFU | 125930 tok/s step 15674/19560 | loss 3.307062 (-0.25z)| norm 0.2568 (-0.78z)| lr 6.07e-05 | 4175.63 ms | 32.3% bf16 MFU | 125912 tok/s step 15675/19560 | loss 3.274001 (-0.94z)| norm 0.2492 (-1.44z)| lr 6.07e-05 | 4156.74 ms | 32.5% bf16 MFU | 125923 tok/s step 15676/19560 | loss 3.310488 (-0.18z)| norm 0.2680 (+0.22z)| lr 6.07e-05 | 4157.99 ms | 32.5% bf16 MFU | 125931 tok/s step 15677/19560 | loss 3.250723 (-1.39z)| norm 0.2650 (-0.04z)| lr 6.06e-05 | 4176.67 ms | 32.3% bf16 MFU | 125911 tok/s step 15678/19560 | loss 3.330857 (+0.24z)| norm 0.2675 (+0.19z)| lr 6.06e-05 | 4159.24 ms | 32.5% bf16 MFU | 125918 tok/s step 15679/19560 | loss 3.299075 (-0.40z)| norm 0.2639 (-0.15z)| lr 6.06e-05 | 4157.51 ms | 32.5% bf16 MFU | 125927 tok/s step 15680/19560 | loss 3.292204 (-0.54z)| norm 0.2982 (+2.87z)| lr 6.05e-05 | 4163.76 ms | 32.4% bf16 MFU | 125927 tok/s step 15681/19560 | loss 3.292543 (-0.52z)| norm 0.2566 (-0.80z)| lr 6.05e-05 | 4153.64 ms | 32.5% bf16 MFU | 125942 tok/s step 15682/19560 | loss 3.335217 (+0.35z)| norm 0.2697 (+0.34z)| lr 6.05e-05 | 4153.86 ms | 32.5% bf16 MFU | 125955 tok/s step 15683/19560 | loss 3.359519 (+0.85z)| norm 0.2533 (-1.10z)| lr 6.04e-05 | 4163.46 ms | 32.4% bf16 MFU | 125954 tok/s step 15684/19560 | loss 3.248952 (-1.39z)| norm 0.2825 (+1.45z)| lr 6.04e-05 | 4151.47 ms | 32.5% bf16 MFU | 125971 tok/s step 15685/19560 | loss 3.310525 (-0.15z)| norm 0.2794 (+1.16z)| lr 6.04e-05 | 4156.71 ms | 32.5% bf16 MFU | 125979 tok/s step 15686/19560 | loss 3.296965 (-0.43z)| norm 0.2545 (-1.03z)| lr 6.04e-05 | 4158.80 ms | 32.5% bf16 MFU | 125983 tok/s step 15687/19560 | loss 3.348928 (+0.63z)| norm 0.2567 (-0.82z)| lr 6.03e-05 | 4167.39 ms | 32.4% bf16 MFU | 125974 tok/s step 15688/19560 | loss 3.306885 (-0.23z)| norm 0.2686 (+0.22z)| lr 6.03e-05 | 4165.85 ms | 32.4% bf16 MFU | 125968 tok/s step 15689/19560 | loss 3.344644 (+0.54z)| norm 0.2581 (-0.73z)| lr 6.03e-05 | 4145.34 ms | 32.6% bf16 MFU | 125994 tok/s step 15690/19560 | loss 3.349688 (+0.64z)| norm 0.2604 (-0.52z)| lr 6.02e-05 | 4162.17 ms | 32.4% bf16 MFU | 125992 tok/s step 15691/19560 | loss 3.276763 (-0.85z)| norm 0.2717 (+0.49z)| lr 6.02e-05 | 4157.01 ms | 32.5% bf16 MFU | 125999 tok/s step 15692/19560 | loss 3.333083 (+0.31z)| norm 0.2663 (+0.01z)| lr 6.02e-05 | 7164.30 ms | 18.8% bf16 MFU | 123358 tok/s step 15693/19560 | loss 3.295599 (-0.46z)| norm 0.2610 (-0.47z)| lr 6.01e-05 | 4160.39 ms | 32.5% bf16 MFU | 123491 tok/s step 15694/19560 | loss 3.312373 (-0.09z)| norm 0.2833 (+1.51z)| lr 6.01e-05 | 4138.55 ms | 32.6% bf16 MFU | 123651 tok/s step 15695/19560 | loss 3.277758 (-0.81z)| norm 0.2759 (+0.84z)| lr 6.01e-05 | 4154.00 ms | 32.5% bf16 MFU | 123779 tok/s step 15696/19560 | loss 3.348653 (+0.68z)| norm 0.2695 (+0.27z)| lr 6.01e-05 | 4148.60 ms | 32.5% bf16 MFU | 123909 tok/s step 15697/19560 | loss 3.241850 (-1.55z)| norm 0.2659 (-0.03z)| lr 6.00e-05 | 4153.74 ms | 32.5% bf16 MFU | 124024 tok/s step 15698/19560 | loss 3.321985 (+0.13z)| norm 0.2729 (+0.60z)| lr 6.00e-05 | 4179.02 ms | 32.3% bf16 MFU | 124096 tok/s step 15699/19560 | loss 3.267169 (-1.00z)| norm 0.2645 (-0.15z)| lr 6.00e-05 | 4161.18 ms | 32.4% bf16 MFU | 124191 tok/s step 15700/19560 | loss 3.345145 (+0.63z)| norm 0.2606 (-0.51z)| lr 5.99e-05 | 4143.36 ms | 32.6% bf16 MFU | 124308 tok/s step 15701/19560 | loss 3.359105 (+0.90z)| norm 0.2896 (+2.15z)| lr 5.99e-05 | 4157.11 ms | 32.5% bf16 MFU | 124399 tok/s step 15702/19560 | loss 3.306409 (-0.20z)| norm 0.2716 (+0.49z)| lr 5.99e-05 | 4157.03 ms | 32.5% bf16 MFU | 124485 tok/s step 15703/19560 | loss 3.284528 (-0.65z)| norm 0.2563 (-0.92z)| lr 5.98e-05 | 4146.68 ms | 32.6% bf16 MFU | 124582 tok/s step 15704/19560 | loss 3.317485 (+0.04z)| norm 0.2744 (+0.74z)| lr 5.98e-05 | 4154.42 ms | 32.5% bf16 MFU | 124663 tok/s step 15705/19560 | loss 3.310518 (-0.09z)| norm 0.2801 (+1.27z)| lr 5.98e-05 | 4152.24 ms | 32.5% bf16 MFU | 124743 tok/s step 15706/19560 | loss 3.242190 (-1.52z)| norm 0.2584 (-0.72z)| lr 5.98e-05 | 4149.67 ms | 32.5% bf16 MFU | 124823 tok/s step 15707/19560 | loss 3.322566 (+0.18z)| norm 0.2461 (-1.82z)| lr 5.97e-05 | 4152.49 ms | 32.5% bf16 MFU | 124895 tok/s step 15708/19560 | loss 3.292812 (-0.45z)| norm 0.2725 (+0.58z)| lr 5.97e-05 | 4168.06 ms | 32.4% bf16 MFU | 124940 tok/s step 15709/19560 | loss 3.228020 (-1.79z)| norm 0.2621 (-0.37z)| lr 5.97e-05 | 4157.01 ms | 32.5% bf16 MFU | 124999 tok/s step 15710/19560 | loss 3.265336 (-0.99z)| norm 0.2556 (-0.95z)| lr 5.96e-05 | 4159.75 ms | 32.5% bf16 MFU | 125051 tok/s step 15711/19560 | loss 3.262538 (-1.04z)| norm 0.2511 (-1.33z)| lr 5.96e-05 | 4152.21 ms | 32.5% bf16 MFU | 125112 tok/s step 15712/19560 | loss 3.264786 (-0.97z)| norm 0.2638 (-0.20z)| lr 5.96e-05 | 4159.55 ms | 32.5% bf16 MFU | 125158 tok/s step 15713/19560 | loss 3.291312 (-0.42z)| norm 0.2505 (-1.38z)| lr 5.95e-05 | 4159.48 ms | 32.5% bf16 MFU | 125203 tok/s step 15714/19560 | loss 3.297807 (-0.28z)| norm 0.2487 (-1.52z)| lr 5.95e-05 | 4165.92 ms | 32.4% bf16 MFU | 125235 tok/s step 15715/19560 | loss 3.294076 (-0.35z)| norm 0.2609 (-0.42z)| lr 5.95e-05 | 4166.68 ms | 32.4% bf16 MFU | 125265 tok/s step 15716/19560 | loss 3.272460 (-0.79z)| norm 0.2620 (-0.33z)| lr 5.95e-05 | 4150.78 ms | 32.5% bf16 MFU | 125317 tok/s step 15717/19560 | loss 3.305330 (-0.11z)| norm 0.2538 (-1.08z)| lr 5.94e-05 | 4152.89 ms | 32.5% bf16 MFU | 125364 tok/s step 15718/19560 | loss 3.414624 (+2.11z)| norm 0.2651 (-0.05z)| lr 5.94e-05 | 4163.49 ms | 32.4% bf16 MFU | 125392 tok/s step 15719/19560 | loss 3.354238 (+0.87z)| norm 0.2549 (-0.97z)| lr 5.94e-05 | 4154.55 ms | 32.5% bf16 MFU | 125432 tok/s step 15720/19560 | loss 3.316971 (+0.11z)| norm 0.2511 (-1.29z)| lr 5.93e-05 | 4146.01 ms | 32.6% bf16 MFU | 125483 tok/s step 15721/19560 | loss 3.304590 (-0.15z)| norm 0.2537 (-1.07z)| lr 5.93e-05 | 4150.60 ms | 32.5% bf16 MFU | 125525 tok/s step 15722/19560 | loss 3.225008 (-1.76z)| norm 0.3112 (+3.83z)| lr 5.93e-05 | 4153.79 ms | 32.5% bf16 MFU | 125559 tok/s step 15723/19560 | loss 3.307548 (-0.08z)| norm 0.2753 (+0.79z)| lr 5.92e-05 | 4154.14 ms | 32.5% bf16 MFU | 125592 tok/s step 15724/19560 | loss 3.337687 (+0.53z)| norm 0.2561 (-0.84z)| lr 5.92e-05 | 4156.57 ms | 32.5% bf16 MFU | 125619 tok/s step 15725/19560 | loss 3.322707 (+0.22z)| norm 0.2497 (-1.36z)| lr 5.92e-05 | 4160.54 ms | 32.5% bf16 MFU | 125639 tok/s step 15726/19560 | loss 3.299696 (-0.24z)| norm 0.2615 (-0.35z)| lr 5.92e-05 | 4144.35 ms | 32.6% bf16 MFU | 125682 tok/s step 15727/19560 | loss 3.295229 (-0.33z)| norm 0.2572 (-0.74z)| lr 5.91e-05 | 4161.97 ms | 32.4% bf16 MFU | 125697 tok/s step 15728/19560 | loss 3.270517 (-0.87z)| norm 0.2607 (-0.44z)| lr 5.91e-05 | 4159.91 ms | 32.5% bf16 MFU | 125713 tok/s step 15729/19560 | loss 3.335634 (+0.59z)| norm 0.2724 (+0.58z)| lr 5.91e-05 | 4147.23 ms | 32.6% bf16 MFU | 125749 tok/s step 15730/19560 | loss 3.252836 (-1.25z)| norm 0.2491 (-1.46z)| lr 5.90e-05 | 4159.68 ms | 32.5% bf16 MFU | 125763 tok/s step 15731/19560 | loss 3.278578 (-0.67z)| norm 0.2783 (+1.08z)| lr 5.90e-05 | 4158.38 ms | 32.5% bf16 MFU | 125779 tok/s step 15732/19560 | loss 3.317462 (+0.20z)| norm 0.2789 (+1.12z)| lr 5.90e-05 | 4149.58 ms | 32.5% bf16 MFU | 125808 tok/s step 15733/19560 | loss 3.289459 (-0.42z)| norm 0.2764 (+0.89z)| lr 5.90e-05 | 4173.49 ms | 32.4% bf16 MFU | 125798 tok/s step 15734/19560 | loss 3.334924 (+0.60z)| norm 0.2741 (+0.69z)| lr 5.89e-05 | 4155.67 ms | 32.5% bf16 MFU | 125817 tok/s step 15735/19560 | loss 3.293719 (-0.32z)| norm 0.2579 (-0.71z)| lr 5.89e-05 | 4156.90 ms | 32.5% bf16 MFU | 125832 tok/s step 15736/19560 | loss 3.277852 (-0.66z)| norm 0.2571 (-0.77z)| lr 5.89e-05 | 4154.28 ms | 32.5% bf16 MFU | 125851 tok/s step 15737/19560 | loss 3.350006 (+0.95z)| norm 0.2574 (-0.74z)| lr 5.88e-05 | 4157.02 ms | 32.5% bf16 MFU | 125864 tok/s step 15738/19560 | loss 3.311939 (+0.09z)| norm 0.2489 (-1.45z)| lr 5.88e-05 | 4158.83 ms | 32.5% bf16 MFU | 125874 tok/s step 15739/19560 | loss 3.332550 (+0.55z)| norm 0.2619 (-0.34z)| lr 5.88e-05 | 4148.97 ms | 32.5% bf16 MFU | 125899 tok/s step 15740/19560 | loss 3.331666 (+0.52z)| norm 0.2553 (-0.91z)| lr 5.87e-05 | 4157.51 ms | 32.5% bf16 MFU | 125909 tok/s step 15741/19560 | loss 3.360393 (+1.14z)| norm 0.2526 (-1.13z)| lr 5.87e-05 | 4153.83 ms | 32.5% bf16 MFU | 125925 tok/s step 15742/19560 | loss 3.311088 (+0.04z)| norm 0.2516 (-1.22z)| lr 5.87e-05 | 4164.65 ms | 32.4% bf16 MFU | 125923 tok/s step 15743/19560 | loss 3.317597 (+0.19z)| norm 0.2571 (-0.74z)| lr 5.87e-05 | 4209.49 ms | 32.1% bf16 MFU | 125854 tok/s step 15744/19560 | loss 3.345782 (+0.85z)| norm 0.2672 (+0.13z)| lr 5.86e-05 | 4158.32 ms | 32.5% bf16 MFU | 125866 tok/s step 15745/19560 | loss 3.283041 (-0.59z)| norm 0.2623 (-0.29z)| lr 5.86e-05 | 4184.22 ms | 32.3% bf16 MFU | 125837 tok/s step 15746/19560 | loss 3.254017 (-1.23z)| norm 0.2725 (+0.60z)| lr 5.86e-05 | 4154.77 ms | 32.5% bf16 MFU | 125855 tok/s step 15747/19560 | loss 3.315269 (+0.16z)| norm 0.2662 (+0.07z)| lr 5.85e-05 | 4165.70 ms | 32.4% bf16 MFU | 125855 tok/s step 15748/19560 | loss 3.290956 (-0.39z)| norm 0.2527 (-1.12z)| lr 5.85e-05 | 4165.88 ms | 32.4% bf16 MFU | 125855 tok/s step 15749/19560 | loss 3.316979 (+0.19z)| norm 0.2500 (-1.34z)| lr 5.85e-05 | 4163.20 ms | 32.4% bf16 MFU | 125859 tok/s step 15750/19560 | loss 3.343744 (+0.81z)| norm 0.2363 (-2.49z)| lr 5.84e-05 | 4159.40 ms | 32.5% bf16 MFU | 125868 tok/s val loss 3.286086 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3008/10042 = 0.299542 step 15751/19560 | loss 3.291425 (-0.40z)| norm 0.2394 (-2.16z)| lr 5.84e-05 | 4154.66 ms | 32.5% bf16 MFU | 125885 tok/s step 15752/19560 | loss 3.305261 (-0.08z)| norm 0.2564 (-0.69z)| lr 5.84e-05 | 4152.42 ms | 32.5% bf16 MFU | 125903 tok/s step 15753/19560 | loss 3.346288 (+0.85z)| norm 0.2485 (-1.35z)| lr 5.84e-05 | 4156.31 ms | 32.5% bf16 MFU | 125915 tok/s step 15754/19560 | loss 3.362778 (+1.21z)| norm 0.2364 (-2.33z)| lr 5.83e-05 | 4162.59 ms | 32.4% bf16 MFU | 125917 tok/s step 15755/19560 | loss 3.293878 (-0.35z)| norm 0.2559 (-0.66z)| lr 5.83e-05 | 4159.94 ms | 32.5% bf16 MFU | 125923 tok/s step 15756/19560 | loss 3.280684 (-0.64z)| norm 0.2463 (-1.46z)| lr 5.83e-05 | 4158.32 ms | 32.5% bf16 MFU | 125931 tok/s step 15757/19560 | loss 3.291386 (-0.40z)| norm 0.2674 (+0.34z)| lr 5.82e-05 | 4157.76 ms | 32.5% bf16 MFU | 125939 tok/s step 15758/19560 | loss 3.336065 (+0.61z)| norm 0.2413 (-1.85z)| lr 5.82e-05 | 4888.01 ms | 27.6% bf16 MFU | 125005 tok/s step 15759/19560 | loss 3.307792 (-0.05z)| norm 0.2548 (-0.70z)| lr 5.82e-05 | 4159.27 ms | 32.5% bf16 MFU | 125058 tok/s step 15760/19560 | loss 3.306055 (-0.10z)| norm 0.2583 (-0.40z)| lr 5.81e-05 | 4155.75 ms | 32.5% bf16 MFU | 125113 tok/s step 15761/19560 | loss 3.283607 (-0.62z)| norm 0.2493 (-1.15z)| lr 5.81e-05 | 4153.91 ms | 32.5% bf16 MFU | 125168 tok/s step 15762/19560 | loss 3.295081 (-0.36z)| norm 0.2619 (-0.08z)| lr 5.81e-05 | 4148.14 ms | 32.5% bf16 MFU | 125229 tok/s step 15763/19560 | loss 3.330656 (+0.48z)| norm 0.2513 (-0.97z)| lr 5.81e-05 | 4153.09 ms | 32.5% bf16 MFU | 125280 tok/s step 15764/19560 | loss 3.287043 (-0.54z)| norm 0.2730 (+0.86z)| lr 5.80e-05 | 4171.04 ms | 32.4% bf16 MFU | 125301 tok/s step 15765/19560 | loss 3.316696 (+0.15z)| norm 0.2445 (-1.52z)| lr 5.80e-05 | 4156.30 ms | 32.5% bf16 MFU | 125343 tok/s step 15766/19560 | loss 3.337294 (+0.62z)| norm 0.2630 (+0.04z)| lr 5.80e-05 | 4159.48 ms | 32.5% bf16 MFU | 125378 tok/s step 15767/19560 | loss 3.300624 (-0.24z)| norm 0.2660 (+0.30z)| lr 5.79e-05 | 4170.19 ms | 32.4% bf16 MFU | 125395 tok/s step 15768/19560 | loss 3.312095 (+0.03z)| norm 0.2481 (-1.19z)| lr 5.79e-05 | 4154.32 ms | 32.5% bf16 MFU | 125436 tok/s step 15769/19560 | loss 3.477448 (+3.71z)| norm 0.2916 (+2.42z)| lr 5.79e-05 | 4160.94 ms | 32.4% bf16 MFU | 125464 tok/s step 15770/19560 | loss 3.322710 (+0.23z)| norm 0.2707 (+0.70z)| lr 5.79e-05 | 4158.42 ms | 32.5% bf16 MFU | 125495 tok/s step 15771/19560 | loss 3.229069 (-1.84z)| norm 0.2516 (-0.89z)| lr 5.78e-05 | 4158.31 ms | 32.5% bf16 MFU | 125524 tok/s step 15772/19560 | loss 3.305664 (-0.14z)| norm 0.2736 (+0.93z)| lr 5.78e-05 | 4163.65 ms | 32.4% bf16 MFU | 125544 tok/s step 15773/19560 | loss 3.389757 (+1.69z)| norm 0.2726 (+0.84z)| lr 5.78e-05 | 4164.54 ms | 32.4% bf16 MFU | 125561 tok/s step 15774/19560 | loss 3.319829 (+0.16z)| norm 0.2559 (-0.54z)| lr 5.77e-05 | 4147.94 ms | 32.6% bf16 MFU | 125603 tok/s step 15775/19560 | loss 3.344501 (+0.72z)| norm 0.2662 (+0.32z)| lr 5.77e-05 | 4159.13 ms | 32.5% bf16 MFU | 125626 tok/s step 15776/19560 | loss 3.326598 (+0.32z)| norm 0.2789 (+1.36z)| lr 5.77e-05 | 4169.50 ms | 32.4% bf16 MFU | 125632 tok/s step 15777/19560 | loss 3.376368 (+1.48z)| norm 0.2703 (+0.65z)| lr 5.76e-05 | 4163.46 ms | 32.4% bf16 MFU | 125646 tok/s step 15778/19560 | loss 3.327695 (+0.36z)| norm 0.2733 (+0.90z)| lr 5.76e-05 | 4155.90 ms | 32.5% bf16 MFU | 125672 tok/s step 15779/19560 | loss 3.452166 (+3.07z)| norm 0.2801 (+1.44z)| lr 5.76e-05 | 4165.18 ms | 32.4% bf16 MFU | 125682 tok/s step 15780/19560 | loss 3.296090 (-0.37z)| norm 0.2648 (+0.18z)| lr 5.76e-05 | 4176.65 ms | 32.3% bf16 MFU | 125674 tok/s step 15781/19560 | loss 3.336165 (+0.51z)| norm 0.2673 (+0.37z)| lr 5.75e-05 | 4157.89 ms | 32.5% bf16 MFU | 125695 tok/s step 15782/19560 | loss 3.300220 (-0.28z)| norm 0.2550 (-0.62z)| lr 5.75e-05 | 4160.17 ms | 32.5% bf16 MFU | 125712 tok/s step 15783/19560 | loss 3.298191 (-0.32z)| norm 0.2614 (-0.10z)| lr 5.75e-05 | 4162.71 ms | 32.4% bf16 MFU | 125724 tok/s step 15784/19560 | loss 3.290855 (-0.49z)| norm 0.2582 (-0.36z)| lr 5.74e-05 | 4168.52 ms | 32.4% bf16 MFU | 125726 tok/s step 15785/19560 | loss 3.259056 (-1.24z)| norm 0.2525 (-0.82z)| lr 5.74e-05 | 4166.63 ms | 32.4% bf16 MFU | 125731 tok/s step 15786/19560 | loss 3.282382 (-0.67z)| norm 0.2612 (-0.11z)| lr 5.74e-05 | 4154.25 ms | 32.5% bf16 MFU | 125755 tok/s step 15787/19560 | loss 3.244027 (-1.60z)| norm 0.2752 (+1.07z)| lr 5.74e-05 | 4146.25 ms | 32.6% bf16 MFU | 125790 tok/s step 15788/19560 | loss 3.279915 (-0.70z)| norm 0.2573 (-0.42z)| lr 5.73e-05 | 4168.56 ms | 32.4% bf16 MFU | 125789 tok/s step 15789/19560 | loss 3.357958 (+1.20z)| norm 0.2633 (+0.07z)| lr 5.73e-05 | 4166.82 ms | 32.4% bf16 MFU | 125791 tok/s step 15790/19560 | loss 3.364543 (+1.34z)| norm 0.2523 (-0.85z)| lr 5.73e-05 | 4155.55 ms | 32.5% bf16 MFU | 125809 tok/s step 15791/19560 | loss 3.362680 (+1.27z)| norm 0.2565 (-0.49z)| lr 5.72e-05 | 4152.09 ms | 32.5% bf16 MFU | 125832 tok/s step 15792/19560 | loss 3.244200 (-1.57z)| norm 0.2591 (-0.27z)| lr 5.72e-05 | 4157.51 ms | 32.5% bf16 MFU | 125846 tok/s step 15793/19560 | loss 3.304128 (-0.13z)| norm 0.2529 (-0.80z)| lr 5.72e-05 | 4158.31 ms | 32.5% bf16 MFU | 125858 tok/s step 15794/19560 | loss 3.284235 (-0.61z)| norm 0.2511 (-0.93z)| lr 5.71e-05 | 4144.61 ms | 32.6% bf16 MFU | 125890 tok/s step 15795/19560 | loss 3.326308 (+0.39z)| norm 0.2956 (+2.72z)| lr 5.71e-05 | 4167.34 ms | 32.4% bf16 MFU | 125886 tok/s step 15796/19560 | loss 3.262848 (-1.15z)| norm 0.2707 (+0.68z)| lr 5.71e-05 | 4157.79 ms | 32.5% bf16 MFU | 125896 tok/s step 15797/19560 | loss 3.309090 (-0.02z)| norm 0.2923 (+2.38z)| lr 5.71e-05 | 4165.06 ms | 32.4% bf16 MFU | 125896 tok/s step 15798/19560 | loss 3.266815 (-1.07z)| norm 0.2558 (-0.54z)| lr 5.70e-05 | 4161.67 ms | 32.4% bf16 MFU | 125900 tok/s step 15799/19560 | loss 3.264614 (-1.11z)| norm 0.2645 (+0.15z)| lr 5.70e-05 | 4167.72 ms | 32.4% bf16 MFU | 125895 tok/s step 15800/19560 | loss 3.307434 (-0.04z)| norm 0.2866 (+1.88z)| lr 5.70e-05 | 4166.57 ms | 32.4% bf16 MFU | 125891 tok/s step 15801/19560 | loss 3.288937 (-0.49z)| norm 0.2575 (-0.42z)| lr 5.69e-05 | 4163.86 ms | 32.4% bf16 MFU | 125893 tok/s step 15802/19560 | loss 3.242083 (-1.64z)| norm 0.2775 (+1.14z)| lr 5.69e-05 | 4153.96 ms | 32.5% bf16 MFU | 125909 tok/s step 15803/19560 | loss 3.319579 (+0.28z)| norm 0.2674 (+0.34z)| lr 5.69e-05 | 4154.11 ms | 32.5% bf16 MFU | 125924 tok/s step 15804/19560 | loss 3.303948 (-0.11z)| norm 0.2625 (-0.05z)| lr 5.69e-05 | 4168.93 ms | 32.4% bf16 MFU | 125916 tok/s step 15805/19560 | loss 3.309194 (+0.01z)| norm 0.2576 (-0.43z)| lr 5.68e-05 | 4146.91 ms | 32.6% bf16 MFU | 125941 tok/s step 15806/19560 | loss 3.302140 (-0.16z)| norm 0.2501 (-1.02z)| lr 5.68e-05 | 4168.79 ms | 32.4% bf16 MFU | 125932 tok/s step 15807/19560 | loss 3.252571 (-1.39z)| norm 0.2693 (+0.50z)| lr 5.68e-05 | 4154.83 ms | 32.5% bf16 MFU | 125945 tok/s step 15808/19560 | loss 3.287796 (-0.51z)| norm 0.2633 (+0.05z)| lr 5.67e-05 | 4159.16 ms | 32.5% bf16 MFU | 125951 tok/s step 15809/19560 | loss 3.280102 (-0.70z)| norm 0.2725 (+0.78z)| lr 5.67e-05 | 4164.78 ms | 32.4% bf16 MFU | 125947 tok/s step 15810/19560 | loss 3.305649 (-0.05z)| norm 0.2621 (-0.05z)| lr 5.67e-05 | 4158.78 ms | 32.5% bf16 MFU | 125953 tok/s step 15811/19560 | loss 3.327299 (+0.50z)| norm 0.2709 (+0.66z)| lr 5.67e-05 | 4151.21 ms | 32.5% bf16 MFU | 125971 tok/s step 15812/19560 | loss 3.331893 (+0.60z)| norm 0.2750 (+1.00z)| lr 5.66e-05 | 4163.21 ms | 32.4% bf16 MFU | 125969 tok/s step 15813/19560 | loss 3.370287 (+1.55z)| norm 0.2906 (+2.24z)| lr 5.66e-05 | 4159.51 ms | 32.5% bf16 MFU | 125973 tok/s step 15814/19560 | loss 3.256575 (-1.29z)| norm 0.2733 (+0.83z)| lr 5.66e-05 | 4151.72 ms | 32.5% bf16 MFU | 125988 tok/s step 15815/19560 | loss 3.318237 (+0.26z)| norm 0.2826 (+1.55z)| lr 5.65e-05 | 4163.81 ms | 32.4% bf16 MFU | 125984 tok/s step 15816/19560 | loss 3.327511 (+0.48z)| norm 0.2705 (+0.58z)| lr 5.65e-05 | 4157.80 ms | 32.5% bf16 MFU | 125990 tok/s step 15817/19560 | loss 3.344702 (+0.91z)| norm 0.2497 (-1.08z)| lr 5.65e-05 | 4168.21 ms | 32.4% bf16 MFU | 125980 tok/s step 15818/19560 | loss 3.314435 (+0.16z)| norm 0.2763 (+1.03z)| lr 5.64e-05 | 4160.78 ms | 32.5% bf16 MFU | 125981 tok/s step 15819/19560 | loss 3.297505 (-0.27z)| norm 0.2603 (-0.24z)| lr 5.64e-05 | 4154.84 ms | 32.5% bf16 MFU | 125991 tok/s step 15820/19560 | loss 3.327350 (+0.49z)| norm 0.2784 (+1.19z)| lr 5.64e-05 | 4164.64 ms | 32.4% bf16 MFU | 125986 tok/s step 15821/19560 | loss 3.280344 (-0.69z)| norm 0.2546 (-0.69z)| lr 5.64e-05 | 4155.01 ms | 32.5% bf16 MFU | 125996 tok/s step 15822/19560 | loss 3.381539 (+1.81z)| norm 0.2900 (+2.09z)| lr 5.63e-05 | 4147.63 ms | 32.6% bf16 MFU | 126017 tok/s step 15823/19560 | loss 3.291247 (-0.43z)| norm 0.2602 (-0.24z)| lr 5.63e-05 | 4160.15 ms | 32.5% bf16 MFU | 126017 tok/s step 15824/19560 | loss 3.321089 (+0.32z)| norm 0.2682 (+0.39z)| lr 5.63e-05 | 4149.60 ms | 32.5% bf16 MFU | 126034 tok/s step 15825/19560 | loss 3.334003 (+0.63z)| norm 0.2618 (-0.11z)| lr 5.62e-05 | 4156.20 ms | 32.5% bf16 MFU | 126039 tok/s step 15826/19560 | loss 3.311426 (+0.06z)| norm 0.2746 (+0.89z)| lr 5.62e-05 | 4162.26 ms | 32.4% bf16 MFU | 126035 tok/s step 15827/19560 | loss 3.406428 (+2.38z)| norm 0.2754 (+0.95z)| lr 5.62e-05 | 4163.62 ms | 32.4% bf16 MFU | 126030 tok/s step 15828/19560 | loss 3.311244 (+0.04z)| norm 0.2661 (+0.21z)| lr 5.62e-05 | 4154.73 ms | 32.5% bf16 MFU | 126038 tok/s step 15829/19560 | loss 3.271769 (-0.93z)| norm 0.2900 (+2.09z)| lr 5.61e-05 | 4166.17 ms | 32.4% bf16 MFU | 126028 tok/s step 15830/19560 | loss 3.289735 (-0.48z)| norm 0.2621 (-0.10z)| lr 5.61e-05 | 4167.22 ms | 32.4% bf16 MFU | 126017 tok/s step 15831/19560 | loss 3.317126 (+0.20z)| norm 0.2669 (+0.28z)| lr 5.61e-05 | 4173.93 ms | 32.3% bf16 MFU | 125997 tok/s step 15832/19560 | loss 3.298420 (-0.27z)| norm 0.2731 (+0.77z)| lr 5.60e-05 | 4164.65 ms | 32.4% bf16 MFU | 125992 tok/s step 15833/19560 | loss 3.292020 (-0.42z)| norm 0.2980 (+2.66z)| lr 5.60e-05 | 4160.06 ms | 32.5% bf16 MFU | 125993 tok/s step 15834/19560 | loss 3.326769 (+0.43z)| norm 0.2787 (+1.15z)| lr 5.60e-05 | 4156.89 ms | 32.5% bf16 MFU | 126000 tok/s step 15835/19560 | loss 3.269840 (-0.99z)| norm 0.2677 (+0.30z)| lr 5.60e-05 | 4172.54 ms | 32.4% bf16 MFU | 125983 tok/s step 15836/19560 | loss 3.298577 (-0.27z)| norm 0.2716 (+0.60z)| lr 5.59e-05 | 4162.49 ms | 32.4% bf16 MFU | 125981 tok/s step 15837/19560 | loss 3.296961 (-0.33z)| norm 0.2807 (+1.28z)| lr 5.59e-05 | 4153.72 ms | 32.5% bf16 MFU | 125993 tok/s step 15838/19560 | loss 3.306618 (-0.09z)| norm 0.2532 (-0.82z)| lr 5.59e-05 | 4163.96 ms | 32.4% bf16 MFU | 125989 tok/s step 15839/19560 | loss 3.336794 (+0.67z)| norm 0.2804 (+1.24z)| lr 5.58e-05 | 4156.51 ms | 32.5% bf16 MFU | 125997 tok/s step 15840/19560 | loss 3.327459 (+0.42z)| norm 0.2769 (+0.96z)| lr 5.58e-05 | 4166.98 ms | 32.4% bf16 MFU | 125988 tok/s step 15841/19560 | loss 3.335858 (+0.63z)| norm 0.2810 (+1.25z)| lr 5.58e-05 | 4175.77 ms | 32.3% bf16 MFU | 125966 tok/s step 15842/19560 | loss 3.323593 (+0.31z)| norm 0.2722 (+0.57z)| lr 5.57e-05 | 4171.35 ms | 32.4% bf16 MFU | 125952 tok/s step 15843/19560 | loss 3.398536 (+2.18z)| norm 0.2853 (+1.55z)| lr 5.57e-05 | 4157.60 ms | 32.5% bf16 MFU | 125960 tok/s step 15844/19560 | loss 3.260291 (-1.32z)| norm 0.2790 (+1.05z)| lr 5.57e-05 | 4157.51 ms | 32.5% bf16 MFU | 125967 tok/s step 15845/19560 | loss 3.308227 (-0.11z)| norm 0.2965 (+2.30z)| lr 5.57e-05 | 4167.74 ms | 32.4% bf16 MFU | 125959 tok/s step 15846/19560 | loss 3.308856 (-0.07z)| norm 0.3039 (+2.74z)| lr 5.56e-05 | 4184.79 ms | 32.3% bf16 MFU | 125925 tok/s step 15847/19560 | loss 3.333262 (+0.57z)| norm 0.3039 (+2.64z)| lr 5.56e-05 | 4169.57 ms | 32.4% bf16 MFU | 125916 tok/s step 15848/19560 | loss 3.324211 (+0.33z)| norm 0.2695 (+0.23z)| lr 5.56e-05 | 4152.74 ms | 32.5% bf16 MFU | 125932 tok/s step 15849/19560 | loss 3.307841 (-0.10z)| norm 0.2957 (+2.02z)| lr 5.55e-05 | 4168.63 ms | 32.4% bf16 MFU | 125924 tok/s step 15850/19560 | loss 3.341042 (+0.76z)| norm 0.3033 (+2.58z)| lr 5.55e-05 | 4162.54 ms | 32.4% bf16 MFU | 125926 tok/s step 15851/19560 | loss 3.382641 (+1.82z)| norm 0.2661 (-0.02z)| lr 5.55e-05 | 4160.81 ms | 32.4% bf16 MFU | 125930 tok/s step 15852/19560 | loss 3.328394 (+0.40z)| norm 0.2640 (-0.17z)| lr 5.55e-05 | 4166.48 ms | 32.4% bf16 MFU | 125925 tok/s step 15853/19560 | loss 3.355156 (+1.09z)| norm 0.2945 (+1.93z)| lr 5.54e-05 | 4200.81 ms | 32.1% bf16 MFU | 125869 tok/s step 15854/19560 | loss 3.266216 (-1.21z)| norm 0.2563 (-0.72z)| lr 5.54e-05 | 4724.48 ms | 28.6% bf16 MFU | 125124 tok/s step 15855/19560 | loss 3.278363 (-0.89z)| norm 0.2501 (-1.15z)| lr 5.54e-05 | 4156.52 ms | 32.5% bf16 MFU | 125175 tok/s step 15856/19560 | loss 3.299218 (-0.36z)| norm 0.2631 (-0.25z)| lr 5.53e-05 | 4149.92 ms | 32.5% bf16 MFU | 125233 tok/s step 15857/19560 | loss 3.370561 (+1.48z)| norm 0.2561 (-0.72z)| lr 5.53e-05 | 4150.77 ms | 32.5% bf16 MFU | 125287 tok/s step 15858/19560 | loss 3.302651 (-0.29z)| norm 0.2720 (+0.36z)| lr 5.53e-05 | 4164.78 ms | 32.4% bf16 MFU | 125317 tok/s step 15859/19560 | loss 3.282240 (-0.82z)| norm 0.2550 (-0.80z)| lr 5.53e-05 | 4167.69 ms | 32.4% bf16 MFU | 125341 tok/s step 15860/19560 | loss 3.220565 (-2.36z)| norm 0.2612 (-0.37z)| lr 5.52e-05 | 4164.77 ms | 32.4% bf16 MFU | 125368 tok/s step 15861/19560 | loss 3.270352 (-1.08z)| norm 0.2919 (+1.75z)| lr 5.52e-05 | 4159.09 ms | 32.5% bf16 MFU | 125403 tok/s step 15862/19560 | loss 3.302691 (-0.25z)| norm 0.2700 (+0.24z)| lr 5.52e-05 | 4158.55 ms | 32.5% bf16 MFU | 125436 tok/s step 15863/19560 | loss 3.291753 (-0.53z)| norm 0.2586 (-0.55z)| lr 5.51e-05 | 4150.93 ms | 32.5% bf16 MFU | 125480 tok/s step 15864/19560 | loss 3.284312 (-0.72z)| norm 0.2714 (+0.33z)| lr 5.51e-05 | 4149.68 ms | 32.5% bf16 MFU | 125523 tok/s step 15865/19560 | loss 3.271837 (-1.02z)| norm 0.2642 (-0.18z)| lr 5.51e-05 | 4162.73 ms | 32.4% bf16 MFU | 125544 tok/s step 15866/19560 | loss 3.367317 (+1.39z)| norm 0.2572 (-0.67z)| lr 5.51e-05 | 4162.23 ms | 32.4% bf16 MFU | 125565 tok/s step 15867/19560 | loss 3.248154 (-1.59z)| norm 0.2584 (-0.58z)| lr 5.50e-05 | 4156.13 ms | 32.5% bf16 MFU | 125594 tok/s step 15868/19560 | loss 3.255197 (-1.39z)| norm 0.2766 (+0.68z)| lr 5.50e-05 | 4165.72 ms | 32.4% bf16 MFU | 125608 tok/s step 15869/19560 | loss 3.360250 (+1.22z)| norm 0.2654 (-0.11z)| lr 5.50e-05 | 4164.82 ms | 32.4% bf16 MFU | 125621 tok/s step 15870/19560 | loss 3.297158 (-0.35z)| norm 0.2561 (-0.77z)| lr 5.49e-05 | 4159.84 ms | 32.5% bf16 MFU | 125642 tok/s step 15871/19560 | loss 3.329104 (+0.44z)| norm 0.2668 (-0.02z)| lr 5.49e-05 | 4165.55 ms | 32.4% bf16 MFU | 125653 tok/s step 15872/19560 | loss 3.274726 (-0.89z)| norm 0.2617 (-0.38z)| lr 5.49e-05 | 4215.70 ms | 32.0% bf16 MFU | 125589 tok/s step 15873/19560 | loss 3.290807 (-0.49z)| norm 0.2648 (-0.16z)| lr 5.49e-05 | 4162.32 ms | 32.4% bf16 MFU | 125607 tok/s step 15874/19560 | loss 3.286098 (-0.62z)| norm 0.2620 (-0.35z)| lr 5.48e-05 | 4157.53 ms | 32.5% bf16 MFU | 125632 tok/s step 15875/19560 | loss 3.274736 (-0.89z)| norm 0.2491 (-1.24z)| lr 5.48e-05 | 4153.67 ms | 32.5% bf16 MFU | 125662 tok/s step 15876/19560 | loss 3.351335 (+1.00z)| norm 0.2740 (+0.48z)| lr 5.48e-05 | 4159.52 ms | 32.5% bf16 MFU | 125681 tok/s step 15877/19560 | loss 3.301961 (-0.22z)| norm 0.2614 (-0.40z)| lr 5.47e-05 | 4159.52 ms | 32.5% bf16 MFU | 125699 tok/s step 15878/19560 | loss 3.351779 (+1.01z)| norm 0.2525 (-1.05z)| lr 5.47e-05 | 4160.54 ms | 32.5% bf16 MFU | 125715 tok/s step 15879/19560 | loss 3.322498 (+0.28z)| norm 0.2663 (-0.09z)| lr 5.47e-05 | 4153.60 ms | 32.5% bf16 MFU | 125740 tok/s step 15880/19560 | loss 3.261734 (-1.21z)| norm 0.2649 (-0.19z)| lr 5.46e-05 | 4159.78 ms | 32.5% bf16 MFU | 125755 tok/s step 15881/19560 | loss 3.246202 (-1.57z)| norm 0.2504 (-1.25z)| lr 5.46e-05 | 4151.54 ms | 32.5% bf16 MFU | 125782 tok/s step 15882/19560 | loss 3.356029 (+1.13z)| norm 0.2575 (-0.75z)| lr 5.46e-05 | 4153.39 ms | 32.5% bf16 MFU | 125804 tok/s step 15883/19560 | loss 3.270034 (-0.98z)| norm 0.2761 (+0.61z)| lr 5.46e-05 | 4155.84 ms | 32.5% bf16 MFU | 125822 tok/s step 15884/19560 | loss 3.285484 (-0.60z)| norm 0.2557 (-0.92z)| lr 5.45e-05 | 4149.53 ms | 32.5% bf16 MFU | 125848 tok/s step 15885/19560 | loss 3.327816 (+0.43z)| norm 0.2526 (-1.13z)| lr 5.45e-05 | 4156.92 ms | 32.5% bf16 MFU | 125862 tok/s step 15886/19560 | loss 3.295360 (-0.36z)| norm 0.2726 (+0.34z)| lr 5.45e-05 | 4172.33 ms | 32.4% bf16 MFU | 125852 tok/s step 15887/19560 | loss 3.333045 (+0.56z)| norm 0.2489 (-1.44z)| lr 5.44e-05 | 4156.51 ms | 32.5% bf16 MFU | 125866 tok/s step 15888/19560 | loss 3.290385 (-0.48z)| norm 0.2555 (-0.94z)| lr 5.44e-05 | 4161.17 ms | 32.4% bf16 MFU | 125873 tok/s step 15889/19560 | loss 3.327263 (+0.41z)| norm 0.2485 (-1.46z)| lr 5.44e-05 | 4151.66 ms | 32.5% bf16 MFU | 125893 tok/s step 15890/19560 | loss 3.308791 (-0.04z)| norm 0.2556 (-0.93z)| lr 5.44e-05 | 4158.71 ms | 32.5% bf16 MFU | 125902 tok/s step 15891/19560 | loss 3.323622 (+0.32z)| norm 0.2681 (+0.00z)| lr 5.43e-05 | 4159.66 ms | 32.5% bf16 MFU | 125909 tok/s step 15892/19560 | loss 3.363801 (+1.29z)| norm 0.2537 (-1.06z)| lr 5.43e-05 | 4159.57 ms | 32.5% bf16 MFU | 125916 tok/s step 15893/19560 | loss 3.253096 (-1.39z)| norm 0.2676 (-0.04z)| lr 5.43e-05 | 4161.13 ms | 32.4% bf16 MFU | 125920 tok/s step 15894/19560 | loss 3.262776 (-1.14z)| norm 0.2602 (-0.60z)| lr 5.42e-05 | 4153.51 ms | 32.5% bf16 MFU | 125935 tok/s step 15895/19560 | loss 3.363158 (+1.26z)| norm 0.2554 (-0.95z)| lr 5.42e-05 | 4156.58 ms | 32.5% bf16 MFU | 125945 tok/s step 15896/19560 | loss 3.261618 (-1.15z)| norm 0.2636 (-0.34z)| lr 5.42e-05 | 4150.23 ms | 32.5% bf16 MFU | 125964 tok/s step 15897/19560 | loss 3.307609 (-0.03z)| norm 0.2577 (-0.78z)| lr 5.42e-05 | 4167.02 ms | 32.4% bf16 MFU | 125957 tok/s step 15898/19560 | loss 3.352170 (+1.10z)| norm 0.2573 (-0.80z)| lr 5.41e-05 | 4150.71 ms | 32.5% bf16 MFU | 125975 tok/s step 15899/19560 | loss 3.283618 (-0.66z)| norm 0.2679 (+0.00z)| lr 5.41e-05 | 4157.99 ms | 32.5% bf16 MFU | 125981 tok/s step 15900/19560 | loss 3.257055 (-1.33z)| norm 0.2649 (-0.23z)| lr 5.41e-05 | 4149.50 ms | 32.5% bf16 MFU | 125999 tok/s step 15901/19560 | loss 3.319857 (+0.30z)| norm 0.2679 (+0.01z)| lr 5.40e-05 | 4164.20 ms | 32.4% bf16 MFU | 125994 tok/s step 15902/19560 | loss 3.265346 (-1.10z)| norm 0.2695 (+0.13z)| lr 5.40e-05 | 4165.56 ms | 32.4% bf16 MFU | 125988 tok/s step 15903/19560 | loss 3.281467 (-0.68z)| norm 0.2848 (+1.29z)| lr 5.40e-05 | 4161.89 ms | 32.4% bf16 MFU | 125987 tok/s step 15904/19560 | loss 3.308788 (+0.04z)| norm 0.2718 (+0.29z)| lr 5.40e-05 | 4156.28 ms | 32.5% bf16 MFU | 125995 tok/s step 15905/19560 | loss 3.312980 (+0.16z)| norm 0.2681 (+0.01z)| lr 5.39e-05 | 4155.40 ms | 32.5% bf16 MFU | 126004 tok/s step 15906/19560 | loss 3.358851 (+1.35z)| norm 0.2721 (+0.32z)| lr 5.39e-05 | 4164.74 ms | 32.4% bf16 MFU | 125998 tok/s step 15907/19560 | loss 3.259505 (-1.27z)| norm 0.2701 (+0.18z)| lr 5.39e-05 | 4176.46 ms | 32.3% bf16 MFU | 125975 tok/s step 15908/19560 | loss 3.261932 (-1.19z)| norm 0.2638 (-0.31z)| lr 5.38e-05 | 4151.42 ms | 32.5% bf16 MFU | 125990 tok/s step 15909/19560 | loss 3.300120 (-0.14z)| norm 0.2492 (-1.43z)| lr 5.38e-05 | 4151.97 ms | 32.5% bf16 MFU | 126005 tok/s step 15910/19560 | loss 3.301059 (-0.11z)| norm 0.2590 (-0.67z)| lr 5.38e-05 | 4154.54 ms | 32.5% bf16 MFU | 126014 tok/s step 15911/19560 | loss 3.284824 (-0.55z)| norm 0.2771 (+0.71z)| lr 5.38e-05 | 4168.58 ms | 32.4% bf16 MFU | 126002 tok/s step 15912/19560 | loss 3.286212 (-0.51z)| norm 0.2677 (-0.02z)| lr 5.37e-05 | 4168.31 ms | 32.4% bf16 MFU | 125991 tok/s step 15913/19560 | loss 3.236269 (-1.87z)| norm 0.2579 (-0.78z)| lr 5.37e-05 | 4153.01 ms | 32.5% bf16 MFU | 126004 tok/s step 15914/19560 | loss 3.448481 (+3.68z)| norm 0.2545 (-1.04z)| lr 5.37e-05 | 4163.40 ms | 32.4% bf16 MFU | 126000 tok/s step 15915/19560 | loss 3.283392 (-0.60z)| norm 0.3782 (+6.77z)| lr 5.36e-05 | 4160.93 ms | 32.4% bf16 MFU | 126000 tok/s step 15916/19560 | loss 3.270164 (-0.94z)| norm 0.2668 (-0.12z)| lr 5.36e-05 | 4167.03 ms | 32.4% bf16 MFU | 125991 tok/s step 15917/19560 | loss 3.297729 (-0.21z)| norm 0.2765 (+0.47z)| lr 5.36e-05 | 4151.75 ms | 32.5% bf16 MFU | 126005 tok/s step 15918/19560 | loss 3.273992 (-0.82z)| norm 0.2786 (+0.59z)| lr 5.36e-05 | 4165.04 ms | 32.4% bf16 MFU | 125999 tok/s step 15919/19560 | loss 3.247564 (-1.49z)| norm 0.2518 (-1.07z)| lr 5.35e-05 | 4157.39 ms | 32.5% bf16 MFU | 126004 tok/s step 15920/19560 | loss 3.238719 (-1.72z)| norm 0.2724 (+0.20z)| lr 5.35e-05 | 4159.33 ms | 32.5% bf16 MFU | 126007 tok/s step 15921/19560 | loss 3.277439 (-0.70z)| norm 0.2745 (+0.32z)| lr 5.35e-05 | 4164.43 ms | 32.4% bf16 MFU | 126001 tok/s step 15922/19560 | loss 3.513260 (+4.91z)| norm 0.2897 (+1.25z)| lr 5.34e-05 | 4164.69 ms | 32.4% bf16 MFU | 125996 tok/s step 15923/19560 | loss 3.310932 (+0.13z)| norm 0.2679 (-0.10z)| lr 5.34e-05 | 4156.60 ms | 32.5% bf16 MFU | 126003 tok/s step 15924/19560 | loss 3.281610 (-0.57z)| norm 0.2692 (-0.01z)| lr 5.34e-05 | 4152.42 ms | 32.5% bf16 MFU | 126016 tok/s step 15925/19560 | loss 3.387591 (+1.90z)| norm 0.2642 (-0.32z)| lr 5.34e-05 | 4165.52 ms | 32.4% bf16 MFU | 126008 tok/s step 15926/19560 | loss 3.324829 (+0.42z)| norm 0.2625 (-0.43z)| lr 5.33e-05 | 4156.58 ms | 32.5% bf16 MFU | 126014 tok/s step 15927/19560 | loss 3.301046 (-0.14z)| norm 0.2690 (-0.02z)| lr 5.33e-05 | 4154.05 ms | 32.5% bf16 MFU | 126024 tok/s step 15928/19560 | loss 3.307614 (+0.01z)| norm 0.2606 (-0.54z)| lr 5.33e-05 | 4157.35 ms | 32.5% bf16 MFU | 126028 tok/s step 15929/19560 | loss 3.259094 (-1.12z)| norm 0.2583 (-0.69z)| lr 5.32e-05 | 4155.10 ms | 32.5% bf16 MFU | 126036 tok/s step 15930/19560 | loss 3.286309 (-0.49z)| norm 0.2685 (-0.03z)| lr 5.32e-05 | 4177.20 ms | 32.3% bf16 MFU | 126010 tok/s step 15931/19560 | loss 3.298694 (-0.20z)| norm 0.2618 (-0.46z)| lr 5.32e-05 | 4162.37 ms | 32.4% bf16 MFU | 126007 tok/s step 15932/19560 | loss 3.313420 (+0.15z)| norm 0.2643 (-0.30z)| lr 5.32e-05 | 4160.71 ms | 32.5% bf16 MFU | 126007 tok/s step 15933/19560 | loss 3.267933 (-0.92z)| norm 0.2556 (-0.85z)| lr 5.31e-05 | 4199.13 ms | 32.2% bf16 MFU | 125950 tok/s step 15934/19560 | loss 3.273566 (-0.78z)| norm 0.2472 (-1.39z)| lr 5.31e-05 | 4184.63 ms | 32.3% bf16 MFU | 125917 tok/s step 15935/19560 | loss 3.310776 (+0.09z)| norm 0.2591 (-0.62z)| lr 5.31e-05 | 4160.87 ms | 32.4% bf16 MFU | 125921 tok/s step 15936/19560 | loss 3.286313 (-0.49z)| norm 0.2519 (-1.07z)| lr 5.31e-05 | 4162.97 ms | 32.4% bf16 MFU | 125922 tok/s step 15937/19560 | loss 3.301686 (-0.13z)| norm 0.2470 (-1.36z)| lr 5.30e-05 | 4154.83 ms | 32.5% bf16 MFU | 125935 tok/s step 15938/19560 | loss 3.264658 (-1.00z)| norm 0.2587 (-0.62z)| lr 5.30e-05 | 4151.16 ms | 32.5% bf16 MFU | 125954 tok/s step 15939/19560 | loss 3.320475 (+0.32z)| norm 0.2761 (+0.47z)| lr 5.30e-05 | 4153.56 ms | 32.5% bf16 MFU | 125967 tok/s step 15940/19560 | loss 3.259067 (-1.11z)| norm 0.2709 (+0.15z)| lr 5.29e-05 | 4148.14 ms | 32.5% bf16 MFU | 125988 tok/s step 15941/19560 | loss 3.295213 (-0.25z)| norm 0.2728 (+0.28z)| lr 5.29e-05 | 4153.64 ms | 32.5% bf16 MFU | 126000 tok/s step 15942/19560 | loss 3.342618 (+0.86z)| norm 0.2726 (+0.27z)| lr 5.29e-05 | 4148.09 ms | 32.5% bf16 MFU | 126020 tok/s step 15943/19560 | loss 3.287007 (-0.45z)| norm 0.2801 (+0.74z)| lr 5.29e-05 | 4150.61 ms | 32.5% bf16 MFU | 126035 tok/s step 15944/19560 | loss 3.234429 (-1.67z)| norm 0.2682 (-0.01z)| lr 5.28e-05 | 4150.49 ms | 32.5% bf16 MFU | 126049 tok/s step 15945/19560 | loss 3.349724 (+1.04z)| norm 0.2698 (+0.08z)| lr 5.28e-05 | 4148.01 ms | 32.5% bf16 MFU | 126066 tok/s step 15946/19560 | loss 3.359586 (+1.26z)| norm 0.2734 (+0.31z)| lr 5.28e-05 | 4150.73 ms | 32.5% bf16 MFU | 126078 tok/s step 15947/19560 | loss 3.286612 (-0.45z)| norm 0.2792 (+0.67z)| lr 5.27e-05 | 4147.55 ms | 32.6% bf16 MFU | 126095 tok/s step 15948/19560 | loss 3.320520 (+0.35z)| norm 0.2696 (+0.06z)| lr 5.27e-05 | 4152.47 ms | 32.5% bf16 MFU | 126103 tok/s step 15949/19560 | loss 3.343981 (+0.88z)| norm 0.2843 (+0.98z)| lr 5.27e-05 | 4148.30 ms | 32.5% bf16 MFU | 126117 tok/s step 15950/19560 | loss 3.296418 (-0.21z)| norm 0.2849 (+1.03z)| lr 5.27e-05 | 4154.98 ms | 32.5% bf16 MFU | 126121 tok/s step 15951/19560 | loss 3.274324 (-0.73z)| norm 0.2565 (-0.78z)| lr 5.26e-05 | 4146.12 ms | 32.6% bf16 MFU | 126137 tok/s step 15952/19560 | loss 3.314072 (+0.21z)| norm 0.2683 (-0.03z)| lr 5.26e-05 | 4149.91 ms | 32.5% bf16 MFU | 126147 tok/s step 15953/19560 | loss 3.231712 (-1.70z)| norm 0.2924 (+1.48z)| lr 5.26e-05 | 4152.16 ms | 32.5% bf16 MFU | 126153 tok/s step 15954/19560 | loss 3.354756 (+1.16z)| norm 0.2743 (+0.33z)| lr 5.25e-05 | 4147.29 ms | 32.6% bf16 MFU | 126166 tok/s step 15955/19560 | loss 3.329148 (+0.59z)| norm 0.2689 (-0.01z)| lr 5.25e-05 | 4148.77 ms | 32.5% bf16 MFU | 126177 tok/s step 15956/19560 | loss 3.371219 (+1.57z)| norm 0.2773 (+0.52z)| lr 5.25e-05 | 4152.06 ms | 32.5% bf16 MFU | 126182 tok/s step 15957/19560 | loss 3.297800 (-0.17z)| norm 0.2861 (+1.09z)| lr 5.25e-05 | 4147.71 ms | 32.6% bf16 MFU | 126193 tok/s step 15958/19560 | loss 3.253490 (-1.20z)| norm 0.2779 (+0.56z)| lr 5.24e-05 | 4152.25 ms | 32.5% bf16 MFU | 126196 tok/s step 15959/19560 | loss 3.323653 (+0.45z)| norm 0.2672 (-0.12z)| lr 5.24e-05 | 4151.93 ms | 32.5% bf16 MFU | 126200 tok/s step 15960/19560 | loss 3.312432 (+0.18z)| norm 0.2632 (-0.37z)| lr 5.24e-05 | 4149.21 ms | 32.5% bf16 MFU | 126208 tok/s step 15961/19560 | loss 3.300410 (-0.10z)| norm 0.2602 (-0.55z)| lr 5.23e-05 | 4148.68 ms | 32.5% bf16 MFU | 126217 tok/s step 15962/19560 | loss 3.319775 (+0.35z)| norm 0.2630 (-0.37z)| lr 5.23e-05 | 4150.56 ms | 32.5% bf16 MFU | 126222 tok/s step 15963/19560 | loss 3.335399 (+0.71z)| norm 0.2795 (+0.69z)| lr 5.23e-05 | 4151.42 ms | 32.5% bf16 MFU | 126225 tok/s step 15964/19560 | loss 3.262541 (-0.99z)| norm 0.2546 (-0.90z)| lr 5.23e-05 | 4146.87 ms | 32.6% bf16 MFU | 126235 tok/s step 15965/19560 | loss 3.266637 (-0.89z)| norm 0.2503 (-1.15z)| lr 5.22e-05 | 4149.78 ms | 32.5% bf16 MFU | 126241 tok/s step 15966/19560 | loss 3.331071 (+0.61z)| norm 0.2571 (-0.72z)| lr 5.22e-05 | 4148.26 ms | 32.5% bf16 MFU | 126248 tok/s step 15967/19560 | loss 3.271589 (-0.77z)| norm 0.2605 (-0.49z)| lr 5.22e-05 | 4147.62 ms | 32.6% bf16 MFU | 126256 tok/s step 15968/19560 | loss 3.243263 (-1.40z)| norm 0.2446 (-1.48z)| lr 5.21e-05 | 4149.54 ms | 32.5% bf16 MFU | 126260 tok/s step 15969/19560 | loss 3.320699 (+0.39z)| norm 0.2568 (-0.70z)| lr 5.21e-05 | 4149.80 ms | 32.5% bf16 MFU | 126264 tok/s step 15970/19560 | loss 3.300476 (-0.07z)| norm 0.2554 (-0.78z)| lr 5.21e-05 | 4149.94 ms | 32.5% bf16 MFU | 126268 tok/s step 15971/19560 | loss 3.271555 (-0.73z)| norm 0.2529 (-0.92z)| lr 5.21e-05 | 4145.45 ms | 32.6% bf16 MFU | 126278 tok/s step 15972/19560 | loss 3.266005 (-0.86z)| norm 0.2546 (-0.80z)| lr 5.20e-05 | 4148.88 ms | 32.5% bf16 MFU | 126283 tok/s step 15973/19560 | loss 3.336394 (+0.79z)| norm 0.2455 (-1.36z)| lr 5.20e-05 | 4146.20 ms | 32.6% bf16 MFU | 126291 tok/s step 15974/19560 | loss 3.229143 (-1.70z)| norm 0.3419 (+4.48z)| lr 5.20e-05 | 4149.55 ms | 32.5% bf16 MFU | 126294 tok/s step 15975/19560 | loss 3.316931 (+0.35z)| norm 0.2524 (-0.87z)| lr 5.19e-05 | 4147.13 ms | 32.6% bf16 MFU | 126300 tok/s step 15976/19560 | loss 3.316954 (+0.35z)| norm 0.2576 (-0.55z)| lr 5.19e-05 | 4149.09 ms | 32.5% bf16 MFU | 126304 tok/s step 15977/19560 | loss 3.326965 (+0.58z)| norm 0.2691 (+0.16z)| lr 5.19e-05 | 4150.48 ms | 32.5% bf16 MFU | 126304 tok/s step 15978/19560 | loss 3.327504 (+0.60z)| norm 0.2650 (-0.07z)| lr 5.19e-05 | 4149.07 ms | 32.5% bf16 MFU | 126307 tok/s step 15979/19560 | loss 3.288933 (-0.29z)| norm 0.2618 (-0.27z)| lr 5.18e-05 | 4147.08 ms | 32.6% bf16 MFU | 126313 tok/s step 15980/19560 | loss 3.340548 (+0.93z)| norm 0.2621 (-0.25z)| lr 5.18e-05 | 4147.29 ms | 32.6% bf16 MFU | 126318 tok/s step 15981/19560 | loss 3.303907 (+0.07z)| norm 0.2632 (-0.17z)| lr 5.18e-05 | 4144.94 ms | 32.6% bf16 MFU | 126327 tok/s step 15982/19560 | loss 3.297477 (-0.09z)| norm 0.3148 (+2.98z)| lr 5.18e-05 | 4147.58 ms | 32.6% bf16 MFU | 126331 tok/s step 15983/19560 | loss 3.275121 (-0.62z)| norm 0.2825 (+0.98z)| lr 5.17e-05 | 4145.74 ms | 32.6% bf16 MFU | 126338 tok/s step 15984/19560 | loss 3.336610 (+0.84z)| norm 0.2741 (+0.46z)| lr 5.17e-05 | 4147.56 ms | 32.6% bf16 MFU | 126341 tok/s step 15985/19560 | loss 3.317901 (+0.41z)| norm 0.2816 (+0.90z)| lr 5.17e-05 | 4148.86 ms | 32.5% bf16 MFU | 126342 tok/s step 15986/19560 | loss 3.293299 (-0.18z)| norm 0.2667 (-0.01z)| lr 5.16e-05 | 4150.04 ms | 32.5% bf16 MFU | 126342 tok/s step 15987/19560 | loss 3.275145 (-0.62z)| norm 0.2667 (-0.01z)| lr 5.16e-05 | 4147.18 ms | 32.6% bf16 MFU | 126346 tok/s step 15988/19560 | loss 3.294694 (-0.17z)| norm 0.2633 (-0.22z)| lr 5.16e-05 | 4149.53 ms | 32.5% bf16 MFU | 126346 tok/s step 15989/19560 | loss 3.377472 (+1.81z)| norm 0.2695 (+0.17z)| lr 5.16e-05 | 4150.50 ms | 32.5% bf16 MFU | 126345 tok/s step 15990/19560 | loss 3.278579 (-0.57z)| norm 0.2607 (-0.37z)| lr 5.15e-05 | 4146.51 ms | 32.6% bf16 MFU | 126350 tok/s step 15991/19560 | loss 3.354022 (+1.23z)| norm 0.2737 (+0.43z)| lr 5.15e-05 | 4147.82 ms | 32.6% bf16 MFU | 126352 tok/s step 15992/19560 | loss 3.350775 (+1.14z)| norm 0.2766 (+0.61z)| lr 5.15e-05 | 4149.60 ms | 32.5% bf16 MFU | 126352 tok/s step 15993/19560 | loss 3.341259 (+0.89z)| norm 0.2808 (+0.86z)| lr 5.14e-05 | 4144.80 ms | 32.6% bf16 MFU | 126359 tok/s step 15994/19560 | loss 3.281938 (-0.51z)| norm 0.2486 (-1.13z)| lr 5.14e-05 | 4147.55 ms | 32.6% bf16 MFU | 126361 tok/s step 15995/19560 | loss 3.320349 (+0.40z)| norm 0.6765 (+10.28z)| lr 5.14e-05 | 4146.36 ms | 32.6% bf16 MFU | 126366 tok/s step 15996/19560 | loss 3.345544 (+1.00z)| norm 0.2711 (+0.02z)| lr 5.14e-05 | 4148.82 ms | 32.5% bf16 MFU | 126366 tok/s step 15997/19560 | loss 3.324996 (+0.51z)| norm 0.2710 (+0.02z)| lr 5.13e-05 | 4148.19 ms | 32.5% bf16 MFU | 126367 tok/s step 15998/19560 | loss 3.282326 (-0.52z)| norm 0.2595 (-0.27z)| lr 5.13e-05 | 4147.71 ms | 32.6% bf16 MFU | 126369 tok/s step 15999/19560 | loss 3.312332 (+0.21z)| norm 0.2692 (-0.03z)| lr 5.13e-05 | 4146.86 ms | 32.6% bf16 MFU | 126372 tok/s step 16000/19560 | loss 3.316666 (+0.31z)| norm 0.2624 (-0.20z)| lr 5.12e-05 | 4148.22 ms | 32.5% bf16 MFU | 126373 tok/s val loss 3.283572 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3015/10042 = 0.300239 step 16001/19560 | loss 3.331651 (+0.66z)| norm 0.2630 (-0.18z)| lr 5.12e-05 | 4147.89 ms | 32.6% bf16 MFU | 126374 tok/s step 16002/19560 | loss 3.327937 (+0.56z)| norm 0.2609 (-0.24z)| lr 5.12e-05 | 4143.06 ms | 32.6% bf16 MFU | 126383 tok/s step 16003/19560 | loss 3.312329 (+0.18z)| norm 0.3018 (+0.79z)| lr 5.12e-05 | 4147.94 ms | 32.6% bf16 MFU | 126383 tok/s step 16004/19560 | loss 3.340006 (+0.86z)| norm 0.2927 (+0.55z)| lr 5.11e-05 | 4144.39 ms | 32.6% bf16 MFU | 126390 tok/s step 16005/19560 | loss 3.358220 (+1.28z)| norm 0.2625 (-0.21z)| lr 5.11e-05 | 4147.02 ms | 32.6% bf16 MFU | 126391 tok/s step 16006/19560 | loss 3.297256 (-0.19z)| norm 0.2851 (+0.36z)| lr 5.11e-05 | 4148.79 ms | 32.5% bf16 MFU | 126390 tok/s step 16007/19560 | loss 3.307270 (+0.06z)| norm 0.2614 (-0.24z)| lr 5.11e-05 | 4148.95 ms | 32.5% bf16 MFU | 126389 tok/s step 16008/19560 | loss 3.332498 (+0.66z)| norm 0.2695 (-0.04z)| lr 5.10e-05 | 4145.26 ms | 32.6% bf16 MFU | 126394 tok/s step 16009/19560 | loss 3.305397 (-0.01z)| norm 0.2763 (+0.13z)| lr 5.10e-05 | 4141.00 ms | 32.6% bf16 MFU | 126404 tok/s step 16010/19560 | loss 3.330423 (+0.61z)| norm 0.2759 (+0.12z)| lr 5.10e-05 | 4143.83 ms | 32.6% bf16 MFU | 126410 tok/s step 16011/19560 | loss 3.289527 (-0.41z)| norm 0.2779 (+0.17z)| lr 5.09e-05 | 4145.45 ms | 32.6% bf16 MFU | 126413 tok/s step 16012/19560 | loss 3.276098 (-0.74z)| norm 0.2485 (-0.57z)| lr 5.09e-05 | 4145.11 ms | 32.6% bf16 MFU | 126417 tok/s step 16013/19560 | loss 3.287760 (-0.44z)| norm 0.2739 (+0.06z)| lr 5.09e-05 | 4145.63 ms | 32.6% bf16 MFU | 126419 tok/s step 16014/19560 | loss 3.238845 (-1.63z)| norm 0.2567 (-0.37z)| lr 5.09e-05 | 4141.61 ms | 32.6% bf16 MFU | 126428 tok/s step 16015/19560 | loss 3.289577 (-0.37z)| norm 0.2764 (+0.12z)| lr 5.08e-05 | 4147.76 ms | 32.6% bf16 MFU | 126427 tok/s step 16016/19560 | loss 3.264567 (-0.98z)| norm 0.2499 (-0.55z)| lr 5.08e-05 | 4146.78 ms | 32.6% bf16 MFU | 126427 tok/s step 16017/19560 | loss 3.262353 (-1.02z)| norm 0.2517 (-0.50z)| lr 5.08e-05 | 4147.67 ms | 32.6% bf16 MFU | 126426 tok/s step 16018/19560 | loss 3.284267 (-0.48z)| norm 0.2717 (+0.00z)| lr 5.07e-05 | 4147.75 ms | 32.6% bf16 MFU | 126425 tok/s step 16019/19560 | loss 3.298027 (-0.14z)| norm 0.2542 (-0.44z)| lr 5.07e-05 | 4145.77 ms | 32.6% bf16 MFU | 126427 tok/s step 16020/19560 | loss 3.250750 (-1.27z)| norm 0.2552 (-0.41z)| lr 5.07e-05 | 4145.08 ms | 32.6% bf16 MFU | 126430 tok/s step 16021/19560 | loss 3.272656 (-0.74z)| norm 0.2442 (-0.69z)| lr 5.07e-05 | 4146.36 ms | 32.6% bf16 MFU | 126430 tok/s step 16022/19560 | loss 3.263643 (-0.97z)| norm 0.2559 (-0.39z)| lr 5.06e-05 | 4144.93 ms | 32.6% bf16 MFU | 126433 tok/s step 16023/19560 | loss 3.326595 (+0.60z)| norm 0.2541 (-0.44z)| lr 5.06e-05 | 4148.65 ms | 32.5% bf16 MFU | 126430 tok/s step 16024/19560 | loss 3.309408 (+0.16z)| norm 0.2556 (-0.40z)| lr 5.06e-05 | 4145.10 ms | 32.6% bf16 MFU | 126433 tok/s step 16025/19560 | loss 3.310168 (+0.18z)| norm 0.2573 (-0.35z)| lr 5.06e-05 | 4151.30 ms | 32.5% bf16 MFU | 126426 tok/s step 16026/19560 | loss 3.277348 (-0.63z)| norm 0.2713 (-0.00z)| lr 5.05e-05 | 4149.45 ms | 32.5% bf16 MFU | 126422 tok/s step 16027/19560 | loss 3.298771 (-0.09z)| norm 0.2556 (-0.40z)| lr 5.05e-05 | 4147.01 ms | 32.6% bf16 MFU | 126423 tok/s step 16028/19560 | loss 3.317228 (+0.36z)| norm 0.2634 (-0.20z)| lr 5.05e-05 | 4150.21 ms | 32.5% bf16 MFU | 126418 tok/s step 16029/19560 | loss 3.320750 (+0.45z)| norm 0.2537 (-0.44z)| lr 5.04e-05 | 4150.71 ms | 32.5% bf16 MFU | 126413 tok/s step 16030/19560 | loss 3.354050 (+1.26z)| norm 0.2919 (+0.52z)| lr 5.04e-05 | 4148.44 ms | 32.5% bf16 MFU | 126411 tok/s step 16031/19560 | loss 3.332036 (+0.70z)| norm 0.2562 (-0.37z)| lr 5.04e-05 | 4148.46 ms | 32.5% bf16 MFU | 126410 tok/s step 16032/19560 | loss 3.232591 (-1.75z)| norm 0.2604 (-0.27z)| lr 5.04e-05 | 4147.71 ms | 32.6% bf16 MFU | 126409 tok/s step 16033/19560 | loss 3.290858 (-0.31z)| norm 0.2524 (-0.47z)| lr 5.03e-05 | 4145.43 ms | 32.6% bf16 MFU | 126413 tok/s step 16034/19560 | loss 3.258400 (-1.09z)| norm 0.2628 (-0.20z)| lr 5.03e-05 | 4150.70 ms | 32.5% bf16 MFU | 126408 tok/s step 16035/19560 | loss 3.315329 (+0.31z)| norm 0.2662 (-0.12z)| lr 5.03e-05 | 4151.98 ms | 32.5% bf16 MFU | 126401 tok/s step 16036/19560 | loss 3.328453 (+0.63z)| norm 0.2542 (-0.41z)| lr 5.02e-05 | 4149.32 ms | 32.5% bf16 MFU | 126399 tok/s step 16037/19560 | loss 3.348588 (+1.11z)| norm 0.2643 (-0.16z)| lr 5.02e-05 | 4152.13 ms | 32.5% bf16 MFU | 126392 tok/s step 16038/19560 | loss 3.200626 (-2.48z)| norm 0.2670 (-0.10z)| lr 5.02e-05 | 4153.18 ms | 32.5% bf16 MFU | 126384 tok/s step 16039/19560 | loss 3.323661 (+0.49z)| norm 0.2501 (-0.52z)| lr 5.02e-05 | 4148.71 ms | 32.5% bf16 MFU | 126384 tok/s step 16040/19560 | loss 3.255338 (-1.15z)| norm 0.2565 (-0.35z)| lr 5.01e-05 | 4150.14 ms | 32.5% bf16 MFU | 126381 tok/s step 16041/19560 | loss 3.298991 (-0.11z)| norm 0.2524 (-0.46z)| lr 5.01e-05 | 4146.64 ms | 32.6% bf16 MFU | 126384 tok/s step 16042/19560 | loss 3.258163 (-1.12z)| norm 0.2507 (-0.50z)| lr 5.01e-05 | 4150.77 ms | 32.5% bf16 MFU | 126380 tok/s step 16043/19560 | loss 3.294148 (-0.21z)| norm 0.2675 (-0.06z)| lr 5.01e-05 | 4150.65 ms | 32.5% bf16 MFU | 126377 tok/s step 16044/19560 | loss 3.319494 (+0.43z)| norm 0.2572 (-0.32z)| lr 5.00e-05 | 4148.45 ms | 32.5% bf16 MFU | 126377 tok/s step 16045/19560 | loss 3.291443 (-0.28z)| norm 0.2707 (+0.03z)| lr 5.00e-05 | 4152.69 ms | 32.5% bf16 MFU | 126371 tok/s step 16046/19560 | loss 3.297613 (-0.13z)| norm 0.2536 (-0.41z)| lr 5.00e-05 | 4149.96 ms | 32.5% bf16 MFU | 126369 tok/s step 16047/19560 | loss 3.304974 (+0.05z)| norm 0.2744 (+0.13z)| lr 4.99e-05 | 4150.79 ms | 32.5% bf16 MFU | 126366 tok/s step 16048/19560 | loss 3.287010 (-0.43z)| norm 0.2579 (-0.30z)| lr 4.99e-05 | 4148.98 ms | 32.5% bf16 MFU | 126366 tok/s step 16049/19560 | loss 3.288451 (-0.40z)| norm 0.2594 (-0.25z)| lr 4.99e-05 | 4150.72 ms | 32.5% bf16 MFU | 126364 tok/s step 16050/19560 | loss 3.264971 (-1.10z)| norm 0.2460 (-0.59z)| lr 4.99e-05 | 4152.00 ms | 32.5% bf16 MFU | 126359 tok/s step 16051/19560 | loss 3.305649 (+0.12z)| norm 0.2670 (-0.05z)| lr 4.98e-05 | 4152.62 ms | 32.5% bf16 MFU | 126354 tok/s step 16052/19560 | loss 3.307615 (+0.17z)| norm 0.2533 (-0.40z)| lr 4.98e-05 | 4149.55 ms | 32.5% bf16 MFU | 126354 tok/s step 16053/19560 | loss 3.293262 (-0.24z)| norm 0.2538 (-0.38z)| lr 4.98e-05 | 4152.31 ms | 32.5% bf16 MFU | 126349 tok/s step 16054/19560 | loss 3.315579 (+0.45z)| norm 0.2792 (+0.27z)| lr 4.97e-05 | 4156.03 ms | 32.5% bf16 MFU | 126339 tok/s step 16055/19560 | loss 3.400184 (+2.92z)| norm 0.2670 (-0.05z)| lr 4.97e-05 | 4153.31 ms | 32.5% bf16 MFU | 126334 tok/s step 16056/19560 | loss 3.267272 (-1.01z)| norm 0.2793 (+0.27z)| lr 4.97e-05 | 4150.66 ms | 32.5% bf16 MFU | 126333 tok/s step 16057/19560 | loss 3.277012 (-0.73z)| norm 0.2730 (+0.10z)| lr 4.97e-05 | 4151.18 ms | 32.5% bf16 MFU | 126331 tok/s step 16058/19560 | loss 3.229373 (-2.10z)| norm 0.2646 (-0.12z)| lr 4.96e-05 | 4151.03 ms | 32.5% bf16 MFU | 126330 tok/s step 16059/19560 | loss 3.283873 (-0.50z)| norm 0.2854 (+0.42z)| lr 4.96e-05 | 4154.81 ms | 32.5% bf16 MFU | 126323 tok/s step 16060/19560 | loss 3.282286 (-0.54z)| norm 0.2725 (+0.08z)| lr 4.96e-05 | 4150.41 ms | 32.5% bf16 MFU | 126323 tok/s step 16061/19560 | loss 3.305997 (+0.14z)| norm 0.2545 (-0.38z)| lr 4.96e-05 | 4153.79 ms | 32.5% bf16 MFU | 126318 tok/s step 16062/19560 | loss 3.218543 (-2.35z)| norm 0.2727 (+0.08z)| lr 4.95e-05 | 4149.88 ms | 32.5% bf16 MFU | 126319 tok/s step 16063/19560 | loss 3.269484 (-0.88z)| norm 0.2645 (-0.13z)| lr 4.95e-05 | 4151.09 ms | 32.5% bf16 MFU | 126318 tok/s step 16064/19560 | loss 3.332819 (+0.91z)| norm 0.2630 (-0.17z)| lr 4.95e-05 | 4149.95 ms | 32.5% bf16 MFU | 126319 tok/s step 16065/19560 | loss 3.301055 (+0.01z)| norm 0.2558 (-0.36z)| lr 4.94e-05 | 4154.53 ms | 32.5% bf16 MFU | 126313 tok/s step 16066/19560 | loss 3.301840 (+0.02z)| norm 0.2619 (-0.20z)| lr 4.94e-05 | 4151.59 ms | 32.5% bf16 MFU | 126311 tok/s step 16067/19560 | loss 3.322881 (+0.63z)| norm 0.2606 (-0.23z)| lr 4.94e-05 | 4151.99 ms | 32.5% bf16 MFU | 126309 tok/s step 16068/19560 | loss 3.303023 (+0.05z)| norm 0.2771 (+0.19z)| lr 4.94e-05 | 4150.97 ms | 32.5% bf16 MFU | 126309 tok/s step 16069/19560 | loss 3.358605 (+1.62z)| norm 0.2579 (-0.30z)| lr 4.93e-05 | 4150.88 ms | 32.5% bf16 MFU | 126309 tok/s step 16070/19560 | loss 3.339308 (+1.07z)| norm 0.2608 (-0.22z)| lr 4.93e-05 | 4152.54 ms | 32.5% bf16 MFU | 126306 tok/s step 16071/19560 | loss 3.300720 (-0.04z)| norm 0.2627 (-0.17z)| lr 4.93e-05 | 4153.65 ms | 32.5% bf16 MFU | 126302 tok/s step 16072/19560 | loss 3.275266 (-0.78z)| norm 0.2618 (-0.19z)| lr 4.93e-05 | 4152.14 ms | 32.5% bf16 MFU | 126301 tok/s step 16073/19560 | loss 3.318343 (+0.47z)| norm 0.2874 (+0.47z)| lr 4.92e-05 | 4153.98 ms | 32.5% bf16 MFU | 126296 tok/s step 16074/19560 | loss 3.316624 (+0.44z)| norm 0.2908 (+0.55z)| lr 4.92e-05 | 4149.64 ms | 32.5% bf16 MFU | 126299 tok/s step 16075/19560 | loss 3.270478 (-0.91z)| norm 0.2802 (+0.27z)| lr 4.92e-05 | 4152.01 ms | 32.5% bf16 MFU | 126297 tok/s step 16076/19560 | loss 3.309002 (+0.22z)| norm 0.2650 (-0.12z)| lr 4.91e-05 | 4157.31 ms | 32.5% bf16 MFU | 126288 tok/s step 16077/19560 | loss 3.307331 (+0.18z)| norm 0.2760 (+0.17z)| lr 4.91e-05 | 4156.33 ms | 32.5% bf16 MFU | 126281 tok/s step 16078/19560 | loss 3.400414 (+2.81z)| norm 0.2787 (+0.24z)| lr 4.91e-05 | 4151.13 ms | 32.5% bf16 MFU | 126282 tok/s step 16079/19560 | loss 3.307536 (+0.15z)| norm 0.2627 (-0.17z)| lr 4.91e-05 | 4152.84 ms | 32.5% bf16 MFU | 126280 tok/s step 16080/19560 | loss 3.327867 (+0.73z)| norm 0.2637 (-0.15z)| lr 4.90e-05 | 4151.68 ms | 32.5% bf16 MFU | 126280 tok/s step 16081/19560 | loss 3.300380 (-0.07z)| norm 0.2803 (+0.28z)| lr 4.90e-05 | 4150.51 ms | 32.5% bf16 MFU | 126282 tok/s step 16082/19560 | loss 3.331496 (+0.84z)| norm 0.2648 (-0.12z)| lr 4.90e-05 | 4151.75 ms | 32.5% bf16 MFU | 126282 tok/s step 16083/19560 | loss 3.393923 (+2.59z)| norm 0.2567 (-0.32z)| lr 4.90e-05 | 4149.56 ms | 32.5% bf16 MFU | 126285 tok/s step 16084/19560 | loss 3.292483 (-0.29z)| norm 0.2644 (-0.12z)| lr 4.89e-05 | 4152.24 ms | 32.5% bf16 MFU | 126284 tok/s step 16085/19560 | loss 3.322462 (+0.57z)| norm 0.2698 (+0.02z)| lr 4.89e-05 | 4154.42 ms | 32.5% bf16 MFU | 126280 tok/s step 16086/19560 | loss 3.270934 (-0.93z)| norm 0.2635 (-0.14z)| lr 4.89e-05 | 4153.60 ms | 32.5% bf16 MFU | 126278 tok/s step 16087/19560 | loss 3.294631 (-0.24z)| norm 0.2549 (-0.36z)| lr 4.88e-05 | 4153.13 ms | 32.5% bf16 MFU | 126276 tok/s step 16088/19560 | loss 3.303634 (+0.03z)| norm 0.2671 (-0.04z)| lr 4.88e-05 | 4152.88 ms | 32.5% bf16 MFU | 126274 tok/s step 16089/19560 | loss 3.297951 (-0.14z)| norm 0.2523 (-0.42z)| lr 4.88e-05 | 4157.59 ms | 32.5% bf16 MFU | 126266 tok/s step 16090/19560 | loss 3.352049 (+1.42z)| norm 0.2574 (-0.29z)| lr 4.88e-05 | 4152.61 ms | 32.5% bf16 MFU | 126265 tok/s step 16091/19560 | loss 3.312375 (+0.28z)| norm 0.2809 (+0.32z)| lr 4.87e-05 | 4151.47 ms | 32.5% bf16 MFU | 126266 tok/s step 16092/19560 | loss 3.301767 (-0.04z)| norm 0.2815 (+0.33z)| lr 4.87e-05 | 4152.80 ms | 32.5% bf16 MFU | 126265 tok/s step 16093/19560 | loss 3.344827 (+1.20z)| norm 0.2492 (-0.51z)| lr 4.87e-05 | 4155.55 ms | 32.5% bf16 MFU | 126260 tok/s step 16094/19560 | loss 3.292467 (-0.32z)| norm 0.2648 (-0.11z)| lr 4.87e-05 | 4156.60 ms | 32.5% bf16 MFU | 126254 tok/s step 16095/19560 | loss 3.293131 (-0.30z)| norm 0.2786 (+0.25z)| lr 4.86e-05 | 4149.74 ms | 32.5% bf16 MFU | 126259 tok/s step 16096/19560 | loss 3.307288 (+0.10z)| norm 0.2572 (-0.31z)| lr 4.86e-05 | 4151.06 ms | 32.5% bf16 MFU | 126261 tok/s step 16097/19560 | loss 3.295749 (-0.24z)| norm 0.2569 (-0.32z)| lr 4.86e-05 | 4155.40 ms | 32.5% bf16 MFU | 126256 tok/s step 16098/19560 | loss 3.297488 (-0.19z)| norm 0.2802 (+0.28z)| lr 4.85e-05 | 4155.93 ms | 32.5% bf16 MFU | 126251 tok/s step 16099/19560 | loss 3.326784 (+0.67z)| norm 0.2569 (-0.32z)| lr 4.85e-05 | 4153.65 ms | 32.5% bf16 MFU | 126250 tok/s step 16100/19560 | loss 3.294400 (-0.30z)| norm 0.2694 (-0.00z)| lr 4.85e-05 | 4155.44 ms | 32.5% bf16 MFU | 126246 tok/s step 16101/19560 | loss 3.360004 (+1.64z)| norm 0.2679 (-0.05z)| lr 4.85e-05 | 4152.72 ms | 32.5% bf16 MFU | 126246 tok/s step 16102/19560 | loss 3.317305 (+0.36z)| norm 0.2540 (-0.39z)| lr 4.84e-05 | 4151.54 ms | 32.5% bf16 MFU | 126248 tok/s step 16103/19560 | loss 3.311976 (+0.20z)| norm 0.2642 (-0.13z)| lr 4.84e-05 | 4151.60 ms | 32.5% bf16 MFU | 126250 tok/s step 16104/19560 | loss 3.316841 (+0.35z)| norm 0.2702 (+0.03z)| lr 4.84e-05 | 4153.07 ms | 32.5% bf16 MFU | 126250 tok/s step 16105/19560 | loss 3.272573 (-0.97z)| norm 0.2561 (-0.34z)| lr 4.84e-05 | 4155.88 ms | 32.5% bf16 MFU | 126245 tok/s step 16106/19560 | loss 3.265042 (-1.18z)| norm 0.2693 (+0.01z)| lr 4.83e-05 | 4151.60 ms | 32.5% bf16 MFU | 126247 tok/s step 16107/19560 | loss 3.296185 (-0.25z)| norm 0.2529 (-0.42z)| lr 4.83e-05 | 4150.22 ms | 32.5% bf16 MFU | 126251 tok/s step 16108/19560 | loss 3.287707 (-0.49z)| norm 0.2560 (-0.34z)| lr 4.83e-05 | 4154.38 ms | 32.5% bf16 MFU | 126248 tok/s step 16109/19560 | loss 3.275892 (-0.84z)| norm 0.2685 (-0.01z)| lr 4.82e-05 | 4151.88 ms | 32.5% bf16 MFU | 126250 tok/s step 16110/19560 | loss 3.268168 (-1.06z)| norm 0.2689 (+0.01z)| lr 4.82e-05 | 4153.90 ms | 32.5% bf16 MFU | 126248 tok/s step 16111/19560 | loss 3.298225 (-0.17z)| norm 0.2629 (-0.15z)| lr 4.82e-05 | 4156.17 ms | 32.5% bf16 MFU | 126243 tok/s step 16112/19560 | loss 3.363789 (+1.78z)| norm 0.2720 (+0.09z)| lr 4.82e-05 | 4152.84 ms | 32.5% bf16 MFU | 126243 tok/s step 16113/19560 | loss 3.278404 (-0.75z)| norm 0.2600 (-0.22z)| lr 4.81e-05 | 4152.40 ms | 32.5% bf16 MFU | 126244 tok/s step 16114/19560 | loss 3.302794 (-0.03z)| norm 0.2670 (-0.04z)| lr 4.81e-05 | 4151.61 ms | 32.5% bf16 MFU | 126246 tok/s step 16115/19560 | loss 3.270521 (-0.99z)| norm 0.2570 (-0.30z)| lr 4.81e-05 | 4153.57 ms | 32.5% bf16 MFU | 126245 tok/s step 16116/19560 | loss 3.284876 (-0.56z)| norm 0.2612 (-0.19z)| lr 4.81e-05 | 4154.45 ms | 32.5% bf16 MFU | 126243 tok/s step 16117/19560 | loss 3.291744 (-0.34z)| norm 0.2604 (-0.21z)| lr 4.80e-05 | 4154.05 ms | 32.5% bf16 MFU | 126241 tok/s step 16118/19560 | loss 3.290355 (-0.38z)| norm 0.2758 (+0.20z)| lr 4.80e-05 | 4152.62 ms | 32.5% bf16 MFU | 126242 tok/s step 16119/19560 | loss 3.258375 (-1.33z)| norm 0.2384 (-0.78z)| lr 4.80e-05 | 4153.18 ms | 32.5% bf16 MFU | 126242 tok/s step 16120/19560 | loss 3.330662 (+0.87z)| norm 0.2524 (-0.41z)| lr 4.79e-05 | 4153.90 ms | 32.5% bf16 MFU | 126241 tok/s step 16121/19560 | loss 3.288549 (-0.40z)| norm 0.2657 (-0.05z)| lr 4.79e-05 | 4151.36 ms | 32.5% bf16 MFU | 126243 tok/s step 16122/19560 | loss 3.289219 (-0.39z)| norm 0.2462 (-0.57z)| lr 4.79e-05 | 4150.11 ms | 32.5% bf16 MFU | 126248 tok/s step 16123/19560 | loss 3.322915 (+0.65z)| norm 0.2523 (-1.10z)| lr 4.79e-05 | 4153.48 ms | 32.5% bf16 MFU | 126247 tok/s step 16124/19560 | loss 3.281525 (-0.61z)| norm 0.2660 (+0.15z)| lr 4.78e-05 | 4194.41 ms | 32.2% bf16 MFU | 126184 tok/s step 16125/19560 | loss 3.253492 (-1.45z)| norm 0.2570 (-0.66z)| lr 4.78e-05 | 4180.79 ms | 32.3% bf16 MFU | 126145 tok/s step 16126/19560 | loss 3.337824 (+1.12z)| norm 0.2896 (+2.24z)| lr 4.78e-05 | 4181.98 ms | 32.3% bf16 MFU | 126106 tok/s step 16127/19560 | loss 3.321850 (+0.63z)| norm 0.2634 (-0.09z)| lr 4.78e-05 | 4166.96 ms | 32.4% bf16 MFU | 126092 tok/s step 16128/19560 | loss 3.348872 (+1.44z)| norm 0.2676 (+0.28z)| lr 4.77e-05 | 4157.88 ms | 32.5% bf16 MFU | 126092 tok/s step 16129/19560 | loss 3.298643 (-0.08z)| norm 0.2602 (-0.38z)| lr 4.77e-05 | 4160.73 ms | 32.5% bf16 MFU | 126088 tok/s step 16130/19560 | loss 3.418899 (+3.40z)| norm 0.2572 (-0.64z)| lr 4.77e-05 | 4159.51 ms | 32.5% bf16 MFU | 126086 tok/s step 16131/19560 | loss 3.347732 (+1.32z)| norm 0.2557 (-0.78z)| lr 4.76e-05 | 4160.78 ms | 32.5% bf16 MFU | 126082 tok/s step 16132/19560 | loss 3.251998 (-1.43z)| norm 0.2751 (+1.06z)| lr 4.76e-05 | 4160.01 ms | 32.5% bf16 MFU | 126079 tok/s step 16133/19560 | loss 3.307080 (+0.17z)| norm 0.2707 (+0.64z)| lr 4.76e-05 | 4157.05 ms | 32.5% bf16 MFU | 126081 tok/s step 16134/19560 | loss 3.314802 (+0.39z)| norm 0.2523 (-1.10z)| lr 4.76e-05 | 4154.33 ms | 32.5% bf16 MFU | 126087 tok/s step 16135/19560 | loss 3.273182 (-0.81z)| norm 0.2562 (-0.72z)| lr 4.75e-05 | 4157.00 ms | 32.5% bf16 MFU | 126089 tok/s step 16136/19560 | loss 3.326390 (+0.74z)| norm 0.2591 (-0.43z)| lr 4.75e-05 | 4157.30 ms | 32.5% bf16 MFU | 126090 tok/s step 16137/19560 | loss 3.294271 (-0.19z)| norm 0.2546 (-0.85z)| lr 4.75e-05 | 4152.57 ms | 32.5% bf16 MFU | 126099 tok/s step 16138/19560 | loss 3.271177 (-0.85z)| norm 0.2440 (-1.83z)| lr 4.75e-05 | 4152.33 ms | 32.5% bf16 MFU | 126107 tok/s step 16139/19560 | loss 3.313350 (+0.37z)| norm 0.2545 (-0.82z)| lr 4.74e-05 | 4157.47 ms | 32.5% bf16 MFU | 126107 tok/s step 16140/19560 | loss 3.325580 (+0.72z)| norm 0.2688 (+0.55z)| lr 4.74e-05 | 4162.27 ms | 32.4% bf16 MFU | 126100 tok/s step 16141/19560 | loss 3.284857 (-0.47z)| norm 0.2548 (-0.80z)| lr 4.74e-05 | 4154.83 ms | 32.5% bf16 MFU | 126104 tok/s step 16142/19560 | loss 3.307597 (+0.18z)| norm 0.2534 (-0.93z)| lr 4.74e-05 | 4156.52 ms | 32.5% bf16 MFU | 126106 tok/s step 16143/19560 | loss 3.348923 (+1.37z)| norm 0.2581 (-0.46z)| lr 4.73e-05 | 4153.60 ms | 32.5% bf16 MFU | 126112 tok/s step 16144/19560 | loss 3.342296 (+1.16z)| norm 0.2544 (-0.83z)| lr 4.73e-05 | 4154.16 ms | 32.5% bf16 MFU | 126116 tok/s step 16145/19560 | loss 3.295019 (-0.23z)| norm 0.2404 (-2.15z)| lr 4.73e-05 | 4154.35 ms | 32.5% bf16 MFU | 126121 tok/s step 16146/19560 | loss 3.429023 (+3.50z)| norm 0.2588 (-0.37z)| lr 4.72e-05 | 4157.59 ms | 32.5% bf16 MFU | 126120 tok/s step 16147/19560 | loss 3.334244 (+0.84z)| norm 0.2591 (-0.35z)| lr 4.72e-05 | 4155.96 ms | 32.5% bf16 MFU | 126122 tok/s step 16148/19560 | loss 3.370168 (+1.80z)| norm 0.2495 (-1.27z)| lr 4.72e-05 | 4155.86 ms | 32.5% bf16 MFU | 126123 tok/s step 16149/19560 | loss 3.344898 (+1.09z)| norm 0.2553 (-0.73z)| lr 4.72e-05 | 4156.86 ms | 32.5% bf16 MFU | 126123 tok/s step 16150/19560 | loss 3.358272 (+1.43z)| norm 0.2510 (-1.14z)| lr 4.71e-05 | 4155.22 ms | 32.5% bf16 MFU | 126126 tok/s step 16151/19560 | loss 3.266599 (-1.08z)| norm 0.2541 (-0.83z)| lr 4.71e-05 | 4156.82 ms | 32.5% bf16 MFU | 126126 tok/s step 16152/19560 | loss 3.317644 (+0.32z)| norm 0.2904 (+2.59z)| lr 4.71e-05 | 4154.40 ms | 32.5% bf16 MFU | 126130 tok/s step 16153/19560 | loss 3.254683 (-1.39z)| norm 0.2550 (-0.75z)| lr 4.71e-05 | 4155.93 ms | 32.5% bf16 MFU | 126131 tok/s step 16154/19560 | loss 3.301619 (-0.11z)| norm 0.2457 (-1.60z)| lr 4.70e-05 | 4153.10 ms | 32.5% bf16 MFU | 126136 tok/s step 16155/19560 | loss 3.343616 (+1.02z)| norm 0.2786 (+1.45z)| lr 4.70e-05 | 4155.87 ms | 32.5% bf16 MFU | 126137 tok/s step 16156/19560 | loss 3.252854 (-1.43z)| norm 0.2656 (+0.25z)| lr 4.70e-05 | 4155.13 ms | 32.5% bf16 MFU | 126140 tok/s step 16157/19560 | loss 3.295443 (-0.27z)| norm 0.2581 (-0.46z)| lr 4.69e-05 | 4152.82 ms | 32.5% bf16 MFU | 126145 tok/s step 16158/19560 | loss 3.332607 (+0.74z)| norm 0.2716 (+0.84z)| lr 4.69e-05 | 4156.62 ms | 32.5% bf16 MFU | 126144 tok/s step 16159/19560 | loss 3.362598 (+1.54z)| norm 0.2640 (+0.11z)| lr 4.69e-05 | 4156.30 ms | 32.5% bf16 MFU | 126144 tok/s step 16160/19560 | loss 3.351124 (+1.22z)| norm 0.2699 (+0.66z)| lr 4.69e-05 | 4153.93 ms | 32.5% bf16 MFU | 126148 tok/s step 16161/19560 | loss 3.267792 (-1.05z)| norm 0.2660 (+0.28z)| lr 4.68e-05 | 4156.14 ms | 32.5% bf16 MFU | 126148 tok/s step 16162/19560 | loss 3.295130 (-0.31z)| norm 0.2597 (-0.33z)| lr 4.68e-05 | 4156.90 ms | 32.5% bf16 MFU | 126147 tok/s step 16163/19560 | loss 3.289053 (-0.47z)| norm 0.2643 (+0.12z)| lr 4.68e-05 | 4156.71 ms | 32.5% bf16 MFU | 126146 tok/s step 16164/19560 | loss 3.310771 (+0.12z)| norm 0.2690 (+0.56z)| lr 4.68e-05 | 4153.39 ms | 32.5% bf16 MFU | 126150 tok/s step 16165/19560 | loss 3.372681 (+1.80z)| norm 0.2575 (-0.54z)| lr 4.67e-05 | 4153.93 ms | 32.5% bf16 MFU | 126153 tok/s step 16166/19560 | loss 3.414381 (+2.89z)| norm 0.2659 (+0.27z)| lr 4.67e-05 | 4155.67 ms | 32.5% bf16 MFU | 126154 tok/s step 16167/19560 | loss 3.320524 (+0.34z)| norm 0.2667 (+0.33z)| lr 4.67e-05 | 4160.07 ms | 32.5% bf16 MFU | 126148 tok/s step 16168/19560 | loss 3.291975 (-0.45z)| norm 0.2686 (+0.51z)| lr 4.67e-05 | 4153.66 ms | 32.5% bf16 MFU | 126151 tok/s step 16169/19560 | loss 3.269007 (-1.06z)| norm 0.2521 (-1.09z)| lr 4.66e-05 | 4155.76 ms | 32.5% bf16 MFU | 126152 tok/s step 16170/19560 | loss 3.366509 (+1.56z)| norm 0.2572 (-0.60z)| lr 4.66e-05 | 4154.55 ms | 32.5% bf16 MFU | 126154 tok/s step 16171/19560 | loss 3.308643 (-0.01z)| norm 0.2738 (+1.01z)| lr 4.66e-05 | 4150.27 ms | 32.5% bf16 MFU | 126163 tok/s step 16172/19560 | loss 3.341466 (+0.87z)| norm 0.2711 (+0.73z)| lr 4.65e-05 | 4155.42 ms | 32.5% bf16 MFU | 126163 tok/s step 16173/19560 | loss 3.236127 (-1.94z)| norm 0.2828 (+1.83z)| lr 4.65e-05 | 4152.70 ms | 32.5% bf16 MFU | 126167 tok/s step 16174/19560 | loss 3.317574 (+0.23z)| norm 0.2544 (-0.89z)| lr 4.65e-05 | 4151.44 ms | 32.5% bf16 MFU | 126174 tok/s step 16175/19560 | loss 3.343740 (+0.92z)| norm 0.2764 (+1.22z)| lr 4.65e-05 | 4156.77 ms | 32.5% bf16 MFU | 126171 tok/s step 16176/19560 | loss 3.288050 (-0.57z)| norm 0.2570 (-0.63z)| lr 4.64e-05 | 4153.58 ms | 32.5% bf16 MFU | 126174 tok/s step 16177/19560 | loss 3.286827 (-0.60z)| norm 0.2704 (+0.63z)| lr 4.64e-05 | 4154.22 ms | 32.5% bf16 MFU | 126176 tok/s step 16178/19560 | loss 3.271604 (-1.01z)| norm 0.2599 (-0.38z)| lr 4.64e-05 | 4152.95 ms | 32.5% bf16 MFU | 126179 tok/s step 16179/19560 | loss 3.304091 (-0.14z)| norm 0.2616 (-0.21z)| lr 4.64e-05 | 4157.28 ms | 32.5% bf16 MFU | 126176 tok/s step 16180/19560 | loss 3.318615 (+0.25z)| norm 0.2741 (+0.98z)| lr 4.63e-05 | 4151.53 ms | 32.5% bf16 MFU | 126181 tok/s step 16181/19560 | loss 3.292198 (-0.46z)| norm 0.2662 (+0.20z)| lr 4.63e-05 | 4154.00 ms | 32.5% bf16 MFU | 126183 tok/s step 16182/19560 | loss 3.419791 (+2.83z)| norm 0.2796 (+1.51z)| lr 4.63e-05 | 4159.03 ms | 32.5% bf16 MFU | 126177 tok/s step 16183/19560 | loss 3.297272 (-0.32z)| norm 0.2664 (+0.22z)| lr 4.63e-05 | 4155.64 ms | 32.5% bf16 MFU | 126176 tok/s step 16184/19560 | loss 3.258029 (-1.35z)| norm 0.2622 (-0.17z)| lr 4.62e-05 | 4153.81 ms | 32.5% bf16 MFU | 126178 tok/s step 16185/19560 | loss 3.299320 (-0.27z)| norm 0.2767 (+1.25z)| lr 4.62e-05 | 4152.07 ms | 32.5% bf16 MFU | 126183 tok/s step 16186/19560 | loss 3.268691 (-1.10z)| norm 0.2594 (-0.44z)| lr 4.62e-05 | 4156.36 ms | 32.5% bf16 MFU | 126181 tok/s step 16187/19560 | loss 3.291808 (-0.48z)| norm 0.2502 (-1.32z)| lr 4.61e-05 | 4155.54 ms | 32.5% bf16 MFU | 126180 tok/s step 16188/19560 | loss 3.323276 (+0.35z)| norm 0.2430 (-1.98z)| lr 4.61e-05 | 4156.27 ms | 32.5% bf16 MFU | 126178 tok/s step 16189/19560 | loss 3.379955 (+1.83z)| norm 0.2589 (-0.45z)| lr 4.61e-05 | 4155.75 ms | 32.5% bf16 MFU | 126177 tok/s step 16190/19560 | loss 3.299209 (-0.33z)| norm 0.2604 (-0.28z)| lr 4.61e-05 | 4154.16 ms | 32.5% bf16 MFU | 126179 tok/s step 16191/19560 | loss 3.332690 (+0.57z)| norm 0.2543 (-0.87z)| lr 4.60e-05 | 4153.15 ms | 32.5% bf16 MFU | 126182 tok/s step 16192/19560 | loss 3.251124 (-1.62z)| norm 0.2626 (-0.06z)| lr 4.60e-05 | 4155.29 ms | 32.5% bf16 MFU | 126181 tok/s step 16193/19560 | loss 3.254226 (-1.51z)| norm 0.2455 (-1.71z)| lr 4.60e-05 | 4154.03 ms | 32.5% bf16 MFU | 126183 tok/s step 16194/19560 | loss 3.328764 (+0.47z)| norm 0.2489 (-1.36z)| lr 4.60e-05 | 4152.95 ms | 32.5% bf16 MFU | 126186 tok/s step 16195/19560 | loss 3.344039 (+0.87z)| norm 0.2502 (-1.21z)| lr 4.59e-05 | 4154.18 ms | 32.5% bf16 MFU | 126187 tok/s step 16196/19560 | loss 3.367131 (+1.46z)| norm 0.2626 (-0.02z)| lr 4.59e-05 | 4150.74 ms | 32.5% bf16 MFU | 126193 tok/s step 16197/19560 | loss 3.275060 (-0.95z)| norm 0.2448 (-1.70z)| lr 4.59e-05 | 4153.69 ms | 32.5% bf16 MFU | 126195 tok/s step 16198/19560 | loss 3.311609 (+0.02z)| norm 0.2639 (+0.11z)| lr 4.59e-05 | 4153.82 ms | 32.5% bf16 MFU | 126196 tok/s step 16199/19560 | loss 3.266539 (-1.16z)| norm 0.2504 (-1.16z)| lr 4.58e-05 | 4153.88 ms | 32.5% bf16 MFU | 126197 tok/s step 16200/19560 | loss 3.286481 (-0.64z)| norm 0.2523 (-0.98z)| lr 4.58e-05 | 4153.97 ms | 32.5% bf16 MFU | 126198 tok/s step 16201/19560 | loss 3.358778 (+1.25z)| norm 0.2572 (-0.50z)| lr 4.58e-05 | 4152.04 ms | 32.5% bf16 MFU | 126201 tok/s step 16202/19560 | loss 3.236394 (-1.92z)| norm 0.2417 (-1.98z)| lr 4.57e-05 | 4152.43 ms | 32.5% bf16 MFU | 126204 tok/s step 16203/19560 | loss 3.290910 (-0.51z)| norm 0.2558 (-0.59z)| lr 4.57e-05 | 4154.55 ms | 32.5% bf16 MFU | 126204 tok/s step 16204/19560 | loss 3.272897 (-0.97z)| norm 0.2646 (+0.28z)| lr 4.57e-05 | 4152.04 ms | 32.5% bf16 MFU | 126207 tok/s step 16205/19560 | loss 3.359002 (+1.24z)| norm 0.2699 (+0.81z)| lr 4.57e-05 | 4151.68 ms | 32.5% bf16 MFU | 126211 tok/s step 16206/19560 | loss 3.275360 (-0.90z)| norm 0.2554 (-0.61z)| lr 4.56e-05 | 4156.68 ms | 32.5% bf16 MFU | 126207 tok/s step 16207/19560 | loss 3.305763 (-0.10z)| norm 0.2501 (-1.14z)| lr 4.56e-05 | 4152.29 ms | 32.5% bf16 MFU | 126210 tok/s step 16208/19560 | loss 3.260808 (-1.26z)| norm 0.2613 (-0.01z)| lr 4.56e-05 | 4153.43 ms | 32.5% bf16 MFU | 126211 tok/s step 16209/19560 | loss 3.390467 (+2.07z)| norm 0.2666 (+0.53z)| lr 4.56e-05 | 4155.63 ms | 32.5% bf16 MFU | 126209 tok/s step 16210/19560 | loss 3.303461 (-0.16z)| norm 0.2547 (-0.66z)| lr 4.55e-05 | 4146.38 ms | 32.6% bf16 MFU | 126221 tok/s step 16211/19560 | loss 3.336820 (+0.72z)| norm 0.2500 (-1.13z)| lr 4.55e-05 | 4150.03 ms | 32.5% bf16 MFU | 126226 tok/s step 16212/19560 | loss 3.275469 (-0.87z)| norm 0.2511 (-1.00z)| lr 4.55e-05 | 4147.57 ms | 32.6% bf16 MFU | 126235 tok/s step 16213/19560 | loss 3.316062 (+0.18z)| norm 0.2714 (+1.02z)| lr 4.55e-05 | 4147.57 ms | 32.6% bf16 MFU | 126244 tok/s step 16214/19560 | loss 3.238995 (-1.80z)| norm 0.2658 (+0.46z)| lr 4.54e-05 | 4147.13 ms | 32.6% bf16 MFU | 126253 tok/s step 16215/19560 | loss 3.348831 (+1.02z)| norm 0.2478 (-1.32z)| lr 4.54e-05 | 4149.50 ms | 32.5% bf16 MFU | 126258 tok/s step 16216/19560 | loss 3.333370 (+0.61z)| norm 0.2756 (+1.42z)| lr 4.54e-05 | 4152.09 ms | 32.5% bf16 MFU | 126258 tok/s step 16217/19560 | loss 3.365301 (+1.41z)| norm 0.2700 (+0.86z)| lr 4.54e-05 | 4152.87 ms | 32.5% bf16 MFU | 126258 tok/s step 16218/19560 | loss 3.256837 (-1.33z)| norm 0.2689 (+0.74z)| lr 4.53e-05 | 4153.03 ms | 32.5% bf16 MFU | 126257 tok/s step 16219/19560 | loss 3.353161 (+1.10z)| norm 0.2934 (+3.07z)| lr 4.53e-05 | 4149.46 ms | 32.5% bf16 MFU | 126262 tok/s step 16220/19560 | loss 3.266801 (-1.07z)| norm 0.2584 (-0.28z)| lr 4.53e-05 | 4148.60 ms | 32.5% bf16 MFU | 126267 tok/s step 16221/19560 | loss 3.280869 (-0.70z)| norm 0.2634 (+0.19z)| lr 4.52e-05 | 4151.39 ms | 32.5% bf16 MFU | 126269 tok/s step 16222/19560 | loss 3.303793 (-0.13z)| norm 0.2648 (+0.33z)| lr 4.52e-05 | 4151.13 ms | 32.5% bf16 MFU | 126270 tok/s step 16223/19560 | loss 3.323238 (+0.36z)| norm 0.2711 (+0.96z)| lr 4.52e-05 | 4148.61 ms | 32.5% bf16 MFU | 126276 tok/s step 16224/19560 | loss 3.376510 (+1.67z)| norm 0.2615 (+0.01z)| lr 4.52e-05 | 4149.60 ms | 32.5% bf16 MFU | 126279 tok/s step 16225/19560 | loss 3.411803 (+2.47z)| norm 0.2713 (+0.96z)| lr 4.51e-05 | 4153.18 ms | 32.5% bf16 MFU | 126277 tok/s step 16226/19560 | loss 3.386278 (+1.81z)| norm 0.2564 (-0.49z)| lr 4.51e-05 | 4151.52 ms | 32.5% bf16 MFU | 126278 tok/s step 16227/19560 | loss 3.263024 (-1.14z)| norm 0.2786 (+1.69z)| lr 4.51e-05 | 4155.92 ms | 32.5% bf16 MFU | 126271 tok/s step 16228/19560 | loss 3.281667 (-0.69z)| norm 0.2645 (+0.31z)| lr 4.51e-05 | 4150.11 ms | 32.5% bf16 MFU | 126274 tok/s step 16229/19560 | loss 3.354787 (+1.06z)| norm 0.2540 (-0.73z)| lr 4.50e-05 | 4154.13 ms | 32.5% bf16 MFU | 126271 tok/s step 16230/19560 | loss 3.278984 (-0.75z)| norm 0.2599 (-0.15z)| lr 4.50e-05 | 4155.51 ms | 32.5% bf16 MFU | 126266 tok/s step 16231/19560 | loss 3.274863 (-0.84z)| norm 0.2608 (-0.06z)| lr 4.50e-05 | 4156.33 ms | 32.5% bf16 MFU | 126260 tok/s step 16232/19560 | loss 3.341818 (+0.75z)| norm 0.2434 (-1.74z)| lr 4.50e-05 | 4151.26 ms | 32.5% bf16 MFU | 126262 tok/s step 16233/19560 | loss 3.295491 (-0.35z)| norm 0.2600 (-0.11z)| lr 4.49e-05 | 4158.04 ms | 32.5% bf16 MFU | 126253 tok/s step 16234/19560 | loss 3.353051 (+1.00z)| norm 0.2523 (-0.87z)| lr 4.49e-05 | 4153.46 ms | 32.5% bf16 MFU | 126252 tok/s step 16235/19560 | loss 3.286178 (-0.59z)| norm 0.2458 (-1.48z)| lr 4.49e-05 | 4148.23 ms | 32.5% bf16 MFU | 126259 tok/s step 16236/19560 | loss 3.351478 (+0.95z)| norm 0.2613 (+0.03z)| lr 4.48e-05 | 4152.57 ms | 32.5% bf16 MFU | 126259 tok/s step 16237/19560 | loss 3.282785 (-0.68z)| norm 0.2549 (-0.59z)| lr 4.48e-05 | 4152.84 ms | 32.5% bf16 MFU | 126258 tok/s step 16238/19560 | loss 3.267614 (-1.04z)| norm 0.2511 (-0.95z)| lr 4.48e-05 | 4152.91 ms | 32.5% bf16 MFU | 126257 tok/s step 16239/19560 | loss 3.384418 (+1.70z)| norm 0.2583 (-0.24z)| lr 4.48e-05 | 4157.38 ms | 32.5% bf16 MFU | 126250 tok/s step 16240/19560 | loss 3.297374 (-0.34z)| norm 0.2798 (+1.84z)| lr 4.47e-05 | 4149.29 ms | 32.5% bf16 MFU | 126255 tok/s step 16241/19560 | loss 3.339748 (+0.65z)| norm 0.2579 (-0.28z)| lr 4.47e-05 | 4154.10 ms | 32.5% bf16 MFU | 126253 tok/s step 16242/19560 | loss 3.338517 (+0.62z)| norm 0.2518 (-0.86z)| lr 4.47e-05 | 4149.73 ms | 32.5% bf16 MFU | 126258 tok/s step 16243/19560 | loss 3.337379 (+0.58z)| norm 0.2737 (+1.24z)| lr 4.47e-05 | 4150.66 ms | 32.5% bf16 MFU | 126260 tok/s step 16244/19560 | loss 3.286292 (-0.63z)| norm 0.2592 (-0.16z)| lr 4.46e-05 | 4154.42 ms | 32.5% bf16 MFU | 126257 tok/s step 16245/19560 | loss 3.386341 (+1.70z)| norm 0.2589 (-0.19z)| lr 4.46e-05 | 4156.85 ms | 32.5% bf16 MFU | 126251 tok/s step 16246/19560 | loss 3.297039 (-0.39z)| norm 0.2873 (+2.50z)| lr 4.46e-05 | 4150.73 ms | 32.5% bf16 MFU | 126254 tok/s step 16247/19560 | loss 3.238864 (-1.74z)| norm 0.2861 (+2.33z)| lr 4.46e-05 | 4153.03 ms | 32.5% bf16 MFU | 126253 tok/s step 16248/19560 | loss 3.254058 (-1.37z)| norm 0.2697 (+0.78z)| lr 4.45e-05 | 4150.78 ms | 32.5% bf16 MFU | 126256 tok/s step 16249/19560 | loss 3.353522 (+0.93z)| norm 0.2759 (+1.35z)| lr 4.45e-05 | 4154.81 ms | 32.5% bf16 MFU | 126253 tok/s step 16250/19560 | loss 3.333646 (+0.46z)| norm 0.2596 (-0.19z)| lr 4.45e-05 | 4152.69 ms | 32.5% bf16 MFU | 126253 tok/s val loss 3.279593 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3016/10042 = 0.300339 step 16251/19560 | loss 3.311140 (-0.06z)| norm 0.2773 (+1.45z)| lr 4.45e-05 | 4147.26 ms | 32.6% bf16 MFU | 126261 tok/s step 16252/19560 | loss 3.283259 (-0.71z)| norm 0.2713 (+0.88z)| lr 4.44e-05 | 4151.99 ms | 32.5% bf16 MFU | 126262 tok/s step 16253/19560 | loss 3.331046 (+0.39z)| norm 0.2603 (-0.14z)| lr 4.44e-05 | 4148.91 ms | 32.5% bf16 MFU | 126267 tok/s step 16254/19560 | loss 3.268026 (-1.06z)| norm 0.2567 (-0.47z)| lr 4.44e-05 | 4145.71 ms | 32.6% bf16 MFU | 126277 tok/s step 16255/19560 | loss 3.276144 (-0.87z)| norm 0.2702 (+0.82z)| lr 4.44e-05 | 4145.77 ms | 32.6% bf16 MFU | 126286 tok/s step 16256/19560 | loss 3.410109 (+2.19z)| norm 0.2634 (+0.17z)| lr 4.43e-05 | 4202.30 ms | 32.1% bf16 MFU | 126210 tok/s step 16257/19560 | loss 3.305329 (-0.20z)| norm 0.2457 (-1.50z)| lr 4.43e-05 | 4149.89 ms | 32.5% bf16 MFU | 126216 tok/s step 16258/19560 | loss 3.321100 (+0.18z)| norm 0.2698 (+0.78z)| lr 4.43e-05 | 4150.07 ms | 32.5% bf16 MFU | 126222 tok/s step 16259/19560 | loss 3.246883 (-1.52z)| norm 0.2556 (-0.57z)| lr 4.42e-05 | 4153.98 ms | 32.5% bf16 MFU | 126222 tok/s step 16260/19560 | loss 3.298944 (-0.32z)| norm 0.2493 (-1.15z)| lr 4.42e-05 | 4152.19 ms | 32.5% bf16 MFU | 126224 tok/s step 16261/19560 | loss 3.309305 (-0.08z)| norm 0.2492 (-1.14z)| lr 4.42e-05 | 4151.58 ms | 32.5% bf16 MFU | 126227 tok/s step 16262/19560 | loss 3.331865 (+0.44z)| norm 0.2622 (+0.09z)| lr 4.42e-05 | 4150.80 ms | 32.5% bf16 MFU | 126231 tok/s step 16263/19560 | loss 3.293483 (-0.46z)| norm 0.2539 (-0.70z)| lr 4.41e-05 | 4151.32 ms | 32.5% bf16 MFU | 126234 tok/s step 16264/19560 | loss 3.345967 (+0.76z)| norm 0.2758 (+1.36z)| lr 4.41e-05 | 4153.27 ms | 32.5% bf16 MFU | 126234 tok/s step 16265/19560 | loss 3.261949 (-1.19z)| norm 0.2762 (+1.38z)| lr 4.41e-05 | 4149.67 ms | 32.5% bf16 MFU | 126240 tok/s step 16266/19560 | loss 3.318286 (+0.11z)| norm 0.2585 (-0.30z)| lr 4.41e-05 | 4155.11 ms | 32.5% bf16 MFU | 126237 tok/s step 16267/19560 | loss 3.290947 (-0.52z)| norm 0.2632 (+0.13z)| lr 4.40e-05 | 4147.94 ms | 32.6% bf16 MFU | 126245 tok/s step 16268/19560 | loss 3.308956 (-0.10z)| norm 0.2980 (+3.28z)| lr 4.40e-05 | 4148.52 ms | 32.5% bf16 MFU | 126252 tok/s step 16269/19560 | loss 3.272452 (-0.94z)| norm 0.2494 (-1.14z)| lr 4.40e-05 | 4149.62 ms | 32.5% bf16 MFU | 126256 tok/s step 16270/19560 | loss 3.333618 (+0.47z)| norm 0.2621 (+0.00z)| lr 4.40e-05 | 4147.46 ms | 32.6% bf16 MFU | 126264 tok/s step 16271/19560 | loss 3.353998 (+0.95z)| norm 0.2609 (-0.10z)| lr 4.39e-05 | 4150.44 ms | 32.5% bf16 MFU | 126267 tok/s step 16272/19560 | loss 3.327469 (+0.33z)| norm 0.2742 (+1.09z)| lr 4.39e-05 | 4150.68 ms | 32.5% bf16 MFU | 126269 tok/s step 16273/19560 | loss 3.323916 (+0.25z)| norm 0.2595 (-0.27z)| lr 4.39e-05 | 4149.39 ms | 32.5% bf16 MFU | 126274 tok/s step 16274/19560 | loss 3.368291 (+1.32z)| norm 0.2559 (-0.59z)| lr 4.39e-05 | 4149.24 ms | 32.5% bf16 MFU | 126278 tok/s step 16275/19560 | loss 3.339542 (+0.63z)| norm 0.2636 (+0.12z)| lr 4.38e-05 | 4148.99 ms | 32.5% bf16 MFU | 126282 tok/s step 16276/19560 | loss 3.339650 (+0.65z)| norm 0.2629 (+0.04z)| lr 4.38e-05 | 4151.43 ms | 32.5% bf16 MFU | 126283 tok/s step 16277/19560 | loss 3.298177 (-0.34z)| norm 0.2574 (-0.47z)| lr 4.38e-05 | 4151.44 ms | 32.5% bf16 MFU | 126283 tok/s step 16278/19560 | loss 3.249774 (-1.47z)| norm 0.2717 (+0.84z)| lr 4.38e-05 | 4152.58 ms | 32.5% bf16 MFU | 126282 tok/s step 16279/19560 | loss 3.287696 (-0.57z)| norm 0.2602 (-0.23z)| lr 4.37e-05 | 4150.80 ms | 32.5% bf16 MFU | 126283 tok/s step 16280/19560 | loss 3.286924 (-0.59z)| norm 0.2448 (-1.66z)| lr 4.37e-05 | 4153.30 ms | 32.5% bf16 MFU | 126281 tok/s step 16281/19560 | loss 3.280050 (-0.76z)| norm 0.2590 (-0.32z)| lr 4.37e-05 | 4152.43 ms | 32.5% bf16 MFU | 126280 tok/s step 16282/19560 | loss 3.345512 (+0.81z)| norm 0.2592 (-0.31z)| lr 4.36e-05 | 4151.63 ms | 32.5% bf16 MFU | 126280 tok/s step 16283/19560 | loss 3.286308 (-0.60z)| norm 0.2601 (-0.21z)| lr 4.36e-05 | 4152.81 ms | 32.5% bf16 MFU | 126278 tok/s step 16284/19560 | loss 3.305924 (-0.14z)| norm 0.2634 (+0.11z)| lr 4.36e-05 | 4148.58 ms | 32.5% bf16 MFU | 126283 tok/s step 16285/19560 | loss 3.266762 (-1.08z)| norm 0.2508 (-1.10z)| lr 4.36e-05 | 4150.02 ms | 32.5% bf16 MFU | 126286 tok/s step 16286/19560 | loss 3.348357 (+0.88z)| norm 0.2526 (-0.92z)| lr 4.35e-05 | 4151.13 ms | 32.5% bf16 MFU | 126286 tok/s step 16287/19560 | loss 3.331785 (+0.49z)| norm 0.2555 (-0.63z)| lr 4.35e-05 | 4150.67 ms | 32.5% bf16 MFU | 126288 tok/s step 16288/19560 | loss 3.326541 (+0.37z)| norm 0.2481 (-1.32z)| lr 4.35e-05 | 4149.27 ms | 32.5% bf16 MFU | 126291 tok/s step 16289/19560 | loss 3.280222 (-0.76z)| norm 0.2509 (-1.03z)| lr 4.35e-05 | 4149.58 ms | 32.5% bf16 MFU | 126294 tok/s step 16290/19560 | loss 3.309981 (-0.04z)| norm 0.2554 (-0.60z)| lr 4.34e-05 | 4149.30 ms | 32.5% bf16 MFU | 126297 tok/s step 16291/19560 | loss 3.266562 (-1.09z)| norm 0.2498 (-1.12z)| lr 4.34e-05 | 4148.04 ms | 32.5% bf16 MFU | 126302 tok/s step 16292/19560 | loss 3.333921 (+0.54z)| norm 0.2525 (-0.85z)| lr 4.34e-05 | 4152.72 ms | 32.5% bf16 MFU | 126300 tok/s step 16293/19560 | loss 3.299214 (-0.29z)| norm 0.2518 (-0.91z)| lr 4.34e-05 | 4154.80 ms | 32.5% bf16 MFU | 126294 tok/s step 16294/19560 | loss 3.305144 (-0.13z)| norm 0.2532 (-0.76z)| lr 4.33e-05 | 4149.53 ms | 32.5% bf16 MFU | 126297 tok/s step 16295/19560 | loss 3.312699 (+0.07z)| norm 0.2647 (+0.32z)| lr 4.33e-05 | 4151.50 ms | 32.5% bf16 MFU | 126296 tok/s step 16296/19560 | loss 3.333014 (+0.57z)| norm 0.2611 (-0.02z)| lr 4.33e-05 | 4149.58 ms | 32.5% bf16 MFU | 126299 tok/s step 16297/19560 | loss 3.309298 (-0.04z)| norm 0.2651 (+0.36z)| lr 4.33e-05 | 4148.04 ms | 32.5% bf16 MFU | 126304 tok/s step 16298/19560 | loss 3.276228 (-0.85z)| norm 0.2659 (+0.42z)| lr 4.32e-05 | 4148.52 ms | 32.5% bf16 MFU | 126307 tok/s step 16299/19560 | loss 3.345195 (+0.88z)| norm 0.2584 (-0.28z)| lr 4.32e-05 | 4152.77 ms | 32.5% bf16 MFU | 126305 tok/s step 16300/19560 | loss 3.276384 (-0.84z)| norm 0.2488 (-1.16z)| lr 4.32e-05 | 4149.72 ms | 32.5% bf16 MFU | 126306 tok/s step 16301/19560 | loss 3.239330 (-1.78z)| norm 0.2572 (-0.36z)| lr 4.32e-05 | 4147.80 ms | 32.6% bf16 MFU | 126311 tok/s step 16302/19560 | loss 3.283836 (-0.65z)| norm 0.2484 (-1.20z)| lr 4.31e-05 | 4149.18 ms | 32.5% bf16 MFU | 126314 tok/s step 16303/19560 | loss 3.256634 (-1.31z)| norm 0.2691 (+0.80z)| lr 4.31e-05 | 4147.55 ms | 32.6% bf16 MFU | 126318 tok/s step 16304/19560 | loss 3.261154 (-1.19z)| norm 0.2617 (+0.08z)| lr 4.31e-05 | 4150.92 ms | 32.5% bf16 MFU | 126318 tok/s step 16305/19560 | loss 3.350877 (+1.04z)| norm 0.2583 (-0.24z)| lr 4.31e-05 | 4152.63 ms | 32.5% bf16 MFU | 126315 tok/s step 16306/19560 | loss 3.288288 (-0.52z)| norm 0.2520 (-0.84z)| lr 4.30e-05 | 4151.34 ms | 32.5% bf16 MFU | 126314 tok/s step 16307/19560 | loss 3.313152 (+0.09z)| norm 0.2509 (-0.93z)| lr 4.30e-05 | 4148.99 ms | 32.5% bf16 MFU | 126316 tok/s step 16308/19560 | loss 3.460444 (+3.55z)| norm 0.2676 (+0.68z)| lr 4.30e-05 | 4147.84 ms | 32.6% bf16 MFU | 126320 tok/s step 16309/19560 | loss 3.401216 (+2.09z)| norm 0.2663 (+0.56z)| lr 4.29e-05 | 4144.56 ms | 32.6% bf16 MFU | 126329 tok/s step 16310/19560 | loss 3.250529 (-1.41z)| norm 0.2492 (-1.09z)| lr 4.29e-05 | 4152.55 ms | 32.5% bf16 MFU | 126326 tok/s step 16311/19560 | loss 3.267740 (-0.99z)| norm 0.2579 (-0.23z)| lr 4.29e-05 | 4148.10 ms | 32.5% bf16 MFU | 126329 tok/s step 16312/19560 | loss 3.311821 (+0.04z)| norm 0.2509 (-0.90z)| lr 4.29e-05 | 4149.07 ms | 32.5% bf16 MFU | 126331 tok/s step 16313/19560 | loss 3.306014 (-0.10z)| norm 0.2687 (+0.84z)| lr 4.28e-05 | 4155.53 ms | 32.5% bf16 MFU | 126323 tok/s step 16314/19560 | loss 3.203417 (-2.47z)| norm 0.2547 (-0.53z)| lr 4.28e-05 | 4157.78 ms | 32.5% bf16 MFU | 126311 tok/s step 16315/19560 | loss 3.253334 (-1.30z)| norm 0.2663 (+0.60z)| lr 4.28e-05 | 4251.36 ms | 31.8% bf16 MFU | 126162 tok/s step 16316/19560 | loss 3.280293 (-0.67z)| norm 0.2631 (+0.27z)| lr 4.28e-05 | 4204.22 ms | 32.1% bf16 MFU | 126089 tok/s step 16317/19560 | loss 3.307232 (-0.03z)| norm 0.2556 (-0.47z)| lr 4.27e-05 | 4172.33 ms | 32.4% bf16 MFU | 126068 tok/s step 16318/19560 | loss 3.286273 (-0.52z)| norm 0.2606 (+0.03z)| lr 4.27e-05 | 4160.24 ms | 32.5% bf16 MFU | 126065 tok/s step 16319/19560 | loss 3.368267 (+1.38z)| norm 0.2623 (+0.19z)| lr 4.27e-05 | 4165.38 ms | 32.4% bf16 MFU | 126055 tok/s step 16320/19560 | loss 3.390070 (+1.85z)| norm 0.2644 (+0.40z)| lr 4.27e-05 | 4154.90 ms | 32.5% bf16 MFU | 126062 tok/s step 16321/19560 | loss 3.299422 (-0.25z)| norm 0.2574 (-0.31z)| lr 4.26e-05 | 4176.83 ms | 32.3% bf16 MFU | 126035 tok/s step 16322/19560 | loss 3.261226 (-1.12z)| norm 0.2571 (-0.35z)| lr 4.26e-05 | 4167.01 ms | 32.4% bf16 MFU | 126024 tok/s step 16323/19560 | loss 3.435056 (+2.79z)| norm 0.2493 (-1.14z)| lr 4.26e-05 | 4164.92 ms | 32.4% bf16 MFU | 126017 tok/s step 16324/19560 | loss 3.395697 (+1.89z)| norm 0.2637 (+0.31z)| lr 4.26e-05 | 4166.07 ms | 32.4% bf16 MFU | 126009 tok/s step 16325/19560 | loss 3.411748 (+2.19z)| norm 0.2885 (+2.72z)| lr 4.25e-05 | 4166.93 ms | 32.4% bf16 MFU | 125999 tok/s step 16326/19560 | loss 3.361743 (+1.08z)| norm 0.2506 (-1.00z)| lr 4.25e-05 | 4164.39 ms | 32.4% bf16 MFU | 125994 tok/s step 16327/19560 | loss 3.294809 (-0.38z)| norm 0.2627 (+0.17z)| lr 4.25e-05 | 4165.51 ms | 32.4% bf16 MFU | 125988 tok/s step 16328/19560 | loss 3.332259 (+0.43z)| norm 0.2638 (+0.28z)| lr 4.25e-05 | 4169.25 ms | 32.4% bf16 MFU | 125976 tok/s step 16329/19560 | loss 3.351759 (+0.86z)| norm 0.2450 (-1.56z)| lr 4.24e-05 | 4164.62 ms | 32.4% bf16 MFU | 125972 tok/s step 16330/19560 | loss 3.290556 (-0.49z)| norm 0.2570 (-0.39z)| lr 4.24e-05 | 4166.46 ms | 32.4% bf16 MFU | 125965 tok/s step 16331/19560 | loss 3.275814 (-0.82z)| norm 0.2597 (-0.13z)| lr 4.24e-05 | 4162.84 ms | 32.4% bf16 MFU | 125964 tok/s step 16332/19560 | loss 3.363519 (+1.10z)| norm 0.2600 (-0.10z)| lr 4.24e-05 | 4163.60 ms | 32.4% bf16 MFU | 125962 tok/s step 16333/19560 | loss 3.318316 (+0.11z)| norm 0.2606 (-0.03z)| lr 4.23e-05 | 4160.77 ms | 32.5% bf16 MFU | 125964 tok/s step 16334/19560 | loss 3.345438 (+0.70z)| norm 0.2679 (+0.68z)| lr 4.23e-05 | 4168.09 ms | 32.4% bf16 MFU | 125955 tok/s step 16335/19560 | loss 3.302418 (-0.25z)| norm 0.2691 (+0.79z)| lr 4.23e-05 | 4159.05 ms | 32.5% bf16 MFU | 125960 tok/s step 16336/19560 | loss 3.333225 (+0.42z)| norm 0.2588 (-0.24z)| lr 4.23e-05 | 4154.08 ms | 32.5% bf16 MFU | 125973 tok/s step 16337/19560 | loss 3.284472 (-0.65z)| norm 0.2737 (+1.24z)| lr 4.22e-05 | 4162.65 ms | 32.4% bf16 MFU | 125972 tok/s step 16338/19560 | loss 3.324206 (+0.24z)| norm 0.2529 (-0.83z)| lr 4.22e-05 | 4165.74 ms | 32.4% bf16 MFU | 125966 tok/s step 16339/19560 | loss 3.344376 (+0.69z)| norm 0.2572 (-0.40z)| lr 4.22e-05 | 4168.25 ms | 32.4% bf16 MFU | 125957 tok/s step 16340/19560 | loss 3.324712 (+0.24z)| norm 0.2463 (-1.48z)| lr 4.22e-05 | 4159.73 ms | 32.5% bf16 MFU | 125961 tok/s step 16341/19560 | loss 3.309254 (-0.11z)| norm 0.2491 (-1.18z)| lr 4.21e-05 | 4174.21 ms | 32.3% bf16 MFU | 125943 tok/s step 16342/19560 | loss 3.301460 (-0.30z)| norm 0.2601 (-0.09z)| lr 4.21e-05 | 4152.72 ms | 32.5% bf16 MFU | 125958 tok/s step 16343/19560 | loss 3.401171 (+1.94z)| norm 0.2688 (+0.76z)| lr 4.21e-05 | 4161.53 ms | 32.4% bf16 MFU | 125960 tok/s step 16344/19560 | loss 3.340176 (+0.56z)| norm 0.2767 (+1.55z)| lr 4.21e-05 | 4163.00 ms | 32.4% bf16 MFU | 125959 tok/s step 16345/19560 | loss 3.340681 (+0.58z)| norm 0.2448 (-1.61z)| lr 4.20e-05 | 4172.69 ms | 32.4% bf16 MFU | 125943 tok/s step 16346/19560 | loss 3.313263 (-0.04z)| norm 0.2607 (-0.03z)| lr 4.20e-05 | 4165.47 ms | 32.4% bf16 MFU | 125939 tok/s step 16347/19560 | loss 3.292733 (-0.50z)| norm 0.2711 (+1.06z)| lr 4.20e-05 | 4162.44 ms | 32.4% bf16 MFU | 125940 tok/s step 16348/19560 | loss 3.395056 (+1.79z)| norm 0.2697 (+0.91z)| lr 4.20e-05 | 4164.00 ms | 32.4% bf16 MFU | 125939 tok/s step 16349/19560 | loss 3.339991 (+0.54z)| norm 0.2791 (+1.84z)| lr 4.19e-05 | 4165.95 ms | 32.4% bf16 MFU | 125934 tok/s step 16350/19560 | loss 3.255718 (-1.35z)| norm 0.2591 (-0.18z)| lr 4.19e-05 | 4162.68 ms | 32.4% bf16 MFU | 125935 tok/s step 16351/19560 | loss 3.383044 (+1.48z)| norm 0.2721 (+1.14z)| lr 4.19e-05 | 4160.29 ms | 32.5% bf16 MFU | 125939 tok/s step 16352/19560 | loss 3.238990 (-1.69z)| norm 0.2771 (+1.61z)| lr 4.18e-05 | 4165.85 ms | 32.4% bf16 MFU | 125935 tok/s step 16353/19560 | loss 3.358522 (+0.98z)| norm 0.2661 (+0.51z)| lr 4.18e-05 | 4155.74 ms | 32.5% bf16 MFU | 125946 tok/s step 16354/19560 | loss 3.314461 (+0.00z)| norm 0.2664 (+0.53z)| lr 4.18e-05 | 4165.15 ms | 32.4% bf16 MFU | 125943 tok/s step 16355/19560 | loss 3.302669 (-0.27z)| norm 0.2582 (-0.28z)| lr 4.18e-05 | 4153.71 ms | 32.5% bf16 MFU | 125957 tok/s step 16356/19560 | loss 3.351765 (+0.83z)| norm 0.2511 (-0.99z)| lr 4.17e-05 | 4162.90 ms | 32.4% bf16 MFU | 125956 tok/s step 16357/19560 | loss 3.316406 (+0.04z)| norm 0.2554 (-0.55z)| lr 4.17e-05 | 4165.52 ms | 32.4% bf16 MFU | 125951 tok/s step 16358/19560 | loss 3.304115 (-0.25z)| norm 0.2594 (-0.14z)| lr 4.17e-05 | 4172.37 ms | 32.4% bf16 MFU | 125937 tok/s step 16359/19560 | loss 3.366431 (+1.16z)| norm 0.2736 (+1.28z)| lr 4.17e-05 | 4162.05 ms | 32.4% bf16 MFU | 125938 tok/s step 16360/19560 | loss 3.324988 (+0.21z)| norm 0.2563 (-0.48z)| lr 4.16e-05 | 4159.38 ms | 32.5% bf16 MFU | 125944 tok/s step 16361/19560 | loss 3.320258 (+0.10z)| norm 0.2574 (-0.37z)| lr 4.16e-05 | 4180.18 ms | 32.3% bf16 MFU | 125918 tok/s step 16362/19560 | loss 3.290408 (-0.57z)| norm 0.2497 (-1.15z)| lr 4.16e-05 | 4163.97 ms | 32.4% bf16 MFU | 125917 tok/s step 16363/19560 | loss 3.331406 (+0.36z)| norm 0.2576 (-0.35z)| lr 4.16e-05 | 4169.69 ms | 32.4% bf16 MFU | 125908 tok/s step 16364/19560 | loss 3.337252 (+0.50z)| norm 0.2555 (-0.56z)| lr 4.15e-05 | 4156.53 ms | 32.5% bf16 MFU | 125920 tok/s step 16365/19560 | loss 3.296877 (-0.43z)| norm 0.2596 (-0.15z)| lr 4.15e-05 | 4161.29 ms | 32.4% bf16 MFU | 125923 tok/s step 16366/19560 | loss 3.286134 (-0.69z)| norm 0.2556 (-0.56z)| lr 4.15e-05 | 4167.17 ms | 32.4% bf16 MFU | 125918 tok/s step 16367/19560 | loss 3.339342 (+0.56z)| norm 0.2577 (-0.35z)| lr 4.15e-05 | 4159.31 ms | 32.5% bf16 MFU | 125925 tok/s step 16368/19560 | loss 3.344357 (+0.66z)| norm 0.2504 (-1.09z)| lr 4.14e-05 | 4157.94 ms | 32.5% bf16 MFU | 125933 tok/s step 16369/19560 | loss 3.382172 (+1.52z)| norm 0.2686 (+0.79z)| lr 4.14e-05 | 4156.48 ms | 32.5% bf16 MFU | 125943 tok/s step 16370/19560 | loss 3.352393 (+0.83z)| norm 0.2528 (-0.85z)| lr 4.14e-05 | 4168.53 ms | 32.4% bf16 MFU | 125935 tok/s step 16371/19560 | loss 3.351886 (+0.82z)| norm 0.2467 (-1.46z)| lr 4.14e-05 | 4328.16 ms | 31.2% bf16 MFU | 125695 tok/s step 16372/19560 | loss 3.351084 (+0.79z)| norm 0.2429 (-1.82z)| lr 4.13e-05 | 4170.73 ms | 32.4% bf16 MFU | 125695 tok/s step 16373/19560 | loss 3.383086 (+1.53z)| norm 0.2663 (+0.57z)| lr 4.13e-05 | 4157.94 ms | 32.5% bf16 MFU | 125715 tok/s step 16374/19560 | loss 3.418040 (+2.26z)| norm 0.3270 (+5.92z)| lr 4.13e-05 | 4165.60 ms | 32.4% bf16 MFU | 125722 tok/s step 16375/19560 | loss 3.317433 (-0.02z)| norm 0.2463 (-1.31z)| lr 4.13e-05 | 4165.61 ms | 32.4% bf16 MFU | 125729 tok/s step 16376/19560 | loss 3.316612 (-0.05z)| norm 0.2584 (-0.20z)| lr 4.12e-05 | 4164.75 ms | 32.4% bf16 MFU | 125737 tok/s step 16377/19560 | loss 3.334914 (+0.37z)| norm 0.2580 (-0.23z)| lr 4.12e-05 | 4164.80 ms | 32.4% bf16 MFU | 125745 tok/s step 16378/19560 | loss 3.454412 (+3.00z)| norm 0.2780 (+1.59z)| lr 4.12e-05 | 4162.39 ms | 32.4% bf16 MFU | 125755 tok/s step 16379/19560 | loss 3.278786 (-0.90z)| norm 0.3027 (+3.64z)| lr 4.12e-05 | 4164.60 ms | 32.4% bf16 MFU | 125762 tok/s step 16380/19560 | loss 3.324367 (+0.10z)| norm 0.2865 (+2.19z)| lr 4.11e-05 | 4159.28 ms | 32.5% bf16 MFU | 125777 tok/s step 16381/19560 | loss 3.419268 (+2.16z)| norm 0.2727 (+0.99z)| lr 4.11e-05 | 4167.68 ms | 32.4% bf16 MFU | 125778 tok/s step 16382/19560 | loss 3.313092 (-0.17z)| norm 0.2677 (+0.56z)| lr 4.11e-05 | 4171.79 ms | 32.4% bf16 MFU | 125773 tok/s step 16383/19560 | loss 3.303221 (-0.39z)| norm 0.2526 (-0.71z)| lr 4.11e-05 | 4157.42 ms | 32.5% bf16 MFU | 125789 tok/s step 16384/19560 | loss 3.292170 (-0.62z)| norm 0.2445 (-1.38z)| lr 4.10e-05 | 4160.99 ms | 32.4% bf16 MFU | 125800 tok/s step 16385/19560 | loss 3.301633 (-0.41z)| norm 0.2623 (+0.11z)| lr 4.10e-05 | 4155.88 ms | 32.5% bf16 MFU | 125818 tok/s step 16386/19560 | loss 3.281154 (-0.86z)| norm 0.2478 (-1.10z)| lr 4.10e-05 | 4164.21 ms | 32.4% bf16 MFU | 125822 tok/s step 16387/19560 | loss 3.295747 (-0.55z)| norm 0.2538 (-0.59z)| lr 4.10e-05 | 4152.29 ms | 32.5% bf16 MFU | 125844 tok/s step 16388/19560 | loss 3.304557 (-0.35z)| norm 0.2489 (-1.01z)| lr 4.09e-05 | 4162.11 ms | 32.4% bf16 MFU | 125850 tok/s step 16389/19560 | loss 3.294940 (-0.57z)| norm 0.2406 (-1.69z)| lr 4.09e-05 | 4175.89 ms | 32.3% bf16 MFU | 125835 tok/s step 16390/19560 | loss 3.565466 (+4.93z)| norm 0.3178 (+4.39z)| lr 4.09e-05 | 4154.71 ms | 32.5% bf16 MFU | 125853 tok/s step 16391/19560 | loss 3.377976 (+1.12z)| norm 0.2504 (-0.83z)| lr 4.09e-05 | 4156.73 ms | 32.5% bf16 MFU | 125867 tok/s step 16392/19560 | loss 3.330966 (+0.17z)| norm 0.2516 (-0.72z)| lr 4.08e-05 | 4175.69 ms | 32.3% bf16 MFU | 125852 tok/s step 16393/19560 | loss 3.399961 (+1.54z)| norm 0.2898 (+2.20z)| lr 4.08e-05 | 4162.56 ms | 32.4% bf16 MFU | 125857 tok/s step 16394/19560 | loss 3.281401 (-0.84z)| norm 0.2504 (-0.81z)| lr 4.08e-05 | 4152.40 ms | 32.5% bf16 MFU | 125877 tok/s step 16395/19560 | loss 3.358982 (+0.71z)| norm 0.2680 (+0.53z)| lr 4.08e-05 | 4155.76 ms | 32.5% bf16 MFU | 125891 tok/s step 16396/19560 | loss 3.328426 (+0.09z)| norm 0.2652 (+0.35z)| lr 4.07e-05 | 4162.22 ms | 32.4% bf16 MFU | 125895 tok/s step 16397/19560 | loss 3.313219 (-0.22z)| norm 0.2477 (-1.02z)| lr 4.07e-05 | 4163.97 ms | 32.4% bf16 MFU | 125895 tok/s step 16398/19560 | loss 3.308988 (-0.30z)| norm 0.2537 (-0.54z)| lr 4.07e-05 | 4155.65 ms | 32.5% bf16 MFU | 125909 tok/s step 16399/19560 | loss 3.263830 (-1.19z)| norm 0.2432 (-1.35z)| lr 4.07e-05 | 4163.68 ms | 32.4% bf16 MFU | 125909 tok/s step 16400/19560 | loss 3.280197 (-0.85z)| norm 0.2673 (+0.53z)| lr 4.06e-05 | 4158.08 ms | 32.5% bf16 MFU | 125918 tok/s step 16401/19560 | loss 3.314513 (-0.17z)| norm 0.2436 (-1.30z)| lr 4.06e-05 | 4180.23 ms | 32.3% bf16 MFU | 125893 tok/s step 16402/19560 | loss 3.393917 (+1.41z)| norm 0.2673 (+0.53z)| lr 4.06e-05 | 4161.58 ms | 32.4% bf16 MFU | 125898 tok/s step 16403/19560 | loss 3.268265 (-1.07z)| norm 0.2628 (+0.19z)| lr 4.06e-05 | 4150.62 ms | 32.5% bf16 MFU | 125919 tok/s step 16404/19560 | loss 3.296689 (-0.51z)| norm 0.2901 (+2.24z)| lr 4.05e-05 | 4151.65 ms | 32.5% bf16 MFU | 125937 tok/s step 16405/19560 | loss 3.347619 (+0.50z)| norm 0.2541 (-0.50z)| lr 4.05e-05 | 4153.56 ms | 32.5% bf16 MFU | 125952 tok/s step 16406/19560 | loss 3.262991 (-1.18z)| norm 0.2570 (-0.26z)| lr 4.05e-05 | 4156.38 ms | 32.5% bf16 MFU | 125961 tok/s step 16407/19560 | loss 3.345582 (+0.45z)| norm 0.2648 (+0.33z)| lr 4.05e-05 | 4151.54 ms | 32.5% bf16 MFU | 125977 tok/s step 16408/19560 | loss 3.377641 (+1.07z)| norm 0.2814 (+1.56z)| lr 4.04e-05 | 4161.33 ms | 32.4% bf16 MFU | 125978 tok/s step 16409/19560 | loss 3.338125 (+0.27z)| norm 0.2570 (-0.29z)| lr 4.04e-05 | 4164.28 ms | 32.4% bf16 MFU | 125974 tok/s step 16410/19560 | loss 3.409178 (+1.66z)| norm 0.2587 (-0.16z)| lr 4.04e-05 | 4163.01 ms | 32.4% bf16 MFU | 125972 tok/s step 16411/19560 | loss 3.306210 (-0.37z)| norm 0.2616 (+0.06z)| lr 4.04e-05 | 4172.85 ms | 32.4% bf16 MFU | 125956 tok/s step 16412/19560 | loss 3.397856 (+1.41z)| norm 0.2605 (-0.02z)| lr 4.03e-05 | 4165.20 ms | 32.4% bf16 MFU | 125952 tok/s step 16413/19560 | loss 3.319887 (-0.12z)| norm 0.2493 (-0.87z)| lr 4.03e-05 | 4162.14 ms | 32.4% bf16 MFU | 125952 tok/s step 16414/19560 | loss 3.319350 (-0.13z)| norm 0.2555 (-0.40z)| lr 4.03e-05 | 4155.56 ms | 32.5% bf16 MFU | 125963 tok/s step 16415/19560 | loss 3.295198 (-0.60z)| norm 0.2614 (+0.04z)| lr 4.03e-05 | 4160.33 ms | 32.5% bf16 MFU | 125966 tok/s step 16416/19560 | loss 3.387544 (+1.20z)| norm 0.2693 (+0.63z)| lr 4.02e-05 | 4152.64 ms | 32.5% bf16 MFU | 125980 tok/s step 16417/19560 | loss 3.287984 (-0.75z)| norm 0.2506 (-0.79z)| lr 4.02e-05 | 4156.42 ms | 32.5% bf16 MFU | 125988 tok/s step 16418/19560 | loss 3.315143 (-0.22z)| norm 0.2468 (-1.07z)| lr 4.02e-05 | 4167.38 ms | 32.4% bf16 MFU | 125979 tok/s step 16419/19560 | loss 3.352411 (+0.50z)| norm 0.2542 (-0.51z)| lr 4.02e-05 | 4164.07 ms | 32.4% bf16 MFU | 125976 tok/s step 16420/19560 | loss 3.260306 (-1.29z)| norm 0.2528 (-0.62z)| lr 4.01e-05 | 4159.01 ms | 32.5% bf16 MFU | 125980 tok/s step 16421/19560 | loss 3.254035 (-1.39z)| norm 0.2630 (+0.15z)| lr 4.01e-05 | 4165.57 ms | 32.4% bf16 MFU | 125974 tok/s step 16422/19560 | loss 3.396139 (+1.34z)| norm 0.2586 (-0.19z)| lr 4.01e-05 | 4153.56 ms | 32.5% bf16 MFU | 125987 tok/s step 16423/19560 | loss 3.311721 (-0.29z)| norm 0.2572 (-0.29z)| lr 4.01e-05 | 4162.40 ms | 32.4% bf16 MFU | 125985 tok/s step 16424/19560 | loss 3.280883 (-0.87z)| norm 0.2539 (-0.54z)| lr 4.00e-05 | 4148.40 ms | 32.5% bf16 MFU | 126005 tok/s step 16425/19560 | loss 3.310982 (-0.29z)| norm 0.2505 (-0.79z)| lr 4.00e-05 | 4155.26 ms | 32.5% bf16 MFU | 126014 tok/s step 16426/19560 | loss 3.361242 (+0.66z)| norm 0.2493 (-0.87z)| lr 4.00e-05 | 4155.23 ms | 32.5% bf16 MFU | 126022 tok/s step 16427/19560 | loss 3.378735 (+0.99z)| norm 0.2497 (-0.83z)| lr 4.00e-05 | 4157.92 ms | 32.5% bf16 MFU | 126025 tok/s step 16428/19560 | loss 3.287634 (-0.76z)| norm 0.2499 (-0.82z)| lr 3.99e-05 | 4167.27 ms | 32.4% bf16 MFU | 126015 tok/s step 16429/19560 | loss 3.422059 (+1.79z)| norm 0.2679 (+0.54z)| lr 3.99e-05 | 4159.36 ms | 32.5% bf16 MFU | 126016 tok/s step 16430/19560 | loss 3.342137 (+0.25z)| norm 0.2686 (+0.58z)| lr 3.99e-05 | 4156.25 ms | 32.5% bf16 MFU | 126023 tok/s step 16431/19560 | loss 3.323881 (-0.11z)| norm 0.2684 (+0.57z)| lr 3.99e-05 | 4169.12 ms | 32.4% bf16 MFU | 126009 tok/s step 16432/19560 | loss 3.366706 (+0.71z)| norm 0.2706 (+0.73z)| lr 3.98e-05 | 4163.96 ms | 32.4% bf16 MFU | 126004 tok/s step 16433/19560 | loss 3.267331 (-1.21z)| norm 0.2535 (-0.56z)| lr 3.98e-05 | 4156.82 ms | 32.5% bf16 MFU | 126011 tok/s step 16434/19560 | loss 3.361707 (+0.61z)| norm 0.2651 (+0.31z)| lr 3.98e-05 | 4157.16 ms | 32.5% bf16 MFU | 126016 tok/s step 16435/19560 | loss 3.348349 (+0.34z)| norm 0.2746 (+1.01z)| lr 3.98e-05 | 4152.59 ms | 32.5% bf16 MFU | 126028 tok/s step 16436/19560 | loss 3.318590 (-0.22z)| norm 0.2455 (-1.17z)| lr 3.97e-05 | 4152.85 ms | 32.5% bf16 MFU | 126039 tok/s step 16437/19560 | loss 3.388359 (+1.17z)| norm 0.2882 (+2.00z)| lr 3.97e-05 | 4155.60 ms | 32.5% bf16 MFU | 126045 tok/s step 16438/19560 | loss 3.333344 (+0.07z)| norm 0.2621 (+0.06z)| lr 3.97e-05 | 4162.29 ms | 32.4% bf16 MFU | 126041 tok/s step 16439/19560 | loss 3.279335 (-1.03z)| norm 0.2867 (+1.84z)| lr 3.97e-05 | 4158.95 ms | 32.5% bf16 MFU | 126042 tok/s step 16440/19560 | loss 3.360530 (+0.60z)| norm 0.2694 (+0.57z)| lr 3.96e-05 | 4158.61 ms | 32.5% bf16 MFU | 126044 tok/s step 16441/19560 | loss 3.331875 (+0.02z)| norm 0.2643 (+0.19z)| lr 3.96e-05 | 4162.28 ms | 32.4% bf16 MFU | 126039 tok/s step 16442/19560 | loss 3.331038 (-0.02z)| norm 0.2857 (+1.73z)| lr 3.96e-05 | 4158.28 ms | 32.5% bf16 MFU | 126042 tok/s step 16443/19560 | loss 3.282537 (-1.03z)| norm 0.2769 (+1.08z)| lr 3.96e-05 | 4156.67 ms | 32.5% bf16 MFU | 126046 tok/s step 16444/19560 | loss 3.307317 (-0.52z)| norm 0.2563 (-0.41z)| lr 3.95e-05 | 4164.38 ms | 32.4% bf16 MFU | 126039 tok/s step 16445/19560 | loss 3.303341 (-0.60z)| norm 0.2678 (+0.42z)| lr 3.95e-05 | 4158.29 ms | 32.5% bf16 MFU | 126041 tok/s step 16446/19560 | loss 3.384353 (+1.07z)| norm 0.3174 (+3.74z)| lr 3.95e-05 | 4162.22 ms | 32.4% bf16 MFU | 126037 tok/s step 16447/19560 | loss 3.282352 (-1.04z)| norm 0.2619 (-0.04z)| lr 3.95e-05 | 4154.88 ms | 32.5% bf16 MFU | 126045 tok/s step 16448/19560 | loss 3.322098 (-0.20z)| norm 0.2519 (-0.71z)| lr 3.94e-05 | 4156.90 ms | 32.5% bf16 MFU | 126049 tok/s step 16449/19560 | loss 3.420137 (+1.81z)| norm 0.2968 (+2.28z)| lr 3.94e-05 | 4153.07 ms | 32.5% bf16 MFU | 126058 tok/s step 16450/19560 | loss 3.264610 (-1.41z)| norm 0.2564 (-0.42z)| lr 3.94e-05 | 4155.55 ms | 32.5% bf16 MFU | 126064 tok/s step 16451/19560 | loss 3.310452 (-0.45z)| norm 0.2667 (+0.26z)| lr 3.94e-05 | 4163.94 ms | 32.4% bf16 MFU | 126056 tok/s step 16452/19560 | loss 3.351346 (+0.42z)| norm 0.2527 (-0.67z)| lr 3.93e-05 | 4157.21 ms | 32.5% bf16 MFU | 126059 tok/s step 16453/19560 | loss 3.264384 (-1.41z)| norm 0.2561 (-0.43z)| lr 3.93e-05 | 4159.34 ms | 32.5% bf16 MFU | 126059 tok/s step 16454/19560 | loss 3.311909 (-0.38z)| norm 0.2508 (-0.79z)| lr 3.93e-05 | 4164.51 ms | 32.4% bf16 MFU | 126050 tok/s step 16455/19560 | loss 3.346075 (+0.34z)| norm 0.2750 (+0.84z)| lr 3.93e-05 | 4153.29 ms | 32.5% bf16 MFU | 126060 tok/s step 16456/19560 | loss 3.319918 (-0.22z)| norm 0.2737 (+0.74z)| lr 3.92e-05 | 4153.10 ms | 32.5% bf16 MFU | 126069 tok/s step 16457/19560 | loss 3.365000 (+0.74z)| norm 0.2527 (-0.68z)| lr 3.92e-05 | 4157.22 ms | 32.5% bf16 MFU | 126071 tok/s step 16458/19560 | loss 3.401899 (+1.50z)| norm 0.2739 (+0.75z)| lr 3.92e-05 | 4159.13 ms | 32.5% bf16 MFU | 126070 tok/s step 16459/19560 | loss 3.298579 (-0.70z)| norm 0.2648 (+0.13z)| lr 3.92e-05 | 4161.25 ms | 32.4% bf16 MFU | 126066 tok/s step 16460/19560 | loss 3.273593 (-1.21z)| norm 0.2704 (+0.50z)| lr 3.91e-05 | 4165.77 ms | 32.4% bf16 MFU | 126056 tok/s step 16461/19560 | loss 3.320147 (-0.22z)| norm 0.2544 (-0.57z)| lr 3.91e-05 | 4160.69 ms | 32.5% bf16 MFU | 126054 tok/s step 16462/19560 | loss 3.351737 (+0.45z)| norm 0.2657 (+0.19z)| lr 3.91e-05 | 4152.78 ms | 32.5% bf16 MFU | 126063 tok/s step 16463/19560 | loss 3.389009 (+1.22z)| norm 0.2528 (-0.67z)| lr 3.91e-05 | 4154.34 ms | 32.5% bf16 MFU | 126070 tok/s step 16464/19560 | loss 3.305545 (-0.54z)| norm 0.2526 (-0.68z)| lr 3.90e-05 | 4155.83 ms | 32.5% bf16 MFU | 126075 tok/s step 16465/19560 | loss 3.447165 (+2.38z)| norm 0.2682 (+0.37z)| lr 3.90e-05 | 4152.71 ms | 32.5% bf16 MFU | 126083 tok/s step 16466/19560 | loss 3.300124 (-0.66z)| norm 0.2574 (-0.36z)| lr 3.90e-05 | 4171.38 ms | 32.4% bf16 MFU | 126064 tok/s step 16467/19560 | loss 3.275911 (-1.15z)| norm 0.2583 (-0.30z)| lr 3.90e-05 | 4156.17 ms | 32.5% bf16 MFU | 126068 tok/s step 16468/19560 | loss 3.314320 (-0.36z)| norm 0.2504 (-0.83z)| lr 3.89e-05 | 4162.41 ms | 32.4% bf16 MFU | 126062 tok/s step 16469/19560 | loss 3.316091 (-0.32z)| norm 0.2627 (-0.01z)| lr 3.89e-05 | 4157.92 ms | 32.5% bf16 MFU | 126064 tok/s step 16470/19560 | loss 3.337436 (+0.11z)| norm 0.2559 (-0.47z)| lr 3.89e-05 | 4166.25 ms | 32.4% bf16 MFU | 126053 tok/s step 16471/19560 | loss 3.327759 (-0.07z)| norm 0.2468 (-1.06z)| lr 3.89e-05 | 4156.27 ms | 32.5% bf16 MFU | 126057 tok/s step 16472/19560 | loss 3.292900 (-0.79z)| norm 0.2510 (-0.77z)| lr 3.88e-05 | 4155.60 ms | 32.5% bf16 MFU | 126063 tok/s step 16473/19560 | loss 3.337563 (+0.14z)| norm 0.2524 (-0.68z)| lr 3.88e-05 | 4191.64 ms | 32.2% bf16 MFU | 126014 tok/s step 16474/19560 | loss 3.282432 (-1.00z)| norm 0.2718 (+0.62z)| lr 3.88e-05 | 4155.08 ms | 32.5% bf16 MFU | 126022 tok/s step 16475/19560 | loss 3.361701 (+0.63z)| norm 0.2445 (-1.21z)| lr 3.88e-05 | 4159.93 ms | 32.5% bf16 MFU | 126022 tok/s step 16476/19560 | loss 3.286152 (-0.92z)| norm 0.2433 (-1.26z)| lr 3.87e-05 | 4160.16 ms | 32.5% bf16 MFU | 126023 tok/s step 16477/19560 | loss 3.307596 (-0.47z)| norm 0.2578 (-0.28z)| lr 3.87e-05 | 4156.88 ms | 32.5% bf16 MFU | 126028 tok/s step 16478/19560 | loss 3.390199 (+1.23z)| norm 0.2488 (-0.88z)| lr 3.87e-05 | 4160.38 ms | 32.5% bf16 MFU | 126027 tok/s step 16479/19560 | loss 3.255672 (-1.54z)| norm 0.2422 (-1.30z)| lr 3.87e-05 | 4154.22 ms | 32.5% bf16 MFU | 126036 tok/s step 16480/19560 | loss 3.290511 (-0.84z)| norm 0.2546 (-0.47z)| lr 3.86e-05 | 4148.98 ms | 32.5% bf16 MFU | 126053 tok/s step 16481/19560 | loss 3.361839 (+0.65z)| norm 0.2547 (-0.45z)| lr 3.86e-05 | 4160.22 ms | 32.5% bf16 MFU | 126051 tok/s step 16482/19560 | loss 3.361217 (+0.63z)| norm 0.2433 (-1.19z)| lr 3.86e-05 | 4150.06 ms | 32.5% bf16 MFU | 126065 tok/s step 16483/19560 | loss 3.410504 (+1.63z)| norm 0.2811 (+1.30z)| lr 3.86e-05 | 4158.40 ms | 32.5% bf16 MFU | 126066 tok/s step 16484/19560 | loss 3.355741 (+0.49z)| norm 0.2536 (-0.52z)| lr 3.86e-05 | 4155.84 ms | 32.5% bf16 MFU | 126071 tok/s step 16485/19560 | loss 3.319887 (-0.25z)| norm 0.2889 (+1.77z)| lr 3.85e-05 | 4148.18 ms | 32.5% bf16 MFU | 126087 tok/s step 16486/19560 | loss 3.322413 (-0.20z)| norm 0.2482 (-0.87z)| lr 3.85e-05 | 4157.26 ms | 32.5% bf16 MFU | 126088 tok/s step 16487/19560 | loss 3.334124 (+0.05z)| norm 0.2585 (-0.20z)| lr 3.85e-05 | 4162.48 ms | 32.4% bf16 MFU | 126081 tok/s step 16488/19560 | loss 3.305995 (-0.53z)| norm 0.2542 (-0.48z)| lr 3.85e-05 | 4156.41 ms | 32.5% bf16 MFU | 126084 tok/s step 16489/19560 | loss 3.273600 (-1.19z)| norm 0.2537 (-0.50z)| lr 3.84e-05 | 4149.49 ms | 32.5% bf16 MFU | 126097 tok/s step 16490/19560 | loss 3.271984 (-1.22z)| norm 0.2599 (-0.11z)| lr 3.84e-05 | 4155.17 ms | 32.5% bf16 MFU | 126101 tok/s step 16491/19560 | loss 3.309601 (-0.44z)| norm 0.2535 (-0.52z)| lr 3.84e-05 | 4163.64 ms | 32.4% bf16 MFU | 126092 tok/s step 16492/19560 | loss 3.266593 (-1.30z)| norm 0.2572 (-0.28z)| lr 3.84e-05 | 4166.41 ms | 32.4% bf16 MFU | 126080 tok/s step 16493/19560 | loss 3.322282 (-0.17z)| norm 0.2559 (-0.37z)| lr 3.83e-05 | 4157.80 ms | 32.5% bf16 MFU | 126081 tok/s step 16494/19560 | loss 3.353362 (+0.46z)| norm 0.2801 (+1.20z)| lr 3.83e-05 | 4156.96 ms | 32.5% bf16 MFU | 126083 tok/s step 16495/19560 | loss 3.278847 (-1.06z)| norm 0.2584 (-0.22z)| lr 3.83e-05 | 4154.06 ms | 32.5% bf16 MFU | 126089 tok/s step 16496/19560 | loss 3.313473 (-0.35z)| norm 0.2519 (-0.64z)| lr 3.83e-05 | 4147.61 ms | 32.6% bf16 MFU | 126105 tok/s step 16497/19560 | loss 3.339581 (+0.19z)| norm 0.2546 (-0.45z)| lr 3.82e-05 | 4154.63 ms | 32.5% bf16 MFU | 126109 tok/s step 16498/19560 | loss 3.363699 (+0.69z)| norm 0.2587 (-0.19z)| lr 3.82e-05 | 4149.78 ms | 32.5% bf16 MFU | 126121 tok/s step 16499/19560 | loss 3.309483 (-0.42z)| norm 0.2614 (-0.02z)| lr 3.82e-05 | 4158.31 ms | 32.5% bf16 MFU | 126119 tok/s step 16500/19560 | loss 3.277075 (-1.07z)| norm 0.2704 (+0.56z)| lr 3.82e-05 | 4164.85 ms | 32.4% bf16 MFU | 126107 tok/s val loss 3.278083 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3003/10042 = 0.299044 step 16501/19560 | loss 3.329403 (+0.01z)| norm 0.2572 (-0.31z)| lr 3.81e-05 | 4147.73 ms | 32.6% bf16 MFU | 126122 tok/s step 16502/19560 | loss 3.372344 (+0.91z)| norm 0.2496 (-0.83z)| lr 3.81e-05 | 4157.09 ms | 32.5% bf16 MFU | 126122 tok/s step 16503/19560 | loss 3.317293 (-0.23z)| norm 0.2701 (+0.61z)| lr 3.81e-05 | 4154.30 ms | 32.5% bf16 MFU | 126126 tok/s step 16504/19560 | loss 3.310106 (-0.38z)| norm 0.2595 (-0.14z)| lr 3.81e-05 | 4176.11 ms | 32.3% bf16 MFU | 126097 tok/s step 16505/19560 | loss 3.294642 (-0.69z)| norm 0.2380 (-1.64z)| lr 3.80e-05 | 4198.62 ms | 32.2% bf16 MFU | 126036 tok/s step 16506/19560 | loss 3.295195 (-0.67z)| norm 0.2426 (-1.29z)| lr 3.80e-05 | 4176.99 ms | 32.3% bf16 MFU | 126010 tok/s step 16507/19560 | loss 3.336446 (+0.19z)| norm 0.2577 (-0.22z)| lr 3.80e-05 | 4173.80 ms | 32.3% bf16 MFU | 125990 tok/s step 16508/19560 | loss 3.355400 (+0.59z)| norm 0.2516 (-0.65z)| lr 3.80e-05 | 4160.20 ms | 32.5% bf16 MFU | 125992 tok/s step 16509/19560 | loss 3.303745 (-0.50z)| norm 0.2659 (+0.41z)| lr 3.79e-05 | 4159.76 ms | 32.5% bf16 MFU | 125994 tok/s step 16510/19560 | loss 3.305336 (-0.46z)| norm 0.2422 (-1.32z)| lr 3.79e-05 | 4154.26 ms | 32.5% bf16 MFU | 126005 tok/s step 16511/19560 | loss 3.328355 (+0.03z)| norm 0.2443 (-1.15z)| lr 3.79e-05 | 4166.37 ms | 32.4% bf16 MFU | 125996 tok/s step 16512/19560 | loss 3.348648 (+0.46z)| norm 0.2769 (+1.20z)| lr 3.79e-05 | 4168.17 ms | 32.4% bf16 MFU | 125986 tok/s step 16513/19560 | loss 3.292227 (-0.76z)| norm 0.2451 (-1.10z)| lr 3.78e-05 | 4158.93 ms | 32.5% bf16 MFU | 125990 tok/s step 16514/19560 | loss 3.310263 (-0.37z)| norm 0.2495 (-0.78z)| lr 3.78e-05 | 4152.77 ms | 32.5% bf16 MFU | 126003 tok/s step 16515/19560 | loss 3.272421 (-1.18z)| norm 0.2468 (-0.97z)| lr 3.78e-05 | 4157.19 ms | 32.5% bf16 MFU | 126008 tok/s step 16516/19560 | loss 3.299080 (-0.61z)| norm 0.2497 (-0.76z)| lr 3.78e-05 | 4162.18 ms | 32.4% bf16 MFU | 126006 tok/s step 16517/19560 | loss 3.279138 (-1.03z)| norm 0.2524 (-0.58z)| lr 3.77e-05 | 4163.54 ms | 32.4% bf16 MFU | 126002 tok/s step 16518/19560 | loss 3.304546 (-0.49z)| norm 0.2430 (-1.31z)| lr 3.77e-05 | 4155.02 ms | 32.5% bf16 MFU | 126011 tok/s step 16519/19560 | loss 3.307501 (-0.41z)| norm 0.2663 (+0.51z)| lr 3.77e-05 | 4166.39 ms | 32.4% bf16 MFU | 126002 tok/s step 16520/19560 | loss 3.300587 (-0.57z)| norm 0.2420 (-1.38z)| lr 3.77e-05 | 4159.37 ms | 32.5% bf16 MFU | 126005 tok/s step 16521/19560 | loss 3.260051 (-1.54z)| norm 0.2520 (-0.60z)| lr 3.76e-05 | 4154.05 ms | 32.5% bf16 MFU | 126015 tok/s step 16522/19560 | loss 3.383582 (+1.45z)| norm 0.2533 (-0.49z)| lr 3.76e-05 | 4152.08 ms | 32.5% bf16 MFU | 126028 tok/s step 16523/19560 | loss 3.347798 (+0.58z)| norm 0.2419 (-1.37z)| lr 3.76e-05 | 4154.80 ms | 32.5% bf16 MFU | 126036 tok/s step 16524/19560 | loss 3.316659 (-0.17z)| norm 0.2411 (-1.41z)| lr 3.76e-05 | 4152.44 ms | 32.5% bf16 MFU | 126047 tok/s step 16525/19560 | loss 3.277488 (-1.11z)| norm 0.2468 (-0.96z)| lr 3.76e-05 | 4156.62 ms | 32.5% bf16 MFU | 126051 tok/s step 16526/19560 | loss 3.295453 (-0.67z)| norm 0.2418 (-1.34z)| lr 3.75e-05 | 4163.05 ms | 32.4% bf16 MFU | 126046 tok/s step 16527/19560 | loss 3.339729 (+0.38z)| norm 0.2601 (+0.07z)| lr 3.75e-05 | 4149.31 ms | 32.5% bf16 MFU | 126061 tok/s step 16528/19560 | loss 3.300934 (-0.57z)| norm 0.2514 (-0.60z)| lr 3.75e-05 | 4153.09 ms | 32.5% bf16 MFU | 126070 tok/s step 16529/19560 | loss 3.413897 (+2.14z)| norm 0.2578 (-0.10z)| lr 3.75e-05 | 4156.89 ms | 32.5% bf16 MFU | 126073 tok/s step 16530/19560 | loss 3.247221 (-1.83z)| norm 0.2629 (+0.30z)| lr 3.74e-05 | 4151.80 ms | 32.5% bf16 MFU | 126083 tok/s step 16531/19560 | loss 3.325805 (+0.04z)| norm 0.2703 (+0.88z)| lr 3.74e-05 | 4155.50 ms | 32.5% bf16 MFU | 126087 tok/s step 16532/19560 | loss 3.367020 (+1.02z)| norm 0.2621 (+0.26z)| lr 3.74e-05 | 4157.11 ms | 32.5% bf16 MFU | 126089 tok/s step 16533/19560 | loss 3.285732 (-0.92z)| norm 0.2475 (-0.92z)| lr 3.74e-05 | 4153.69 ms | 32.5% bf16 MFU | 126096 tok/s step 16534/19560 | loss 3.281310 (-1.04z)| norm 0.2767 (+1.41z)| lr 3.73e-05 | 4152.16 ms | 32.5% bf16 MFU | 126104 tok/s step 16535/19560 | loss 3.256320 (-1.61z)| norm 0.2543 (-0.37z)| lr 3.73e-05 | 4154.46 ms | 32.5% bf16 MFU | 126109 tok/s step 16536/19560 | loss 3.269717 (-1.27z)| norm 0.2549 (-0.31z)| lr 3.73e-05 | 4149.70 ms | 32.5% bf16 MFU | 126121 tok/s step 16537/19560 | loss 3.323543 (+0.02z)| norm 0.2609 (+0.17z)| lr 3.73e-05 | 4148.58 ms | 32.5% bf16 MFU | 126134 tok/s step 16538/19560 | loss 3.283540 (-0.93z)| norm 0.2546 (-0.34z)| lr 3.72e-05 | 4152.54 ms | 32.5% bf16 MFU | 126140 tok/s step 16539/19560 | loss 3.288887 (-0.79z)| norm 0.2607 (+0.16z)| lr 3.72e-05 | 4153.06 ms | 32.5% bf16 MFU | 126145 tok/s step 16540/19560 | loss 3.320989 (-0.00z)| norm 0.2623 (+0.29z)| lr 3.72e-05 | 4151.98 ms | 32.5% bf16 MFU | 126151 tok/s step 16541/19560 | loss 3.305205 (-0.39z)| norm 0.2670 (+0.65z)| lr 3.72e-05 | 4153.95 ms | 32.5% bf16 MFU | 126154 tok/s step 16542/19560 | loss 3.233852 (-2.08z)| norm 0.2588 (-0.00z)| lr 3.71e-05 | 4157.42 ms | 32.5% bf16 MFU | 126152 tok/s step 16543/19560 | loss 3.342201 (+0.52z)| norm 0.2675 (+0.69z)| lr 3.71e-05 | 4151.60 ms | 32.5% bf16 MFU | 126159 tok/s step 16544/19560 | loss 3.306869 (-0.32z)| norm 0.2540 (-0.39z)| lr 3.71e-05 | 4154.02 ms | 32.5% bf16 MFU | 126162 tok/s step 16545/19560 | loss 3.301776 (-0.45z)| norm 0.2678 (+0.71z)| lr 3.71e-05 | 4154.51 ms | 32.5% bf16 MFU | 126163 tok/s step 16546/19560 | loss 3.296050 (-0.58z)| norm 0.2537 (-0.43z)| lr 3.70e-05 | 4147.37 ms | 32.6% bf16 MFU | 126176 tok/s step 16547/19560 | loss 3.327019 (+0.18z)| norm 0.2588 (-0.02z)| lr 3.70e-05 | 4158.78 ms | 32.5% bf16 MFU | 126170 tok/s step 16548/19560 | loss 3.261523 (-1.42z)| norm 0.2607 (+0.13z)| lr 3.70e-05 | 4153.81 ms | 32.5% bf16 MFU | 126173 tok/s step 16549/19560 | loss 3.274589 (-1.11z)| norm 0.2765 (+1.39z)| lr 3.70e-05 | 4149.25 ms | 32.5% bf16 MFU | 126182 tok/s step 16550/19560 | loss 3.257720 (-1.51z)| norm 0.2412 (-1.43z)| lr 3.69e-05 | 4162.38 ms | 32.4% bf16 MFU | 126171 tok/s step 16551/19560 | loss 3.266329 (-1.28z)| norm 0.2579 (-0.09z)| lr 3.69e-05 | 4162.62 ms | 32.4% bf16 MFU | 126160 tok/s step 16552/19560 | loss 3.303677 (-0.37z)| norm 0.2644 (+0.42z)| lr 3.69e-05 | 4152.97 ms | 32.5% bf16 MFU | 126164 tok/s step 16553/19560 | loss 3.251990 (-1.61z)| norm 0.2538 (-0.43z)| lr 3.69e-05 | 4153.17 ms | 32.5% bf16 MFU | 126168 tok/s step 16554/19560 | loss 3.285366 (-0.79z)| norm 0.2527 (-0.52z)| lr 3.69e-05 | 4155.18 ms | 32.5% bf16 MFU | 126168 tok/s step 16555/19560 | loss 3.254217 (-1.52z)| norm 0.2596 (+0.02z)| lr 3.68e-05 | 4163.81 ms | 32.4% bf16 MFU | 126156 tok/s step 16556/19560 | loss 3.310042 (-0.17z)| norm 0.2595 (+0.01z)| lr 3.68e-05 | 4154.58 ms | 32.5% bf16 MFU | 126158 tok/s step 16557/19560 | loss 3.249831 (-1.63z)| norm 0.2462 (-1.05z)| lr 3.68e-05 | 4158.10 ms | 32.5% bf16 MFU | 126154 tok/s step 16558/19560 | loss 3.343117 (+0.68z)| norm 0.2657 (+0.53z)| lr 3.68e-05 | 4162.16 ms | 32.4% bf16 MFU | 126145 tok/s step 16559/19560 | loss 3.299922 (-0.38z)| norm 0.2809 (+1.72z)| lr 3.67e-05 | 4151.54 ms | 32.5% bf16 MFU | 126152 tok/s step 16560/19560 | loss 3.329458 (+0.36z)| norm 0.3072 (+3.60z)| lr 3.67e-05 | 4157.75 ms | 32.5% bf16 MFU | 126149 tok/s step 16561/19560 | loss 3.290666 (-0.61z)| norm 0.2505 (-0.68z)| lr 3.67e-05 | 4151.64 ms | 32.5% bf16 MFU | 126156 tok/s step 16562/19560 | loss 3.345380 (+0.76z)| norm 0.2508 (-0.65z)| lr 3.67e-05 | 4158.13 ms | 32.5% bf16 MFU | 126153 tok/s step 16563/19560 | loss 3.287042 (-0.69z)| norm 0.2666 (+0.55z)| lr 3.66e-05 | 4153.79 ms | 32.5% bf16 MFU | 126156 tok/s step 16564/19560 | loss 3.237477 (-1.89z)| norm 0.2530 (-0.49z)| lr 3.66e-05 | 4154.28 ms | 32.5% bf16 MFU | 126158 tok/s step 16565/19560 | loss 3.290836 (-0.56z)| norm 0.2677 (+0.65z)| lr 3.66e-05 | 4152.13 ms | 32.5% bf16 MFU | 126164 tok/s step 16566/19560 | loss 3.337529 (+0.61z)| norm 0.2472 (-0.92z)| lr 3.66e-05 | 4151.16 ms | 32.5% bf16 MFU | 126171 tok/s step 16567/19560 | loss 3.298984 (-0.36z)| norm 0.2471 (-0.92z)| lr 3.65e-05 | 4147.74 ms | 32.6% bf16 MFU | 126182 tok/s step 16568/19560 | loss 3.338042 (+0.63z)| norm 0.2607 (+0.15z)| lr 3.65e-05 | 4152.82 ms | 32.5% bf16 MFU | 126186 tok/s step 16569/19560 | loss 3.313573 (+0.01z)| norm 0.2599 (+0.09z)| lr 3.65e-05 | 4146.28 ms | 32.6% bf16 MFU | 126199 tok/s step 16570/19560 | loss 3.322672 (+0.24z)| norm 0.2530 (-0.44z)| lr 3.65e-05 | 4150.04 ms | 32.5% bf16 MFU | 126205 tok/s step 16571/19560 | loss 3.305911 (-0.18z)| norm 0.2467 (-0.93z)| lr 3.64e-05 | 4154.05 ms | 32.5% bf16 MFU | 126206 tok/s step 16572/19560 | loss 3.335588 (+0.56z)| norm 0.2846 (+2.07z)| lr 3.64e-05 | 4153.91 ms | 32.5% bf16 MFU | 126206 tok/s step 16573/19560 | loss 3.252336 (-1.52z)| norm 0.2537 (-0.37z)| lr 3.64e-05 | 4162.42 ms | 32.4% bf16 MFU | 126194 tok/s step 16574/19560 | loss 3.342896 (+0.77z)| norm 0.2631 (+0.45z)| lr 3.64e-05 | 4147.68 ms | 32.6% bf16 MFU | 126204 tok/s step 16575/19560 | loss 3.351169 (+0.96z)| norm 0.2637 (+0.50z)| lr 3.64e-05 | 4153.17 ms | 32.5% bf16 MFU | 126206 tok/s step 16576/19560 | loss 3.317678 (+0.11z)| norm 0.2581 (+0.01z)| lr 3.63e-05 | 4148.90 ms | 32.5% bf16 MFU | 126214 tok/s step 16577/19560 | loss 3.290551 (-0.56z)| norm 0.2482 (-0.86z)| lr 3.63e-05 | 4163.82 ms | 32.4% bf16 MFU | 126199 tok/s step 16578/19560 | loss 3.368811 (+1.45z)| norm 0.2712 (+1.22z)| lr 3.63e-05 | 4157.01 ms | 32.5% bf16 MFU | 126195 tok/s step 16579/19560 | loss 3.347671 (+0.89z)| norm 0.2729 (+1.36z)| lr 3.63e-05 | 4159.19 ms | 32.5% bf16 MFU | 126188 tok/s step 16580/19560 | loss 3.333916 (+0.54z)| norm 0.2559 (-0.17z)| lr 3.62e-05 | 4150.66 ms | 32.5% bf16 MFU | 126195 tok/s step 16581/19560 | loss 3.320559 (+0.18z)| norm 0.2457 (-1.07z)| lr 3.62e-05 | 4148.59 ms | 32.5% bf16 MFU | 126204 tok/s step 16582/19560 | loss 3.254893 (-1.51z)| norm 0.2501 (-0.68z)| lr 3.62e-05 | 4152.32 ms | 32.5% bf16 MFU | 126207 tok/s step 16583/19560 | loss 3.347072 (+0.88z)| norm 0.2666 (+0.81z)| lr 3.62e-05 | 4156.79 ms | 32.5% bf16 MFU | 126203 tok/s step 16584/19560 | loss 3.314727 (+0.04z)| norm 0.2632 (+0.51z)| lr 3.61e-05 | 4146.94 ms | 32.6% bf16 MFU | 126214 tok/s step 16585/19560 | loss 3.342573 (+0.77z)| norm 0.2483 (-0.84z)| lr 3.61e-05 | 4152.56 ms | 32.5% bf16 MFU | 126216 tok/s step 16586/19560 | loss 3.401926 (+2.31z)| norm 0.2521 (-0.48z)| lr 3.61e-05 | 4151.19 ms | 32.5% bf16 MFU | 126220 tok/s step 16587/19560 | loss 3.362603 (+1.27z)| norm 0.2668 (+0.86z)| lr 3.61e-05 | 4152.35 ms | 32.5% bf16 MFU | 126222 tok/s step 16588/19560 | loss 3.323673 (+0.26z)| norm 0.2799 (+2.03z)| lr 3.60e-05 | 4165.61 ms | 32.4% bf16 MFU | 126204 tok/s step 16589/19560 | loss 3.282002 (-0.82z)| norm 0.2465 (-0.98z)| lr 3.60e-05 | 4159.53 ms | 32.5% bf16 MFU | 126196 tok/s step 16590/19560 | loss 3.334668 (+0.55z)| norm 0.2714 (+1.25z)| lr 3.60e-05 | 4166.03 ms | 32.4% bf16 MFU | 126179 tok/s step 16591/19560 | loss 3.317029 (+0.11z)| norm 0.2583 (+0.08z)| lr 3.60e-05 | 4153.05 ms | 32.5% bf16 MFU | 126182 tok/s step 16592/19560 | loss 3.268608 (-1.15z)| norm 0.2584 (+0.08z)| lr 3.59e-05 | 4153.23 ms | 32.5% bf16 MFU | 126185 tok/s step 16593/19560 | loss 3.302917 (-0.23z)| norm 0.2635 (+0.54z)| lr 3.59e-05 | 4158.05 ms | 32.5% bf16 MFU | 126180 tok/s step 16594/19560 | loss 3.421863 (+2.92z)| norm 0.2486 (-0.79z)| lr 3.59e-05 | 4160.87 ms | 32.4% bf16 MFU | 126171 tok/s step 16595/19560 | loss 3.293500 (-0.51z)| norm 0.2623 (+0.43z)| lr 3.59e-05 | 4152.98 ms | 32.5% bf16 MFU | 126175 tok/s step 16596/19560 | loss 3.309103 (-0.09z)| norm 0.2531 (-0.39z)| lr 3.59e-05 | 4150.88 ms | 32.5% bf16 MFU | 126182 tok/s step 16597/19560 | loss 3.289444 (-0.61z)| norm 0.2651 (+0.68z)| lr 3.58e-05 | 4148.18 ms | 32.5% bf16 MFU | 126192 tok/s step 16598/19560 | loss 3.342903 (+0.82z)| norm 0.2463 (-1.00z)| lr 3.58e-05 | 4166.86 ms | 32.4% bf16 MFU | 126174 tok/s step 16599/19560 | loss 3.348204 (+0.95z)| norm 0.2565 (-0.09z)| lr 3.58e-05 | 4147.87 ms | 32.6% bf16 MFU | 126185 tok/s step 16600/19560 | loss 3.310231 (-0.06z)| norm 0.2617 (+0.37z)| lr 3.58e-05 | 4150.85 ms | 32.5% bf16 MFU | 126191 tok/s step 16601/19560 | loss 3.303397 (-0.24z)| norm 0.2632 (+0.49z)| lr 3.57e-05 | 4161.31 ms | 32.4% bf16 MFU | 126181 tok/s step 16602/19560 | loss 3.297686 (-0.39z)| norm 0.2545 (-0.27z)| lr 3.57e-05 | 4164.80 ms | 32.4% bf16 MFU | 126166 tok/s step 16603/19560 | loss 3.278350 (-0.90z)| norm 0.2435 (-1.26z)| lr 3.57e-05 | 4148.37 ms | 32.5% bf16 MFU | 126177 tok/s step 16604/19560 | loss 3.318925 (+0.19z)| norm 0.2664 (+0.78z)| lr 3.57e-05 | 4149.91 ms | 32.5% bf16 MFU | 126185 tok/s step 16605/19560 | loss 3.306760 (-0.14z)| norm 0.2620 (+0.38z)| lr 3.56e-05 | 4144.86 ms | 32.6% bf16 MFU | 126200 tok/s step 16606/19560 | loss 3.266052 (-1.22z)| norm 0.2684 (+0.95z)| lr 3.56e-05 | 4157.48 ms | 32.5% bf16 MFU | 126196 tok/s step 16607/19560 | loss 3.274544 (-1.00z)| norm 0.2444 (-1.22z)| lr 3.56e-05 | 4498.22 ms | 30.0% bf16 MFU | 125714 tok/s step 16608/19560 | loss 3.312480 (+0.03z)| norm 0.2791 (+1.88z)| lr 3.56e-05 | 4139.75 ms | 32.6% bf16 MFU | 125760 tok/s step 16609/19560 | loss 3.318602 (+0.21z)| norm 0.2712 (+1.15z)| lr 3.55e-05 | 4152.63 ms | 32.5% bf16 MFU | 125785 tok/s step 16610/19560 | loss 3.353873 (+1.19z)| norm 0.2797 (+1.87z)| lr 3.55e-05 | 4157.62 ms | 32.5% bf16 MFU | 125801 tok/s step 16611/19560 | loss 3.294300 (-0.45z)| norm 0.2435 (-1.31z)| lr 3.55e-05 | 4153.84 ms | 32.5% bf16 MFU | 125822 tok/s step 16612/19560 | loss 3.256680 (-1.49z)| norm 0.2939 (+3.04z)| lr 3.55e-05 | 4155.20 ms | 32.5% bf16 MFU | 125840 tok/s step 16613/19560 | loss 3.320411 (+0.32z)| norm 0.2673 (+0.79z)| lr 3.55e-05 | 4157.03 ms | 32.5% bf16 MFU | 125854 tok/s step 16614/19560 | loss 3.309300 (+0.00z)| norm 0.2508 (-0.67z)| lr 3.54e-05 | 4154.72 ms | 32.5% bf16 MFU | 125870 tok/s step 16615/19560 | loss 3.293165 (-0.45z)| norm 0.2726 (+1.24z)| lr 3.54e-05 | 4152.07 ms | 32.5% bf16 MFU | 125891 tok/s step 16616/19560 | loss 3.300306 (-0.24z)| norm 0.2632 (+0.41z)| lr 3.54e-05 | 4157.43 ms | 32.5% bf16 MFU | 125901 tok/s step 16617/19560 | loss 3.308751 (-0.01z)| norm 0.2668 (+0.71z)| lr 3.54e-05 | 4147.19 ms | 32.6% bf16 MFU | 125927 tok/s step 16618/19560 | loss 3.347534 (+1.08z)| norm 0.2710 (+1.06z)| lr 3.53e-05 | 4150.10 ms | 32.5% bf16 MFU | 125948 tok/s step 16619/19560 | loss 3.267695 (-1.18z)| norm 0.2572 (-0.14z)| lr 3.53e-05 | 4153.86 ms | 32.5% bf16 MFU | 125961 tok/s step 16620/19560 | loss 3.254180 (-1.56z)| norm 0.2579 (-0.07z)| lr 3.53e-05 | 4153.54 ms | 32.5% bf16 MFU | 125974 tok/s step 16621/19560 | loss 3.347584 (+1.07z)| norm 0.2617 (+0.25z)| lr 3.53e-05 | 4165.03 ms | 32.4% bf16 MFU | 125970 tok/s step 16622/19560 | loss 3.271785 (-1.05z)| norm 0.2716 (+1.13z)| lr 3.52e-05 | 4158.31 ms | 32.5% bf16 MFU | 125975 tok/s step 16623/19560 | loss 3.343252 (+0.96z)| norm 0.2886 (+2.54z)| lr 3.52e-05 | 4147.77 ms | 32.6% bf16 MFU | 125997 tok/s step 16624/19560 | loss 3.302460 (-0.19z)| norm 0.2635 (+0.37z)| lr 3.52e-05 | 4159.24 ms | 32.5% bf16 MFU | 125999 tok/s step 16625/19560 | loss 3.357717 (+1.36z)| norm 0.2733 (+1.20z)| lr 3.52e-05 | 4164.35 ms | 32.4% bf16 MFU | 125994 tok/s step 16626/19560 | loss 3.277254 (-0.89z)| norm 0.2726 (+1.12z)| lr 3.51e-05 | 4151.40 ms | 32.5% bf16 MFU | 126009 tok/s step 16627/19560 | loss 3.361516 (+1.47z)| norm 0.2647 (+0.46z)| lr 3.51e-05 | 4156.35 ms | 32.5% bf16 MFU | 126016 tok/s step 16628/19560 | loss 3.367683 (+1.61z)| norm 0.2738 (+1.22z)| lr 3.51e-05 | 4145.70 ms | 32.6% bf16 MFU | 126038 tok/s step 16629/19560 | loss 3.281184 (-0.79z)| norm 0.2620 (+0.22z)| lr 3.51e-05 | 4145.17 ms | 32.6% bf16 MFU | 126060 tok/s step 16630/19560 | loss 3.296650 (-0.34z)| norm 0.2663 (+0.57z)| lr 3.51e-05 | 4154.56 ms | 32.5% bf16 MFU | 126067 tok/s step 16631/19560 | loss 3.315774 (+0.19z)| norm 0.2535 (-0.50z)| lr 3.50e-05 | 4158.72 ms | 32.5% bf16 MFU | 126067 tok/s step 16632/19560 | loss 3.270725 (-1.06z)| norm 0.2481 (-0.95z)| lr 3.50e-05 | 4144.12 ms | 32.6% bf16 MFU | 126090 tok/s step 16633/19560 | loss 3.261362 (-1.30z)| norm 0.2724 (+1.09z)| lr 3.50e-05 | 4162.61 ms | 32.4% bf16 MFU | 126083 tok/s step 16634/19560 | loss 3.315271 (+0.19z)| norm 0.2677 (+0.67z)| lr 3.50e-05 | 4151.02 ms | 32.5% bf16 MFU | 126094 tok/s step 16635/19560 | loss 3.249203 (-1.62z)| norm 0.2556 (-0.36z)| lr 3.49e-05 | 4154.88 ms | 32.5% bf16 MFU | 126098 tok/s step 16636/19560 | loss 3.312917 (+0.15z)| norm 0.2584 (-0.12z)| lr 3.49e-05 | 4152.82 ms | 32.5% bf16 MFU | 126106 tok/s step 16637/19560 | loss 3.298706 (-0.24z)| norm 0.2564 (-0.29z)| lr 3.49e-05 | 4160.22 ms | 32.5% bf16 MFU | 126102 tok/s step 16638/19560 | loss 3.262021 (-1.24z)| norm 0.2606 (+0.06z)| lr 3.49e-05 | 4159.64 ms | 32.5% bf16 MFU | 126099 tok/s step 16639/19560 | loss 3.283092 (-0.65z)| norm 0.2645 (+0.39z)| lr 3.48e-05 | 4152.20 ms | 32.5% bf16 MFU | 126107 tok/s step 16640/19560 | loss 3.314538 (+0.22z)| norm 0.2623 (+0.21z)| lr 3.48e-05 | 4150.62 ms | 32.5% bf16 MFU | 126118 tok/s step 16641/19560 | loss 3.289996 (-0.46z)| norm 0.2523 (-0.68z)| lr 3.48e-05 | 4163.46 ms | 32.4% bf16 MFU | 126108 tok/s step 16642/19560 | loss 3.325546 (+0.53z)| norm 0.2609 (+0.07z)| lr 3.48e-05 | 4156.38 ms | 32.5% bf16 MFU | 126110 tok/s step 16643/19560 | loss 3.301018 (-0.16z)| norm 0.2643 (+0.36z)| lr 3.47e-05 | 4158.03 ms | 32.5% bf16 MFU | 126109 tok/s step 16644/19560 | loss 3.301271 (-0.15z)| norm 0.2792 (+1.66z)| lr 3.47e-05 | 4157.78 ms | 32.5% bf16 MFU | 126108 tok/s step 16645/19560 | loss 3.272825 (-0.94z)| norm 0.2573 (-0.28z)| lr 3.47e-05 | 4156.57 ms | 32.5% bf16 MFU | 126110 tok/s step 16646/19560 | loss 3.293658 (-0.36z)| norm 0.2510 (-0.86z)| lr 3.47e-05 | 4155.42 ms | 32.5% bf16 MFU | 126113 tok/s step 16647/19560 | loss 3.306402 (-0.01z)| norm 0.2380 (-1.97z)| lr 3.47e-05 | 4157.16 ms | 32.5% bf16 MFU | 126113 tok/s step 16648/19560 | loss 3.298632 (-0.22z)| norm 0.2558 (-0.42z)| lr 3.46e-05 | 4148.77 ms | 32.5% bf16 MFU | 126126 tok/s step 16649/19560 | loss 3.287000 (-0.55z)| norm 0.2651 (+0.40z)| lr 3.46e-05 | 4156.71 ms | 32.5% bf16 MFU | 126126 tok/s step 16650/19560 | loss 3.303447 (-0.08z)| norm 0.2558 (-0.43z)| lr 3.46e-05 | 4151.12 ms | 32.5% bf16 MFU | 126135 tok/s step 16651/19560 | loss 3.289106 (-0.48z)| norm 0.2415 (-1.71z)| lr 3.46e-05 | 4151.96 ms | 32.5% bf16 MFU | 126142 tok/s step 16652/19560 | loss 3.275657 (-0.85z)| norm 0.2481 (-1.12z)| lr 3.45e-05 | 4152.46 ms | 32.5% bf16 MFU | 126148 tok/s step 16653/19560 | loss 3.322638 (+0.48z)| norm 0.2543 (-0.58z)| lr 3.45e-05 | 4152.36 ms | 32.5% bf16 MFU | 126153 tok/s step 16654/19560 | loss 3.289860 (-0.45z)| norm 0.2625 (+0.15z)| lr 3.45e-05 | 4161.17 ms | 32.4% bf16 MFU | 126145 tok/s step 16655/19560 | loss 3.268752 (-1.04z)| norm 0.2546 (-0.57z)| lr 3.45e-05 | 4157.08 ms | 32.5% bf16 MFU | 126144 tok/s step 16656/19560 | loss 3.311656 (+0.18z)| norm 0.2454 (-1.40z)| lr 3.44e-05 | 4147.78 ms | 32.6% bf16 MFU | 126157 tok/s step 16657/19560 | loss 3.280343 (-0.71z)| norm 0.2517 (-0.81z)| lr 3.44e-05 | 4154.75 ms | 32.5% bf16 MFU | 126159 tok/s step 16658/19560 | loss 3.336588 (+0.94z)| norm 0.2621 (+0.12z)| lr 3.44e-05 | 4163.96 ms | 32.4% bf16 MFU | 126146 tok/s step 16659/19560 | loss 3.301121 (-0.11z)| norm 0.2591 (-0.14z)| lr 3.44e-05 | 4151.97 ms | 32.5% bf16 MFU | 126153 tok/s step 16660/19560 | loss 3.218449 (-2.51z)| norm 0.2597 (-0.08z)| lr 3.44e-05 | 4166.12 ms | 32.4% bf16 MFU | 126137 tok/s step 16661/19560 | loss 3.353762 (+1.46z)| norm 0.2625 (+0.16z)| lr 3.43e-05 | 4156.76 ms | 32.5% bf16 MFU | 126137 tok/s step 16662/19560 | loss 3.292797 (-0.33z)| norm 0.2465 (-1.28z)| lr 3.43e-05 | 4151.65 ms | 32.5% bf16 MFU | 126144 tok/s step 16663/19560 | loss 3.292357 (-0.36z)| norm 0.2433 (-1.55z)| lr 3.43e-05 | 4151.97 ms | 32.5% bf16 MFU | 126151 tok/s step 16664/19560 | loss 3.300371 (-0.13z)| norm 0.2532 (-0.65z)| lr 3.43e-05 | 4158.21 ms | 32.5% bf16 MFU | 126147 tok/s step 16665/19560 | loss 3.327233 (+0.67z)| norm 0.2465 (-1.25z)| lr 3.42e-05 | 4158.87 ms | 32.5% bf16 MFU | 126143 tok/s step 16666/19560 | loss 3.372324 (+1.96z)| norm 0.2594 (-0.08z)| lr 3.42e-05 | 4157.94 ms | 32.5% bf16 MFU | 126141 tok/s step 16667/19560 | loss 3.278475 (-0.78z)| norm 0.2428 (-1.55z)| lr 3.42e-05 | 4154.05 ms | 32.5% bf16 MFU | 126144 tok/s step 16668/19560 | loss 3.320802 (+0.45z)| norm 0.2481 (-1.07z)| lr 3.42e-05 | 4151.82 ms | 32.5% bf16 MFU | 126151 tok/s step 16669/19560 | loss 3.316908 (+0.33z)| norm 0.2581 (-0.17z)| lr 3.41e-05 | 4152.77 ms | 32.5% bf16 MFU | 126156 tok/s step 16670/19560 | loss 3.259618 (-1.36z)| norm 0.2557 (-0.38z)| lr 3.41e-05 | 4154.58 ms | 32.5% bf16 MFU | 126158 tok/s step 16671/19560 | loss 3.336148 (+0.90z)| norm 0.2641 (+0.37z)| lr 3.41e-05 | 4158.43 ms | 32.5% bf16 MFU | 126154 tok/s step 16672/19560 | loss 3.284750 (-0.61z)| norm 0.2501 (-0.87z)| lr 3.41e-05 | 4154.07 ms | 32.5% bf16 MFU | 126157 tok/s step 16673/19560 | loss 3.346136 (+1.18z)| norm 0.2648 (+0.44z)| lr 3.40e-05 | 4162.87 ms | 32.4% bf16 MFU | 126146 tok/s step 16674/19560 | loss 3.304516 (-0.04z)| norm 0.2503 (-0.85z)| lr 3.40e-05 | 4156.49 ms | 32.5% bf16 MFU | 126146 tok/s step 16675/19560 | loss 3.325567 (+0.58z)| norm 0.2508 (-0.80z)| lr 3.40e-05 | 4149.93 ms | 32.5% bf16 MFU | 126155 tok/s step 16676/19560 | loss 3.355824 (+1.44z)| norm 0.2470 (-1.12z)| lr 3.40e-05 | 4143.02 ms | 32.6% bf16 MFU | 126175 tok/s step 16677/19560 | loss 3.357985 (+1.48z)| norm 0.2543 (-0.46z)| lr 3.40e-05 | 4145.31 ms | 32.6% bf16 MFU | 126190 tok/s step 16678/19560 | loss 3.318872 (+0.33z)| norm 0.2584 (-0.11z)| lr 3.39e-05 | 4147.65 ms | 32.6% bf16 MFU | 126201 tok/s step 16679/19560 | loss 3.310979 (+0.09z)| norm 0.2631 (+0.30z)| lr 3.39e-05 | 4151.19 ms | 32.5% bf16 MFU | 126206 tok/s step 16680/19560 | loss 3.328100 (+0.59z)| norm 0.2528 (-0.61z)| lr 3.39e-05 | 4157.75 ms | 32.5% bf16 MFU | 126200 tok/s step 16681/19560 | loss 3.341319 (+0.96z)| norm 0.2465 (-1.17z)| lr 3.39e-05 | 4147.72 ms | 32.6% bf16 MFU | 126211 tok/s step 16682/19560 | loss 3.302660 (-0.19z)| norm 0.2405 (-1.68z)| lr 3.38e-05 | 4149.53 ms | 32.5% bf16 MFU | 126217 tok/s step 16683/19560 | loss 3.325943 (+0.49z)| norm 0.2553 (-0.37z)| lr 3.38e-05 | 4152.77 ms | 32.5% bf16 MFU | 126219 tok/s step 16684/19560 | loss 3.304156 (-0.16z)| norm 0.2487 (-0.94z)| lr 3.38e-05 | 4152.20 ms | 32.5% bf16 MFU | 126222 tok/s step 16685/19560 | loss 3.274092 (-1.08z)| norm 0.2424 (-1.49z)| lr 3.38e-05 | 4146.73 ms | 32.6% bf16 MFU | 126232 tok/s step 16686/19560 | loss 3.289027 (-0.62z)| norm 0.2567 (-0.22z)| lr 3.37e-05 | 4157.30 ms | 32.5% bf16 MFU | 126226 tok/s step 16687/19560 | loss 3.337152 (+0.83z)| norm 0.2538 (-0.46z)| lr 3.37e-05 | 4146.90 ms | 32.6% bf16 MFU | 126236 tok/s step 16688/19560 | loss 3.342029 (+0.98z)| norm 0.2541 (-0.44z)| lr 3.37e-05 | 4143.27 ms | 32.6% bf16 MFU | 126251 tok/s step 16689/19560 | loss 3.295120 (-0.44z)| norm 0.2502 (-0.81z)| lr 3.37e-05 | 4151.11 ms | 32.5% bf16 MFU | 126254 tok/s step 16690/19560 | loss 3.337379 (+0.84z)| norm 0.2404 (-1.73z)| lr 3.37e-05 | 4147.93 ms | 32.6% bf16 MFU | 126261 tok/s step 16691/19560 | loss 3.330460 (+0.62z)| norm 0.2505 (-0.75z)| lr 3.36e-05 | 4145.98 ms | 32.6% bf16 MFU | 126271 tok/s step 16692/19560 | loss 3.279547 (-0.95z)| norm 0.2561 (-0.23z)| lr 3.36e-05 | 4144.31 ms | 32.6% bf16 MFU | 126283 tok/s step 16693/19560 | loss 3.260662 (-1.51z)| norm 0.2428 (-1.47z)| lr 3.36e-05 | 4154.54 ms | 32.5% bf16 MFU | 126278 tok/s step 16694/19560 | loss 3.309279 (-0.02z)| norm 0.2568 (-0.14z)| lr 3.36e-05 | 4142.81 ms | 32.6% bf16 MFU | 126292 tok/s step 16695/19560 | loss 3.343248 (+1.01z)| norm 0.2512 (-0.68z)| lr 3.35e-05 | 4152.96 ms | 32.5% bf16 MFU | 126290 tok/s step 16696/19560 | loss 3.255385 (-1.64z)| norm 0.2517 (-0.63z)| lr 3.35e-05 | 4174.65 ms | 32.3% bf16 MFU | 126255 tok/s step 16697/19560 | loss 3.340248 (+0.92z)| norm 0.2649 (+0.63z)| lr 3.35e-05 | 4177.06 ms | 32.3% bf16 MFU | 126218 tok/s step 16698/19560 | loss 3.303959 (-0.17z)| norm 0.2442 (-1.33z)| lr 3.35e-05 | 4164.18 ms | 32.4% bf16 MFU | 126202 tok/s step 16699/19560 | loss 3.319558 (+0.30z)| norm 0.2568 (-0.14z)| lr 3.35e-05 | 4159.81 ms | 32.5% bf16 MFU | 126194 tok/s step 16700/19560 | loss 3.255817 (-1.60z)| norm 0.2570 (-0.11z)| lr 3.34e-05 | 4159.25 ms | 32.5% bf16 MFU | 126187 tok/s step 16701/19560 | loss 3.282876 (-0.80z)| norm 0.2499 (-0.80z)| lr 3.34e-05 | 4154.76 ms | 32.5% bf16 MFU | 126187 tok/s step 16702/19560 | loss 3.338644 (+0.89z)| norm 0.2545 (-0.34z)| lr 3.34e-05 | 4159.30 ms | 32.5% bf16 MFU | 126180 tok/s step 16703/19560 | loss 3.320533 (+0.35z)| norm 0.2428 (-1.45z)| lr 3.34e-05 | 4156.62 ms | 32.5% bf16 MFU | 126178 tok/s step 16704/19560 | loss 3.228680 (-2.38z)| norm 0.2542 (-0.35z)| lr 3.33e-05 | 4145.11 ms | 32.6% bf16 MFU | 126193 tok/s step 16705/19560 | loss 3.235809 (-2.11z)| norm 0.2606 (+0.26z)| lr 3.33e-05 | 4151.59 ms | 32.5% bf16 MFU | 126198 tok/s step 16706/19560 | loss 3.290659 (-0.50z)| norm 0.2599 (+0.21z)| lr 3.33e-05 | 4169.81 ms | 32.4% bf16 MFU | 126175 tok/s step 16707/19560 | loss 3.328164 (+0.62z)| norm 0.2558 (-0.18z)| lr 3.33e-05 | 4164.96 ms | 32.4% bf16 MFU | 126160 tok/s step 16708/19560 | loss 3.313349 (+0.19z)| norm 0.2568 (-0.09z)| lr 3.32e-05 | 4154.62 ms | 32.5% bf16 MFU | 126162 tok/s step 16709/19560 | loss 3.308265 (+0.04z)| norm 0.2668 (+0.89z)| lr 3.32e-05 | 4153.38 ms | 32.5% bf16 MFU | 126165 tok/s step 16710/19560 | loss 3.350078 (+1.27z)| norm 0.2643 (+0.63z)| lr 3.32e-05 | 4163.31 ms | 32.4% bf16 MFU | 126153 tok/s step 16711/19560 | loss 3.278025 (-0.88z)| norm 0.2532 (-0.47z)| lr 3.32e-05 | 4155.87 ms | 32.5% bf16 MFU | 126154 tok/s step 16712/19560 | loss 3.274267 (-0.97z)| norm 0.2556 (-0.22z)| lr 3.32e-05 | 4148.77 ms | 32.5% bf16 MFU | 126164 tok/s step 16713/19560 | loss 3.283850 (-0.68z)| norm 0.2511 (-0.67z)| lr 3.31e-05 | 4161.85 ms | 32.4% bf16 MFU | 126155 tok/s step 16714/19560 | loss 3.293084 (-0.39z)| norm 0.2555 (-0.23z)| lr 3.31e-05 | 4163.46 ms | 32.4% bf16 MFU | 126143 tok/s step 16715/19560 | loss 3.325867 (+0.65z)| norm 0.2644 (+0.65z)| lr 3.31e-05 | 4158.66 ms | 32.5% bf16 MFU | 126140 tok/s step 16716/19560 | loss 3.279474 (-0.80z)| norm 0.2423 (-1.54z)| lr 3.31e-05 | 4158.80 ms | 32.5% bf16 MFU | 126136 tok/s step 16717/19560 | loss 3.243316 (-1.89z)| norm 0.2636 (+0.60z)| lr 3.30e-05 | 4158.88 ms | 32.5% bf16 MFU | 126133 tok/s step 16718/19560 | loss 3.343526 (+1.20z)| norm 0.2503 (-0.74z)| lr 3.30e-05 | 4153.77 ms | 32.5% bf16 MFU | 126137 tok/s step 16719/19560 | loss 3.380201 (+2.27z)| norm 0.2495 (-0.80z)| lr 3.30e-05 | 4159.59 ms | 32.5% bf16 MFU | 126132 tok/s step 16720/19560 | loss 3.307191 (+0.05z)| norm 0.2504 (-0.71z)| lr 3.30e-05 | 4158.17 ms | 32.5% bf16 MFU | 126130 tok/s step 16721/19560 | loss 3.258916 (-1.39z)| norm 0.2634 (+0.61z)| lr 3.29e-05 | 4170.36 ms | 32.4% bf16 MFU | 126109 tok/s step 16722/19560 | loss 3.313774 (+0.30z)| norm 0.2541 (-0.34z)| lr 3.29e-05 | 4151.95 ms | 32.5% bf16 MFU | 126118 tok/s step 16723/19560 | loss 3.314772 (+0.33z)| norm 0.2457 (-1.17z)| lr 3.29e-05 | 4156.56 ms | 32.5% bf16 MFU | 126119 tok/s step 16724/19560 | loss 3.320108 (+0.49z)| norm 0.2504 (-0.69z)| lr 3.29e-05 | 4159.18 ms | 32.5% bf16 MFU | 126115 tok/s step 16725/19560 | loss 3.335682 (+0.97z)| norm 0.2730 (+1.57z)| lr 3.29e-05 | 4155.20 ms | 32.5% bf16 MFU | 126118 tok/s step 16726/19560 | loss 3.357572 (+1.65z)| norm 0.2811 (+2.32z)| lr 3.28e-05 | 4157.42 ms | 32.5% bf16 MFU | 126118 tok/s step 16727/19560 | loss 3.256773 (-1.49z)| norm 0.2584 (+0.08z)| lr 3.28e-05 | 4158.04 ms | 32.5% bf16 MFU | 126117 tok/s step 16728/19560 | loss 3.212735 (-2.76z)| norm 0.2618 (+0.41z)| lr 3.28e-05 | 4153.55 ms | 32.5% bf16 MFU | 126122 tok/s step 16729/19560 | loss 3.354903 (+1.54z)| norm 0.2569 (-0.06z)| lr 3.28e-05 | 4161.03 ms | 32.4% bf16 MFU | 126116 tok/s step 16730/19560 | loss 3.333771 (+0.89z)| norm 0.2501 (-0.73z)| lr 3.27e-05 | 4179.13 ms | 32.3% bf16 MFU | 126083 tok/s step 16731/19560 | loss 3.319481 (+0.45z)| norm 0.2689 (+1.11z)| lr 3.27e-05 | 4161.46 ms | 32.4% bf16 MFU | 126078 tok/s step 16732/19560 | loss 3.315619 (+0.33z)| norm 0.2698 (+1.19z)| lr 3.27e-05 | 4132.35 ms | 32.7% bf16 MFU | 126118 tok/s step 16733/19560 | loss 3.289297 (-0.45z)| norm 0.2514 (-0.61z)| lr 3.27e-05 | 4135.18 ms | 32.7% bf16 MFU | 126151 tok/s step 16734/19560 | loss 3.317880 (+0.40z)| norm 0.2471 (-1.03z)| lr 3.27e-05 | 4131.91 ms | 32.7% bf16 MFU | 126188 tok/s step 16735/19560 | loss 3.266483 (-1.15z)| norm 0.2494 (-0.81z)| lr 3.26e-05 | 4144.70 ms | 32.6% bf16 MFU | 126204 tok/s step 16736/19560 | loss 3.307532 (+0.09z)| norm 0.2792 (+2.14z)| lr 3.26e-05 | 4140.86 ms | 32.6% bf16 MFU | 126224 tok/s step 16737/19560 | loss 3.301699 (-0.09z)| norm 0.2450 (-1.22z)| lr 3.26e-05 | 4142.13 ms | 32.6% bf16 MFU | 126242 tok/s step 16738/19560 | loss 3.283501 (-0.62z)| norm 0.2506 (-0.65z)| lr 3.26e-05 | 4141.18 ms | 32.6% bf16 MFU | 126260 tok/s step 16739/19560 | loss 3.273034 (-0.93z)| norm 0.2630 (+0.58z)| lr 3.25e-05 | 4147.80 ms | 32.6% bf16 MFU | 126267 tok/s step 16740/19560 | loss 3.350874 (+1.40z)| norm 0.2500 (-0.74z)| lr 3.25e-05 | 4157.83 ms | 32.5% bf16 MFU | 126258 tok/s step 16741/19560 | loss 3.269203 (-1.06z)| norm 0.2548 (-0.22z)| lr 3.25e-05 | 4140.15 ms | 32.6% bf16 MFU | 126277 tok/s step 16742/19560 | loss 3.262076 (-1.25z)| norm 0.2708 (+1.48z)| lr 3.25e-05 | 4142.31 ms | 32.6% bf16 MFU | 126292 tok/s step 16743/19560 | loss 3.270450 (-0.99z)| norm 0.2692 (+1.31z)| lr 3.24e-05 | 4149.07 ms | 32.5% bf16 MFU | 126295 tok/s step 16744/19560 | loss 3.342771 (+1.16z)| norm 0.2666 (+1.03z)| lr 3.24e-05 | 4148.76 ms | 32.5% bf16 MFU | 126299 tok/s step 16745/19560 | loss 3.223656 (-2.32z)| norm 0.2497 (-0.77z)| lr 3.24e-05 | 4153.36 ms | 32.5% bf16 MFU | 126296 tok/s step 16746/19560 | loss 3.300722 (-0.07z)| norm 0.2729 (+1.72z)| lr 3.24e-05 | 4159.69 ms | 32.5% bf16 MFU | 126283 tok/s step 16747/19560 | loss 3.305920 (+0.08z)| norm 0.2565 (-0.04z)| lr 3.24e-05 | 4153.55 ms | 32.5% bf16 MFU | 126280 tok/s step 16748/19560 | loss 3.260175 (-1.27z)| norm 0.2528 (-0.43z)| lr 3.23e-05 | 4148.89 ms | 32.5% bf16 MFU | 126284 tok/s step 16749/19560 | loss 3.293873 (-0.27z)| norm 0.2494 (-0.78z)| lr 3.23e-05 | 4176.99 ms | 32.3% bf16 MFU | 126246 tok/s step 16750/19560 | loss 3.263864 (-1.15z)| norm 0.2664 (+1.04z)| lr 3.23e-05 | 4170.80 ms | 32.4% bf16 MFU | 126219 tok/s val loss 3.275333 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3024/10042 = 0.301135 step 16751/19560 | loss 3.292487 (-0.30z)| norm 0.2633 (+0.77z)| lr 3.23e-05 | 4149.52 ms | 32.5% bf16 MFU | 126226 tok/s step 16752/19560 | loss 3.254159 (-1.42z)| norm 0.2455 (-1.22z)| lr 3.22e-05 | 4155.93 ms | 32.5% bf16 MFU | 126222 tok/s step 16753/19560 | loss 3.293063 (-0.25z)| norm 0.2518 (-0.49z)| lr 3.22e-05 | 4149.47 ms | 32.5% bf16 MFU | 126228 tok/s step 16754/19560 | loss 3.330609 (+0.86z)| norm 0.2586 (+0.29z)| lr 3.22e-05 | 4156.31 ms | 32.5% bf16 MFU | 126224 tok/s step 16755/19560 | loss 3.351881 (+1.50z)| norm 0.2552 (-0.09z)| lr 3.22e-05 | 4150.89 ms | 32.5% bf16 MFU | 126228 tok/s step 16756/19560 | loss 3.267315 (-1.02z)| norm 0.2478 (-0.94z)| lr 3.22e-05 | 4167.17 ms | 32.4% bf16 MFU | 126208 tok/s step 16757/19560 | loss 3.362439 (+1.82z)| norm 0.2569 (+0.13z)| lr 3.21e-05 | 4161.20 ms | 32.4% bf16 MFU | 126197 tok/s step 16758/19560 | loss 3.271521 (-0.90z)| norm 0.2679 (+1.42z)| lr 3.21e-05 | 4155.89 ms | 32.5% bf16 MFU | 126195 tok/s step 16759/19560 | loss 3.290182 (-0.33z)| norm 0.2548 (-0.12z)| lr 3.21e-05 | 4163.37 ms | 32.4% bf16 MFU | 126182 tok/s step 16760/19560 | loss 3.292035 (-0.28z)| norm 0.2477 (-0.95z)| lr 3.21e-05 | 4153.29 ms | 32.5% bf16 MFU | 126184 tok/s step 16761/19560 | loss 3.303346 (+0.05z)| norm 0.2437 (-1.40z)| lr 3.20e-05 | 4153.53 ms | 32.5% bf16 MFU | 126186 tok/s step 16762/19560 | loss 3.287380 (-0.43z)| norm 0.2497 (-0.68z)| lr 3.20e-05 | 4155.80 ms | 32.5% bf16 MFU | 126185 tok/s step 16763/19560 | loss 3.253162 (-1.46z)| norm 0.2574 (+0.24z)| lr 3.20e-05 | 4154.21 ms | 32.5% bf16 MFU | 126186 tok/s step 16764/19560 | loss 3.286016 (-0.46z)| norm 0.2449 (-1.24z)| lr 3.20e-05 | 4182.62 ms | 32.3% bf16 MFU | 126144 tok/s step 16765/19560 | loss 3.286880 (-0.44z)| norm 0.2655 (+1.19z)| lr 3.20e-05 | 4156.23 ms | 32.5% bf16 MFU | 126144 tok/s step 16766/19560 | loss 3.339562 (+1.13z)| norm 0.2581 (+0.32z)| lr 3.19e-05 | 4158.17 ms | 32.5% bf16 MFU | 126141 tok/s step 16767/19560 | loss 3.285484 (-0.50z)| norm 0.2542 (-0.13z)| lr 3.19e-05 | 4171.43 ms | 32.4% bf16 MFU | 126118 tok/s step 16768/19560 | loss 3.242999 (-1.74z)| norm 0.2523 (-0.35z)| lr 3.19e-05 | 4164.42 ms | 32.4% bf16 MFU | 126107 tok/s step 16769/19560 | loss 3.387269 (+2.48z)| norm 0.2728 (+2.03z)| lr 3.19e-05 | 4150.70 ms | 32.5% bf16 MFU | 126118 tok/s step 16770/19560 | loss 3.457342 (+4.18z)| norm 0.3027 (+4.94z)| lr 3.18e-05 | 4180.78 ms | 32.3% bf16 MFU | 126082 tok/s step 16771/19560 | loss 3.448123 (+3.68z)| norm 0.2742 (+1.91z)| lr 3.18e-05 | 4152.91 ms | 32.5% bf16 MFU | 126090 tok/s step 16772/19560 | loss 3.317054 (+0.32z)| norm 0.2651 (+1.00z)| lr 3.18e-05 | 4153.30 ms | 32.5% bf16 MFU | 126097 tok/s step 16773/19560 | loss 3.287746 (-0.43z)| norm 0.2713 (+1.62z)| lr 3.18e-05 | 4154.02 ms | 32.5% bf16 MFU | 126103 tok/s step 16774/19560 | loss 3.290085 (-0.37z)| norm 0.2692 (+1.38z)| lr 3.18e-05 | 4151.80 ms | 32.5% bf16 MFU | 126112 tok/s step 16775/19560 | loss 3.253536 (-1.29z)| norm 0.2527 (-0.35z)| lr 3.17e-05 | 4153.59 ms | 32.5% bf16 MFU | 126118 tok/s step 16776/19560 | loss 3.304558 (+0.01z)| norm 0.2490 (-0.74z)| lr 3.17e-05 | 4152.95 ms | 32.5% bf16 MFU | 126124 tok/s step 16777/19560 | loss 3.259593 (-1.13z)| norm 0.2547 (-0.13z)| lr 3.17e-05 | 4155.78 ms | 32.5% bf16 MFU | 126126 tok/s step 16778/19560 | loss 3.270420 (-0.84z)| norm 0.2660 (+1.06z)| lr 3.17e-05 | 4148.62 ms | 32.5% bf16 MFU | 126138 tok/s step 16779/19560 | loss 3.322896 (+0.48z)| norm 0.2429 (-1.38z)| lr 3.16e-05 | 4156.30 ms | 32.5% bf16 MFU | 126138 tok/s step 16780/19560 | loss 3.350702 (+1.16z)| norm 0.2619 (+0.61z)| lr 3.16e-05 | 4155.61 ms | 32.5% bf16 MFU | 126140 tok/s step 16781/19560 | loss 3.241742 (-1.55z)| norm 0.2498 (-0.66z)| lr 3.16e-05 | 4149.18 ms | 32.5% bf16 MFU | 126151 tok/s step 16782/19560 | loss 3.322852 (+0.47z)| norm 0.2607 (+0.50z)| lr 3.16e-05 | 4152.27 ms | 32.5% bf16 MFU | 126156 tok/s step 16783/19560 | loss 3.294175 (-0.26z)| norm 0.2541 (-0.20z)| lr 3.16e-05 | 4170.36 ms | 32.4% bf16 MFU | 126135 tok/s step 16784/19560 | loss 3.271020 (-0.83z)| norm 0.2477 (-0.88z)| lr 3.15e-05 | 4161.09 ms | 32.4% bf16 MFU | 126128 tok/s step 16785/19560 | loss 3.293808 (-0.26z)| norm 0.2508 (-0.55z)| lr 3.15e-05 | 4151.90 ms | 32.5% bf16 MFU | 126135 tok/s step 16786/19560 | loss 3.329172 (+0.63z)| norm 0.2635 (+0.78z)| lr 3.15e-05 | 4160.48 ms | 32.5% bf16 MFU | 126129 tok/s step 16787/19560 | loss 3.313367 (+0.23z)| norm 0.2508 (-0.54z)| lr 3.15e-05 | 4153.41 ms | 32.5% bf16 MFU | 126134 tok/s step 16788/19560 | loss 3.343124 (+0.96z)| norm 0.2375 (-1.90z)| lr 3.14e-05 | 4150.53 ms | 32.5% bf16 MFU | 126144 tok/s step 16789/19560 | loss 3.215750 (-2.21z)| norm 0.2500 (-0.60z)| lr 3.14e-05 | 4159.93 ms | 32.5% bf16 MFU | 126138 tok/s step 16790/19560 | loss 3.291739 (-0.31z)| norm 0.2457 (-1.04z)| lr 3.14e-05 | 4143.67 ms | 32.6% bf16 MFU | 126157 tok/s step 16791/19560 | loss 3.347948 (+1.08z)| norm 0.2409 (-1.53z)| lr 3.14e-05 | 4157.62 ms | 32.5% bf16 MFU | 126155 tok/s step 16792/19560 | loss 3.319469 (+0.37z)| norm 0.2660 (+1.05z)| lr 3.14e-05 | 4156.43 ms | 32.5% bf16 MFU | 126154 tok/s step 16793/19560 | loss 3.397910 (+2.27z)| norm 0.2439 (-1.22z)| lr 3.13e-05 | 4151.12 ms | 32.5% bf16 MFU | 126161 tok/s step 16794/19560 | loss 3.324926 (+0.50z)| norm 0.2437 (-1.22z)| lr 3.13e-05 | 4173.87 ms | 32.3% bf16 MFU | 126134 tok/s step 16795/19560 | loss 3.293487 (-0.28z)| norm 0.2434 (-1.25z)| lr 3.13e-05 | 4155.15 ms | 32.5% bf16 MFU | 126136 tok/s step 16796/19560 | loss 3.286122 (-0.46z)| norm 0.2710 (+1.55z)| lr 3.13e-05 | 4160.60 ms | 32.5% bf16 MFU | 126130 tok/s step 16797/19560 | loss 3.269772 (-0.85z)| norm 0.2506 (-0.53z)| lr 3.12e-05 | 4146.78 ms | 32.6% bf16 MFU | 126145 tok/s step 16798/19560 | loss 3.348636 (+1.08z)| norm 0.2426 (-1.32z)| lr 3.12e-05 | 4147.93 ms | 32.6% bf16 MFU | 126158 tok/s step 16799/19560 | loss 3.252628 (-1.27z)| norm 0.2486 (-0.71z)| lr 3.12e-05 | 4147.66 ms | 32.6% bf16 MFU | 126170 tok/s step 16800/19560 | loss 3.321683 (+0.42z)| norm 0.2507 (-0.50z)| lr 3.12e-05 | 4167.36 ms | 32.4% bf16 MFU | 126152 tok/s step 16801/19560 | loss 3.281140 (-0.57z)| norm 0.2523 (-0.32z)| lr 3.12e-05 | 4152.24 ms | 32.5% bf16 MFU | 126158 tok/s step 16802/19560 | loss 3.277487 (-0.65z)| norm 0.2501 (-0.55z)| lr 3.11e-05 | 4144.05 ms | 32.6% bf16 MFU | 126176 tok/s step 16803/19560 | loss 3.308301 (+0.11z)| norm 0.2516 (-0.39z)| lr 3.11e-05 | 4165.88 ms | 32.4% bf16 MFU | 126159 tok/s step 16804/19560 | loss 3.347049 (+1.07z)| norm 0.2643 (+0.88z)| lr 3.11e-05 | 4156.74 ms | 32.5% bf16 MFU | 126158 tok/s step 16805/19560 | loss 3.289281 (-0.35z)| norm 0.2449 (-1.08z)| lr 3.11e-05 | 4151.09 ms | 32.5% bf16 MFU | 126165 tok/s step 16806/19560 | loss 3.297212 (-0.14z)| norm 0.2371 (-1.82z)| lr 3.10e-05 | 4158.56 ms | 32.5% bf16 MFU | 126161 tok/s step 16807/19560 | loss 3.309660 (+0.16z)| norm 0.2534 (-0.19z)| lr 3.10e-05 | 4156.37 ms | 32.5% bf16 MFU | 126160 tok/s step 16808/19560 | loss 3.286933 (-0.39z)| norm 0.2505 (-0.47z)| lr 3.10e-05 | 4150.78 ms | 32.5% bf16 MFU | 126167 tok/s step 16809/19560 | loss 3.256427 (-1.13z)| norm 0.2564 (+0.11z)| lr 3.10e-05 | 4164.30 ms | 32.4% bf16 MFU | 126154 tok/s step 16810/19560 | loss 3.271559 (-0.75z)| norm 0.2353 (-1.99z)| lr 3.10e-05 | 4152.59 ms | 32.5% bf16 MFU | 126159 tok/s step 16811/19560 | loss 3.285106 (-0.41z)| norm 0.2727 (+1.69z)| lr 3.09e-05 | 4161.27 ms | 32.4% bf16 MFU | 126151 tok/s step 16812/19560 | loss 3.286809 (-0.36z)| norm 0.2624 (+0.67z)| lr 3.09e-05 | 4149.61 ms | 32.5% bf16 MFU | 126160 tok/s step 16813/19560 | loss 3.300645 (-0.02z)| norm 0.2573 (+0.16z)| lr 3.09e-05 | 4168.14 ms | 32.4% bf16 MFU | 126142 tok/s step 16814/19560 | loss 3.327151 (+0.63z)| norm 0.2665 (+1.06z)| lr 3.09e-05 | 4156.76 ms | 32.5% bf16 MFU | 126141 tok/s step 16815/19560 | loss 3.305352 (+0.09z)| norm 0.2625 (+0.66z)| lr 3.08e-05 | 4161.18 ms | 32.4% bf16 MFU | 126134 tok/s step 16816/19560 | loss 3.301878 (+0.01z)| norm 0.2514 (-0.43z)| lr 3.08e-05 | 4162.43 ms | 32.4% bf16 MFU | 126125 tok/s step 16817/19560 | loss 3.284729 (-0.41z)| norm 0.2558 (-0.00z)| lr 3.08e-05 | 4156.92 ms | 32.5% bf16 MFU | 126125 tok/s step 16818/19560 | loss 3.317236 (+0.41z)| norm 0.2561 (+0.01z)| lr 3.08e-05 | 4151.29 ms | 32.5% bf16 MFU | 126133 tok/s step 16819/19560 | loss 3.306568 (+0.14z)| norm 0.2426 (-1.32z)| lr 3.08e-05 | 4153.20 ms | 32.5% bf16 MFU | 126138 tok/s step 16820/19560 | loss 3.304527 (+0.09z)| norm 0.2398 (-1.56z)| lr 3.07e-05 | 4151.41 ms | 32.5% bf16 MFU | 126146 tok/s step 16821/19560 | loss 3.357826 (+1.40z)| norm 0.2540 (-0.18z)| lr 3.07e-05 | 4153.37 ms | 32.5% bf16 MFU | 126150 tok/s step 16822/19560 | loss 3.257835 (-1.09z)| norm 0.2517 (-0.40z)| lr 3.07e-05 | 4150.31 ms | 32.5% bf16 MFU | 126159 tok/s step 16823/19560 | loss 3.250921 (-1.24z)| norm 0.2476 (-0.81z)| lr 3.07e-05 | 4143.39 ms | 32.6% bf16 MFU | 126178 tok/s step 16824/19560 | loss 3.300558 (-0.01z)| norm 0.2652 (+0.91z)| lr 3.06e-05 | 4152.32 ms | 32.5% bf16 MFU | 126182 tok/s step 16825/19560 | loss 3.274044 (-0.66z)| norm 0.2432 (-1.23z)| lr 3.06e-05 | 4159.59 ms | 32.5% bf16 MFU | 126175 tok/s step 16826/19560 | loss 3.259337 (-1.02z)| norm 0.2473 (-0.82z)| lr 3.06e-05 | 4168.90 ms | 32.4% bf16 MFU | 126155 tok/s step 16827/19560 | loss 3.286466 (-0.34z)| norm 0.2634 (+0.74z)| lr 3.06e-05 | 4151.42 ms | 32.5% bf16 MFU | 126161 tok/s step 16828/19560 | loss 3.322203 (+0.54z)| norm 0.2471 (-0.84z)| lr 3.06e-05 | 4155.23 ms | 32.5% bf16 MFU | 126162 tok/s step 16829/19560 | loss 3.285436 (-0.38z)| norm 0.2535 (-0.22z)| lr 3.05e-05 | 4149.57 ms | 32.5% bf16 MFU | 126171 tok/s step 16830/19560 | loss 3.291084 (-0.23z)| norm 0.2542 (-0.15z)| lr 3.05e-05 | 4150.10 ms | 32.5% bf16 MFU | 126179 tok/s step 16831/19560 | loss 3.286977 (-0.32z)| norm 0.2428 (-1.26z)| lr 3.05e-05 | 4152.41 ms | 32.5% bf16 MFU | 126183 tok/s step 16832/19560 | loss 3.296030 (-0.11z)| norm 0.2601 (+0.42z)| lr 3.05e-05 | 4152.03 ms | 32.5% bf16 MFU | 126188 tok/s step 16833/19560 | loss 3.278127 (-0.58z)| norm 0.2534 (-0.22z)| lr 3.04e-05 | 4154.14 ms | 32.5% bf16 MFU | 126189 tok/s step 16834/19560 | loss 3.283239 (-0.45z)| norm 0.2581 (+0.23z)| lr 3.04e-05 | 4147.71 ms | 32.6% bf16 MFU | 126200 tok/s step 16835/19560 | loss 3.343884 (+1.10z)| norm 0.2428 (-1.24z)| lr 3.04e-05 | 4158.56 ms | 32.5% bf16 MFU | 126193 tok/s step 16836/19560 | loss 3.293997 (-0.17z)| norm 0.2543 (-0.12z)| lr 3.04e-05 | 4152.08 ms | 32.5% bf16 MFU | 126197 tok/s step 16837/19560 | loss 3.286437 (-0.36z)| norm 0.2576 (+0.21z)| lr 3.04e-05 | 4158.08 ms | 32.5% bf16 MFU | 126192 tok/s step 16838/19560 | loss 3.306318 (+0.16z)| norm 0.2544 (-0.11z)| lr 3.03e-05 | 4153.75 ms | 32.5% bf16 MFU | 126193 tok/s step 16839/19560 | loss 3.266045 (-0.87z)| norm 0.2392 (-1.56z)| lr 3.03e-05 | 4164.36 ms | 32.4% bf16 MFU | 126179 tok/s step 16840/19560 | loss 3.292058 (-0.21z)| norm 0.2643 (+0.86z)| lr 3.03e-05 | 4156.93 ms | 32.5% bf16 MFU | 126176 tok/s step 16841/19560 | loss 3.271183 (-0.74z)| norm 0.2682 (+1.21z)| lr 3.03e-05 | 4151.58 ms | 32.5% bf16 MFU | 126181 tok/s step 16842/19560 | loss 3.295263 (-0.12z)| norm 0.2523 (-0.30z)| lr 3.02e-05 | 4161.68 ms | 32.4% bf16 MFU | 126171 tok/s step 16843/19560 | loss 3.326428 (+0.68z)| norm 0.2567 (+0.12z)| lr 3.02e-05 | 4152.91 ms | 32.5% bf16 MFU | 126175 tok/s step 16844/19560 | loss 3.270442 (-0.76z)| norm 0.2717 (+1.53z)| lr 3.02e-05 | 4147.10 ms | 32.6% bf16 MFU | 126187 tok/s step 16845/19560 | loss 3.330370 (+0.77z)| norm 0.2548 (-0.07z)| lr 3.02e-05 | 4151.32 ms | 32.5% bf16 MFU | 126193 tok/s step 16846/19560 | loss 3.285667 (-0.38z)| norm 0.2450 (-1.02z)| lr 3.02e-05 | 4146.23 ms | 32.6% bf16 MFU | 126206 tok/s step 16847/19560 | loss 3.294280 (-0.14z)| norm 0.2477 (-0.76z)| lr 3.01e-05 | 4149.79 ms | 32.5% bf16 MFU | 126212 tok/s step 16848/19560 | loss 3.304261 (+0.13z)| norm 0.2426 (-1.23z)| lr 3.01e-05 | 4156.93 ms | 32.5% bf16 MFU | 126208 tok/s step 16849/19560 | loss 3.333772 (+0.89z)| norm 0.2446 (-1.02z)| lr 3.01e-05 | 4152.96 ms | 32.5% bf16 MFU | 126210 tok/s step 16850/19560 | loss 3.289758 (-0.27z)| norm 0.2433 (-1.13z)| lr 3.01e-05 | 4154.30 ms | 32.5% bf16 MFU | 126209 tok/s step 16851/19560 | loss 3.278098 (-0.57z)| norm 0.2458 (-0.90z)| lr 3.01e-05 | 4154.51 ms | 32.5% bf16 MFU | 126209 tok/s step 16852/19560 | loss 3.253273 (-1.21z)| norm 0.2497 (-0.53z)| lr 3.00e-05 | 4159.52 ms | 32.5% bf16 MFU | 126201 tok/s step 16853/19560 | loss 3.294172 (-0.12z)| norm 0.2424 (-1.20z)| lr 3.00e-05 | 4153.99 ms | 32.5% bf16 MFU | 126201 tok/s step 16854/19560 | loss 3.292825 (-0.15z)| norm 0.2428 (-1.15z)| lr 3.00e-05 | 4158.07 ms | 32.5% bf16 MFU | 126196 tok/s step 16855/19560 | loss 3.236633 (-1.63z)| norm 0.2532 (-0.15z)| lr 3.00e-05 | 4151.23 ms | 32.5% bf16 MFU | 126201 tok/s step 16856/19560 | loss 3.325389 (+0.71z)| norm 0.2515 (-0.30z)| lr 2.99e-05 | 4149.97 ms | 32.5% bf16 MFU | 126207 tok/s step 16857/19560 | loss 3.270418 (-0.76z)| norm 0.2348 (-1.88z)| lr 2.99e-05 | 4155.53 ms | 32.5% bf16 MFU | 126205 tok/s step 16858/19560 | loss 3.264541 (-0.91z)| norm 0.2406 (-1.31z)| lr 2.99e-05 | 4152.98 ms | 32.5% bf16 MFU | 126207 tok/s step 16859/19560 | loss 3.295655 (-0.05z)| norm 0.2596 (+0.51z)| lr 2.99e-05 | 4153.62 ms | 32.5% bf16 MFU | 126208 tok/s step 16860/19560 | loss 3.230421 (-1.79z)| norm 0.2409 (-1.27z)| lr 2.99e-05 | 4158.64 ms | 32.5% bf16 MFU | 126201 tok/s step 16861/19560 | loss 3.338951 (+1.12z)| norm 0.2583 (+0.40z)| lr 2.98e-05 | 4148.97 ms | 32.5% bf16 MFU | 126210 tok/s step 16862/19560 | loss 3.362013 (+1.71z)| norm 0.2743 (+1.90z)| lr 2.98e-05 | 4172.30 ms | 32.4% bf16 MFU | 126182 tok/s step 16863/19560 | loss 3.339074 (+1.08z)| norm 0.2718 (+1.63z)| lr 2.98e-05 | 4175.69 ms | 32.3% bf16 MFU | 126151 tok/s step 16864/19560 | loss 3.305858 (+0.20z)| norm 0.2651 (+1.03z)| lr 2.98e-05 | 4152.14 ms | 32.5% bf16 MFU | 126157 tok/s step 16865/19560 | loss 3.318793 (+0.54z)| norm 0.2499 (-0.44z)| lr 2.97e-05 | 4173.33 ms | 32.4% bf16 MFU | 126130 tok/s step 16866/19560 | loss 3.255182 (-1.13z)| norm 0.2615 (+0.67z)| lr 2.97e-05 | 4154.33 ms | 32.5% bf16 MFU | 126134 tok/s step 16867/19560 | loss 3.332466 (+0.89z)| norm 0.2488 (-0.53z)| lr 2.97e-05 | 4150.40 ms | 32.5% bf16 MFU | 126143 tok/s step 16868/19560 | loss 3.262652 (-0.93z)| norm 0.2482 (-0.59z)| lr 2.97e-05 | 4156.54 ms | 32.5% bf16 MFU | 126143 tok/s step 16869/19560 | loss 3.363029 (+1.68z)| norm 0.2524 (-0.19z)| lr 2.97e-05 | 4157.97 ms | 32.5% bf16 MFU | 126140 tok/s step 16870/19560 | loss 3.311960 (+0.34z)| norm 0.2493 (-0.48z)| lr 2.96e-05 | 4153.45 ms | 32.5% bf16 MFU | 126145 tok/s step 16871/19560 | loss 3.323361 (+0.63z)| norm 0.2559 (+0.18z)| lr 2.96e-05 | 4153.88 ms | 32.5% bf16 MFU | 126148 tok/s step 16872/19560 | loss 3.308597 (+0.25z)| norm 0.2626 (+0.83z)| lr 2.96e-05 | 4154.85 ms | 32.5% bf16 MFU | 126150 tok/s step 16873/19560 | loss 3.250812 (-1.29z)| norm 0.2472 (-0.66z)| lr 2.96e-05 | 4147.98 ms | 32.6% bf16 MFU | 126163 tok/s step 16874/19560 | loss 3.327393 (+0.74z)| norm 0.2487 (-0.51z)| lr 2.96e-05 | 4159.61 ms | 32.5% bf16 MFU | 126157 tok/s step 16875/19560 | loss 3.258188 (-1.09z)| norm 0.2488 (-0.49z)| lr 2.95e-05 | 4167.86 ms | 32.4% bf16 MFU | 126138 tok/s step 16876/19560 | loss 3.303994 (+0.12z)| norm 0.2557 (+0.19z)| lr 2.95e-05 | 4152.97 ms | 32.5% bf16 MFU | 126144 tok/s step 16877/19560 | loss 3.338362 (+1.02z)| norm 0.2578 (+0.39z)| lr 2.95e-05 | 4153.55 ms | 32.5% bf16 MFU | 126148 tok/s step 16878/19560 | loss 3.312088 (+0.31z)| norm 0.2502 (-0.36z)| lr 2.95e-05 | 4161.79 ms | 32.4% bf16 MFU | 126139 tok/s step 16879/19560 | loss 3.289737 (-0.28z)| norm 0.2367 (-1.66z)| lr 2.94e-05 | 4192.47 ms | 32.2% bf16 MFU | 126085 tok/s step 16880/19560 | loss 3.295081 (-0.15z)| norm 0.2528 (-0.08z)| lr 2.94e-05 | 4151.18 ms | 32.5% bf16 MFU | 126096 tok/s step 16881/19560 | loss 3.319955 (+0.51z)| norm 0.2579 (+0.42z)| lr 2.94e-05 | 4160.69 ms | 32.5% bf16 MFU | 126091 tok/s step 16882/19560 | loss 3.322636 (+0.58z)| norm 0.2514 (-0.22z)| lr 2.94e-05 | 4151.38 ms | 32.5% bf16 MFU | 126102 tok/s step 16883/19560 | loss 3.277861 (-0.60z)| norm 0.2468 (-0.66z)| lr 2.94e-05 | 4154.76 ms | 32.5% bf16 MFU | 126106 tok/s step 16884/19560 | loss 3.214327 (-2.26z)| norm 0.2415 (-1.18z)| lr 2.93e-05 | 4155.38 ms | 32.5% bf16 MFU | 126109 tok/s step 16885/19560 | loss 3.382723 (+2.17z)| norm 0.2486 (-0.48z)| lr 2.93e-05 | 4155.81 ms | 32.5% bf16 MFU | 126112 tok/s step 16886/19560 | loss 3.303988 (+0.10z)| norm 0.2622 (+0.87z)| lr 2.93e-05 | 4147.17 ms | 32.6% bf16 MFU | 126127 tok/s step 16887/19560 | loss 3.308860 (+0.22z)| norm 0.2488 (-0.45z)| lr 2.93e-05 | 4316.24 ms | 31.3% bf16 MFU | 125894 tok/s step 16888/19560 | loss 3.308029 (+0.20z)| norm 0.2561 (+0.26z)| lr 2.92e-05 | 4167.24 ms | 32.4% bf16 MFU | 125890 tok/s step 16889/19560 | loss 3.269430 (-0.81z)| norm 0.2696 (+1.57z)| lr 2.92e-05 | 4165.00 ms | 32.4% bf16 MFU | 125890 tok/s step 16890/19560 | loss 3.290494 (-0.25z)| norm 0.2471 (-0.63z)| lr 2.92e-05 | 4259.04 ms | 31.7% bf16 MFU | 125750 tok/s step 16891/19560 | loss 3.282514 (-0.47z)| norm 0.2458 (-0.76z)| lr 2.92e-05 | 6825.69 ms | 19.8% bf16 MFU | 123303 tok/s step 16892/19560 | loss 3.256095 (-1.16z)| norm 0.2720 (+1.78z)| lr 2.92e-05 | 5343.25 ms | 25.3% bf16 MFU | 122044 tok/s step 16893/19560 | loss 3.284604 (-0.41z)| norm 0.2695 (+1.53z)| lr 2.91e-05 | 5920.02 ms | 22.8% bf16 MFU | 120370 tok/s step 16894/19560 | loss 3.283994 (-0.41z)| norm 0.2604 (+0.64z)| lr 2.91e-05 | 4200.43 ms | 32.1% bf16 MFU | 120592 tok/s step 16895/19560 | loss 3.228682 (-1.83z)| norm 0.2539 (+0.02z)| lr 2.91e-05 | 4259.22 ms | 31.7% bf16 MFU | 120717 tok/s step 16896/19560 | loss 3.278345 (-0.56z)| norm 0.2624 (+0.82z)| lr 2.91e-05 | 4245.89 ms | 31.8% bf16 MFU | 120856 tok/s step 16897/19560 | loss 3.309057 (+0.27z)| norm 0.2485 (-0.50z)| lr 2.91e-05 | 4277.64 ms | 31.6% bf16 MFU | 120941 tok/s step 16898/19560 | loss 3.312535 (+0.42z)| norm 0.2455 (-0.82z)| lr 2.90e-05 | 4150.11 ms | 32.5% bf16 MFU | 121211 tok/s step 16899/19560 | loss 3.227869 (-2.09z)| norm 0.2693 (+1.74z)| lr 2.90e-05 | 4137.99 ms | 32.6% bf16 MFU | 121485 tok/s step 16900/19560 | loss 3.364127 (+2.04z)| norm 0.2416 (-1.23z)| lr 2.90e-05 | 4178.96 ms | 32.3% bf16 MFU | 121684 tok/s step 16901/19560 | loss 3.335861 (+1.17z)| norm 0.2580 (+0.56z)| lr 2.90e-05 | 4157.13 ms | 32.5% bf16 MFU | 121905 tok/s step 16902/19560 | loss 3.287936 (-0.27z)| norm 0.2426 (-1.12z)| lr 2.89e-05 | 4271.23 ms | 31.6% bf16 MFU | 121948 tok/s step 16903/19560 | loss 3.303507 (+0.19z)| norm 0.2467 (-0.65z)| lr 2.89e-05 | 4150.76 ms | 32.5% bf16 MFU | 122166 tok/s step 16904/19560 | loss 3.319721 (+0.67z)| norm 0.2441 (-0.93z)| lr 2.89e-05 | 4230.43 ms | 31.9% bf16 MFU | 122254 tok/s step 16905/19560 | loss 3.245613 (-1.55z)| norm 0.2509 (-0.18z)| lr 2.89e-05 | 4165.27 ms | 32.4% bf16 MFU | 122435 tok/s step 16906/19560 | loss 3.292547 (-0.15z)| norm 0.3257 (+6.57z)| lr 2.89e-05 | 4152.58 ms | 32.5% bf16 MFU | 122626 tok/s step 16907/19560 | loss 3.267925 (-0.87z)| norm 0.2523 (-0.07z)| lr 2.88e-05 | 4170.05 ms | 32.4% bf16 MFU | 122781 tok/s step 16908/19560 | loss 3.260681 (-1.08z)| norm 0.2450 (-0.73z)| lr 2.88e-05 | 4155.95 ms | 32.5% bf16 MFU | 122950 tok/s step 16909/19560 | loss 3.221242 (-2.24z)| norm 0.2352 (-1.58z)| lr 2.88e-05 | 4168.14 ms | 32.4% bf16 MFU | 123091 tok/s step 16910/19560 | loss 3.318989 (+0.69z)| norm 0.2423 (-0.93z)| lr 2.88e-05 | 4175.89 ms | 32.3% bf16 MFU | 123214 tok/s step 16911/19560 | loss 3.279025 (-0.51z)| norm 0.2528 (+0.01z)| lr 2.88e-05 | 4165.65 ms | 32.4% bf16 MFU | 123347 tok/s step 16912/19560 | loss 3.261472 (-1.03z)| norm 0.2718 (+1.68z)| lr 2.87e-05 | 4173.54 ms | 32.4% bf16 MFU | 123460 tok/s step 16913/19560 | loss 3.288779 (-0.21z)| norm 0.2591 (+0.54z)| lr 2.87e-05 | 4195.03 ms | 32.2% bf16 MFU | 123536 tok/s step 16914/19560 | loss 3.271139 (-0.73z)| norm 0.2501 (-0.24z)| lr 2.87e-05 | 4166.21 ms | 32.4% bf16 MFU | 123652 tok/s step 16915/19560 | loss 3.229622 (-1.92z)| norm 0.2444 (-0.75z)| lr 2.87e-05 | 4161.64 ms | 32.4% bf16 MFU | 123768 tok/s step 16916/19560 | loss 3.301705 (+0.22z)| norm 0.2579 (+0.44z)| lr 2.86e-05 | 4165.22 ms | 32.4% bf16 MFU | 123873 tok/s step 16917/19560 | loss 3.267077 (-0.84z)| norm 0.2461 (-0.61z)| lr 2.86e-05 | 4360.65 ms | 31.0% bf16 MFU | 123691 tok/s step 16918/19560 | loss 3.283239 (-0.35z)| norm 0.2509 (-0.18z)| lr 2.86e-05 | 4162.50 ms | 32.4% bf16 MFU | 123804 tok/s step 16919/19560 | loss 3.303061 (+0.27z)| norm 0.2573 (+0.38z)| lr 2.86e-05 | 4178.83 ms | 32.3% bf16 MFU | 123887 tok/s step 16920/19560 | loss 3.225523 (-2.06z)| norm 0.2430 (-0.89z)| lr 2.86e-05 | 4184.29 ms | 32.3% bf16 MFU | 123958 tok/s step 16921/19560 | loss 3.284135 (-0.27z)| norm 0.2480 (-0.44z)| lr 2.85e-05 | 4167.90 ms | 32.4% bf16 MFU | 124050 tok/s step 16922/19560 | loss 3.290699 (-0.06z)| norm 0.2451 (-0.70z)| lr 2.85e-05 | 4174.00 ms | 32.3% bf16 MFU | 124128 tok/s step 16923/19560 | loss 3.290760 (-0.05z)| norm 0.2466 (-0.57z)| lr 2.85e-05 | 4166.60 ms | 32.4% bf16 MFU | 124213 tok/s step 16924/19560 | loss 3.271900 (-0.65z)| norm 0.2476 (-0.47z)| lr 2.85e-05 | 4171.72 ms | 32.4% bf16 MFU | 124286 tok/s step 16925/19560 | loss 3.288917 (-0.11z)| norm 0.2440 (-0.80z)| lr 2.85e-05 | 4163.15 ms | 32.4% bf16 MFU | 124368 tok/s step 16926/19560 | loss 3.285094 (-0.22z)| norm 0.2720 (+1.72z)| lr 2.84e-05 | 4184.64 ms | 32.3% bf16 MFU | 124414 tok/s step 16927/19560 | loss 3.361140 (+2.16z)| norm 0.2476 (-0.49z)| lr 2.84e-05 | 4159.66 ms | 32.5% bf16 MFU | 124496 tok/s step 16928/19560 | loss 3.340124 (+1.48z)| norm 0.2637 (+0.95z)| lr 2.84e-05 | 4179.31 ms | 32.3% bf16 MFU | 124543 tok/s step 16929/19560 | loss 3.299013 (+0.18z)| norm 0.2478 (-0.47z)| lr 2.84e-05 | 4164.74 ms | 32.4% bf16 MFU | 124611 tok/s step 16930/19560 | loss 3.310791 (+0.55z)| norm 0.2577 (+0.42z)| lr 2.84e-05 | 4179.70 ms | 32.3% bf16 MFU | 124652 tok/s step 16931/19560 | loss 3.300395 (+0.22z)| norm 0.2510 (-0.18z)| lr 2.83e-05 | 4266.96 ms | 31.6% bf16 MFU | 124563 tok/s step 16932/19560 | loss 3.322268 (+0.92z)| norm 0.2399 (-1.17z)| lr 2.83e-05 | 4165.86 ms | 32.4% bf16 MFU | 124627 tok/s step 16933/19560 | loss 3.269543 (-0.74z)| norm 0.2449 (-0.72z)| lr 2.83e-05 | 4174.89 ms | 32.3% bf16 MFU | 124675 tok/s step 16934/19560 | loss 3.261697 (-0.98z)| norm 0.2495 (-0.32z)| lr 2.83e-05 | 4159.07 ms | 32.5% bf16 MFU | 124744 tok/s step 16935/19560 | loss 3.198921 (-2.84z)| norm 0.2574 (+0.39z)| lr 2.82e-05 | 4171.29 ms | 32.4% bf16 MFU | 124792 tok/s step 16936/19560 | loss 3.270255 (-0.66z)| norm 0.2430 (-0.89z)| lr 2.82e-05 | 4174.44 ms | 32.3% bf16 MFU | 124832 tok/s step 16937/19560 | loss 3.333530 (+1.25z)| norm 0.2700 (+1.51z)| lr 2.82e-05 | 4168.92 ms | 32.4% bf16 MFU | 124878 tok/s step 16938/19560 | loss 3.401961 (+3.18z)| norm 0.2623 (+0.81z)| lr 2.82e-05 | 4155.81 ms | 32.5% bf16 MFU | 124942 tok/s step 16939/19560 | loss 3.282533 (-0.32z)| norm 0.2466 (-0.58z)| lr 2.82e-05 | 4163.87 ms | 32.4% bf16 MFU | 124991 tok/s step 16940/19560 | loss 3.241941 (-1.48z)| norm 0.2498 (-0.29z)| lr 2.81e-05 | 4160.26 ms | 32.5% bf16 MFU | 125042 tok/s step 16941/19560 | loss 3.309315 (+0.47z)| norm 0.2437 (-0.83z)| lr 2.81e-05 | 4177.42 ms | 32.3% bf16 MFU | 125066 tok/s step 16942/19560 | loss 3.322387 (+0.85z)| norm 0.2579 (+0.47z)| lr 2.81e-05 | 4173.26 ms | 32.4% bf16 MFU | 125094 tok/s step 16943/19560 | loss 3.409148 (+3.21z)| norm 0.2757 (+2.05z)| lr 2.81e-05 | 4160.11 ms | 32.5% bf16 MFU | 125140 tok/s step 16944/19560 | loss 3.298472 (+0.13z)| norm 0.2488 (-0.37z)| lr 2.81e-05 | 4177.26 ms | 32.3% bf16 MFU | 125159 tok/s step 16945/19560 | loss 3.318412 (+0.68z)| norm 0.2597 (+0.61z)| lr 2.80e-05 | 4167.24 ms | 32.4% bf16 MFU | 125192 tok/s step 16946/19560 | loss 3.311745 (+0.49z)| norm 0.2507 (-0.19z)| lr 2.80e-05 | 4167.08 ms | 32.4% bf16 MFU | 125223 tok/s step 16947/19560 | loss 3.271529 (-0.62z)| norm 0.2450 (-0.71z)| lr 2.80e-05 | 4169.61 ms | 32.4% bf16 MFU | 125249 tok/s step 16948/19560 | loss 3.282861 (-0.30z)| norm 0.2523 (-0.06z)| lr 2.80e-05 | 4173.50 ms | 32.4% bf16 MFU | 125267 tok/s step 16949/19560 | loss 3.216237 (-2.11z)| norm 0.3025 (+4.14z)| lr 2.80e-05 | 4165.23 ms | 32.4% bf16 MFU | 125298 tok/s step 16950/19560 | loss 3.279095 (-0.38z)| norm 0.2387 (-1.22z)| lr 2.79e-05 | 4178.20 ms | 32.3% bf16 MFU | 125307 tok/s step 16951/19560 | loss 3.258620 (-0.95z)| norm 0.2471 (-0.52z)| lr 2.79e-05 | 4167.10 ms | 32.4% bf16 MFU | 125332 tok/s step 16952/19560 | loss 3.279138 (-0.37z)| norm 0.2434 (-0.82z)| lr 2.79e-05 | 4180.42 ms | 32.3% bf16 MFU | 125336 tok/s step 16953/19560 | loss 3.304660 (+0.33z)| norm 0.2479 (-0.44z)| lr 2.79e-05 | 4171.96 ms | 32.4% bf16 MFU | 125353 tok/s step 16954/19560 | loss 3.330271 (+1.02z)| norm 0.2457 (-0.62z)| lr 2.78e-05 | 4187.02 ms | 32.2% bf16 MFU | 125346 tok/s step 16955/19560 | loss 3.273380 (-0.55z)| norm 0.2380 (-1.25z)| lr 2.78e-05 | 4168.42 ms | 32.4% bf16 MFU | 125368 tok/s step 16956/19560 | loss 3.275228 (-0.49z)| norm 0.2443 (-0.72z)| lr 2.78e-05 | 4162.47 ms | 32.4% bf16 MFU | 125397 tok/s step 16957/19560 | loss 3.278575 (-0.40z)| norm 0.2425 (-0.86z)| lr 2.78e-05 | 4190.16 ms | 32.2% bf16 MFU | 125384 tok/s step 16958/19560 | loss 3.392542 (+2.67z)| norm 0.2608 (+0.66z)| lr 2.78e-05 | 4156.50 ms | 32.5% bf16 MFU | 125421 tok/s step 16959/19560 | loss 3.422239 (+3.29z)| norm 0.2521 (-0.07z)| lr 2.77e-05 | 4162.99 ms | 32.4% bf16 MFU | 125447 tok/s step 16960/19560 | loss 3.314008 (+0.49z)| norm 0.2453 (-0.62z)| lr 2.77e-05 | 4178.53 ms | 32.3% bf16 MFU | 125448 tok/s step 16961/19560 | loss 3.262108 (-0.84z)| norm 0.2534 (+0.05z)| lr 2.77e-05 | 4174.97 ms | 32.3% bf16 MFU | 125455 tok/s step 16962/19560 | loss 3.344416 (+1.26z)| norm 0.2549 (+0.17z)| lr 2.77e-05 | 4167.29 ms | 32.4% bf16 MFU | 125473 tok/s step 16963/19560 | loss 3.347460 (+1.34z)| norm 0.2514 (-0.13z)| lr 2.77e-05 | 4170.70 ms | 32.4% bf16 MFU | 125484 tok/s step 16964/19560 | loss 3.304767 (+0.24z)| norm 0.2496 (-0.27z)| lr 2.76e-05 | 4165.06 ms | 32.4% bf16 MFU | 125504 tok/s step 16965/19560 | loss 3.302491 (+0.18z)| norm 0.2656 (+1.06z)| lr 2.76e-05 | 4171.99 ms | 32.4% bf16 MFU | 125512 tok/s step 16966/19560 | loss 3.308568 (+0.34z)| norm 0.2433 (-0.79z)| lr 2.76e-05 | 4158.21 ms | 32.5% bf16 MFU | 125541 tok/s step 16967/19560 | loss 3.304028 (+0.21z)| norm 0.2527 (-0.02z)| lr 2.76e-05 | 4179.90 ms | 32.3% bf16 MFU | 125535 tok/s step 16968/19560 | loss 3.356482 (+1.53z)| norm 0.2450 (-0.65z)| lr 2.76e-05 | 4185.40 ms | 32.3% bf16 MFU | 125522 tok/s step 16969/19560 | loss 3.322261 (+0.65z)| norm 0.2388 (-1.15z)| lr 2.75e-05 | 4172.71 ms | 32.4% bf16 MFU | 125528 tok/s step 16970/19560 | loss 3.357893 (+1.53z)| norm 0.2571 (+0.38z)| lr 2.75e-05 | 4179.83 ms | 32.3% bf16 MFU | 125523 tok/s step 16971/19560 | loss 3.292183 (-0.12z)| norm 0.2499 (-0.21z)| lr 2.75e-05 | 4164.33 ms | 32.4% bf16 MFU | 125542 tok/s step 16972/19560 | loss 3.301280 (+0.11z)| norm 0.2472 (-0.43z)| lr 2.75e-05 | 4158.29 ms | 32.5% bf16 MFU | 125569 tok/s step 16973/19560 | loss 3.397142 (+2.46z)| norm 0.2612 (+0.75z)| lr 2.74e-05 | 4162.46 ms | 32.4% bf16 MFU | 125589 tok/s step 16974/19560 | loss 3.282996 (-0.36z)| norm 0.2443 (-0.68z)| lr 2.74e-05 | 4176.80 ms | 32.3% bf16 MFU | 125585 tok/s step 16975/19560 | loss 3.369200 (+1.73z)| norm 0.2868 (+2.80z)| lr 2.74e-05 | 4161.75 ms | 32.4% bf16 MFU | 125605 tok/s step 16976/19560 | loss 3.261363 (-0.89z)| norm 0.2462 (-0.53z)| lr 2.74e-05 | 4168.66 ms | 32.4% bf16 MFU | 125613 tok/s step 16977/19560 | loss 3.276420 (-0.51z)| norm 0.2513 (-0.12z)| lr 2.74e-05 | 4175.67 ms | 32.3% bf16 MFU | 125610 tok/s step 16978/19560 | loss 3.288946 (-0.21z)| norm 0.2425 (-0.84z)| lr 2.73e-05 | 4180.34 ms | 32.3% bf16 MFU | 125601 tok/s step 16979/19560 | loss 3.312211 (+0.36z)| norm 0.2629 (+0.83z)| lr 2.73e-05 | 4172.82 ms | 32.4% bf16 MFU | 125603 tok/s step 16980/19560 | loss 3.245752 (-1.26z)| norm 0.2481 (-0.39z)| lr 2.73e-05 | 4158.11 ms | 32.5% bf16 MFU | 125627 tok/s step 16981/19560 | loss 3.294279 (-0.08z)| norm 0.2511 (-0.15z)| lr 2.73e-05 | 4171.72 ms | 32.4% bf16 MFU | 125630 tok/s step 16982/19560 | loss 3.295442 (-0.05z)| norm 0.2535 (+0.04z)| lr 2.73e-05 | 4162.85 ms | 32.4% bf16 MFU | 125645 tok/s step 16983/19560 | loss 3.310878 (+0.31z)| norm 0.2461 (-0.57z)| lr 2.72e-05 | 4166.96 ms | 32.4% bf16 MFU | 125654 tok/s step 16984/19560 | loss 3.220958 (-1.85z)| norm 0.2393 (-1.11z)| lr 2.72e-05 | 4176.01 ms | 32.3% bf16 MFU | 125649 tok/s step 16985/19560 | loss 3.291162 (-0.15z)| norm 0.2452 (-0.64z)| lr 2.72e-05 | 4168.30 ms | 32.4% bf16 MFU | 125655 tok/s step 16986/19560 | loss 3.326168 (+0.68z)| norm 0.2393 (-1.12z)| lr 2.72e-05 | 4160.17 ms | 32.5% bf16 MFU | 125674 tok/s step 16987/19560 | loss 3.370385 (+1.72z)| norm 0.2482 (-0.38z)| lr 2.72e-05 | 4190.91 ms | 32.2% bf16 MFU | 125645 tok/s step 16988/19560 | loss 3.325299 (+0.63z)| norm 0.2561 (+0.26z)| lr 2.71e-05 | 4169.75 ms | 32.4% bf16 MFU | 125650 tok/s step 16989/19560 | loss 3.221950 (-1.83z)| norm 0.2376 (-1.25z)| lr 2.71e-05 | 4172.47 ms | 32.4% bf16 MFU | 125650 tok/s step 16990/19560 | loss 3.305861 (+0.19z)| norm 0.2477 (-0.41z)| lr 2.71e-05 | 4164.73 ms | 32.4% bf16 MFU | 125662 tok/s step 16991/19560 | loss 3.326707 (+0.70z)| norm 0.2469 (-0.46z)| lr 2.71e-05 | 4168.64 ms | 32.4% bf16 MFU | 125667 tok/s step 16992/19560 | loss 3.290172 (-0.18z)| norm 0.2471 (-0.44z)| lr 2.71e-05 | 4164.27 ms | 32.4% bf16 MFU | 125679 tok/s step 16993/19560 | loss 3.278632 (-0.46z)| norm 0.2606 (+0.70z)| lr 2.70e-05 | 4166.98 ms | 32.4% bf16 MFU | 125686 tok/s step 16994/19560 | loss 3.291230 (-0.16z)| norm 0.2516 (-0.06z)| lr 2.70e-05 | 4177.09 ms | 32.3% bf16 MFU | 125677 tok/s step 16995/19560 | loss 3.262674 (-0.84z)| norm 0.2367 (-1.30z)| lr 2.70e-05 | 4165.01 ms | 32.4% bf16 MFU | 125688 tok/s step 16996/19560 | loss 3.266421 (-0.75z)| norm 0.2602 (+0.66z)| lr 2.70e-05 | 4161.70 ms | 32.4% bf16 MFU | 125702 tok/s step 16997/19560 | loss 3.329685 (+0.80z)| norm 0.2448 (-0.62z)| lr 2.69e-05 | 4158.09 ms | 32.5% bf16 MFU | 125721 tok/s step 16998/19560 | loss 3.279968 (-0.41z)| norm 0.2473 (-0.41z)| lr 2.69e-05 | 4179.72 ms | 32.3% bf16 MFU | 125707 tok/s step 16999/19560 | loss 3.276480 (-0.49z)| norm 0.2488 (-0.28z)| lr 2.69e-05 | 4162.83 ms | 32.4% bf16 MFU | 125719 tok/s step 17000/19560 | loss 3.233912 (-1.51z)| norm 0.2517 (-0.03z)| lr 2.69e-05 | 4169.03 ms | 32.4% bf16 MFU | 125721 tok/s val loss 3.273283 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3022/10042 = 0.300936 step 17001/19560 | loss 3.267400 (-0.70z)| norm 0.2664 (+1.19z)| lr 2.69e-05 | 4163.90 ms | 32.4% bf16 MFU | 125731 tok/s step 17002/19560 | loss 3.284534 (-0.27z)| norm 0.2599 (+0.64z)| lr 2.68e-05 | 4176.51 ms | 32.3% bf16 MFU | 125721 tok/s step 17003/19560 | loss 3.300093 (+0.10z)| norm 0.2412 (-0.92z)| lr 2.68e-05 | 4159.59 ms | 32.5% bf16 MFU | 125737 tok/s step 17004/19560 | loss 3.264312 (-0.77z)| norm 0.2421 (-0.83z)| lr 2.68e-05 | 4157.49 ms | 32.5% bf16 MFU | 125755 tok/s step 17005/19560 | loss 3.329462 (+0.83z)| norm 0.2683 (+1.33z)| lr 2.68e-05 | 4168.54 ms | 32.4% bf16 MFU | 125756 tok/s step 17006/19560 | loss 3.260626 (-0.84z)| norm 0.2619 (+0.79z)| lr 2.68e-05 | 4169.82 ms | 32.4% bf16 MFU | 125755 tok/s step 17007/19560 | loss 3.252494 (-1.03z)| norm 0.2464 (-0.50z)| lr 2.67e-05 | 4160.53 ms | 32.5% bf16 MFU | 125768 tok/s step 17008/19560 | loss 3.393069 (+2.33z)| norm 0.2588 (+0.53z)| lr 2.67e-05 | 4159.27 ms | 32.5% bf16 MFU | 125782 tok/s step 17009/19560 | loss 3.279715 (-0.37z)| norm 0.2645 (+1.00z)| lr 2.67e-05 | 4190.33 ms | 32.2% bf16 MFU | 125749 tok/s step 17010/19560 | loss 3.294436 (-0.01z)| norm 0.2603 (+0.64z)| lr 2.67e-05 | 4170.50 ms | 32.4% bf16 MFU | 125747 tok/s step 17011/19560 | loss 3.311381 (+0.38z)| norm 0.2652 (+1.03z)| lr 2.67e-05 | 4163.27 ms | 32.4% bf16 MFU | 125757 tok/s step 17012/19560 | loss 3.441286 (+3.35z)| norm 0.2620 (+0.75z)| lr 2.66e-05 | 4168.27 ms | 32.4% bf16 MFU | 125758 tok/s step 17013/19560 | loss 3.234092 (-1.44z)| norm 0.2641 (+0.91z)| lr 2.66e-05 | 4225.48 ms | 32.0% bf16 MFU | 125674 tok/s step 17014/19560 | loss 3.317591 (+0.50z)| norm 0.2518 (-0.09z)| lr 2.66e-05 | 4177.77 ms | 32.3% bf16 MFU | 125665 tok/s step 17015/19560 | loss 3.298101 (+0.05z)| norm 0.2442 (-0.71z)| lr 2.66e-05 | 4164.34 ms | 32.4% bf16 MFU | 125676 tok/s step 17016/19560 | loss 3.315162 (+0.45z)| norm 0.2658 (+1.05z)| lr 2.66e-05 | 4162.20 ms | 32.4% bf16 MFU | 125691 tok/s step 17017/19560 | loss 3.291150 (-0.12z)| norm 0.2513 (-0.12z)| lr 2.65e-05 | 4173.97 ms | 32.3% bf16 MFU | 125687 tok/s step 17018/19560 | loss 3.245114 (-1.18z)| norm 0.2472 (-0.46z)| lr 2.65e-05 | 4172.47 ms | 32.4% bf16 MFU | 125685 tok/s step 17019/19560 | loss 3.274144 (-0.50z)| norm 0.2563 (+0.29z)| lr 2.65e-05 | 4169.17 ms | 32.4% bf16 MFU | 125689 tok/s step 17020/19560 | loss 3.290399 (-0.13z)| norm 0.2463 (-0.53z)| lr 2.65e-05 | 4161.90 ms | 32.4% bf16 MFU | 125703 tok/s step 17021/19560 | loss 3.278245 (-0.41z)| norm 0.2493 (-0.27z)| lr 2.65e-05 | 4161.83 ms | 32.4% bf16 MFU | 125716 tok/s step 17022/19560 | loss 3.291708 (-0.10z)| norm 0.2538 (+0.12z)| lr 2.64e-05 | 4176.95 ms | 32.3% bf16 MFU | 125707 tok/s step 17023/19560 | loss 3.322727 (+0.61z)| norm 0.2408 (-0.96z)| lr 2.64e-05 | 4165.15 ms | 32.4% bf16 MFU | 125715 tok/s step 17024/19560 | loss 3.263543 (-0.78z)| norm 0.2523 (+0.00z)| lr 2.64e-05 | 4164.17 ms | 32.4% bf16 MFU | 125724 tok/s step 17025/19560 | loss 3.278272 (-0.43z)| norm 0.2519 (-0.03z)| lr 2.64e-05 | 4178.59 ms | 32.3% bf16 MFU | 125712 tok/s step 17026/19560 | loss 3.289777 (-0.15z)| norm 0.2465 (-0.48z)| lr 2.64e-05 | 4171.75 ms | 32.4% bf16 MFU | 125710 tok/s step 17027/19560 | loss 3.310566 (+0.32z)| norm 0.2459 (-0.53z)| lr 2.63e-05 | 4166.52 ms | 32.4% bf16 MFU | 125716 tok/s step 17028/19560 | loss 3.374355 (+1.83z)| norm 0.2587 (+0.54z)| lr 2.63e-05 | 4172.35 ms | 32.4% bf16 MFU | 125713 tok/s step 17029/19560 | loss 3.263578 (-0.78z)| norm 0.2391 (-1.09z)| lr 2.63e-05 | 4168.28 ms | 32.4% bf16 MFU | 125717 tok/s step 17030/19560 | loss 3.285702 (-0.25z)| norm 0.2418 (-0.87z)| lr 2.63e-05 | 4163.23 ms | 32.4% bf16 MFU | 125727 tok/s step 17031/19560 | loss 3.286246 (-0.24z)| norm 0.2384 (-1.14z)| lr 2.62e-05 | 4178.65 ms | 32.3% bf16 MFU | 125714 tok/s step 17032/19560 | loss 3.265355 (-0.72z)| norm 0.2396 (-1.04z)| lr 2.62e-05 | 4174.27 ms | 32.3% bf16 MFU | 125709 tok/s step 17033/19560 | loss 3.299949 (+0.09z)| norm 0.2460 (-0.50z)| lr 2.62e-05 | 4163.78 ms | 32.4% bf16 MFU | 125719 tok/s step 17034/19560 | loss 3.286080 (-0.24z)| norm 0.2411 (-1.01z)| lr 2.62e-05 | 4158.85 ms | 32.5% bf16 MFU | 125736 tok/s step 17035/19560 | loss 3.304235 (+0.19z)| norm 0.2427 (-0.84z)| lr 2.62e-05 | 4165.12 ms | 32.4% bf16 MFU | 125743 tok/s step 17036/19560 | loss 3.307855 (+0.26z)| norm 0.2446 (-0.66z)| lr 2.61e-05 | 4177.85 ms | 32.3% bf16 MFU | 125731 tok/s step 17037/19560 | loss 3.265473 (-0.76z)| norm 0.2400 (-1.12z)| lr 2.61e-05 | 4159.17 ms | 32.5% bf16 MFU | 125747 tok/s step 17038/19560 | loss 3.255647 (-0.99z)| norm 0.2505 (-0.08z)| lr 2.61e-05 | 4162.87 ms | 32.4% bf16 MFU | 125757 tok/s step 17039/19560 | loss 3.321110 (+0.58z)| norm 0.2439 (-0.73z)| lr 2.61e-05 | 4169.29 ms | 32.4% bf16 MFU | 125757 tok/s step 17040/19560 | loss 3.420038 (+2.84z)| norm 0.2657 (+1.46z)| lr 2.61e-05 | 4163.98 ms | 32.4% bf16 MFU | 125764 tok/s step 17041/19560 | loss 3.266766 (-0.73z)| norm 0.2524 (+0.12z)| lr 2.60e-05 | 4173.75 ms | 32.3% bf16 MFU | 125757 tok/s step 17042/19560 | loss 3.277437 (-0.48z)| norm 0.2463 (-0.48z)| lr 2.60e-05 | 4174.54 ms | 32.3% bf16 MFU | 125749 tok/s step 17043/19560 | loss 3.288086 (-0.25z)| norm 0.2473 (-0.39z)| lr 2.60e-05 | 4168.39 ms | 32.4% bf16 MFU | 125750 tok/s step 17044/19560 | loss 3.228402 (-1.62z)| norm 0.2516 (+0.05z)| lr 2.60e-05 | 4157.79 ms | 32.5% bf16 MFU | 125767 tok/s step 17045/19560 | loss 3.291687 (-0.15z)| norm 0.2551 (+0.40z)| lr 2.60e-05 | 4158.38 ms | 32.5% bf16 MFU | 125783 tok/s step 17046/19560 | loss 3.307861 (+0.22z)| norm 0.2631 (+1.19z)| lr 2.59e-05 | 4159.78 ms | 32.5% bf16 MFU | 125796 tok/s step 17047/19560 | loss 3.272007 (-0.61z)| norm 0.2568 (+0.55z)| lr 2.59e-05 | 4163.25 ms | 32.4% bf16 MFU | 125803 tok/s step 17048/19560 | loss 3.290989 (-0.18z)| norm 0.2576 (+0.63z)| lr 2.59e-05 | 4180.11 ms | 32.3% bf16 MFU | 125784 tok/s step 17049/19560 | loss 3.306074 (+0.17z)| norm 0.2582 (+0.68z)| lr 2.59e-05 | 4166.75 ms | 32.4% bf16 MFU | 125786 tok/s step 17050/19560 | loss 3.286638 (-0.29z)| norm 0.2636 (+1.20z)| lr 2.59e-05 | 4159.96 ms | 32.5% bf16 MFU | 125798 tok/s step 17051/19560 | loss 3.331141 (+0.75z)| norm 0.2649 (+1.31z)| lr 2.58e-05 | 4184.93 ms | 32.3% bf16 MFU | 125772 tok/s step 17052/19560 | loss 3.251734 (-1.11z)| norm 0.2422 (-0.94z)| lr 2.58e-05 | 4162.40 ms | 32.4% bf16 MFU | 125781 tok/s step 17053/19560 | loss 3.282539 (-0.38z)| norm 0.2555 (+0.37z)| lr 2.58e-05 | 4164.85 ms | 32.4% bf16 MFU | 125787 tok/s step 17054/19560 | loss 3.335714 (+0.85z)| norm 0.2485 (-0.32z)| lr 2.58e-05 | 4171.24 ms | 32.4% bf16 MFU | 125782 tok/s step 17055/19560 | loss 3.277868 (-0.49z)| norm 0.2416 (-1.00z)| lr 2.58e-05 | 4170.54 ms | 32.4% bf16 MFU | 125778 tok/s step 17056/19560 | loss 3.296294 (-0.05z)| norm 0.2506 (-0.09z)| lr 2.57e-05 | 4169.64 ms | 32.4% bf16 MFU | 125776 tok/s step 17057/19560 | loss 3.342230 (+1.03z)| norm 0.2577 (+0.62z)| lr 2.57e-05 | 4176.67 ms | 32.3% bf16 MFU | 125764 tok/s step 17058/19560 | loss 3.260561 (-0.88z)| norm 0.2616 (+1.01z)| lr 2.57e-05 | 4180.27 ms | 32.3% bf16 MFU | 125747 tok/s step 17059/19560 | loss 3.304249 (+0.14z)| norm 0.2563 (+0.47z)| lr 2.57e-05 | 4172.41 ms | 32.4% bf16 MFU | 125742 tok/s step 17060/19560 | loss 3.291568 (-0.15z)| norm 0.2561 (+0.44z)| lr 2.57e-05 | 4164.24 ms | 32.4% bf16 MFU | 125750 tok/s step 17061/19560 | loss 3.310505 (+0.29z)| norm 0.2506 (-0.12z)| lr 2.56e-05 | 4155.59 ms | 32.5% bf16 MFU | 125771 tok/s step 17062/19560 | loss 3.261499 (-0.87z)| norm 0.2468 (-0.50z)| lr 2.56e-05 | 4172.19 ms | 32.4% bf16 MFU | 125765 tok/s step 17063/19560 | loss 3.306397 (+0.17z)| norm 0.2421 (-0.96z)| lr 2.56e-05 | 4162.27 ms | 32.4% bf16 MFU | 125775 tok/s step 17064/19560 | loss 3.308159 (+0.21z)| norm 0.2442 (-0.76z)| lr 2.56e-05 | 4160.16 ms | 32.5% bf16 MFU | 125788 tok/s step 17065/19560 | loss 3.236199 (-1.50z)| norm 0.2459 (-0.57z)| lr 2.56e-05 | 4161.47 ms | 32.4% bf16 MFU | 125798 tok/s step 17066/19560 | loss 3.302575 (+0.11z)| norm 0.2509 (-0.04z)| lr 2.55e-05 | 4176.50 ms | 32.3% bf16 MFU | 125785 tok/s step 17067/19560 | loss 3.238232 (-1.45z)| norm 0.2454 (-0.61z)| lr 2.55e-05 | 4165.50 ms | 32.4% bf16 MFU | 125789 tok/s step 17068/19560 | loss 3.337458 (+0.96z)| norm 0.2413 (-1.02z)| lr 2.55e-05 | 4163.31 ms | 32.4% bf16 MFU | 125796 tok/s step 17069/19560 | loss 3.320541 (+0.54z)| norm 0.2455 (-0.59z)| lr 2.55e-05 | 4157.07 ms | 32.5% bf16 MFU | 125812 tok/s step 17070/19560 | loss 3.319927 (+0.53z)| norm 0.2427 (-0.87z)| lr 2.55e-05 | 4176.88 ms | 32.3% bf16 MFU | 125797 tok/s step 17071/19560 | loss 3.415848 (+2.86z)| norm 0.2452 (-0.60z)| lr 2.54e-05 | 4166.18 ms | 32.4% bf16 MFU | 125800 tok/s step 17072/19560 | loss 3.283097 (-0.37z)| norm 0.2683 (+1.78z)| lr 2.54e-05 | 4180.29 ms | 32.3% bf16 MFU | 125781 tok/s step 17073/19560 | loss 3.259630 (-0.93z)| norm 0.2478 (-0.33z)| lr 2.54e-05 | 4163.69 ms | 32.4% bf16 MFU | 125787 tok/s step 17074/19560 | loss 3.258585 (-0.94z)| norm 0.2576 (+0.67z)| lr 2.54e-05 | 7850.19 ms | 17.2% bf16 MFU | 122837 tok/s step 17075/19560 | loss 3.304386 (+0.16z)| norm 0.2415 (-0.99z)| lr 2.54e-05 | 6184.62 ms | 21.8% bf16 MFU | 120934 tok/s step 17076/19560 | loss 3.296074 (-0.04z)| norm 0.2461 (-0.51z)| lr 2.53e-05 | 4313.87 ms | 31.3% bf16 MFU | 120964 tok/s step 17077/19560 | loss 3.337967 (+0.96z)| norm 0.2376 (-1.50z)| lr 2.53e-05 | 4394.38 ms | 30.7% bf16 MFU | 120882 tok/s step 17078/19560 | loss 3.311550 (+0.31z)| norm 0.2489 (-0.19z)| lr 2.53e-05 | 4365.61 ms | 30.9% bf16 MFU | 120842 tok/s step 17079/19560 | loss 3.267529 (-0.78z)| norm 0.2672 (+1.90z)| lr 2.53e-05 | 6707.49 ms | 20.1% bf16 MFU | 118708 tok/s step 17080/19560 | loss 3.356148 (+1.38z)| norm 0.2726 (+2.45z)| lr 2.53e-05 | 4328.19 ms | 31.2% bf16 MFU | 118830 tok/s step 17081/19560 | loss 3.291255 (-0.20z)| norm 0.2446 (-0.71z)| lr 2.52e-05 | 4276.47 ms | 31.6% bf16 MFU | 119018 tok/s step 17082/19560 | loss 3.248829 (-1.22z)| norm 0.2583 (+0.82z)| lr 2.52e-05 | 4234.75 ms | 31.9% bf16 MFU | 119257 tok/s step 17083/19560 | loss 3.273677 (-0.62z)| norm 0.2448 (-0.71z)| lr 2.52e-05 | 4156.55 ms | 32.5% bf16 MFU | 119601 tok/s step 17084/19560 | loss 3.275241 (-0.58z)| norm 0.2464 (-0.54z)| lr 2.52e-05 | 4196.33 ms | 32.2% bf16 MFU | 119868 tok/s step 17085/19560 | loss 3.263462 (-0.86z)| norm 0.2558 (+0.53z)| lr 2.52e-05 | 4138.54 ms | 32.6% bf16 MFU | 120209 tok/s step 17086/19560 | loss 3.314722 (+0.41z)| norm 0.2537 (+0.29z)| lr 2.51e-05 | 4152.84 ms | 32.5% bf16 MFU | 120511 tok/s step 17087/19560 | loss 3.269784 (-0.70z)| norm 0.2498 (-0.15z)| lr 2.51e-05 | 4342.03 ms | 31.1% bf16 MFU | 120523 tok/s step 17088/19560 | loss 3.320216 (+0.59z)| norm 0.2816 (+3.30z)| lr 2.51e-05 | 4148.33 ms | 32.5% bf16 MFU | 120816 tok/s step 17089/19560 | loss 3.298926 (+0.04z)| norm 0.2585 (+0.77z)| lr 2.51e-05 | 4147.89 ms | 32.6% bf16 MFU | 121095 tok/s step 17090/19560 | loss 3.303640 (+0.17z)| norm 0.2604 (+0.97z)| lr 2.51e-05 | 4152.79 ms | 32.5% bf16 MFU | 121353 tok/s step 17091/19560 | loss 3.283929 (-0.33z)| norm 0.2424 (-0.98z)| lr 2.50e-05 | 4149.70 ms | 32.5% bf16 MFU | 121602 tok/s step 17092/19560 | loss 3.265026 (-0.81z)| norm 0.2489 (-0.27z)| lr 2.50e-05 | 4158.80 ms | 32.5% bf16 MFU | 121826 tok/s step 17093/19560 | loss 3.311703 (+0.40z)| norm 0.2631 (+1.28z)| lr 2.50e-05 | 4145.96 ms | 32.6% bf16 MFU | 122057 tok/s step 17094/19560 | loss 3.276295 (-0.51z)| norm 0.2500 (-0.15z)| lr 2.50e-05 | 4164.85 ms | 32.4% bf16 MFU | 122248 tok/s step 17095/19560 | loss 3.220299 (-1.92z)| norm 0.2448 (-0.72z)| lr 2.50e-05 | 4161.54 ms | 32.4% bf16 MFU | 122435 tok/s step 17096/19560 | loss 3.297689 (+0.07z)| norm 0.2552 (+0.41z)| lr 2.49e-05 | 4167.43 ms | 32.4% bf16 MFU | 122604 tok/s step 17097/19560 | loss 3.242682 (-1.33z)| norm 0.2453 (-0.68z)| lr 2.49e-05 | 4159.04 ms | 32.5% bf16 MFU | 122777 tok/s step 17098/19560 | loss 3.321876 (+0.72z)| norm 0.2520 (+0.06z)| lr 2.49e-05 | 4162.47 ms | 32.4% bf16 MFU | 122936 tok/s step 17099/19560 | loss 3.285306 (-0.23z)| norm 0.2397 (-1.28z)| lr 2.49e-05 | 4150.23 ms | 32.5% bf16 MFU | 123105 tok/s step 17100/19560 | loss 3.352809 (+1.50z)| norm 0.2538 (+0.25z)| lr 2.49e-05 | 4319.32 ms | 31.3% bf16 MFU | 123019 tok/s step 17101/19560 | loss 3.229537 (-1.66z)| norm 0.2470 (-0.47z)| lr 2.48e-05 | 4193.00 ms | 32.2% bf16 MFU | 123120 tok/s step 17102/19560 | loss 3.312020 (+0.49z)| norm 0.2459 (-0.60z)| lr 2.48e-05 | 4272.66 ms | 31.6% bf16 MFU | 123099 tok/s step 17103/19560 | loss 3.284921 (-0.21z)| norm 0.2489 (-0.24z)| lr 2.48e-05 | 4234.59 ms | 31.9% bf16 MFU | 123135 tok/s step 17104/19560 | loss 3.310546 (+0.46z)| norm 0.2460 (-0.59z)| lr 2.48e-05 | 4144.38 ms | 32.6% bf16 MFU | 123303 tok/s step 17105/19560 | loss 3.303968 (+0.28z)| norm 0.2363 (-1.68z)| lr 2.48e-05 | 4171.60 ms | 32.4% bf16 MFU | 123422 tok/s step 17106/19560 | loss 3.208715 (-2.19z)| norm 0.2407 (-1.18z)| lr 2.47e-05 | 4215.56 ms | 32.0% bf16 MFU | 123470 tok/s step 17107/19560 | loss 3.252751 (-1.03z)| norm 0.2441 (-0.77z)| lr 2.47e-05 | 4151.97 ms | 32.5% bf16 MFU | 123610 tok/s step 17108/19560 | loss 3.254188 (-0.99z)| norm 0.2513 (+0.06z)| lr 2.47e-05 | 4244.50 ms | 31.8% bf16 MFU | 123606 tok/s step 17109/19560 | loss 3.334949 (+1.10z)| norm 0.2651 (+1.63z)| lr 2.47e-05 | 4143.89 ms | 32.6% bf16 MFU | 123751 tok/s step 17110/19560 | loss 3.283514 (-0.23z)| norm 0.2475 (-0.38z)| lr 2.47e-05 | 4144.25 ms | 32.6% bf16 MFU | 123889 tok/s step 17111/19560 | loss 3.253503 (-1.00z)| norm 0.2474 (-0.40z)| lr 2.46e-05 | 4160.25 ms | 32.5% bf16 MFU | 123996 tok/s step 17112/19560 | loss 3.238004 (-1.41z)| norm 0.2564 (+0.63z)| lr 2.46e-05 | 4142.92 ms | 32.6% bf16 MFU | 124124 tok/s step 17113/19560 | loss 3.220001 (-1.84z)| norm 0.2457 (-0.61z)| lr 2.46e-05 | 4403.34 ms | 30.7% bf16 MFU | 123871 tok/s step 17114/19560 | loss 3.242087 (-1.25z)| norm 0.2942 (+4.55z)| lr 2.46e-05 | 4151.85 ms | 32.5% bf16 MFU | 123991 tok/s step 17115/19560 | loss 3.203996 (-2.19z)| norm 0.2573 (+0.62z)| lr 2.46e-05 | 4357.22 ms | 31.0% bf16 MFU | 123808 tok/s step 17116/19560 | loss 3.228679 (-1.53z)| norm 0.2467 (-0.50z)| lr 2.45e-05 | 4150.75 ms | 32.5% bf16 MFU | 123933 tok/s step 17117/19560 | loss 3.263182 (-0.67z)| norm 0.2545 (+0.31z)| lr 2.45e-05 | 4160.71 ms | 32.5% bf16 MFU | 124037 tok/s step 17118/19560 | loss 3.317007 (+0.71z)| norm 0.2368 (-1.56z)| lr 2.45e-05 | 4170.98 ms | 32.4% bf16 MFU | 124120 tok/s step 17119/19560 | loss 3.224888 (-1.62z)| norm 0.2694 (+1.86z)| lr 2.45e-05 | 4157.30 ms | 32.5% bf16 MFU | 124220 tok/s step 17120/19560 | loss 3.261855 (-0.67z)| norm 0.2632 (+1.19z)| lr 2.45e-05 | 4148.39 ms | 32.5% bf16 MFU | 124328 tok/s step 17121/19560 | loss 3.223714 (-1.61z)| norm 0.2475 (-0.43z)| lr 2.44e-05 | 4272.10 ms | 31.6% bf16 MFU | 124248 tok/s step 17122/19560 | loss 3.279097 (-0.22z)| norm 0.2558 (+0.43z)| lr 2.44e-05 | 4208.74 ms | 32.1% bf16 MFU | 124264 tok/s step 17123/19560 | loss 3.254187 (-0.84z)| norm 0.2508 (-0.11z)| lr 2.44e-05 | 4383.46 ms | 30.8% bf16 MFU | 124031 tok/s step 17124/19560 | loss 3.294713 (+0.17z)| norm 0.2428 (-0.93z)| lr 2.44e-05 | 4194.06 ms | 32.2% bf16 MFU | 124080 tok/s step 17125/19560 | loss 3.294177 (+0.16z)| norm 0.2501 (-0.18z)| lr 2.44e-05 | 4166.71 ms | 32.4% bf16 MFU | 124167 tok/s step 17126/19560 | loss 3.289602 (+0.05z)| norm 0.2588 (+0.74z)| lr 2.43e-05 | 4131.51 ms | 32.7% bf16 MFU | 124304 tok/s step 17127/19560 | loss 3.275769 (-0.30z)| norm 0.2397 (-1.27z)| lr 2.43e-05 | 4108.51 ms | 32.9% bf16 MFU | 124469 tok/s step 17128/19560 | loss 3.299107 (+0.28z)| norm 0.2459 (-0.61z)| lr 2.43e-05 | 4116.54 ms | 32.8% bf16 MFU | 124614 tok/s step 17129/19560 | loss 3.240914 (-1.19z)| norm 0.2425 (-0.95z)| lr 2.43e-05 | 4126.18 ms | 32.7% bf16 MFU | 124736 tok/s step 17130/19560 | loss 3.253724 (-0.86z)| norm 0.2611 (+1.01z)| lr 2.43e-05 | 4170.61 ms | 32.4% bf16 MFU | 124785 tok/s step 17131/19560 | loss 3.270934 (-0.42z)| norm 0.2698 (+1.88z)| lr 2.42e-05 | 4976.35 ms | 27.1% bf16 MFU | 123813 tok/s step 17132/19560 | loss 3.301025 (+0.33z)| norm 0.2593 (+0.77z)| lr 2.42e-05 | 9855.33 ms | 13.7% bf16 MFU | 120283 tok/s step 17133/19560 | loss 3.292325 (+0.12z)| norm 0.2448 (-0.73z)| lr 2.42e-05 | 8126.72 ms | 16.6% bf16 MFU | 117494 tok/s step 17134/19560 | loss 3.294916 (+0.18z)| norm 0.2620 (+1.09z)| lr 2.42e-05 | 7445.93 ms | 18.1% bf16 MFU | 115140 tok/s step 17135/19560 | loss 3.308309 (+0.51z)| norm 0.2520 (+0.03z)| lr 2.42e-05 | 4583.40 ms | 29.5% bf16 MFU | 115103 tok/s step 17136/19560 | loss 3.267316 (-0.52z)| norm 0.2508 (-0.09z)| lr 2.41e-05 | 4141.88 ms | 32.6% bf16 MFU | 115677 tok/s step 17137/19560 | loss 3.296459 (+0.24z)| norm 0.2522 (+0.07z)| lr 2.41e-05 | 4194.97 ms | 32.2% bf16 MFU | 116142 tok/s step 17138/19560 | loss 3.216877 (-1.80z)| norm 0.2692 (+1.85z)| lr 2.41e-05 | 16730.56 ms | 8.1% bf16 MFU | 111901 tok/s step 17139/19560 | loss 3.261778 (-0.64z)| norm 0.2516 (+0.01z)| lr 2.41e-05 | 6066.28 ms | 22.3% bf16 MFU | 110628 tok/s step 17140/19560 | loss 3.261013 (-0.66z)| norm 0.2506 (-0.09z)| lr 2.41e-05 | 4101.50 ms | 32.9% bf16 MFU | 111488 tok/s step 17141/19560 | loss 3.282703 (-0.07z)| norm 0.2650 (+1.44z)| lr 2.40e-05 | 4610.46 ms | 29.3% bf16 MFU | 111599 tok/s step 17142/19560 | loss 3.264747 (-0.56z)| norm 0.2607 (+0.98z)| lr 2.40e-05 | 4148.38 ms | 32.5% bf16 MFU | 112338 tok/s step 17143/19560 | loss 3.270579 (-0.39z)| norm 0.2424 (-0.97z)| lr 2.40e-05 | 4189.31 ms | 32.2% bf16 MFU | 112979 tok/s step 17144/19560 | loss 3.276859 (-0.21z)| norm 0.2440 (-0.79z)| lr 2.40e-05 | 4101.43 ms | 32.9% bf16 MFU | 113722 tok/s step 17145/19560 | loss 3.319087 (+0.96z)| norm 0.2438 (-0.80z)| lr 2.40e-05 | 4264.12 ms | 31.7% bf16 MFU | 114183 tok/s step 17146/19560 | loss 3.271282 (-0.38z)| norm 0.2524 (+0.11z)| lr 2.39e-05 | 4125.44 ms | 32.7% bf16 MFU | 114828 tok/s step 17147/19560 | loss 3.311092 (+0.72z)| norm 0.2649 (+1.44z)| lr 2.39e-05 | 4129.91 ms | 32.7% bf16 MFU | 115434 tok/s step 17148/19560 | loss 3.252438 (-0.90z)| norm 0.2606 (+0.96z)| lr 2.39e-05 | 4107.11 ms | 32.9% bf16 MFU | 116045 tok/s step 17149/19560 | loss 3.298983 (+0.39z)| norm 0.2499 (-0.17z)| lr 2.39e-05 | 4200.08 ms | 32.1% bf16 MFU | 116484 tok/s step 17150/19560 | loss 3.296271 (+0.31z)| norm 0.2456 (-0.62z)| lr 2.39e-05 | 4120.81 ms | 32.8% bf16 MFU | 117022 tok/s step 17151/19560 | loss 3.256037 (-0.79z)| norm 0.2461 (-0.57z)| lr 2.39e-05 | 4139.47 ms | 32.6% bf16 MFU | 117503 tok/s step 17152/19560 | loss 3.271921 (-0.35z)| norm 0.2680 (+1.71z)| lr 2.38e-05 | 4247.73 ms | 31.8% bf16 MFU | 117800 tok/s step 17153/19560 | loss 3.243270 (-1.14z)| norm 0.2516 (+0.00z)| lr 2.38e-05 | 4139.14 ms | 32.6% bf16 MFU | 118243 tok/s step 17154/19560 | loss 3.240514 (-1.20z)| norm 0.2426 (-0.94z)| lr 2.38e-05 | 4138.69 ms | 32.6% bf16 MFU | 118665 tok/s step 17155/19560 | loss 3.288946 (+0.14z)| norm 0.2417 (-1.03z)| lr 2.38e-05 | 4231.95 ms | 31.9% bf16 MFU | 118926 tok/s step 17156/19560 | loss 3.286608 (+0.10z)| norm 0.2637 (+1.26z)| lr 2.38e-05 | 4131.95 ms | 32.7% bf16 MFU | 119324 tok/s step 17157/19560 | loss 3.300240 (+0.48z)| norm 0.2436 (-0.83z)| lr 2.37e-05 | 4139.26 ms | 32.6% bf16 MFU | 119691 tok/s step 17158/19560 | loss 3.221564 (-1.72z)| norm 0.2414 (-1.07z)| lr 2.37e-05 | 4137.48 ms | 32.6% bf16 MFU | 120042 tok/s step 17159/19560 | loss 3.262539 (-0.56z)| norm 0.2484 (-0.34z)| lr 2.37e-05 | 4143.36 ms | 32.6% bf16 MFU | 120367 tok/s step 17160/19560 | loss 3.233465 (-1.36z)| norm 0.2460 (-0.60z)| lr 2.37e-05 | 4154.91 ms | 32.5% bf16 MFU | 120658 tok/s step 17161/19560 | loss 3.291306 (+0.25z)| norm 0.2504 (-0.15z)| lr 2.37e-05 | 4152.19 ms | 32.5% bf16 MFU | 120938 tok/s step 17162/19560 | loss 3.253315 (-0.80z)| norm 0.2300 (-2.26z)| lr 2.36e-05 | 4224.48 ms | 32.0% bf16 MFU | 121097 tok/s step 17163/19560 | loss 3.262399 (-0.54z)| norm 0.2489 (-0.29z)| lr 2.36e-05 | 4165.62 ms | 32.4% bf16 MFU | 121335 tok/s step 17164/19560 | loss 3.277626 (-0.11z)| norm 0.2537 (+0.20z)| lr 2.36e-05 | 4147.84 ms | 32.6% bf16 MFU | 121588 tok/s step 17165/19560 | loss 3.282845 (+0.03z)| norm 0.2489 (-0.31z)| lr 2.36e-05 | 4153.68 ms | 32.5% bf16 MFU | 121820 tok/s step 17166/19560 | loss 3.289212 (+0.20z)| norm 0.2464 (-0.57z)| lr 2.36e-05 | 4148.20 ms | 32.5% bf16 MFU | 122048 tok/s step 17167/19560 | loss 3.353931 (+1.98z)| norm 0.2462 (-0.59z)| lr 2.35e-05 | 4193.89 ms | 32.2% bf16 MFU | 122197 tok/s step 17168/19560 | loss 3.272664 (-0.25z)| norm 0.2509 (-0.09z)| lr 2.35e-05 | 4187.44 ms | 32.2% bf16 MFU | 122347 tok/s step 17169/19560 | loss 3.260702 (-0.59z)| norm 0.2499 (-0.19z)| lr 2.35e-05 | 4190.00 ms | 32.2% bf16 MFU | 122486 tok/s step 17170/19560 | loss 3.287029 (+0.17z)| norm 0.2655 (+1.43z)| lr 2.35e-05 | 4155.05 ms | 32.5% bf16 MFU | 122671 tok/s step 17171/19560 | loss 3.305637 (+0.71z)| norm 0.2345 (-1.79z)| lr 2.35e-05 | 4163.70 ms | 32.4% bf16 MFU | 122833 tok/s step 17172/19560 | loss 3.287686 (+0.18z)| norm 0.2484 (-0.35z)| lr 2.34e-05 | 4152.08 ms | 32.5% bf16 MFU | 123005 tok/s step 17173/19560 | loss 3.304909 (+0.68z)| norm 0.2557 (+0.41z)| lr 2.34e-05 | 4159.25 ms | 32.5% bf16 MFU | 123158 tok/s step 17174/19560 | loss 3.292755 (+0.33z)| norm 0.2521 (+0.04z)| lr 2.34e-05 | 4192.83 ms | 32.2% bf16 MFU | 123252 tok/s step 17175/19560 | loss 3.344266 (+1.80z)| norm 0.2349 (-1.72z)| lr 2.34e-05 | 4161.11 ms | 32.4% bf16 MFU | 123389 tok/s step 17176/19560 | loss 3.312220 (+0.86z)| norm 0.2481 (-0.35z)| lr 2.34e-05 | 4151.60 ms | 32.5% bf16 MFU | 123534 tok/s step 17177/19560 | loss 3.302243 (+0.58z)| norm 0.2494 (-0.20z)| lr 2.33e-05 | 4153.52 ms | 32.5% bf16 MFU | 123669 tok/s step 17178/19560 | loss 3.387958 (+2.94z)| norm 0.2370 (-1.46z)| lr 2.33e-05 | 4255.17 ms | 31.7% bf16 MFU | 123646 tok/s step 17179/19560 | loss 3.316768 (+0.95z)| norm 0.2499 (-0.12z)| lr 2.33e-05 | 4178.07 ms | 32.3% bf16 MFU | 123738 tok/s step 17180/19560 | loss 3.272740 (-0.30z)| norm 0.2420 (-0.94z)| lr 2.33e-05 | 4372.51 ms | 30.9% bf16 MFU | 123546 tok/s step 17181/19560 | loss 3.284255 (+0.03z)| norm 0.2492 (-0.19z)| lr 2.33e-05 | 4162.11 ms | 32.4% bf16 MFU | 123667 tok/s step 17182/19560 | loss 3.269358 (-0.38z)| norm 0.2447 (-0.65z)| lr 2.32e-05 | 4157.55 ms | 32.5% bf16 MFU | 123789 tok/s step 17183/19560 | loss 3.286856 (+0.12z)| norm 0.2620 (+1.13z)| lr 2.32e-05 | 4185.20 ms | 32.3% bf16 MFU | 123863 tok/s step 17184/19560 | loss 3.291181 (+0.24z)| norm 0.2490 (-0.22z)| lr 2.32e-05 | 4156.87 ms | 32.5% bf16 MFU | 123976 tok/s step 17185/19560 | loss 3.340207 (+1.64z)| norm 0.2468 (-0.43z)| lr 2.32e-05 | 4231.17 ms | 31.9% bf16 MFU | 123973 tok/s step 17186/19560 | loss 3.246395 (-1.03z)| norm 0.2419 (-0.93z)| lr 2.32e-05 | 4267.55 ms | 31.6% bf16 MFU | 123917 tok/s step 17187/19560 | loss 3.304431 (+0.62z)| norm 0.2386 (-1.25z)| lr 2.32e-05 | 4192.60 ms | 32.2% bf16 MFU | 123974 tok/s step 17188/19560 | loss 3.290987 (+0.24z)| norm 0.2392 (-1.17z)| lr 2.31e-05 | 4315.65 ms | 31.3% bf16 MFU | 123849 tok/s step 17189/19560 | loss 3.257966 (-0.69z)| norm 0.2367 (-1.42z)| lr 2.31e-05 | 4243.38 ms | 31.8% bf16 MFU | 123835 tok/s step 17190/19560 | loss 3.252396 (-0.84z)| norm 0.2373 (-1.33z)| lr 2.31e-05 | 4159.97 ms | 32.5% bf16 MFU | 123945 tok/s step 17191/19560 | loss 3.302769 (+0.59z)| norm 0.2400 (-1.05z)| lr 2.31e-05 | 4177.30 ms | 32.3% bf16 MFU | 124023 tok/s step 17192/19560 | loss 3.263876 (-0.51z)| norm 0.2469 (-0.36z)| lr 2.31e-05 | 4192.51 ms | 32.2% bf16 MFU | 124074 tok/s step 17193/19560 | loss 3.305036 (+0.65z)| norm 0.2527 (+0.23z)| lr 2.30e-05 | 4179.01 ms | 32.3% bf16 MFU | 124143 tok/s step 17194/19560 | loss 3.275250 (-0.20z)| norm 0.2461 (-0.44z)| lr 2.30e-05 | 4181.62 ms | 32.3% bf16 MFU | 124205 tok/s step 17195/19560 | loss 3.219839 (-1.76z)| norm 0.2436 (-0.70z)| lr 2.30e-05 | 4162.31 ms | 32.4% bf16 MFU | 124293 tok/s step 17196/19560 | loss 3.262374 (-0.54z)| norm 0.2366 (-1.39z)| lr 2.30e-05 | 4174.31 ms | 32.3% bf16 MFU | 124358 tok/s step 17197/19560 | loss 3.280349 (-0.02z)| norm 0.2418 (-0.87z)| lr 2.30e-05 | 4162.83 ms | 32.4% bf16 MFU | 124438 tok/s step 17198/19560 | loss 3.271127 (-0.28z)| norm 0.2364 (-1.40z)| lr 2.29e-05 | 4182.78 ms | 32.3% bf16 MFU | 124483 tok/s step 17199/19560 | loss 3.351145 (+2.15z)| norm 0.2619 (+1.15z)| lr 2.29e-05 | 4187.42 ms | 32.2% bf16 MFU | 124519 tok/s step 17200/19560 | loss 3.333821 (+1.60z)| norm 0.2543 (+0.41z)| lr 2.29e-05 | 4195.19 ms | 32.2% bf16 MFU | 124542 tok/s step 17201/19560 | loss 3.270337 (-0.31z)| norm 0.2496 (-0.08z)| lr 2.29e-05 | 4171.60 ms | 32.4% bf16 MFU | 124599 tok/s step 17202/19560 | loss 3.282502 (+0.05z)| norm 0.2468 (-0.35z)| lr 2.29e-05 | 4169.97 ms | 32.4% bf16 MFU | 124655 tok/s step 17203/19560 | loss 3.290117 (+0.28z)| norm 0.2561 (+0.58z)| lr 2.28e-05 | 4167.38 ms | 32.4% bf16 MFU | 124713 tok/s step 17204/19560 | loss 3.357544 (+2.26z)| norm 0.2674 (+1.70z)| lr 2.28e-05 | 4174.54 ms | 32.3% bf16 MFU | 124757 tok/s step 17205/19560 | loss 3.241130 (-1.17z)| norm 0.2333 (-1.72z)| lr 2.28e-05 | 4163.22 ms | 32.4% bf16 MFU | 124816 tok/s step 17206/19560 | loss 3.243945 (-1.07z)| norm 0.2433 (-0.72z)| lr 2.28e-05 | 4181.27 ms | 32.3% bf16 MFU | 124844 tok/s step 17207/19560 | loss 3.213589 (-1.93z)| norm 0.2406 (-0.98z)| lr 2.28e-05 | 4172.74 ms | 32.4% bf16 MFU | 124884 tok/s step 17208/19560 | loss 3.268399 (-0.31z)| norm 0.2386 (-1.16z)| lr 2.28e-05 | 4192.21 ms | 32.2% bf16 MFU | 124893 tok/s step 17209/19560 | loss 3.324904 (+1.36z)| norm 0.2477 (-0.24z)| lr 2.27e-05 | 4181.15 ms | 32.3% bf16 MFU | 124918 tok/s step 17210/19560 | loss 3.285798 (+0.19z)| norm 0.2452 (-0.48z)| lr 2.27e-05 | 4185.52 ms | 32.3% bf16 MFU | 124936 tok/s step 17211/19560 | loss 3.267304 (-0.36z)| norm 0.2590 (+0.92z)| lr 2.27e-05 | 4181.08 ms | 32.3% bf16 MFU | 124959 tok/s step 17212/19560 | loss 3.310659 (+0.92z)| norm 0.2301 (-1.99z)| lr 2.27e-05 | 4162.42 ms | 32.4% bf16 MFU | 125008 tok/s step 17213/19560 | loss 3.279861 (+0.00z)| norm 0.2530 (+0.31z)| lr 2.27e-05 | 4171.06 ms | 32.4% bf16 MFU | 125043 tok/s step 17214/19560 | loss 3.258862 (-0.61z)| norm 0.2496 (-0.02z)| lr 2.26e-05 | 4168.28 ms | 32.4% bf16 MFU | 125080 tok/s step 17215/19560 | loss 3.293999 (+0.43z)| norm 0.2482 (-0.17z)| lr 2.26e-05 | 4183.08 ms | 32.3% bf16 MFU | 125093 tok/s step 17216/19560 | loss 3.226686 (-1.54z)| norm 0.2377 (-1.23z)| lr 2.26e-05 | 4182.27 ms | 32.3% bf16 MFU | 125106 tok/s step 17217/19560 | loss 3.356969 (+2.26z)| norm 0.2520 (+0.27z)| lr 2.26e-05 | 4181.67 ms | 32.3% bf16 MFU | 125119 tok/s step 17218/19560 | loss 3.242001 (-1.06z)| norm 0.2531 (+0.39z)| lr 2.26e-05 | 4193.78 ms | 32.2% bf16 MFU | 125114 tok/s step 17219/19560 | loss 3.271483 (-0.21z)| norm 0.2387 (-1.12z)| lr 2.25e-05 | 4157.80 ms | 32.5% bf16 MFU | 125163 tok/s step 17220/19560 | loss 3.268763 (-0.29z)| norm 0.2502 (+0.08z)| lr 2.25e-05 | 4176.96 ms | 32.3% bf16 MFU | 125181 tok/s step 17221/19560 | loss 3.266331 (-0.35z)| norm 0.2498 (+0.06z)| lr 2.25e-05 | 4184.15 ms | 32.3% bf16 MFU | 125187 tok/s step 17222/19560 | loss 3.268472 (-0.28z)| norm 0.2525 (+0.35z)| lr 2.25e-05 | 4161.39 ms | 32.4% bf16 MFU | 125227 tok/s step 17223/19560 | loss 3.217811 (-1.75z)| norm 0.2505 (+0.13z)| lr 2.25e-05 | 4170.05 ms | 32.4% bf16 MFU | 125252 tok/s step 17224/19560 | loss 3.281722 (+0.11z)| norm 0.2395 (-1.03z)| lr 2.24e-05 | 4165.39 ms | 32.4% bf16 MFU | 125283 tok/s step 17225/19560 | loss 3.288399 (+0.29z)| norm 0.2558 (+0.69z)| lr 2.24e-05 | 4210.24 ms | 32.1% bf16 MFU | 125245 tok/s step 17226/19560 | loss 3.219992 (-1.67z)| norm 0.2536 (+0.46z)| lr 2.24e-05 | 4172.76 ms | 32.4% bf16 MFU | 125265 tok/s step 17227/19560 | loss 3.248308 (-0.84z)| norm 0.2498 (+0.04z)| lr 2.24e-05 | 4157.91 ms | 32.5% bf16 MFU | 125307 tok/s step 17228/19560 | loss 3.322776 (+1.34z)| norm 0.2469 (-0.25z)| lr 2.24e-05 | 4177.21 ms | 32.3% bf16 MFU | 125317 tok/s step 17229/19560 | loss 3.208615 (-1.99z)| norm 0.2548 (+0.58z)| lr 2.24e-05 | 4258.33 ms | 31.7% bf16 MFU | 125207 tok/s step 17230/19560 | loss 3.243978 (-0.94z)| norm 0.2524 (+0.32z)| lr 2.23e-05 | 4170.17 ms | 32.4% bf16 MFU | 125233 tok/s step 17231/19560 | loss 3.276699 (+0.01z)| norm 0.2487 (-0.07z)| lr 2.23e-05 | 4268.00 ms | 31.6% bf16 MFU | 125113 tok/s step 17232/19560 | loss 3.241216 (-1.01z)| norm 0.2530 (+0.37z)| lr 2.23e-05 | 4314.04 ms | 31.3% bf16 MFU | 124934 tok/s step 17233/19560 | loss 3.221734 (-1.55z)| norm 0.2497 (+0.01z)| lr 2.23e-05 | 4174.87 ms | 32.3% bf16 MFU | 124967 tok/s step 17234/19560 | loss 3.299736 (+0.70z)| norm 0.2519 (+0.24z)| lr 2.23e-05 | 4161.23 ms | 32.4% bf16 MFU | 125018 tok/s step 17235/19560 | loss 3.282711 (+0.19z)| norm 0.2552 (+0.59z)| lr 2.22e-05 | 4283.25 ms | 31.5% bf16 MFU | 124887 tok/s step 17236/19560 | loss 3.349544 (+2.10z)| norm 0.2612 (+1.21z)| lr 2.22e-05 | 4184.13 ms | 32.3% bf16 MFU | 124908 tok/s step 17237/19560 | loss 3.244724 (-0.91z)| norm 0.2566 (+0.73z)| lr 2.22e-05 | 4246.92 ms | 31.8% bf16 MFU | 124835 tok/s step 17238/19560 | loss 3.266219 (-0.29z)| norm 0.2396 (-1.08z)| lr 2.22e-05 | 4160.91 ms | 32.4% bf16 MFU | 124894 tok/s step 17239/19560 | loss 3.266033 (-0.29z)| norm 0.2672 (+1.83z)| lr 2.22e-05 | 4169.86 ms | 32.4% bf16 MFU | 124936 tok/s step 17240/19560 | loss 3.256442 (-0.58z)| norm 0.2431 (-0.70z)| lr 2.21e-05 | 4261.86 ms | 31.7% bf16 MFU | 124840 tok/s step 17241/19560 | loss 3.306431 (+0.87z)| norm 0.2560 (+0.65z)| lr 2.21e-05 | 4388.34 ms | 30.8% bf16 MFU | 124571 tok/s step 17242/19560 | loss 3.288276 (+0.32z)| norm 0.2521 (+0.30z)| lr 2.21e-05 | 4156.85 ms | 32.5% bf16 MFU | 124649 tok/s step 17243/19560 | loss 3.251569 (-0.78z)| norm 0.2339 (-1.77z)| lr 2.21e-05 | 4167.15 ms | 32.4% bf16 MFU | 124707 tok/s step 17244/19560 | loss 3.235878 (-1.26z)| norm 0.2390 (-1.17z)| lr 2.21e-05 | 4222.24 ms | 32.0% bf16 MFU | 124681 tok/s step 17245/19560 | loss 3.252671 (-0.75z)| norm 0.2451 (-0.47z)| lr 2.20e-05 | 4198.67 ms | 32.2% bf16 MFU | 124690 tok/s step 17246/19560 | loss 3.300595 (+0.70z)| norm 0.2430 (-0.71z)| lr 2.20e-05 | 4158.08 ms | 32.5% bf16 MFU | 124760 tok/s step 17247/19560 | loss 3.299632 (+0.66z)| norm 0.2566 (+0.87z)| lr 2.20e-05 | 4181.44 ms | 32.3% bf16 MFU | 124791 tok/s step 17248/19560 | loss 3.285002 (+0.21z)| norm 0.2456 (-0.40z)| lr 2.20e-05 | 4321.09 ms | 31.2% bf16 MFU | 124618 tok/s step 17249/19560 | loss 3.244406 (-1.04z)| norm 0.2431 (-0.69z)| lr 2.20e-05 | 4450.22 ms | 30.3% bf16 MFU | 124278 tok/s step 17250/19560 | loss 3.310022 (+0.96z)| norm 0.2417 (-0.84z)| lr 2.20e-05 | 4160.23 ms | 32.5% bf16 MFU | 124365 tok/s val loss 3.271929 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3028/10042 = 0.301534 step 17251/19560 | loss 3.255058 (-0.72z)| norm 0.2434 (-0.63z)| lr 2.19e-05 | 4174.93 ms | 32.3% bf16 MFU | 124426 tok/s step 17252/19560 | loss 3.248025 (-0.92z)| norm 0.2447 (-0.49z)| lr 2.19e-05 | 4195.45 ms | 32.2% bf16 MFU | 124453 tok/s step 17253/19560 | loss 3.347425 (+2.06z)| norm 0.2647 (+1.84z)| lr 2.19e-05 | 4214.18 ms | 32.0% bf16 MFU | 124451 tok/s step 17254/19560 | loss 3.255002 (-0.70z)| norm 0.2403 (-0.99z)| lr 2.19e-05 | 4176.36 ms | 32.3% bf16 MFU | 124505 tok/s step 17255/19560 | loss 3.238909 (-1.17z)| norm 0.2549 (+0.70z)| lr 2.19e-05 | 4168.68 ms | 32.4% bf16 MFU | 124568 tok/s step 17256/19560 | loss 3.296343 (+0.54z)| norm 0.2453 (-0.42z)| lr 2.18e-05 | 4173.86 ms | 32.3% bf16 MFU | 124621 tok/s step 17257/19560 | loss 3.203074 (-2.20z)| norm 0.2465 (-0.29z)| lr 2.18e-05 | 4164.61 ms | 32.4% bf16 MFU | 124684 tok/s step 17258/19560 | loss 3.284998 (+0.20z)| norm 0.2540 (+0.61z)| lr 2.18e-05 | 4178.12 ms | 32.3% bf16 MFU | 124724 tok/s step 17259/19560 | loss 3.296783 (+0.54z)| norm 0.2571 (+1.00z)| lr 2.18e-05 | 4176.44 ms | 32.3% bf16 MFU | 124765 tok/s step 17260/19560 | loss 3.303672 (+0.75z)| norm 0.2433 (-0.64z)| lr 2.18e-05 | 4155.83 ms | 32.5% bf16 MFU | 124834 tok/s step 17261/19560 | loss 3.299367 (+0.62z)| norm 0.2444 (-0.52z)| lr 2.17e-05 | 4184.84 ms | 32.3% bf16 MFU | 124857 tok/s step 17262/19560 | loss 3.303829 (+0.74z)| norm 0.2929 (+4.86z)| lr 2.17e-05 | 4157.98 ms | 32.5% bf16 MFU | 124918 tok/s step 17263/19560 | loss 3.212153 (-1.90z)| norm 0.2448 (-0.44z)| lr 2.17e-05 | 4184.31 ms | 32.3% bf16 MFU | 124937 tok/s step 17264/19560 | loss 3.213293 (-1.83z)| norm 0.2418 (-0.77z)| lr 2.17e-05 | 4165.51 ms | 32.4% bf16 MFU | 124984 tok/s step 17265/19560 | loss 3.323120 (+1.30z)| norm 0.2291 (-2.10z)| lr 2.17e-05 | 4160.50 ms | 32.5% bf16 MFU | 125035 tok/s step 17266/19560 | loss 3.270745 (-0.21z)| norm 0.2324 (-1.73z)| lr 2.17e-05 | 4173.34 ms | 32.4% bf16 MFU | 125065 tok/s step 17267/19560 | loss 3.323185 (+1.28z)| norm 0.2394 (-0.95z)| lr 2.16e-05 | 4177.81 ms | 32.3% bf16 MFU | 125086 tok/s step 17268/19560 | loss 3.242594 (-1.02z)| norm 0.2520 (+0.41z)| lr 2.16e-05 | 4189.45 ms | 32.2% bf16 MFU | 125089 tok/s step 17269/19560 | loss 3.310135 (+0.90z)| norm 0.2505 (+0.27z)| lr 2.16e-05 | 4197.96 ms | 32.2% bf16 MFU | 125079 tok/s step 17270/19560 | loss 3.284678 (+0.17z)| norm 0.2516 (+0.40z)| lr 2.16e-05 | 4224.31 ms | 32.0% bf16 MFU | 125031 tok/s step 17271/19560 | loss 3.303941 (+0.71z)| norm 0.2598 (+1.29z)| lr 2.16e-05 | 4189.15 ms | 32.2% bf16 MFU | 125037 tok/s step 17272/19560 | loss 3.260368 (-0.52z)| norm 0.2492 (+0.11z)| lr 2.15e-05 | 4163.46 ms | 32.4% bf16 MFU | 125082 tok/s step 17273/19560 | loss 3.288256 (+0.28z)| norm 0.2517 (+0.38z)| lr 2.15e-05 | 4184.75 ms | 32.3% bf16 MFU | 125092 tok/s step 17274/19560 | loss 3.227927 (-1.42z)| norm 0.2403 (-0.87z)| lr 2.15e-05 | 4184.96 ms | 32.3% bf16 MFU | 125101 tok/s step 17275/19560 | loss 3.294478 (+0.47z)| norm 0.2430 (-0.56z)| lr 2.15e-05 | 4206.99 ms | 32.1% bf16 MFU | 125077 tok/s step 17276/19560 | loss 3.260682 (-0.50z)| norm 0.2402 (-0.85z)| lr 2.15e-05 | 4178.68 ms | 32.3% bf16 MFU | 125097 tok/s step 17277/19560 | loss 3.218543 (-1.66z)| norm 0.2386 (-1.02z)| lr 2.15e-05 | 4343.57 ms | 31.1% bf16 MFU | 124877 tok/s step 17278/19560 | loss 3.281943 (+0.13z)| norm 0.2383 (-1.05z)| lr 2.14e-05 | 4217.72 ms | 32.0% bf16 MFU | 124849 tok/s step 17279/19560 | loss 3.235907 (-1.16z)| norm 0.2448 (-0.32z)| lr 2.14e-05 | 4161.89 ms | 32.4% bf16 MFU | 124905 tok/s step 17280/19560 | loss 3.234457 (-1.19z)| norm 0.2468 (-0.08z)| lr 2.14e-05 | 4175.60 ms | 32.3% bf16 MFU | 124938 tok/s step 17281/19560 | loss 3.323998 (+1.29z)| norm 0.2555 (+0.90z)| lr 2.14e-05 | 4233.18 ms | 31.9% bf16 MFU | 124883 tok/s step 17282/19560 | loss 3.278429 (+0.02z)| norm 0.2377 (-1.11z)| lr 2.14e-05 | 4166.46 ms | 32.4% bf16 MFU | 124931 tok/s step 17283/19560 | loss 3.307942 (+0.84z)| norm 0.2519 (+0.49z)| lr 2.13e-05 | 4178.09 ms | 32.3% bf16 MFU | 124959 tok/s step 17284/19560 | loss 3.290459 (+0.35z)| norm 0.2409 (-0.75z)| lr 2.13e-05 | 4167.73 ms | 32.4% bf16 MFU | 125001 tok/s step 17285/19560 | loss 3.357270 (+2.16z)| norm 0.2421 (-0.61z)| lr 2.13e-05 | 4183.37 ms | 32.3% bf16 MFU | 125017 tok/s step 17286/19560 | loss 3.285928 (+0.19z)| norm 0.2441 (-0.38z)| lr 2.13e-05 | 4177.62 ms | 32.3% bf16 MFU | 125041 tok/s step 17287/19560 | loss 3.238369 (-1.11z)| norm 0.2440 (-0.39z)| lr 2.13e-05 | 4170.51 ms | 32.4% bf16 MFU | 125075 tok/s step 17288/19560 | loss 3.345159 (+1.79z)| norm 0.2429 (-0.51z)| lr 2.12e-05 | 4176.34 ms | 32.3% bf16 MFU | 125098 tok/s step 17289/19560 | loss 3.297621 (+0.49z)| norm 0.2638 (+1.84z)| lr 2.12e-05 | 4176.79 ms | 32.3% bf16 MFU | 125119 tok/s step 17290/19560 | loss 3.255509 (-0.66z)| norm 0.2372 (-1.18z)| lr 2.12e-05 | 4193.83 ms | 32.2% bf16 MFU | 125114 tok/s step 17291/19560 | loss 3.284559 (+0.13z)| norm 0.2418 (-0.64z)| lr 2.12e-05 | 4186.83 ms | 32.2% bf16 MFU | 125119 tok/s step 17292/19560 | loss 3.287862 (+0.22z)| norm 0.2465 (-0.10z)| lr 2.12e-05 | 4159.81 ms | 32.5% bf16 MFU | 125165 tok/s step 17293/19560 | loss 3.337776 (+1.56z)| norm 0.2519 (+0.51z)| lr 2.12e-05 | 4197.66 ms | 32.2% bf16 MFU | 125152 tok/s step 17294/19560 | loss 3.330531 (+1.34z)| norm 0.2481 (+0.07z)| lr 2.11e-05 | 11549.84 ms | 11.7% bf16 MFU | 121164 tok/s step 17295/19560 | loss 3.290412 (+0.28z)| norm 0.2413 (-0.69z)| lr 2.11e-05 | 5112.11 ms | 26.4% bf16 MFU | 120234 tok/s step 17296/19560 | loss 3.314130 (+0.92z)| norm 0.2447 (-0.31z)| lr 2.11e-05 | 4357.09 ms | 31.0% bf16 MFU | 120238 tok/s step 17297/19560 | loss 3.269827 (-0.29z)| norm 0.2511 (+0.43z)| lr 2.11e-05 | 4168.37 ms | 32.4% bf16 MFU | 120515 tok/s step 17298/19560 | loss 3.269150 (-0.31z)| norm 0.2422 (-0.57z)| lr 2.11e-05 | 4159.91 ms | 32.5% bf16 MFU | 120791 tok/s step 17299/19560 | loss 3.227101 (-1.43z)| norm 0.2392 (-0.92z)| lr 2.10e-05 | 4157.89 ms | 32.5% bf16 MFU | 121057 tok/s step 17300/19560 | loss 3.192913 (-2.29z)| norm 0.2416 (-0.64z)| lr 2.10e-05 | 23497.61 ms | 5.7% bf16 MFU | 116119 tok/s step 17301/19560 | loss 3.197583 (-2.11z)| norm 0.2480 (+0.10z)| lr 2.10e-05 | 4194.11 ms | 32.2% bf16 MFU | 116564 tok/s step 17302/19560 | loss 3.392209 (+2.86z)| norm 0.2404 (-0.77z)| lr 2.10e-05 | 4133.18 ms | 32.7% bf16 MFU | 117078 tok/s step 17303/19560 | loss 3.238289 (-1.02z)| norm 0.2349 (-1.41z)| lr 2.10e-05 | 4139.00 ms | 32.6% bf16 MFU | 117558 tok/s step 17304/19560 | loss 3.298748 (+0.53z)| norm 0.2640 (+1.93z)| lr 2.10e-05 | 4127.84 ms | 32.7% bf16 MFU | 118030 tok/s step 17305/19560 | loss 3.273719 (-0.11z)| norm 0.2518 (+0.53z)| lr 2.09e-05 | 4187.85 ms | 32.2% bf16 MFU | 118388 tok/s step 17306/19560 | loss 3.228598 (-1.26z)| norm 0.2587 (+1.29z)| lr 2.09e-05 | 4130.69 ms | 32.7% bf16 MFU | 118815 tok/s step 17307/19560 | loss 3.233970 (-1.10z)| norm 0.2411 (-0.71z)| lr 2.09e-05 | 4163.04 ms | 32.4% bf16 MFU | 119171 tok/s step 17308/19560 | loss 3.302379 (+0.68z)| norm 0.2435 (-0.43z)| lr 2.09e-05 | 4154.67 ms | 32.5% bf16 MFU | 119522 tok/s step 17309/19560 | loss 3.272245 (-0.10z)| norm 0.2369 (-1.16z)| lr 2.09e-05 | 4153.53 ms | 32.5% bf16 MFU | 119858 tok/s step 17310/19560 | loss 3.230607 (-1.18z)| norm 0.2335 (-1.53z)| lr 2.08e-05 | 4143.27 ms | 32.6% bf16 MFU | 120192 tok/s step 17311/19560 | loss 3.267886 (-0.20z)| norm 0.2389 (-0.91z)| lr 2.08e-05 | 4177.57 ms | 32.3% bf16 MFU | 120457 tok/s step 17312/19560 | loss 3.283952 (+0.22z)| norm 0.2351 (-1.32z)| lr 2.08e-05 | 4135.21 ms | 32.7% bf16 MFU | 120774 tok/s step 17313/19560 | loss 3.223468 (-1.34z)| norm 0.2400 (-0.75z)| lr 2.08e-05 | 4161.36 ms | 32.4% bf16 MFU | 121034 tok/s step 17314/19560 | loss 3.280599 (+0.15z)| norm 0.2472 (+0.05z)| lr 2.08e-05 | 4147.44 ms | 32.6% bf16 MFU | 121303 tok/s step 17315/19560 | loss 3.250802 (-0.62z)| norm 0.2403 (-0.73z)| lr 2.08e-05 | 4163.07 ms | 32.4% bf16 MFU | 121535 tok/s step 17316/19560 | loss 3.263043 (-0.30z)| norm 0.2428 (-0.45z)| lr 2.07e-05 | 4156.27 ms | 32.5% bf16 MFU | 121765 tok/s step 17317/19560 | loss 3.237312 (-0.96z)| norm 0.2525 (+0.62z)| lr 2.07e-05 | 4167.47 ms | 32.4% bf16 MFU | 121967 tok/s step 17318/19560 | loss 3.237199 (-0.96z)| norm 0.2453 (-0.19z)| lr 2.07e-05 | 4173.27 ms | 32.4% bf16 MFU | 122151 tok/s step 17319/19560 | loss 3.240673 (-0.86z)| norm 0.2401 (-0.79z)| lr 2.07e-05 | 4167.69 ms | 32.4% bf16 MFU | 122333 tok/s step 17320/19560 | loss 3.239140 (-0.89z)| norm 0.2423 (-0.53z)| lr 2.07e-05 | 4165.58 ms | 32.4% bf16 MFU | 122509 tok/s step 17321/19560 | loss 3.248637 (-0.63z)| norm 0.2340 (-1.45z)| lr 2.06e-05 | 4159.53 ms | 32.5% bf16 MFU | 122686 tok/s step 17322/19560 | loss 3.295997 (+0.59z)| norm 0.2402 (-0.74z)| lr 2.06e-05 | 4281.57 ms | 31.5% bf16 MFU | 122675 tok/s step 17323/19560 | loss 3.224388 (-1.27z)| norm 0.2359 (-1.21z)| lr 2.06e-05 | 4174.26 ms | 32.3% bf16 MFU | 122821 tok/s step 17324/19560 | loss 3.280517 (+0.19z)| norm 0.2522 (+0.60z)| lr 2.06e-05 | 4216.54 ms | 32.0% bf16 MFU | 122897 tok/s step 17325/19560 | loss 3.265769 (-0.19z)| norm 0.2434 (-0.39z)| lr 2.06e-05 | 4167.80 ms | 32.4% bf16 MFU | 123042 tok/s step 17326/19560 | loss 3.300813 (+0.71z)| norm 0.2386 (-0.94z)| lr 2.06e-05 | 4192.15 ms | 32.2% bf16 MFU | 123143 tok/s step 17327/19560 | loss 3.250103 (-0.59z)| norm 0.2423 (-0.50z)| lr 2.05e-05 | 4161.44 ms | 32.4% bf16 MFU | 123285 tok/s step 17328/19560 | loss 3.213254 (-1.54z)| norm 0.2364 (-1.15z)| lr 2.05e-05 | 4180.01 ms | 32.3% bf16 MFU | 123392 tok/s step 17329/19560 | loss 3.361917 (+2.31z)| norm 0.2455 (-0.12z)| lr 2.05e-05 | 4175.02 ms | 32.3% bf16 MFU | 123501 tok/s step 17330/19560 | loss 3.215549 (-1.44z)| norm 0.2595 (+1.44z)| lr 2.05e-05 | 4158.26 ms | 32.5% bf16 MFU | 123631 tok/s step 17331/19560 | loss 3.220159 (-1.30z)| norm 0.2303 (-1.80z)| lr 2.05e-05 | 4158.94 ms | 32.5% bf16 MFU | 123752 tok/s step 17332/19560 | loss 3.225657 (-1.15z)| norm 0.2413 (-0.57z)| lr 2.04e-05 | 4179.18 ms | 32.3% bf16 MFU | 123837 tok/s step 17333/19560 | loss 3.332187 (+1.57z)| norm 0.2661 (+2.21z)| lr 2.04e-05 | 4178.58 ms | 32.3% bf16 MFU | 123919 tok/s step 17334/19560 | loss 3.281486 (+0.26z)| norm 0.2401 (-0.72z)| lr 2.04e-05 | 4175.10 ms | 32.3% bf16 MFU | 124002 tok/s step 17335/19560 | loss 3.255986 (-0.41z)| norm 0.2455 (-0.11z)| lr 2.04e-05 | 4159.69 ms | 32.5% bf16 MFU | 124104 tok/s step 17336/19560 | loss 3.257982 (-0.35z)| norm 0.2388 (-0.87z)| lr 2.04e-05 | 4159.15 ms | 32.5% bf16 MFU | 124201 tok/s step 17337/19560 | loss 3.285452 (+0.37z)| norm 0.2617 (+1.68z)| lr 2.04e-05 | 4161.84 ms | 32.4% bf16 MFU | 124290 tok/s step 17338/19560 | loss 3.260633 (-0.27z)| norm 0.2626 (+1.74z)| lr 2.03e-05 | 4174.41 ms | 32.3% bf16 MFU | 124355 tok/s step 17339/19560 | loss 3.305149 (+0.88z)| norm 0.2385 (-0.90z)| lr 2.03e-05 | 4169.20 ms | 32.4% bf16 MFU | 124425 tok/s step 17340/19560 | loss 3.262659 (-0.22z)| norm 0.2513 (+0.51z)| lr 2.03e-05 | 4168.45 ms | 32.4% bf16 MFU | 124493 tok/s step 17341/19560 | loss 3.314158 (+1.11z)| norm 0.2679 (+2.31z)| lr 2.03e-05 | 4167.71 ms | 32.4% bf16 MFU | 124558 tok/s step 17342/19560 | loss 3.326947 (+1.42z)| norm 0.2608 (+1.51z)| lr 2.03e-05 | 4182.09 ms | 32.3% bf16 MFU | 124598 tok/s step 17343/19560 | loss 3.287491 (+0.41z)| norm 0.2616 (+1.56z)| lr 2.02e-05 | 4152.01 ms | 32.5% bf16 MFU | 124682 tok/s step 17344/19560 | loss 3.243019 (-0.75z)| norm 0.2590 (+1.26z)| lr 2.02e-05 | 4175.12 ms | 32.3% bf16 MFU | 124727 tok/s step 17345/19560 | loss 3.314081 (+1.12z)| norm 0.2701 (+2.39z)| lr 2.02e-05 | 4169.87 ms | 32.4% bf16 MFU | 124777 tok/s step 17346/19560 | loss 3.249430 (-0.58z)| norm 0.2489 (+0.17z)| lr 2.02e-05 | 4182.22 ms | 32.3% bf16 MFU | 124806 tok/s step 17347/19560 | loss 3.247677 (-0.62z)| norm 0.2501 (+0.28z)| lr 2.02e-05 | 4174.84 ms | 32.3% bf16 MFU | 124845 tok/s step 17348/19560 | loss 3.299708 (+0.73z)| norm 0.2542 (+0.71z)| lr 2.02e-05 | 4164.26 ms | 32.4% bf16 MFU | 124898 tok/s step 17349/19560 | loss 3.288106 (+0.43z)| norm 0.2416 (-0.61z)| lr 2.01e-05 | 4242.31 ms | 31.8% bf16 MFU | 124832 tok/s step 17350/19560 | loss 3.191403 (-2.05z)| norm 0.2483 (+0.09z)| lr 2.01e-05 | 4153.52 ms | 32.5% bf16 MFU | 124902 tok/s step 17351/19560 | loss 3.262483 (-0.23z)| norm 0.2546 (+0.76z)| lr 2.01e-05 | 4161.99 ms | 32.4% bf16 MFU | 124955 tok/s step 17352/19560 | loss 3.225166 (-1.18z)| norm 0.2623 (+1.54z)| lr 2.01e-05 | 4149.37 ms | 32.5% bf16 MFU | 125025 tok/s step 17353/19560 | loss 3.309318 (+0.98z)| norm 0.2533 (+0.60z)| lr 2.01e-05 | 4172.68 ms | 32.4% bf16 MFU | 125056 tok/s step 17354/19560 | loss 3.282452 (+0.28z)| norm 0.2459 (-0.17z)| lr 2.00e-05 | 4163.67 ms | 32.4% bf16 MFU | 125100 tok/s step 17355/19560 | loss 3.330346 (+1.49z)| norm 0.2619 (+1.49z)| lr 2.00e-05 | 4203.02 ms | 32.1% bf16 MFU | 125082 tok/s step 17356/19560 | loss 3.220026 (-1.32z)| norm 0.2321 (-1.58z)| lr 2.00e-05 | 4163.53 ms | 32.4% bf16 MFU | 125124 tok/s step 17357/19560 | loss 3.322312 (+1.28z)| norm 0.2595 (+1.23z)| lr 2.00e-05 | 4163.24 ms | 32.4% bf16 MFU | 125164 tok/s step 17358/19560 | loss 3.237369 (-0.90z)| norm 0.2398 (-0.78z)| lr 2.00e-05 | 4173.41 ms | 32.4% bf16 MFU | 125187 tok/s step 17359/19560 | loss 3.223674 (-1.24z)| norm 0.2473 (-0.01z)| lr 2.00e-05 | 4173.08 ms | 32.4% bf16 MFU | 125210 tok/s step 17360/19560 | loss 3.254163 (-0.46z)| norm 0.2318 (-1.57z)| lr 1.99e-05 | 4170.79 ms | 32.4% bf16 MFU | 125234 tok/s step 17361/19560 | loss 3.302052 (+0.75z)| norm 0.2414 (-0.59z)| lr 1.99e-05 | 4176.22 ms | 32.3% bf16 MFU | 125250 tok/s step 17362/19560 | loss 3.282275 (+0.25z)| norm 0.2449 (-0.23z)| lr 1.99e-05 | 4176.39 ms | 32.3% bf16 MFU | 125264 tok/s step 17363/19560 | loss 3.238161 (-0.88z)| norm 0.2431 (-0.40z)| lr 1.99e-05 | 4194.00 ms | 32.2% bf16 MFU | 125251 tok/s step 17364/19560 | loss 3.272481 (+0.02z)| norm 0.2466 (-0.04z)| lr 1.99e-05 | 4177.55 ms | 32.3% bf16 MFU | 125264 tok/s step 17365/19560 | loss 3.295222 (+0.61z)| norm 0.2472 (+0.04z)| lr 1.98e-05 | 4181.77 ms | 32.3% bf16 MFU | 125269 tok/s step 17366/19560 | loss 3.227483 (-1.15z)| norm 0.2415 (-0.55z)| lr 1.98e-05 | 4172.19 ms | 32.4% bf16 MFU | 125289 tok/s step 17367/19560 | loss 3.230383 (-1.06z)| norm 0.2472 (+0.06z)| lr 1.98e-05 | 4194.26 ms | 32.2% bf16 MFU | 125275 tok/s step 17368/19560 | loss 3.295181 (+0.61z)| norm 0.2307 (-1.66z)| lr 1.98e-05 | 4179.75 ms | 32.3% bf16 MFU | 125283 tok/s step 17369/19560 | loss 3.329937 (+1.49z)| norm 0.2409 (-0.58z)| lr 1.98e-05 | 4179.33 ms | 32.3% bf16 MFU | 125291 tok/s step 17370/19560 | loss 3.242823 (-0.74z)| norm 0.2507 (+0.44z)| lr 1.98e-05 | 4170.12 ms | 32.4% bf16 MFU | 125313 tok/s step 17371/19560 | loss 3.228830 (-1.09z)| norm 0.2500 (+0.36z)| lr 1.97e-05 | 4153.74 ms | 32.5% bf16 MFU | 125358 tok/s step 17372/19560 | loss 3.262295 (-0.24z)| norm 0.2433 (-0.35z)| lr 1.97e-05 | 4161.33 ms | 32.4% bf16 MFU | 125390 tok/s step 17373/19560 | loss 3.275933 (+0.11z)| norm 0.2415 (-0.53z)| lr 1.97e-05 | 4165.09 ms | 32.4% bf16 MFU | 125414 tok/s step 17374/19560 | loss 3.256111 (-0.40z)| norm 0.2526 (+0.62z)| lr 1.97e-05 | 4175.44 ms | 32.3% bf16 MFU | 125422 tok/s step 17375/19560 | loss 3.248394 (-0.58z)| norm 0.2488 (+0.23z)| lr 1.97e-05 | 4185.47 ms | 32.3% bf16 MFU | 125414 tok/s step 17376/19560 | loss 3.260751 (-0.26z)| norm 0.2487 (+0.22z)| lr 1.97e-05 | 4164.52 ms | 32.4% bf16 MFU | 125438 tok/s step 17377/19560 | loss 3.296262 (+0.64z)| norm 0.2440 (-0.27z)| lr 1.96e-05 | 4169.52 ms | 32.4% bf16 MFU | 125453 tok/s step 17378/19560 | loss 3.268902 (-0.05z)| norm 0.2504 (+0.38z)| lr 1.96e-05 | 4161.10 ms | 32.4% bf16 MFU | 125480 tok/s step 17379/19560 | loss 3.235252 (-0.92z)| norm 0.2806 (+3.38z)| lr 1.96e-05 | 4176.79 ms | 32.3% bf16 MFU | 125482 tok/s step 17380/19560 | loss 3.223066 (-1.22z)| norm 0.2428 (-0.42z)| lr 1.96e-05 | 4180.53 ms | 32.3% bf16 MFU | 125479 tok/s step 17381/19560 | loss 3.255233 (-0.38z)| norm 0.2489 (+0.20z)| lr 1.96e-05 | 4160.33 ms | 32.5% bf16 MFU | 125506 tok/s step 17382/19560 | loss 3.251539 (-0.48z)| norm 0.2336 (-1.34z)| lr 1.95e-05 | 4171.77 ms | 32.4% bf16 MFU | 125514 tok/s step 17383/19560 | loss 3.321332 (+1.32z)| norm 0.2494 (+0.27z)| lr 1.95e-05 | 4176.31 ms | 32.3% bf16 MFU | 125516 tok/s step 17384/19560 | loss 3.310904 (+1.04z)| norm 0.2443 (-0.25z)| lr 1.95e-05 | 4170.32 ms | 32.4% bf16 MFU | 125526 tok/s step 17385/19560 | loss 3.257349 (-0.36z)| norm 0.2521 (+0.53z)| lr 1.95e-05 | 4179.35 ms | 32.3% bf16 MFU | 125522 tok/s step 17386/19560 | loss 3.313049 (+1.09z)| norm 0.2440 (-0.28z)| lr 1.95e-05 | 4164.71 ms | 32.4% bf16 MFU | 125540 tok/s step 17387/19560 | loss 3.261671 (-0.24z)| norm 0.2252 (-2.13z)| lr 1.95e-05 | 4169.85 ms | 32.4% bf16 MFU | 125550 tok/s step 17388/19560 | loss 3.299970 (+0.76z)| norm 0.2433 (-0.32z)| lr 1.94e-05 | 4161.56 ms | 32.4% bf16 MFU | 125571 tok/s step 17389/19560 | loss 3.299785 (+0.75z)| norm 0.2265 (-1.95z)| lr 1.94e-05 | 4187.35 ms | 32.2% bf16 MFU | 125553 tok/s step 17390/19560 | loss 3.207359 (-1.63z)| norm 0.2430 (-0.32z)| lr 1.94e-05 | 4176.41 ms | 32.3% bf16 MFU | 125552 tok/s step 17391/19560 | loss 3.270880 (+0.01z)| norm 0.2412 (-0.51z)| lr 1.94e-05 | 4182.92 ms | 32.3% bf16 MFU | 125542 tok/s step 17392/19560 | loss 3.260624 (-0.27z)| norm 0.2366 (-1.00z)| lr 1.94e-05 | 4183.80 ms | 32.3% bf16 MFU | 125530 tok/s step 17393/19560 | loss 3.295059 (+0.64z)| norm 0.2373 (-0.93z)| lr 1.94e-05 | 4173.81 ms | 32.3% bf16 MFU | 125535 tok/s step 17394/19560 | loss 3.223728 (-1.23z)| norm 0.2402 (-0.64z)| lr 1.93e-05 | 4172.06 ms | 32.4% bf16 MFU | 125541 tok/s step 17395/19560 | loss 3.340397 (+1.83z)| norm 0.2519 (+0.63z)| lr 1.93e-05 | 4164.24 ms | 32.4% bf16 MFU | 125559 tok/s step 17396/19560 | loss 3.266719 (-0.11z)| norm 0.2456 (-0.05z)| lr 1.93e-05 | 4161.91 ms | 32.4% bf16 MFU | 125580 tok/s step 17397/19560 | loss 3.264521 (-0.16z)| norm 0.2391 (-0.75z)| lr 1.93e-05 | 4173.85 ms | 32.3% bf16 MFU | 125582 tok/s step 17398/19560 | loss 3.261915 (-0.22z)| norm 0.2355 (-1.13z)| lr 1.93e-05 | 4164.54 ms | 32.4% bf16 MFU | 125597 tok/s step 17399/19560 | loss 3.194647 (-1.95z)| norm 0.2385 (-0.79z)| lr 1.92e-05 | 4169.04 ms | 32.4% bf16 MFU | 125605 tok/s step 17400/19560 | loss 3.299191 (+0.77z)| norm 0.2401 (-0.60z)| lr 1.92e-05 | 4177.27 ms | 32.3% bf16 MFU | 125600 tok/s step 17401/19560 | loss 3.282525 (+0.34z)| norm 0.2545 (+0.98z)| lr 1.92e-05 | 4174.63 ms | 32.3% bf16 MFU | 125600 tok/s step 17402/19560 | loss 3.266229 (-0.10z)| norm 0.2424 (-0.35z)| lr 1.92e-05 | 4175.14 ms | 32.3% bf16 MFU | 125598 tok/s step 17403/19560 | loss 3.316727 (+1.21z)| norm 0.2406 (-0.55z)| lr 1.92e-05 | 4193.03 ms | 32.2% bf16 MFU | 125570 tok/s step 17404/19560 | loss 3.278533 (+0.22z)| norm 0.2333 (-1.33z)| lr 1.92e-05 | 4173.58 ms | 32.4% bf16 MFU | 125573 tok/s step 17405/19560 | loss 3.301401 (+0.80z)| norm 0.2489 (+0.35z)| lr 1.91e-05 | 4184.07 ms | 32.3% bf16 MFU | 125560 tok/s step 17406/19560 | loss 3.280722 (+0.26z)| norm 0.2531 (+0.80z)| lr 1.91e-05 | 4170.22 ms | 32.4% bf16 MFU | 125568 tok/s step 17407/19560 | loss 3.250556 (-0.54z)| norm 0.2342 (-1.25z)| lr 1.91e-05 | 4201.71 ms | 32.1% bf16 MFU | 125528 tok/s step 17408/19560 | loss 3.305503 (+0.89z)| norm 0.2413 (-0.47z)| lr 1.91e-05 | 4160.40 ms | 32.5% bf16 MFU | 125553 tok/s step 17409/19560 | loss 3.278581 (+0.19z)| norm 0.2431 (-0.27z)| lr 1.91e-05 | 4148.21 ms | 32.5% bf16 MFU | 125595 tok/s step 17410/19560 | loss 3.293379 (+0.58z)| norm 0.2440 (-0.17z)| lr 1.91e-05 | 4176.75 ms | 32.3% bf16 MFU | 125591 tok/s step 17411/19560 | loss 3.286222 (+0.40z)| norm 0.2430 (-0.28z)| lr 1.90e-05 | 4169.61 ms | 32.4% bf16 MFU | 125599 tok/s step 17412/19560 | loss 3.232426 (-1.01z)| norm 0.2414 (-0.45z)| lr 1.90e-05 | 4151.96 ms | 32.5% bf16 MFU | 125632 tok/s step 17413/19560 | loss 3.244138 (-0.69z)| norm 0.2398 (-0.63z)| lr 1.90e-05 | 4159.51 ms | 32.5% bf16 MFU | 125653 tok/s step 17414/19560 | loss 3.254680 (-0.40z)| norm 0.2488 (+0.35z)| lr 1.90e-05 | 4177.11 ms | 32.3% bf16 MFU | 125646 tok/s step 17415/19560 | loss 3.264260 (-0.15z)| norm 0.2532 (+0.82z)| lr 1.90e-05 | 4167.66 ms | 32.4% bf16 MFU | 125654 tok/s step 17416/19560 | loss 3.324273 (+1.49z)| norm 0.2642 (+1.97z)| lr 1.89e-05 | 4175.40 ms | 32.3% bf16 MFU | 125649 tok/s step 17417/19560 | loss 3.302449 (+0.89z)| norm 0.2501 (+0.48z)| lr 1.89e-05 | 4167.33 ms | 32.4% bf16 MFU | 125657 tok/s step 17418/19560 | loss 3.247760 (-0.59z)| norm 0.2385 (-0.78z)| lr 1.89e-05 | 4176.30 ms | 32.3% bf16 MFU | 125651 tok/s step 17419/19560 | loss 3.318383 (+1.31z)| norm 0.2362 (-1.02z)| lr 1.89e-05 | 5006.60 ms | 27.0% bf16 MFU | 124605 tok/s step 17420/19560 | loss 3.285810 (+0.43z)| norm 0.2407 (-0.53z)| lr 1.89e-05 | 4155.08 ms | 32.5% bf16 MFU | 124684 tok/s step 17421/19560 | loss 3.234700 (-0.94z)| norm 0.2373 (-0.89z)| lr 1.89e-05 | 4180.89 ms | 32.3% bf16 MFU | 124720 tok/s step 17422/19560 | loss 3.293191 (+0.67z)| norm 0.2546 (+0.97z)| lr 1.88e-05 | 4178.99 ms | 32.3% bf16 MFU | 124756 tok/s step 17423/19560 | loss 3.275575 (+0.19z)| norm 0.2365 (-0.97z)| lr 1.88e-05 | 4150.96 ms | 32.5% bf16 MFU | 124834 tok/s step 17424/19560 | loss 3.276104 (+0.22z)| norm 0.2448 (-0.08z)| lr 1.88e-05 | 4175.68 ms | 32.3% bf16 MFU | 124870 tok/s step 17425/19560 | loss 3.239302 (-0.80z)| norm 0.2382 (-0.77z)| lr 1.88e-05 | 4177.26 ms | 32.3% bf16 MFU | 124902 tok/s step 17426/19560 | loss 3.319774 (+1.41z)| norm 0.2504 (+0.53z)| lr 1.88e-05 | 4162.13 ms | 32.4% bf16 MFU | 124955 tok/s step 17427/19560 | loss 3.324126 (+1.50z)| norm 0.2466 (+0.11z)| lr 1.88e-05 | 4156.23 ms | 32.5% bf16 MFU | 125015 tok/s step 17428/19560 | loss 3.251715 (-0.50z)| norm 0.2427 (-0.30z)| lr 1.87e-05 | 4170.77 ms | 32.4% bf16 MFU | 125049 tok/s step 17429/19560 | loss 3.423888 (+4.02z)| norm 0.2493 (+0.41z)| lr 1.87e-05 | 4165.51 ms | 32.4% bf16 MFU | 125090 tok/s step 17430/19560 | loss 3.231360 (-1.06z)| norm 0.2373 (-0.88z)| lr 1.87e-05 | 4351.55 ms | 31.0% bf16 MFU | 124860 tok/s step 17431/19560 | loss 3.218163 (-1.41z)| norm 0.2375 (-0.87z)| lr 1.87e-05 | 4153.93 ms | 32.5% bf16 MFU | 124927 tok/s step 17432/19560 | loss 3.273815 (+0.11z)| norm 0.2494 (+0.43z)| lr 1.87e-05 | 4177.52 ms | 32.3% bf16 MFU | 124956 tok/s step 17433/19560 | loss 3.297202 (+0.74z)| norm 0.2380 (-0.80z)| lr 1.87e-05 | 4152.95 ms | 32.5% bf16 MFU | 125021 tok/s step 17434/19560 | loss 3.265413 (-0.13z)| norm 0.2478 (+0.28z)| lr 1.86e-05 | 4157.57 ms | 32.5% bf16 MFU | 125075 tok/s step 17435/19560 | loss 3.305752 (+0.95z)| norm 0.2396 (-0.62z)| lr 1.86e-05 | 4161.25 ms | 32.4% bf16 MFU | 125121 tok/s step 17436/19560 | loss 3.272796 (+0.06z)| norm 0.2557 (+1.14z)| lr 1.86e-05 | 4179.51 ms | 32.3% bf16 MFU | 125137 tok/s step 17437/19560 | loss 3.282425 (+0.32z)| norm 0.2390 (-0.69z)| lr 1.86e-05 | 4170.83 ms | 32.4% bf16 MFU | 125165 tok/s step 17438/19560 | loss 3.262064 (-0.25z)| norm 0.2358 (-1.04z)| lr 1.86e-05 | 4171.02 ms | 32.4% bf16 MFU | 125192 tok/s step 17439/19560 | loss 3.257376 (-0.37z)| norm 0.2339 (-1.24z)| lr 1.85e-05 | 4194.79 ms | 32.2% bf16 MFU | 125181 tok/s step 17440/19560 | loss 3.261054 (-0.27z)| norm 0.2376 (-0.85z)| lr 1.85e-05 | 4157.57 ms | 32.5% bf16 MFU | 125228 tok/s step 17441/19560 | loss 3.271088 (-0.00z)| norm 0.2591 (+1.48z)| lr 1.85e-05 | 4190.73 ms | 32.2% bf16 MFU | 125222 tok/s step 17442/19560 | loss 3.255224 (-0.44z)| norm 0.2328 (-1.36z)| lr 1.85e-05 | 4166.08 ms | 32.4% bf16 MFU | 125253 tok/s step 17443/19560 | loss 3.246046 (-0.69z)| norm 0.2331 (-1.31z)| lr 1.85e-05 | 4178.46 ms | 32.3% bf16 MFU | 125264 tok/s step 17444/19560 | loss 3.253689 (-0.47z)| norm 0.2489 (+0.38z)| lr 1.85e-05 | 4212.95 ms | 32.0% bf16 MFU | 125223 tok/s step 17445/19560 | loss 3.311532 (+1.11z)| norm 0.2401 (-0.55z)| lr 1.84e-05 | 4156.27 ms | 32.5% bf16 MFU | 125269 tok/s step 17446/19560 | loss 3.261388 (-0.28z)| norm 0.2395 (-0.61z)| lr 1.84e-05 | 4161.09 ms | 32.4% bf16 MFU | 125305 tok/s step 17447/19560 | loss 3.196494 (-2.05z)| norm 0.2333 (-1.26z)| lr 1.84e-05 | 4171.94 ms | 32.4% bf16 MFU | 125324 tok/s step 17448/19560 | loss 3.340312 (+1.85z)| norm 0.2573 (+1.28z)| lr 1.84e-05 | 4156.42 ms | 32.5% bf16 MFU | 125364 tok/s step 17449/19560 | loss 3.241291 (-0.83z)| norm 0.2528 (+0.79z)| lr 1.84e-05 | 4177.62 ms | 32.3% bf16 MFU | 125371 tok/s step 17450/19560 | loss 3.348079 (+2.02z)| norm 0.2380 (-0.79z)| lr 1.84e-05 | 4164.36 ms | 32.4% bf16 MFU | 125398 tok/s step 17451/19560 | loss 3.256244 (-0.44z)| norm 0.2397 (-0.61z)| lr 1.83e-05 | 4170.53 ms | 32.4% bf16 MFU | 125413 tok/s step 17452/19560 | loss 3.266260 (-0.17z)| norm 0.2426 (-0.30z)| lr 1.83e-05 | 4175.21 ms | 32.3% bf16 MFU | 125421 tok/s step 17453/19560 | loss 3.295499 (+0.61z)| norm 0.2400 (-0.57z)| lr 1.83e-05 | 4168.85 ms | 32.4% bf16 MFU | 125438 tok/s step 17454/19560 | loss 3.303538 (+0.82z)| norm 0.2363 (-0.96z)| lr 1.83e-05 | 4175.44 ms | 32.3% bf16 MFU | 125445 tok/s step 17455/19560 | loss 3.353710 (+2.11z)| norm 0.2431 (-0.23z)| lr 1.83e-05 | 4158.60 ms | 32.5% bf16 MFU | 125476 tok/s step 17456/19560 | loss 3.304613 (+0.81z)| norm 0.2460 (+0.06z)| lr 1.83e-05 | 4187.53 ms | 32.2% bf16 MFU | 125462 tok/s step 17457/19560 | loss 3.248651 (-0.67z)| norm 0.2553 (+1.05z)| lr 1.82e-05 | 4167.99 ms | 32.4% bf16 MFU | 125479 tok/s step 17458/19560 | loss 3.264613 (-0.25z)| norm 0.2464 (+0.11z)| lr 1.82e-05 | 4178.48 ms | 32.3% bf16 MFU | 125479 tok/s step 17459/19560 | loss 3.293530 (+0.53z)| norm 0.2459 (+0.05z)| lr 1.82e-05 | 4195.90 ms | 32.2% bf16 MFU | 125452 tok/s step 17460/19560 | loss 3.342773 (+1.85z)| norm 0.2389 (-0.71z)| lr 1.82e-05 | 4280.58 ms | 31.5% bf16 MFU | 125304 tok/s step 17461/19560 | loss 3.324799 (+1.36z)| norm 0.2542 (+0.97z)| lr 1.82e-05 | 4173.18 ms | 32.4% bf16 MFU | 125320 tok/s step 17462/19560 | loss 3.191280 (-2.24z)| norm 0.2430 (-0.26z)| lr 1.82e-05 | 4213.89 ms | 32.0% bf16 MFU | 125275 tok/s step 17463/19560 | loss 3.266944 (-0.21z)| norm 0.2450 (-0.05z)| lr 1.81e-05 | 4253.10 ms | 31.7% bf16 MFU | 125175 tok/s step 17464/19560 | loss 3.271190 (-0.09z)| norm 0.2379 (-0.83z)| lr 1.81e-05 | 4155.14 ms | 32.5% bf16 MFU | 125225 tok/s step 17465/19560 | loss 3.346806 (+1.90z)| norm 0.2467 (+0.16z)| lr 1.81e-05 | 4341.90 ms | 31.1% bf16 MFU | 125001 tok/s step 17466/19560 | loss 3.271826 (-0.09z)| norm 0.2459 (+0.08z)| lr 1.81e-05 | 4184.69 ms | 32.3% bf16 MFU | 125016 tok/s step 17467/19560 | loss 3.368003 (+2.40z)| norm 0.2641 (+2.08z)| lr 1.81e-05 | 4376.18 ms | 30.9% bf16 MFU | 124755 tok/s step 17468/19560 | loss 3.292587 (+0.43z)| norm 0.2414 (-0.44z)| lr 1.80e-05 | 4272.46 ms | 31.6% bf16 MFU | 124653 tok/s step 17469/19560 | loss 3.319426 (+1.13z)| norm 0.2471 (+0.22z)| lr 1.80e-05 | 4161.42 ms | 32.4% bf16 MFU | 124720 tok/s step 17470/19560 | loss 3.248283 (-0.71z)| norm 0.2606 (+1.77z)| lr 1.80e-05 | 4217.32 ms | 32.0% bf16 MFU | 124700 tok/s step 17471/19560 | loss 3.282220 (+0.18z)| norm 0.2823 (+4.02z)| lr 1.80e-05 | 4153.96 ms | 32.5% bf16 MFU | 124775 tok/s step 17472/19560 | loss 3.262506 (-0.34z)| norm 0.2624 (+1.85z)| lr 1.80e-05 | 4174.37 ms | 32.3% bf16 MFU | 124816 tok/s step 17473/19560 | loss 3.318424 (+1.12z)| norm 0.3129 (+6.23z)| lr 1.80e-05 | 4178.03 ms | 32.3% bf16 MFU | 124850 tok/s step 17474/19560 | loss 3.256873 (-0.49z)| norm 0.2431 (-0.23z)| lr 1.79e-05 | 4151.13 ms | 32.5% bf16 MFU | 124922 tok/s step 17475/19560 | loss 3.274465 (-0.04z)| norm 0.2482 (+0.24z)| lr 1.79e-05 | 4237.36 ms | 31.9% bf16 MFU | 124863 tok/s step 17476/19560 | loss 3.268694 (-0.18z)| norm 0.2454 (-0.01z)| lr 1.79e-05 | 4216.97 ms | 32.0% bf16 MFU | 124836 tok/s step 17477/19560 | loss 3.345792 (+1.81z)| norm 0.2841 (+3.39z)| lr 1.79e-05 | 4193.06 ms | 32.2% bf16 MFU | 124846 tok/s step 17478/19560 | loss 3.295626 (+0.50z)| norm 0.2477 (+0.17z)| lr 1.79e-05 | 4153.75 ms | 32.5% bf16 MFU | 124915 tok/s step 17479/19560 | loss 3.255530 (-0.56z)| norm 0.2502 (+0.39z)| lr 1.79e-05 | 4165.06 ms | 32.4% bf16 MFU | 124963 tok/s step 17480/19560 | loss 3.298927 (+0.57z)| norm 0.2453 (-0.03z)| lr 1.78e-05 | 4165.90 ms | 32.4% bf16 MFU | 125007 tok/s step 17481/19560 | loss 3.270769 (-0.17z)| norm 0.2521 (+0.58z)| lr 1.78e-05 | 4302.15 ms | 31.4% bf16 MFU | 124850 tok/s step 17482/19560 | loss 3.399047 (+3.10z)| norm 0.2426 (-0.27z)| lr 1.78e-05 | 4171.18 ms | 32.4% bf16 MFU | 124893 tok/s step 17483/19560 | loss 3.257128 (-0.52z)| norm 0.2415 (-0.36z)| lr 1.78e-05 | 4169.02 ms | 32.4% bf16 MFU | 124936 tok/s step 17484/19560 | loss 3.292488 (+0.38z)| norm 0.2515 (+0.53z)| lr 1.78e-05 | 4146.88 ms | 32.6% bf16 MFU | 125011 tok/s step 17485/19560 | loss 3.332639 (+1.41z)| norm 0.2477 (+0.20z)| lr 1.78e-05 | 4159.17 ms | 32.5% bf16 MFU | 125063 tok/s step 17486/19560 | loss 3.286888 (+0.22z)| norm 0.2341 (-1.04z)| lr 1.77e-05 | 4163.44 ms | 32.4% bf16 MFU | 125106 tok/s step 17487/19560 | loss 3.262080 (-0.44z)| norm 0.2509 (+0.49z)| lr 1.77e-05 | 4280.52 ms | 31.5% bf16 MFU | 124975 tok/s step 17488/19560 | loss 3.287588 (+0.23z)| norm 0.2472 (+0.14z)| lr 1.77e-05 | 4156.86 ms | 32.5% bf16 MFU | 125032 tok/s step 17489/19560 | loss 3.285888 (+0.18z)| norm 0.2490 (+0.30z)| lr 1.77e-05 | 4161.79 ms | 32.4% bf16 MFU | 125080 tok/s step 17490/19560 | loss 3.354289 (+1.94z)| norm 0.2603 (+1.31z)| lr 1.77e-05 | 4163.23 ms | 32.4% bf16 MFU | 125122 tok/s step 17491/19560 | loss 3.309513 (+0.77z)| norm 0.2421 (-0.33z)| lr 1.77e-05 | 4282.79 ms | 31.5% bf16 MFU | 124987 tok/s step 17492/19560 | loss 3.275212 (-0.12z)| norm 0.2378 (-0.72z)| lr 1.76e-05 | 4265.40 ms | 31.7% bf16 MFU | 124883 tok/s step 17493/19560 | loss 3.311615 (+0.82z)| norm 0.2553 (+0.86z)| lr 1.76e-05 | 4250.18 ms | 31.8% bf16 MFU | 124807 tok/s step 17494/19560 | loss 3.267838 (-0.33z)| norm 0.2584 (+1.13z)| lr 1.76e-05 | 4183.66 ms | 32.3% bf16 MFU | 124833 tok/s step 17495/19560 | loss 3.308661 (+0.72z)| norm 0.2478 (+0.17z)| lr 1.76e-05 | 4162.41 ms | 32.4% bf16 MFU | 124889 tok/s step 17496/19560 | loss 3.283603 (+0.07z)| norm 0.2400 (-0.54z)| lr 1.76e-05 | 4179.80 ms | 32.3% bf16 MFU | 124916 tok/s step 17497/19560 | loss 3.303830 (+0.61z)| norm 0.2472 (+0.11z)| lr 1.76e-05 | 4289.70 ms | 31.5% bf16 MFU | 124781 tok/s step 17498/19560 | loss 3.261537 (-0.51z)| norm 0.2411 (-0.45z)| lr 1.75e-05 | 4164.72 ms | 32.4% bf16 MFU | 124837 tok/s step 17499/19560 | loss 3.278572 (-0.07z)| norm 0.2439 (-0.19z)| lr 1.75e-05 | 4165.90 ms | 32.4% bf16 MFU | 124887 tok/s step 17500/19560 | loss 3.269047 (-0.33z)| norm 0.2564 (+0.93z)| lr 1.75e-05 | 4333.26 ms | 31.2% bf16 MFU | 124693 tok/s val loss 3.270568 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3028/10042 = 0.301534 step 17501/19560 | loss 3.284454 (+0.08z)| norm 0.2394 (-0.59z)| lr 1.75e-05 | 4159.02 ms | 32.5% bf16 MFU | 124761 tok/s step 17502/19560 | loss 3.256562 (-0.66z)| norm 0.2420 (-0.35z)| lr 1.75e-05 | 4220.25 ms | 32.0% bf16 MFU | 124735 tok/s step 17503/19560 | loss 3.297016 (+0.41z)| norm 0.2299 (-1.42z)| lr 1.75e-05 | 4261.63 ms | 31.7% bf16 MFU | 124649 tok/s step 17504/19560 | loss 3.252089 (-0.79z)| norm 0.2618 (+1.42z)| lr 1.74e-05 | 4163.21 ms | 32.4% bf16 MFU | 124713 tok/s step 17505/19560 | loss 3.254006 (-0.73z)| norm 0.2525 (+0.58z)| lr 1.74e-05 | 4183.94 ms | 32.3% bf16 MFU | 124743 tok/s step 17506/19560 | loss 3.311776 (+0.80z)| norm 0.2444 (-0.13z)| lr 1.74e-05 | 4232.46 ms | 31.9% bf16 MFU | 124700 tok/s step 17507/19560 | loss 3.299058 (+0.45z)| norm 0.2371 (-0.78z)| lr 1.74e-05 | 4194.28 ms | 32.2% bf16 MFU | 124715 tok/s step 17508/19560 | loss 3.266924 (-0.42z)| norm 0.2473 (+0.16z)| lr 1.74e-05 | 4210.88 ms | 32.1% bf16 MFU | 124704 tok/s step 17509/19560 | loss 3.306484 (+0.63z)| norm 0.2370 (-0.78z)| lr 1.74e-05 | 4196.32 ms | 32.2% bf16 MFU | 124716 tok/s step 17510/19560 | loss 3.284079 (+0.02z)| norm 0.2416 (-0.37z)| lr 1.73e-05 | 4168.62 ms | 32.4% bf16 MFU | 124769 tok/s step 17511/19560 | loss 3.262713 (-0.54z)| norm 0.2383 (-0.66z)| lr 1.73e-05 | 4187.77 ms | 32.2% bf16 MFU | 124790 tok/s step 17512/19560 | loss 3.281128 (-0.04z)| norm 0.2498 (+0.39z)| lr 1.73e-05 | 4168.77 ms | 32.4% bf16 MFU | 124839 tok/s step 17513/19560 | loss 3.306906 (+0.65z)| norm 0.2458 (+0.03z)| lr 1.73e-05 | 4169.50 ms | 32.4% bf16 MFU | 124884 tok/s step 17514/19560 | loss 3.287295 (+0.12z)| norm 0.2490 (+0.32z)| lr 1.73e-05 | 4180.49 ms | 32.3% bf16 MFU | 124911 tok/s step 17515/19560 | loss 3.332246 (+1.33z)| norm 0.2421 (-0.33z)| lr 1.73e-05 | 4193.58 ms | 32.2% bf16 MFU | 124916 tok/s step 17516/19560 | loss 3.341858 (+1.57z)| norm 0.2603 (+1.35z)| lr 1.72e-05 | 4159.85 ms | 32.5% bf16 MFU | 124972 tok/s step 17517/19560 | loss 3.305962 (+0.60z)| norm 0.2471 (+0.11z)| lr 1.72e-05 | 4158.56 ms | 32.5% bf16 MFU | 125027 tok/s step 17518/19560 | loss 3.316305 (+0.87z)| norm 0.2370 (-0.84z)| lr 1.72e-05 | 4162.60 ms | 32.4% bf16 MFU | 125073 tok/s step 17519/19560 | loss 3.236909 (-1.28z)| norm 0.2522 (+0.58z)| lr 1.72e-05 | 4160.78 ms | 32.5% bf16 MFU | 125120 tok/s step 17520/19560 | loss 3.260503 (-0.64z)| norm 0.2441 (-0.19z)| lr 1.72e-05 | 4159.88 ms | 32.5% bf16 MFU | 125166 tok/s step 17521/19560 | loss 3.234718 (-1.32z)| norm 0.2468 (+0.06z)| lr 1.72e-05 | 4152.50 ms | 32.5% bf16 MFU | 125220 tok/s step 17522/19560 | loss 3.234356 (-1.34z)| norm 0.2500 (+0.36z)| lr 1.71e-05 | 4266.65 ms | 31.6% bf16 MFU | 125103 tok/s step 17523/19560 | loss 3.316609 (+0.89z)| norm 0.2351 (-1.04z)| lr 1.71e-05 | 4153.79 ms | 32.5% bf16 MFU | 125159 tok/s step 17524/19560 | loss 3.288694 (+0.13z)| norm 0.2379 (-0.76z)| lr 1.71e-05 | 4226.28 ms | 31.9% bf16 MFU | 125104 tok/s step 17525/19560 | loss 3.285757 (+0.05z)| norm 0.2521 (+0.56z)| lr 1.71e-05 | 4182.71 ms | 32.3% bf16 MFU | 125116 tok/s step 17526/19560 | loss 3.292752 (+0.23z)| norm 0.2450 (-0.11z)| lr 1.71e-05 | 4166.19 ms | 32.4% bf16 MFU | 125153 tok/s step 17527/19560 | loss 3.253493 (-0.87z)| norm 0.2459 (-0.03z)| lr 1.71e-05 | 4178.25 ms | 32.3% bf16 MFU | 125169 tok/s step 17528/19560 | loss 3.247757 (-1.01z)| norm 0.2391 (-0.68z)| lr 1.70e-05 | 4159.13 ms | 32.5% bf16 MFU | 125213 tok/s step 17529/19560 | loss 3.332314 (+1.31z)| norm 0.2448 (-0.12z)| lr 1.70e-05 | 4170.81 ms | 32.4% bf16 MFU | 125238 tok/s step 17530/19560 | loss 3.240085 (-1.22z)| norm 0.2586 (+1.17z)| lr 1.70e-05 | 4162.37 ms | 32.4% bf16 MFU | 125274 tok/s step 17531/19560 | loss 3.295376 (+0.30z)| norm 0.2376 (-0.81z)| lr 1.70e-05 | 4179.53 ms | 32.3% bf16 MFU | 125282 tok/s step 17532/19560 | loss 3.283287 (-0.03z)| norm 0.2357 (-1.00z)| lr 1.70e-05 | 4177.31 ms | 32.3% bf16 MFU | 125294 tok/s step 17533/19560 | loss 3.254310 (-0.81z)| norm 0.2312 (-1.40z)| lr 1.70e-05 | 4161.07 ms | 32.4% bf16 MFU | 125329 tok/s step 17534/19560 | loss 3.270746 (-0.36z)| norm 0.2372 (-0.82z)| lr 1.69e-05 | 4159.73 ms | 32.5% bf16 MFU | 125364 tok/s step 17535/19560 | loss 3.284140 (-0.00z)| norm 0.2429 (-0.30z)| lr 1.69e-05 | 4182.32 ms | 32.3% bf16 MFU | 125364 tok/s step 17536/19560 | loss 3.309708 (+0.70z)| norm 0.2397 (-0.60z)| lr 1.69e-05 | 4235.45 ms | 31.9% bf16 MFU | 125285 tok/s step 17537/19560 | loss 3.293051 (+0.24z)| norm 0.2393 (-0.63z)| lr 1.69e-05 | 4165.11 ms | 32.4% bf16 MFU | 125315 tok/s step 17538/19560 | loss 3.275088 (-0.25z)| norm 0.2293 (-1.54z)| lr 1.69e-05 | 4160.46 ms | 32.5% bf16 MFU | 125350 tok/s step 17539/19560 | loss 3.229255 (-1.48z)| norm 0.2464 (+0.04z)| lr 1.69e-05 | 4156.07 ms | 32.5% bf16 MFU | 125390 tok/s step 17540/19560 | loss 3.251346 (-0.89z)| norm 0.2351 (-1.00z)| lr 1.68e-05 | 4186.27 ms | 32.3% bf16 MFU | 125382 tok/s step 17541/19560 | loss 3.236433 (-1.29z)| norm 0.2262 (-1.79z)| lr 1.68e-05 | 4151.32 ms | 32.5% bf16 MFU | 125428 tok/s step 17542/19560 | loss 3.283371 (-0.02z)| norm 0.2357 (-0.91z)| lr 1.68e-05 | 4159.50 ms | 32.5% bf16 MFU | 125459 tok/s step 17543/19560 | loss 3.287437 (+0.09z)| norm 0.2349 (-0.97z)| lr 1.68e-05 | 4207.46 ms | 32.1% bf16 MFU | 125416 tok/s step 17544/19560 | loss 3.260215 (-0.65z)| norm 0.2496 (+0.39z)| lr 1.68e-05 | 4242.21 ms | 31.8% bf16 MFU | 125325 tok/s step 17545/19560 | loss 3.240721 (-1.16z)| norm 0.2363 (-0.83z)| lr 1.68e-05 | 4158.42 ms | 32.5% bf16 MFU | 125363 tok/s step 17546/19560 | loss 3.220604 (-1.70z)| norm 0.2330 (-1.13z)| lr 1.67e-05 | 4169.33 ms | 32.4% bf16 MFU | 125382 tok/s step 17547/19560 | loss 3.287884 (+0.14z)| norm 0.2417 (-0.33z)| lr 1.67e-05 | 4164.32 ms | 32.4% bf16 MFU | 125408 tok/s step 17548/19560 | loss 3.258183 (-0.66z)| norm 0.2444 (-0.08z)| lr 1.67e-05 | 4220.20 ms | 32.0% bf16 MFU | 125349 tok/s step 17549/19560 | loss 3.262273 (-0.56z)| norm 0.2409 (-0.42z)| lr 1.67e-05 | 4167.96 ms | 32.4% bf16 MFU | 125371 tok/s step 17550/19560 | loss 3.328124 (+1.23z)| norm 0.2425 (-0.26z)| lr 1.67e-05 | 4273.32 ms | 31.6% bf16 MFU | 125237 tok/s step 17551/19560 | loss 3.320236 (+1.00z)| norm 0.2445 (-0.08z)| lr 1.67e-05 | 4170.30 ms | 32.4% bf16 MFU | 125261 tok/s step 17552/19560 | loss 3.266796 (-0.45z)| norm 0.2376 (-0.71z)| lr 1.66e-05 | 4180.17 ms | 32.3% bf16 MFU | 125269 tok/s step 17553/19560 | loss 3.308438 (+0.67z)| norm 0.2553 (+0.91z)| lr 1.66e-05 | 4260.33 ms | 31.7% bf16 MFU | 125159 tok/s step 17554/19560 | loss 3.266113 (-0.47z)| norm 0.2387 (-0.61z)| lr 1.66e-05 | 4217.33 ms | 32.0% bf16 MFU | 125117 tok/s step 17555/19560 | loss 3.286863 (+0.10z)| norm 0.2450 (-0.03z)| lr 1.66e-05 | 4152.71 ms | 32.5% bf16 MFU | 125174 tok/s step 17556/19560 | loss 3.260499 (-0.62z)| norm 0.2472 (+0.17z)| lr 1.66e-05 | 4162.19 ms | 32.4% bf16 MFU | 125213 tok/s step 17557/19560 | loss 3.199749 (-2.34z)| norm 0.2411 (-0.39z)| lr 1.66e-05 | 4152.21 ms | 32.5% bf16 MFU | 125266 tok/s step 17558/19560 | loss 3.265017 (-0.48z)| norm 0.2317 (-1.24z)| lr 1.65e-05 | 4312.84 ms | 31.3% bf16 MFU | 125081 tok/s step 17559/19560 | loss 3.251913 (-0.88z)| norm 0.2437 (-0.15z)| lr 1.65e-05 | 4162.02 ms | 32.4% bf16 MFU | 125125 tok/s step 17560/19560 | loss 3.245280 (-1.06z)| norm 0.2428 (-0.23z)| lr 1.65e-05 | 4169.98 ms | 32.4% bf16 MFU | 125155 tok/s step 17561/19560 | loss 3.315413 (+0.97z)| norm 0.2443 (-0.09z)| lr 1.65e-05 | 4173.45 ms | 32.4% bf16 MFU | 125179 tok/s step 17562/19560 | loss 3.253317 (-0.82z)| norm 0.2310 (-1.30z)| lr 1.65e-05 | 4285.19 ms | 31.5% bf16 MFU | 125037 tok/s step 17563/19560 | loss 3.255517 (-0.75z)| norm 0.2391 (-0.56z)| lr 1.65e-05 | 4162.73 ms | 32.4% bf16 MFU | 125083 tok/s step 17564/19560 | loss 3.282720 (+0.03z)| norm 0.2707 (+2.29z)| lr 1.64e-05 | 4256.36 ms | 31.7% bf16 MFU | 124988 tok/s step 17565/19560 | loss 3.293360 (+0.34z)| norm 0.2323 (-1.16z)| lr 1.64e-05 | 4152.58 ms | 32.5% bf16 MFU | 125051 tok/s step 17566/19560 | loss 3.243888 (-1.08z)| norm 0.2399 (-0.48z)| lr 1.64e-05 | 4172.73 ms | 32.4% bf16 MFU | 125081 tok/s step 17567/19560 | loss 3.243658 (-1.08z)| norm 0.2435 (-0.17z)| lr 1.64e-05 | 4179.45 ms | 32.3% bf16 MFU | 125099 tok/s step 17568/19560 | loss 3.311698 (+0.86z)| norm 0.2649 (+1.73z)| lr 1.64e-05 | 4252.53 ms | 31.7% bf16 MFU | 125008 tok/s step 17569/19560 | loss 3.357444 (+2.11z)| norm 0.2560 (+0.93z)| lr 1.64e-05 | 4167.02 ms | 32.4% bf16 MFU | 125049 tok/s step 17570/19560 | loss 3.287569 (+0.14z)| norm 0.2564 (+0.96z)| lr 1.63e-05 | 4218.63 ms | 32.0% bf16 MFU | 125010 tok/s step 17571/19560 | loss 3.364290 (+2.24z)| norm 0.2437 (-0.19z)| lr 1.63e-05 | 4237.64 ms | 31.9% bf16 MFU | 124946 tok/s step 17572/19560 | loss 3.268155 (-0.43z)| norm 0.2397 (-0.54z)| lr 1.63e-05 | 4253.10 ms | 31.7% bf16 MFU | 124862 tok/s step 17573/19560 | loss 3.322690 (+1.08z)| norm 0.2447 (-0.10z)| lr 1.63e-05 | 4161.89 ms | 32.4% bf16 MFU | 124918 tok/s step 17574/19560 | loss 3.327775 (+1.20z)| norm 0.2464 (+0.05z)| lr 1.63e-05 | 4155.11 ms | 32.5% bf16 MFU | 124981 tok/s step 17575/19560 | loss 3.315883 (+0.86z)| norm 0.2401 (-0.52z)| lr 1.63e-05 | 4149.19 ms | 32.5% bf16 MFU | 125050 tok/s step 17576/19560 | loss 3.269426 (-0.43z)| norm 0.2385 (-0.66z)| lr 1.63e-05 | 4249.14 ms | 31.8% bf16 MFU | 124967 tok/s step 17577/19560 | loss 3.303775 (+0.53z)| norm 0.2386 (-0.63z)| lr 1.62e-05 | 4167.09 ms | 32.4% bf16 MFU | 125009 tok/s step 17578/19560 | loss 3.280656 (-0.12z)| norm 0.2275 (-1.62z)| lr 1.62e-05 | 4234.82 ms | 31.9% bf16 MFU | 124949 tok/s step 17579/19560 | loss 3.209995 (-2.12z)| norm 0.2377 (-0.70z)| lr 1.62e-05 | 4227.69 ms | 31.9% bf16 MFU | 124902 tok/s step 17580/19560 | loss 3.283671 (-0.02z)| norm 0.2450 (-0.05z)| lr 1.62e-05 | 4161.10 ms | 32.4% bf16 MFU | 124957 tok/s step 17581/19560 | loss 3.258446 (-0.73z)| norm 0.2331 (-1.11z)| lr 1.62e-05 | 4174.39 ms | 32.3% bf16 MFU | 124989 tok/s step 17582/19560 | loss 3.267746 (-0.46z)| norm 0.2330 (-1.11z)| lr 1.62e-05 | 4284.15 ms | 31.5% bf16 MFU | 124858 tok/s step 17583/19560 | loss 3.288940 (+0.16z)| norm 0.2546 (+0.80z)| lr 1.61e-05 | 4237.44 ms | 31.9% bf16 MFU | 124802 tok/s step 17584/19560 | loss 3.295394 (+0.35z)| norm 0.2443 (-0.11z)| lr 1.61e-05 | 4325.51 ms | 31.2% bf16 MFU | 124622 tok/s step 17585/19560 | loss 3.310919 (+0.79z)| norm 0.2385 (-0.62z)| lr 1.61e-05 | 4153.91 ms | 32.5% bf16 MFU | 124702 tok/s step 17586/19560 | loss 3.309531 (+0.74z)| norm 0.2370 (-0.74z)| lr 1.61e-05 | 4197.01 ms | 32.2% bf16 MFU | 124713 tok/s step 17587/19560 | loss 3.297329 (+0.38z)| norm 0.3207 (+5.74z)| lr 1.61e-05 | 4161.47 ms | 32.4% bf16 MFU | 124776 tok/s step 17588/19560 | loss 3.249491 (-0.99z)| norm 0.2426 (-0.26z)| lr 1.61e-05 | 4169.02 ms | 32.4% bf16 MFU | 124825 tok/s step 17589/19560 | loss 3.311761 (+0.83z)| norm 0.2574 (+0.88z)| lr 1.60e-05 | 4172.57 ms | 32.4% bf16 MFU | 124867 tok/s step 17590/19560 | loss 3.280682 (-0.10z)| norm 0.2519 (+0.45z)| lr 1.60e-05 | 4175.72 ms | 32.3% bf16 MFU | 124901 tok/s step 17591/19560 | loss 3.225981 (-1.72z)| norm 0.2360 (-0.76z)| lr 1.60e-05 | 4345.96 ms | 31.1% bf16 MFU | 124688 tok/s step 17592/19560 | loss 3.320266 (+1.08z)| norm 0.2534 (+0.56z)| lr 1.60e-05 | 4176.27 ms | 32.3% bf16 MFU | 124731 tok/s step 17593/19560 | loss 3.234757 (-1.45z)| norm 0.2496 (+0.27z)| lr 1.60e-05 | 4155.30 ms | 32.5% bf16 MFU | 124803 tok/s step 17594/19560 | loss 3.321165 (+1.12z)| norm 0.2462 (+0.01z)| lr 1.60e-05 | 4250.89 ms | 31.8% bf16 MFU | 124729 tok/s step 17595/19560 | loss 3.328927 (+1.38z)| norm 0.2431 (-0.22z)| lr 1.59e-05 | 4232.64 ms | 31.9% bf16 MFU | 124686 tok/s step 17596/19560 | loss 3.306188 (+0.69z)| norm 0.2362 (-0.74z)| lr 1.59e-05 | 4225.99 ms | 31.9% bf16 MFU | 124655 tok/s step 17597/19560 | loss 3.313291 (+0.91z)| norm 0.2426 (-0.25z)| lr 1.59e-05 | 4345.73 ms | 31.1% bf16 MFU | 124455 tok/s step 17598/19560 | loss 3.311914 (+0.85z)| norm 0.2344 (-0.86z)| lr 1.59e-05 | 4380.96 ms | 30.8% bf16 MFU | 124216 tok/s step 17599/19560 | loss 3.283852 (-0.00z)| norm 0.2476 (+0.17z)| lr 1.59e-05 | 4165.94 ms | 32.4% bf16 MFU | 124297 tok/s step 17600/19560 | loss 3.305804 (+0.66z)| norm 0.2355 (-0.78z)| lr 1.59e-05 | 4572.86 ms | 29.5% bf16 MFU | 123815 tok/s step 17601/19560 | loss 3.283864 (-0.00z)| norm 0.2412 (-0.31z)| lr 1.58e-05 | 4438.90 ms | 30.4% bf16 MFU | 123530 tok/s step 17602/19560 | loss 3.280926 (-0.10z)| norm 0.2428 (-0.16z)| lr 1.58e-05 | 4300.49 ms | 31.4% bf16 MFU | 123449 tok/s step 17603/19560 | loss 3.265417 (-0.57z)| norm 0.2486 (+0.36z)| lr 1.58e-05 | 4263.75 ms | 31.7% bf16 MFU | 123425 tok/s step 17604/19560 | loss 3.258310 (-0.78z)| norm 0.2302 (-1.29z)| lr 1.58e-05 | 4156.97 ms | 32.5% bf16 MFU | 123560 tok/s step 17605/19560 | loss 3.267974 (-0.48z)| norm 0.2518 (+0.72z)| lr 1.58e-05 | 4165.03 ms | 32.4% bf16 MFU | 123676 tok/s step 17606/19560 | loss 3.334910 (+1.57z)| norm 0.2595 (+1.43z)| lr 1.58e-05 | 4159.47 ms | 32.5% bf16 MFU | 123794 tok/s step 17607/19560 | loss 3.309345 (+0.77z)| norm 0.2417 (-0.25z)| lr 1.58e-05 | 4247.55 ms | 31.8% bf16 MFU | 123776 tok/s step 17608/19560 | loss 3.353454 (+2.08z)| norm 0.2526 (+0.78z)| lr 1.57e-05 | 4201.88 ms | 32.1% bf16 MFU | 123826 tok/s step 17609/19560 | loss 3.325154 (+1.20z)| norm 0.2386 (-0.53z)| lr 1.57e-05 | 4458.82 ms | 30.3% bf16 MFU | 123514 tok/s step 17610/19560 | loss 3.325126 (+1.27z)| norm 0.2444 (+0.01z)| lr 1.57e-05 | 4257.42 ms | 31.7% bf16 MFU | 123496 tok/s step 17611/19560 | loss 3.245325 (-1.21z)| norm 0.2348 (-0.89z)| lr 1.57e-05 | 4256.96 ms | 31.7% bf16 MFU | 123479 tok/s step 17612/19560 | loss 3.304071 (+0.61z)| norm 0.2342 (-0.92z)| lr 1.57e-05 | 4207.76 ms | 32.1% bf16 MFU | 123535 tok/s step 17613/19560 | loss 3.240270 (-1.35z)| norm 0.2334 (-0.98z)| lr 1.57e-05 | 4288.06 ms | 31.5% bf16 MFU | 123472 tok/s step 17614/19560 | loss 3.309835 (+0.81z)| norm 0.2473 (+0.31z)| lr 1.56e-05 | 4158.95 ms | 32.5% bf16 MFU | 123601 tok/s step 17615/19560 | loss 3.304107 (+0.62z)| norm 0.2445 (+0.04z)| lr 1.56e-05 | 4221.42 ms | 32.0% bf16 MFU | 123631 tok/s step 17616/19560 | loss 3.303740 (+0.60z)| norm 0.2518 (+0.74z)| lr 1.56e-05 | 4155.11 ms | 32.5% bf16 MFU | 123758 tok/s step 17617/19560 | loss 3.267425 (-0.52z)| norm 0.2389 (-0.47z)| lr 1.56e-05 | 4256.94 ms | 31.7% bf16 MFU | 123728 tok/s step 17618/19560 | loss 3.228431 (-1.71z)| norm 0.2379 (-0.55z)| lr 1.56e-05 | 4153.44 ms | 32.5% bf16 MFU | 123854 tok/s step 17619/19560 | loss 3.261572 (-0.66z)| norm 0.2390 (-0.45z)| lr 1.56e-05 | 4148.32 ms | 32.5% bf16 MFU | 123980 tok/s step 17620/19560 | loss 3.279184 (-0.11z)| norm 0.2461 (+0.21z)| lr 1.55e-05 | 4309.10 ms | 31.3% bf16 MFU | 123865 tok/s step 17621/19560 | loss 3.253647 (-0.90z)| norm 0.2439 (+0.02z)| lr 1.55e-05 | 4244.91 ms | 31.8% bf16 MFU | 123847 tok/s step 17622/19560 | loss 3.261384 (-0.65z)| norm 0.2311 (-1.18z)| lr 1.55e-05 | 4204.42 ms | 32.1% bf16 MFU | 123889 tok/s step 17623/19560 | loss 3.288785 (+0.21z)| norm 0.2348 (-0.82z)| lr 1.55e-05 | 4162.19 ms | 32.4% bf16 MFU | 123993 tok/s step 17624/19560 | loss 3.429089 (+4.23z)| norm 0.2529 (+0.89z)| lr 1.55e-05 | 4155.29 ms | 32.5% bf16 MFU | 124102 tok/s step 17625/19560 | loss 3.308281 (+0.72z)| norm 0.2408 (-0.25z)| lr 1.55e-05 | 4187.69 ms | 32.2% bf16 MFU | 124157 tok/s step 17626/19560 | loss 3.321623 (+1.09z)| norm 0.2387 (-0.45z)| lr 1.54e-05 | 4199.09 ms | 32.2% bf16 MFU | 124192 tok/s step 17627/19560 | loss 3.315297 (+0.90z)| norm 0.2504 (+0.66z)| lr 1.54e-05 | 4184.37 ms | 32.3% bf16 MFU | 124247 tok/s step 17628/19560 | loss 3.268444 (-0.45z)| norm 0.2370 (-0.60z)| lr 1.54e-05 | 4160.14 ms | 32.5% bf16 MFU | 124336 tok/s step 17629/19560 | loss 3.298871 (+0.42z)| norm 0.2352 (-0.77z)| lr 1.54e-05 | 4150.99 ms | 32.5% bf16 MFU | 124435 tok/s step 17630/19560 | loss 3.304530 (+0.58z)| norm 0.2395 (-0.36z)| lr 1.54e-05 | 4235.26 ms | 31.9% bf16 MFU | 124402 tok/s step 17631/19560 | loss 3.308090 (+0.68z)| norm 0.2530 (+0.91z)| lr 1.54e-05 | 4229.12 ms | 31.9% bf16 MFU | 124381 tok/s step 17632/19560 | loss 3.174895 (-3.04z)| norm 0.2407 (-0.25z)| lr 1.54e-05 | 4151.32 ms | 32.5% bf16 MFU | 124477 tok/s step 17633/19560 | loss 3.313709 (+0.81z)| norm 0.2381 (-0.49z)| lr 1.53e-05 | 4251.38 ms | 31.8% bf16 MFU | 124419 tok/s step 17634/19560 | loss 3.282675 (-0.04z)| norm 0.2376 (-0.54z)| lr 1.53e-05 | 4157.42 ms | 32.5% bf16 MFU | 124503 tok/s step 17635/19560 | loss 3.273031 (-0.31z)| norm 0.2386 (-0.44z)| lr 1.53e-05 | 4207.31 ms | 32.1% bf16 MFU | 124509 tok/s step 17636/19560 | loss 3.292955 (+0.24z)| norm 0.2389 (-0.40z)| lr 1.53e-05 | 4158.13 ms | 32.5% bf16 MFU | 124588 tok/s step 17637/19560 | loss 3.269590 (-0.40z)| norm 0.2454 (+0.22z)| lr 1.53e-05 | 4154.42 ms | 32.5% bf16 MFU | 124668 tok/s step 17638/19560 | loss 3.303486 (+0.54z)| norm 0.2437 (+0.06z)| lr 1.53e-05 | 4196.27 ms | 32.2% bf16 MFU | 124682 tok/s step 17639/19560 | loss 3.286847 (+0.07z)| norm 0.2452 (+0.19z)| lr 1.52e-05 | 4235.86 ms | 31.9% bf16 MFU | 124637 tok/s step 17640/19560 | loss 3.291853 (+0.21z)| norm 0.2382 (-0.47z)| lr 1.52e-05 | 4224.41 ms | 32.0% bf16 MFU | 124610 tok/s step 17641/19560 | loss 3.252298 (-0.88z)| norm 0.2410 (-0.21z)| lr 1.52e-05 | 4162.53 ms | 32.4% bf16 MFU | 124677 tok/s step 17642/19560 | loss 3.263943 (-0.55z)| norm 0.2439 (+0.08z)| lr 1.52e-05 | 4164.85 ms | 32.4% bf16 MFU | 124738 tok/s step 17643/19560 | loss 3.240243 (-1.20z)| norm 0.2342 (-0.85z)| lr 1.52e-05 | 4178.18 ms | 32.3% bf16 MFU | 124775 tok/s step 17644/19560 | loss 3.267288 (-0.43z)| norm 0.2342 (-0.84z)| lr 1.52e-05 | 4162.64 ms | 32.4% bf16 MFU | 124834 tok/s step 17645/19560 | loss 3.281278 (-0.03z)| norm 0.2354 (-0.71z)| lr 1.51e-05 | 4153.41 ms | 32.5% bf16 MFU | 124904 tok/s step 17646/19560 | loss 3.282697 (+0.02z)| norm 0.2417 (-0.10z)| lr 1.51e-05 | 4155.35 ms | 32.5% bf16 MFU | 124967 tok/s step 17647/19560 | loss 3.227027 (-1.55z)| norm 0.2344 (-0.80z)| lr 1.51e-05 | 4160.28 ms | 32.5% bf16 MFU | 125020 tok/s step 17648/19560 | loss 3.314886 (+0.92z)| norm 0.2355 (-0.69z)| lr 1.51e-05 | 4154.11 ms | 32.5% bf16 MFU | 125079 tok/s step 17649/19560 | loss 3.258756 (-0.68z)| norm 0.2363 (-0.60z)| lr 1.51e-05 | 4210.66 ms | 32.1% bf16 MFU | 125051 tok/s step 17650/19560 | loss 3.312723 (+0.84z)| norm 0.2346 (-0.75z)| lr 1.51e-05 | 4161.59 ms | 32.4% bf16 MFU | 125098 tok/s step 17651/19560 | loss 3.326256 (+1.22z)| norm 0.2344 (-0.77z)| lr 1.51e-05 | 4162.93 ms | 32.4% bf16 MFU | 125140 tok/s step 17652/19560 | loss 3.301739 (+0.52z)| norm 0.2454 (+0.29z)| lr 1.50e-05 | 4166.85 ms | 32.4% bf16 MFU | 125174 tok/s step 17653/19560 | loss 3.257818 (-0.72z)| norm 0.2402 (-0.20z)| lr 1.50e-05 | 4162.14 ms | 32.4% bf16 MFU | 125214 tok/s step 17654/19560 | loss 3.293046 (+0.28z)| norm 0.2438 (+0.15z)| lr 1.50e-05 | 4163.30 ms | 32.4% bf16 MFU | 125249 tok/s step 17655/19560 | loss 3.303201 (+0.56z)| norm 0.2350 (-0.70z)| lr 1.50e-05 | 4154.93 ms | 32.5% bf16 MFU | 125296 tok/s step 17656/19560 | loss 3.254534 (-0.83z)| norm 0.2271 (-1.46z)| lr 1.50e-05 | 4159.52 ms | 32.5% bf16 MFU | 125334 tok/s step 17657/19560 | loss 3.333581 (+1.42z)| norm 0.2445 (+0.23z)| lr 1.50e-05 | 4160.66 ms | 32.5% bf16 MFU | 125368 tok/s step 17658/19560 | loss 3.267789 (-0.46z)| norm 0.2558 (+1.33z)| lr 1.49e-05 | 4160.28 ms | 32.5% bf16 MFU | 125400 tok/s step 17659/19560 | loss 3.312503 (+0.81z)| norm 0.2579 (+1.51z)| lr 1.49e-05 | 4154.97 ms | 32.5% bf16 MFU | 125439 tok/s step 17660/19560 | loss 3.358750 (+2.08z)| norm 0.2493 (+0.68z)| lr 1.49e-05 | 4160.50 ms | 32.5% bf16 MFU | 125468 tok/s step 17661/19560 | loss 3.345794 (+1.68z)| norm 0.2386 (-0.37z)| lr 1.49e-05 | 4152.87 ms | 32.5% bf16 MFU | 125507 tok/s step 17662/19560 | loss 3.265774 (-0.54z)| norm 0.2498 (+0.70z)| lr 1.49e-05 | 4149.17 ms | 32.5% bf16 MFU | 125550 tok/s step 17663/19560 | loss 3.305209 (+0.55z)| norm 0.2407 (-0.17z)| lr 1.49e-05 | 4147.64 ms | 32.6% bf16 MFU | 125593 tok/s step 17664/19560 | loss 3.283599 (-0.04z)| norm 0.2377 (-0.46z)| lr 1.49e-05 | 4151.60 ms | 32.5% bf16 MFU | 125627 tok/s step 17665/19560 | loss 3.290740 (+0.15z)| norm 0.2376 (-0.46z)| lr 1.48e-05 | 4159.54 ms | 32.5% bf16 MFU | 125648 tok/s step 17666/19560 | loss 3.358791 (+2.00z)| norm 0.2466 (+0.39z)| lr 1.48e-05 | 4160.97 ms | 32.4% bf16 MFU | 125666 tok/s step 17667/19560 | loss 3.266667 (-0.54z)| norm 0.2410 (-0.15z)| lr 1.48e-05 | 4153.02 ms | 32.5% bf16 MFU | 125695 tok/s step 17668/19560 | loss 3.324610 (+1.05z)| norm 0.2743 (+2.96z)| lr 1.48e-05 | 4147.71 ms | 32.6% bf16 MFU | 125730 tok/s step 17669/19560 | loss 3.385079 (+2.63z)| norm 0.2484 (+0.51z)| lr 1.48e-05 | 4158.22 ms | 32.5% bf16 MFU | 125748 tok/s step 17670/19560 | loss 3.300284 (+0.33z)| norm 0.2333 (-0.92z)| lr 1.48e-05 | 4151.46 ms | 32.5% bf16 MFU | 125775 tok/s step 17671/19560 | loss 3.297487 (+0.25z)| norm 0.2458 (+0.26z)| lr 1.47e-05 | 4158.53 ms | 32.5% bf16 MFU | 125790 tok/s step 17672/19560 | loss 3.296435 (+0.22z)| norm 0.2496 (+0.62z)| lr 1.47e-05 | 4158.90 ms | 32.5% bf16 MFU | 125804 tok/s step 17673/19560 | loss 3.398647 (+2.87z)| norm 0.2393 (-0.36z)| lr 1.47e-05 | 4150.03 ms | 32.5% bf16 MFU | 125830 tok/s step 17674/19560 | loss 3.275841 (-0.38z)| norm 0.2389 (-0.40z)| lr 1.47e-05 | 4279.66 ms | 31.5% bf16 MFU | 125664 tok/s step 17675/19560 | loss 3.256199 (-0.89z)| norm 0.2433 (+0.01z)| lr 1.47e-05 | 4146.44 ms | 32.6% bf16 MFU | 125703 tok/s step 17676/19560 | loss 3.252505 (-0.99z)| norm 0.2397 (-0.33z)| lr 1.47e-05 | 4162.38 ms | 32.4% bf16 MFU | 125716 tok/s step 17677/19560 | loss 3.294044 (+0.11z)| norm 0.2397 (-0.32z)| lr 1.47e-05 | 4150.18 ms | 32.5% bf16 MFU | 125746 tok/s step 17678/19560 | loss 3.271872 (-0.47z)| norm 0.2376 (-0.52z)| lr 1.46e-05 | 4158.03 ms | 32.5% bf16 MFU | 125764 tok/s step 17679/19560 | loss 3.290811 (+0.04z)| norm 0.2453 (+0.21z)| lr 1.46e-05 | 4149.07 ms | 32.5% bf16 MFU | 125794 tok/s step 17680/19560 | loss 3.324934 (+0.94z)| norm 0.2427 (-0.04z)| lr 1.46e-05 | 4159.44 ms | 32.5% bf16 MFU | 125806 tok/s step 17681/19560 | loss 3.269444 (-0.53z)| norm 0.2281 (-1.40z)| lr 1.46e-05 | 4160.72 ms | 32.5% bf16 MFU | 125816 tok/s step 17682/19560 | loss 3.216628 (-1.91z)| norm 0.2358 (-0.67z)| lr 1.46e-05 | 4157.40 ms | 32.5% bf16 MFU | 125831 tok/s step 17683/19560 | loss 3.323609 (+0.90z)| norm 0.2302 (-1.19z)| lr 1.46e-05 | 4158.82 ms | 32.5% bf16 MFU | 125843 tok/s step 17684/19560 | loss 3.364838 (+1.94z)| norm 0.2528 (+0.94z)| lr 1.45e-05 | 4156.82 ms | 32.5% bf16 MFU | 125857 tok/s step 17685/19560 | loss 3.289886 (-0.03z)| norm 0.2452 (+0.22z)| lr 1.45e-05 | 4317.46 ms | 31.3% bf16 MFU | 125636 tok/s step 17686/19560 | loss 3.342306 (+1.34z)| norm 0.2526 (+0.90z)| lr 1.45e-05 | 4144.38 ms | 32.6% bf16 MFU | 125679 tok/s step 17687/19560 | loss 3.249886 (-1.10z)| norm 0.2406 (-0.22z)| lr 1.45e-05 | 4153.82 ms | 32.5% bf16 MFU | 125706 tok/s step 17688/19560 | loss 3.301233 (+0.25z)| norm 0.2373 (-0.53z)| lr 1.45e-05 | 4161.36 ms | 32.4% bf16 MFU | 125721 tok/s step 17689/19560 | loss 3.249498 (-1.11z)| norm 0.2326 (-0.96z)| lr 1.45e-05 | 4144.79 ms | 32.6% bf16 MFU | 125759 tok/s step 17690/19560 | loss 3.269796 (-0.58z)| norm 0.2292 (-1.28z)| lr 1.45e-05 | 4148.38 ms | 32.5% bf16 MFU | 125790 tok/s step 17691/19560 | loss 3.397211 (+2.70z)| norm 0.2538 (+1.01z)| lr 1.44e-05 | 4154.76 ms | 32.5% bf16 MFU | 125810 tok/s step 17692/19560 | loss 3.297905 (+0.13z)| norm 0.2342 (-0.81z)| lr 1.44e-05 | 4162.47 ms | 32.4% bf16 MFU | 125818 tok/s step 17693/19560 | loss 3.301667 (+0.23z)| norm 0.2497 (+0.66z)| lr 1.44e-05 | 4152.89 ms | 32.5% bf16 MFU | 125839 tok/s step 17694/19560 | loss 3.311521 (+0.47z)| norm 0.2522 (+0.89z)| lr 1.44e-05 | 4204.43 ms | 32.1% bf16 MFU | 125782 tok/s step 17695/19560 | loss 3.247149 (-1.20z)| norm 0.2431 (+0.02z)| lr 1.44e-05 | 4145.12 ms | 32.6% bf16 MFU | 125817 tok/s step 17696/19560 | loss 3.325986 (+0.84z)| norm 0.2418 (-0.09z)| lr 1.44e-05 | 4139.90 ms | 32.6% bf16 MFU | 125858 tok/s step 17697/19560 | loss 3.335171 (+1.09z)| norm 0.2324 (-0.99z)| lr 1.43e-05 | 4147.21 ms | 32.6% bf16 MFU | 125886 tok/s step 17698/19560 | loss 3.272631 (-0.54z)| norm 0.2450 (+0.25z)| lr 1.43e-05 | 4150.24 ms | 32.5% bf16 MFU | 125908 tok/s step 17699/19560 | loss 3.298842 (+0.16z)| norm 0.2391 (-0.32z)| lr 1.43e-05 | 4145.34 ms | 32.6% bf16 MFU | 125937 tok/s step 17700/19560 | loss 3.289715 (-0.08z)| norm 0.2374 (-0.49z)| lr 1.43e-05 | 4142.08 ms | 32.6% bf16 MFU | 125969 tok/s step 17701/19560 | loss 3.273961 (-0.49z)| norm 0.2501 (+0.75z)| lr 1.43e-05 | 4146.48 ms | 32.6% bf16 MFU | 125992 tok/s step 17702/19560 | loss 3.336770 (+1.17z)| norm 0.2609 (+1.77z)| lr 1.43e-05 | 4157.22 ms | 32.5% bf16 MFU | 125999 tok/s step 17703/19560 | loss 3.306321 (+0.37z)| norm 0.2336 (-0.86z)| lr 1.43e-05 | 4145.15 ms | 32.6% bf16 MFU | 126023 tok/s step 17704/19560 | loss 3.257227 (-0.93z)| norm 0.2317 (-1.03z)| lr 1.42e-05 | 4151.70 ms | 32.5% bf16 MFU | 126036 tok/s step 17705/19560 | loss 3.290601 (-0.04z)| norm 0.2430 (+0.05z)| lr 1.42e-05 | 4155.65 ms | 32.5% bf16 MFU | 126042 tok/s step 17706/19560 | loss 3.277553 (-0.39z)| norm 0.2279 (-1.40z)| lr 1.42e-05 | 4140.98 ms | 32.6% bf16 MFU | 126071 tok/s step 17707/19560 | loss 3.265510 (-0.73z)| norm 0.2433 (+0.07z)| lr 1.42e-05 | 4149.67 ms | 32.5% bf16 MFU | 126084 tok/s step 17708/19560 | loss 3.270563 (-0.59z)| norm 0.2256 (-1.59z)| lr 1.42e-05 | 4157.25 ms | 32.5% bf16 MFU | 126086 tok/s step 17709/19560 | loss 3.318280 (+0.68z)| norm 0.2394 (-0.29z)| lr 1.42e-05 | 4150.20 ms | 32.5% bf16 MFU | 126098 tok/s step 17710/19560 | loss 3.344484 (+1.36z)| norm 0.2352 (-0.69z)| lr 1.41e-05 | 4155.95 ms | 32.5% bf16 MFU | 126101 tok/s step 17711/19560 | loss 3.304969 (+0.30z)| norm 0.2380 (-0.42z)| lr 1.41e-05 | 4155.58 ms | 32.5% bf16 MFU | 126104 tok/s step 17712/19560 | loss 3.222777 (-1.86z)| norm 0.2318 (-1.00z)| lr 1.41e-05 | 4162.84 ms | 32.4% bf16 MFU | 126096 tok/s step 17713/19560 | loss 3.278302 (-0.39z)| norm 0.2416 (-0.06z)| lr 1.41e-05 | 4151.32 ms | 32.5% bf16 MFU | 126106 tok/s step 17714/19560 | loss 3.277866 (-0.39z)| norm 0.2386 (-0.35z)| lr 1.41e-05 | 4149.72 ms | 32.5% bf16 MFU | 126118 tok/s step 17715/19560 | loss 3.353083 (+1.57z)| norm 0.2394 (-0.29z)| lr 1.41e-05 | 4145.85 ms | 32.6% bf16 MFU | 126135 tok/s step 17716/19560 | loss 3.295302 (+0.05z)| norm 0.2474 (+0.73z)| lr 1.41e-05 | 4142.99 ms | 32.6% bf16 MFU | 126155 tok/s step 17717/19560 | loss 3.234075 (-1.53z)| norm 0.2374 (-0.54z)| lr 1.40e-05 | 4156.62 ms | 32.5% bf16 MFU | 126154 tok/s step 17718/19560 | loss 3.269113 (-0.62z)| norm 0.2431 (+0.22z)| lr 1.40e-05 | 4170.22 ms | 32.4% bf16 MFU | 126133 tok/s step 17719/19560 | loss 3.295292 (+0.05z)| norm 0.2419 (+0.05z)| lr 1.40e-05 | 4160.20 ms | 32.5% bf16 MFU | 126127 tok/s step 17720/19560 | loss 3.360117 (+1.73z)| norm 0.2641 (+2.89z)| lr 1.40e-05 | 4150.67 ms | 32.5% bf16 MFU | 126137 tok/s step 17721/19560 | loss 3.258416 (-0.93z)| norm 0.2301 (-1.45z)| lr 1.40e-05 | 4149.85 ms | 32.5% bf16 MFU | 126147 tok/s step 17722/19560 | loss 3.329683 (+0.94z)| norm 0.2389 (-0.31z)| lr 1.40e-05 | 4150.30 ms | 32.5% bf16 MFU | 126156 tok/s step 17723/19560 | loss 3.352311 (+1.52z)| norm 0.2524 (+1.39z)| lr 1.40e-05 | 4163.55 ms | 32.4% bf16 MFU | 126144 tok/s step 17724/19560 | loss 3.299300 (+0.14z)| norm 0.2369 (-0.58z)| lr 1.39e-05 | 4148.54 ms | 32.5% bf16 MFU | 126156 tok/s step 17725/19560 | loss 3.398128 (+2.62z)| norm 0.2527 (+1.41z)| lr 1.39e-05 | 4156.53 ms | 32.5% bf16 MFU | 126155 tok/s step 17726/19560 | loss 3.320293 (+0.65z)| norm 0.2398 (-0.22z)| lr 1.39e-05 | 4151.91 ms | 32.5% bf16 MFU | 126161 tok/s step 17727/19560 | loss 3.286583 (-0.21z)| norm 0.2473 (+0.73z)| lr 1.39e-05 | 4150.13 ms | 32.5% bf16 MFU | 126169 tok/s step 17728/19560 | loss 3.285350 (-0.23z)| norm 0.2552 (+1.69z)| lr 1.39e-05 | 4151.80 ms | 32.5% bf16 MFU | 126175 tok/s step 17729/19560 | loss 3.276886 (-0.45z)| norm 0.2338 (-0.97z)| lr 1.39e-05 | 4153.84 ms | 32.5% bf16 MFU | 126177 tok/s step 17730/19560 | loss 3.293227 (-0.04z)| norm 0.2431 (+0.19z)| lr 1.38e-05 | 4150.99 ms | 32.5% bf16 MFU | 126183 tok/s step 17731/19560 | loss 3.364063 (+1.72z)| norm 0.2616 (+2.42z)| lr 1.38e-05 | 4177.36 ms | 32.3% bf16 MFU | 126150 tok/s step 17732/19560 | loss 3.254312 (-1.03z)| norm 0.2354 (-0.79z)| lr 1.38e-05 | 4152.11 ms | 32.5% bf16 MFU | 126156 tok/s step 17733/19560 | loss 3.291407 (-0.10z)| norm 0.2324 (-1.14z)| lr 1.38e-05 | 4163.06 ms | 32.4% bf16 MFU | 126145 tok/s step 17734/19560 | loss 3.277718 (-0.44z)| norm 0.2369 (-0.57z)| lr 1.38e-05 | 4151.18 ms | 32.5% bf16 MFU | 126152 tok/s step 17735/19560 | loss 3.294223 (-0.02z)| norm 0.2478 (+0.79z)| lr 1.38e-05 | 4150.43 ms | 32.5% bf16 MFU | 126161 tok/s step 17736/19560 | loss 3.286593 (-0.20z)| norm 0.2595 (+2.21z)| lr 1.38e-05 | 4158.52 ms | 32.5% bf16 MFU | 126157 tok/s step 17737/19560 | loss 3.286849 (-0.19z)| norm 0.2576 (+1.93z)| lr 1.37e-05 | 4147.80 ms | 32.6% bf16 MFU | 126169 tok/s step 17738/19560 | loss 3.269525 (-0.62z)| norm 0.2350 (-0.81z)| lr 1.37e-05 | 4159.17 ms | 32.5% bf16 MFU | 126163 tok/s step 17739/19560 | loss 3.279827 (-0.36z)| norm 0.2385 (-0.38z)| lr 1.37e-05 | 4154.23 ms | 32.5% bf16 MFU | 126165 tok/s step 17740/19560 | loss 3.337414 (+1.10z)| norm 0.2457 (+0.49z)| lr 1.37e-05 | 4153.53 ms | 32.5% bf16 MFU | 126168 tok/s step 17741/19560 | loss 3.216337 (-1.97z)| norm 0.2572 (+1.84z)| lr 1.37e-05 | 4162.48 ms | 32.4% bf16 MFU | 126158 tok/s step 17742/19560 | loss 3.245717 (-1.21z)| norm 0.2282 (-1.63z)| lr 1.37e-05 | 4156.20 ms | 32.5% bf16 MFU | 126157 tok/s step 17743/19560 | loss 3.269747 (-0.59z)| norm 0.2358 (-0.71z)| lr 1.37e-05 | 4154.48 ms | 32.5% bf16 MFU | 126159 tok/s step 17744/19560 | loss 3.299682 (+0.16z)| norm 0.2448 (+0.37z)| lr 1.36e-05 | 4151.32 ms | 32.5% bf16 MFU | 126166 tok/s step 17745/19560 | loss 3.284462 (-0.23z)| norm 0.2295 (-1.44z)| lr 1.36e-05 | 4154.94 ms | 32.5% bf16 MFU | 126167 tok/s step 17746/19560 | loss 3.312758 (+0.48z)| norm 0.2610 (+2.25z)| lr 1.36e-05 | 4161.62 ms | 32.4% bf16 MFU | 126158 tok/s step 17747/19560 | loss 3.306442 (+0.31z)| norm 0.2330 (-1.02z)| lr 1.36e-05 | 4148.20 ms | 32.5% bf16 MFU | 126169 tok/s step 17748/19560 | loss 3.261634 (-0.83z)| norm 0.2320 (-1.11z)| lr 1.36e-05 | 4158.99 ms | 32.5% bf16 MFU | 126164 tok/s step 17749/19560 | loss 3.374145 (+1.99z)| norm 0.2564 (+1.68z)| lr 1.36e-05 | 4211.10 ms | 32.1% bf16 MFU | 126081 tok/s step 17750/19560 | loss 3.272504 (-0.58z)| norm 0.2347 (-0.81z)| lr 1.35e-05 | 4174.05 ms | 32.3% bf16 MFU | 126057 tok/s val loss 3.269358 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3046/10042 = 0.303326 step 17751/19560 | loss 3.231476 (-1.58z)| norm 0.2420 (+0.02z)| lr 1.35e-05 | 4259.73 ms | 31.7% bf16 MFU | 125908 tok/s step 17752/19560 | loss 3.283502 (-0.27z)| norm 0.2422 (+0.05z)| lr 1.35e-05 | 4367.69 ms | 30.9% bf16 MFU | 125615 tok/s step 17753/19560 | loss 3.282302 (-0.29z)| norm 0.2266 (-1.72z)| lr 1.35e-05 | 4308.89 ms | 31.3% bf16 MFU | 125418 tok/s step 17754/19560 | loss 3.246463 (-1.21z)| norm 0.2434 (+0.21z)| lr 1.35e-05 | 4338.06 ms | 31.1% bf16 MFU | 125190 tok/s step 17755/19560 | loss 3.308347 (+0.40z)| norm 0.2364 (-0.59z)| lr 1.35e-05 | 4165.11 ms | 32.4% bf16 MFU | 125224 tok/s step 17756/19560 | loss 3.236681 (-1.45z)| norm 0.2408 (-0.09z)| lr 1.35e-05 | 4162.48 ms | 32.4% bf16 MFU | 125261 tok/s step 17757/19560 | loss 3.239002 (-1.37z)| norm 0.2471 (+0.63z)| lr 1.34e-05 | 4148.21 ms | 32.5% bf16 MFU | 125317 tok/s step 17758/19560 | loss 3.264932 (-0.69z)| norm 0.2336 (-0.92z)| lr 1.34e-05 | 4154.65 ms | 32.5% bf16 MFU | 125361 tok/s step 17759/19560 | loss 3.289380 (-0.06z)| norm 0.2356 (-0.68z)| lr 1.34e-05 | 4216.50 ms | 32.0% bf16 MFU | 125310 tok/s step 17760/19560 | loss 3.291909 (-0.02z)| norm 0.2491 (+0.87z)| lr 1.34e-05 | 4158.41 ms | 32.5% bf16 MFU | 125348 tok/s step 17761/19560 | loss 3.294559 (+0.06z)| norm 0.2396 (-0.23z)| lr 1.34e-05 | 4220.04 ms | 32.0% bf16 MFU | 125293 tok/s step 17762/19560 | loss 3.257555 (-0.92z)| norm 0.2380 (-0.41z)| lr 1.34e-05 | 4187.75 ms | 32.2% bf16 MFU | 125288 tok/s step 17763/19560 | loss 3.292524 (+0.00z)| norm 0.2454 (+0.43z)| lr 1.34e-05 | 4200.37 ms | 32.1% bf16 MFU | 125265 tok/s step 17764/19560 | loss 3.281703 (-0.28z)| norm 0.2299 (-1.33z)| lr 1.33e-05 | 4191.10 ms | 32.2% bf16 MFU | 125256 tok/s step 17765/19560 | loss 3.242248 (-1.32z)| norm 0.2379 (-0.40z)| lr 1.33e-05 | 4225.97 ms | 31.9% bf16 MFU | 125196 tok/s step 17766/19560 | loss 3.228680 (-1.65z)| norm 0.2355 (-0.68z)| lr 1.33e-05 | 4171.58 ms | 32.4% bf16 MFU | 125221 tok/s step 17767/19560 | loss 3.285201 (-0.17z)| norm 0.2340 (-0.84z)| lr 1.33e-05 | 4154.33 ms | 32.5% bf16 MFU | 125270 tok/s step 17768/19560 | loss 3.254120 (-0.97z)| norm 0.2299 (-1.29z)| lr 1.33e-05 | 4175.44 ms | 32.3% bf16 MFU | 125284 tok/s step 17769/19560 | loss 3.234242 (-1.48z)| norm 0.2343 (-0.79z)| lr 1.33e-05 | 4163.54 ms | 32.4% bf16 MFU | 125316 tok/s step 17770/19560 | loss 3.186527 (-2.63z)| norm 0.2332 (-0.90z)| lr 1.33e-05 | 4153.42 ms | 32.5% bf16 MFU | 125362 tok/s step 17771/19560 | loss 3.372149 (+2.02z)| norm 0.2447 (+0.39z)| lr 1.32e-05 | 4174.66 ms | 32.3% bf16 MFU | 125373 tok/s step 17772/19560 | loss 3.293574 (+0.05z)| norm 0.2449 (+0.41z)| lr 1.32e-05 | 4171.10 ms | 32.4% bf16 MFU | 125390 tok/s step 17773/19560 | loss 3.274819 (-0.42z)| norm 0.2341 (-0.81z)| lr 1.32e-05 | 4157.37 ms | 32.5% bf16 MFU | 125426 tok/s step 17774/19560 | loss 3.282707 (-0.23z)| norm 0.2500 (+0.98z)| lr 1.32e-05 | 4159.18 ms | 32.5% bf16 MFU | 125457 tok/s step 17775/19560 | loss 3.214188 (-1.93z)| norm 0.2391 (-0.26z)| lr 1.32e-05 | 4302.83 ms | 31.4% bf16 MFU | 125277 tok/s step 17776/19560 | loss 3.273711 (-0.44z)| norm 0.2489 (+0.83z)| lr 1.32e-05 | 4148.59 ms | 32.5% bf16 MFU | 125332 tok/s step 17777/19560 | loss 3.207929 (-2.04z)| norm 0.2562 (+1.62z)| lr 1.31e-05 | 4152.49 ms | 32.5% bf16 MFU | 125378 tok/s step 17778/19560 | loss 3.271855 (-0.46z)| norm 0.2466 (+0.55z)| lr 1.31e-05 | 4157.18 ms | 32.5% bf16 MFU | 125415 tok/s step 17779/19560 | loss 3.295608 (+0.13z)| norm 0.2407 (-0.12z)| lr 1.31e-05 | 4151.10 ms | 32.5% bf16 MFU | 125459 tok/s step 17780/19560 | loss 3.373994 (+2.02z)| norm 0.2529 (+1.24z)| lr 1.31e-05 | 4153.31 ms | 32.5% bf16 MFU | 125498 tok/s step 17781/19560 | loss 3.284769 (-0.15z)| norm 0.2554 (+1.49z)| lr 1.31e-05 | 4160.24 ms | 32.5% bf16 MFU | 125524 tok/s step 17782/19560 | loss 3.304203 (+0.32z)| norm 0.2528 (+1.18z)| lr 1.31e-05 | 4183.91 ms | 32.3% bf16 MFU | 125514 tok/s step 17783/19560 | loss 3.287224 (-0.09z)| norm 0.2345 (-0.83z)| lr 1.31e-05 | 4179.48 ms | 32.3% bf16 MFU | 125510 tok/s step 17784/19560 | loss 3.291974 (+0.02z)| norm 0.2417 (-0.05z)| lr 1.30e-05 | 4152.10 ms | 32.5% bf16 MFU | 125548 tok/s step 17785/19560 | loss 3.331611 (+0.99z)| norm 0.2336 (-0.93z)| lr 1.30e-05 | 4148.76 ms | 32.5% bf16 MFU | 125589 tok/s step 17786/19560 | loss 3.244999 (-1.13z)| norm 0.2330 (-0.99z)| lr 1.30e-05 | 4163.42 ms | 32.4% bf16 MFU | 125606 tok/s step 17787/19560 | loss 3.310838 (+0.48z)| norm 0.2350 (-0.75z)| lr 1.30e-05 | 4154.20 ms | 32.5% bf16 MFU | 125636 tok/s step 17788/19560 | loss 3.290244 (-0.01z)| norm 0.2424 (+0.08z)| lr 1.30e-05 | 4135.94 ms | 32.6% bf16 MFU | 125693 tok/s step 17789/19560 | loss 3.298475 (+0.20z)| norm 0.2364 (-0.59z)| lr 1.30e-05 | 4160.30 ms | 32.5% bf16 MFU | 125709 tok/s step 17790/19560 | loss 3.288870 (-0.04z)| norm 0.2528 (+1.25z)| lr 1.30e-05 | 4153.65 ms | 32.5% bf16 MFU | 125735 tok/s step 17791/19560 | loss 3.396438 (+2.55z)| norm 0.2484 (+0.75z)| lr 1.29e-05 | 4144.18 ms | 32.6% bf16 MFU | 125774 tok/s step 17792/19560 | loss 3.272793 (-0.44z)| norm 0.2345 (-0.81z)| lr 1.29e-05 | 4161.09 ms | 32.4% bf16 MFU | 125785 tok/s step 17793/19560 | loss 3.322523 (+0.76z)| norm 0.2373 (-0.49z)| lr 1.29e-05 | 4155.93 ms | 32.5% bf16 MFU | 125803 tok/s step 17794/19560 | loss 3.330040 (+0.95z)| norm 0.2481 (+0.71z)| lr 1.29e-05 | 4157.76 ms | 32.5% bf16 MFU | 125818 tok/s step 17795/19560 | loss 3.277087 (-0.34z)| norm 0.2360 (-0.64z)| lr 1.29e-05 | 4157.93 ms | 32.5% bf16 MFU | 125832 tok/s step 17796/19560 | loss 3.308427 (+0.43z)| norm 0.2497 (+0.97z)| lr 1.29e-05 | 4146.92 ms | 32.6% bf16 MFU | 125862 tok/s step 17797/19560 | loss 3.287619 (-0.07z)| norm 0.2381 (-0.39z)| lr 1.29e-05 | 4151.67 ms | 32.5% bf16 MFU | 125883 tok/s step 17798/19560 | loss 3.314993 (+0.61z)| norm 0.2426 (+0.13z)| lr 1.28e-05 | 4175.02 ms | 32.3% bf16 MFU | 125867 tok/s step 17799/19560 | loss 3.243019 (-1.16z)| norm 0.2406 (-0.09z)| lr 1.28e-05 | 4153.11 ms | 32.5% bf16 MFU | 125886 tok/s step 17800/19560 | loss 3.280510 (-0.23z)| norm 0.2342 (-0.85z)| lr 1.28e-05 | 4150.80 ms | 32.5% bf16 MFU | 125907 tok/s step 17801/19560 | loss 3.276520 (-0.31z)| norm 0.2386 (-0.32z)| lr 1.28e-05 | 4151.80 ms | 32.5% bf16 MFU | 125926 tok/s step 17802/19560 | loss 3.284050 (-0.12z)| norm 0.2350 (-0.74z)| lr 1.28e-05 | 4141.99 ms | 32.6% bf16 MFU | 125959 tok/s step 17803/19560 | loss 3.285481 (-0.09z)| norm 0.2324 (-1.04z)| lr 1.28e-05 | 4152.74 ms | 32.5% bf16 MFU | 125973 tok/s step 17804/19560 | loss 3.244570 (-1.14z)| norm 0.2326 (-1.00z)| lr 1.28e-05 | 4149.23 ms | 32.5% bf16 MFU | 125992 tok/s step 17805/19560 | loss 3.329214 (+1.01z)| norm 0.2394 (-0.20z)| lr 1.27e-05 | 4151.03 ms | 32.5% bf16 MFU | 126008 tok/s step 17806/19560 | loss 3.286588 (-0.07z)| norm 0.2448 (+0.43z)| lr 1.27e-05 | 4151.00 ms | 32.5% bf16 MFU | 126023 tok/s step 17807/19560 | loss 3.335581 (+1.16z)| norm 0.2503 (+1.07z)| lr 1.27e-05 | 4143.22 ms | 32.6% bf16 MFU | 126049 tok/s step 17808/19560 | loss 3.317765 (+0.71z)| norm 0.2334 (-0.90z)| lr 1.27e-05 | 4149.85 ms | 32.5% bf16 MFU | 126063 tok/s step 17809/19560 | loss 3.297445 (+0.19z)| norm 0.2462 (+0.58z)| lr 1.27e-05 | 4153.06 ms | 32.5% bf16 MFU | 126072 tok/s step 17810/19560 | loss 3.303967 (+0.34z)| norm 0.2429 (+0.18z)| lr 1.27e-05 | 4154.02 ms | 32.5% bf16 MFU | 126079 tok/s step 17811/19560 | loss 3.241628 (-1.24z)| norm 0.2411 (-0.04z)| lr 1.27e-05 | 4152.80 ms | 32.5% bf16 MFU | 126088 tok/s step 17812/19560 | loss 3.304107 (+0.38z)| norm 0.2365 (-0.57z)| lr 1.26e-05 | 4154.31 ms | 32.5% bf16 MFU | 126093 tok/s step 17813/19560 | loss 3.336253 (+1.20z)| norm 0.2468 (+0.66z)| lr 1.26e-05 | 4154.01 ms | 32.5% bf16 MFU | 126099 tok/s step 17814/19560 | loss 3.329954 (+1.04z)| norm 0.2507 (+1.13z)| lr 1.26e-05 | 4145.63 ms | 32.6% bf16 MFU | 126118 tok/s step 17815/19560 | loss 3.282488 (-0.20z)| norm 0.2353 (-0.71z)| lr 1.26e-05 | 4156.58 ms | 32.5% bf16 MFU | 126119 tok/s step 17816/19560 | loss 3.355103 (+1.66z)| norm 0.2327 (-1.01z)| lr 1.26e-05 | 4170.48 ms | 32.4% bf16 MFU | 126098 tok/s step 17817/19560 | loss 3.223670 (-1.70z)| norm 0.2422 (+0.10z)| lr 1.26e-05 | 4160.77 ms | 32.5% bf16 MFU | 126094 tok/s step 17818/19560 | loss 3.313514 (+0.58z)| norm 0.2408 (-0.07z)| lr 1.26e-05 | 4152.78 ms | 32.5% bf16 MFU | 126102 tok/s step 17819/19560 | loss 3.256802 (-0.86z)| norm 0.2402 (-0.14z)| lr 1.25e-05 | 4158.98 ms | 32.5% bf16 MFU | 126100 tok/s step 17820/19560 | loss 3.292602 (+0.08z)| norm 0.2370 (-0.52z)| lr 1.25e-05 | 4165.54 ms | 32.4% bf16 MFU | 126088 tok/s step 17821/19560 | loss 3.319600 (+0.79z)| norm 0.2357 (-0.67z)| lr 1.25e-05 | 4149.52 ms | 32.5% bf16 MFU | 126101 tok/s step 17822/19560 | loss 3.252760 (-0.95z)| norm 0.2362 (-0.59z)| lr 1.25e-05 | 4150.34 ms | 32.5% bf16 MFU | 126112 tok/s step 17823/19560 | loss 3.302407 (+0.34z)| norm 0.2427 (+0.21z)| lr 1.25e-05 | 4221.11 ms | 32.0% bf16 MFU | 126017 tok/s step 17824/19560 | loss 3.252849 (-0.95z)| norm 0.2360 (-0.61z)| lr 1.25e-05 | 4148.93 ms | 32.5% bf16 MFU | 126034 tok/s step 17825/19560 | loss 3.318306 (+0.77z)| norm 0.2442 (+0.38z)| lr 1.25e-05 | 4159.47 ms | 32.5% bf16 MFU | 126035 tok/s step 17826/19560 | loss 3.381288 (+2.36z)| norm 0.2381 (-0.36z)| lr 1.24e-05 | 4165.24 ms | 32.4% bf16 MFU | 126027 tok/s step 17827/19560 | loss 3.289542 (-0.00z)| norm 0.2336 (-0.91z)| lr 1.24e-05 | 4149.79 ms | 32.5% bf16 MFU | 126042 tok/s step 17828/19560 | loss 3.288401 (-0.03z)| norm 0.2362 (-0.59z)| lr 1.24e-05 | 4153.50 ms | 32.5% bf16 MFU | 126052 tok/s step 17829/19560 | loss 3.264355 (-0.65z)| norm 0.2387 (-0.28z)| lr 1.24e-05 | 4157.82 ms | 32.5% bf16 MFU | 126054 tok/s step 17830/19560 | loss 3.362236 (+1.85z)| norm 0.2426 (+0.23z)| lr 1.24e-05 | 4148.68 ms | 32.5% bf16 MFU | 126070 tok/s step 17831/19560 | loss 3.329150 (+1.00z)| norm 0.2408 (-0.00z)| lr 1.24e-05 | 4156.29 ms | 32.5% bf16 MFU | 126074 tok/s step 17832/19560 | loss 3.352923 (+1.57z)| norm 0.2510 (+1.27z)| lr 1.24e-05 | 4152.54 ms | 32.5% bf16 MFU | 126083 tok/s step 17833/19560 | loss 3.329221 (+0.96z)| norm 0.2458 (+0.61z)| lr 1.23e-05 | 4154.48 ms | 32.5% bf16 MFU | 126089 tok/s step 17834/19560 | loss 3.313206 (+0.55z)| norm 0.2278 (-1.67z)| lr 1.23e-05 | 4165.58 ms | 32.4% bf16 MFU | 126077 tok/s step 17835/19560 | loss 3.283200 (-0.21z)| norm 0.2336 (-0.92z)| lr 1.23e-05 | 4155.74 ms | 32.5% bf16 MFU | 126081 tok/s step 17836/19560 | loss 3.325083 (+0.84z)| norm 0.2455 (+0.56z)| lr 1.23e-05 | 4160.74 ms | 32.5% bf16 MFU | 126078 tok/s step 17837/19560 | loss 3.252491 (-0.98z)| norm 0.2268 (-1.78z)| lr 1.23e-05 | 4150.57 ms | 32.5% bf16 MFU | 126090 tok/s step 17838/19560 | loss 3.348618 (+1.44z)| norm 0.2492 (+1.02z)| lr 1.23e-05 | 4156.38 ms | 32.5% bf16 MFU | 126092 tok/s step 17839/19560 | loss 3.299038 (+0.19z)| norm 0.2391 (-0.26z)| lr 1.23e-05 | 4164.04 ms | 32.4% bf16 MFU | 126083 tok/s step 17840/19560 | loss 3.334969 (+1.08z)| norm 0.2440 (+0.36z)| lr 1.22e-05 | 4216.02 ms | 32.0% bf16 MFU | 125997 tok/s step 17841/19560 | loss 3.288522 (-0.10z)| norm 0.2388 (-0.30z)| lr 1.22e-05 | 4173.54 ms | 32.4% bf16 MFU | 125978 tok/s step 17842/19560 | loss 3.330381 (+0.95z)| norm 0.2389 (-0.29z)| lr 1.22e-05 | 4174.46 ms | 32.3% bf16 MFU | 125959 tok/s step 17843/19560 | loss 3.309277 (+0.43z)| norm 0.2399 (-0.16z)| lr 1.22e-05 | 4163.42 ms | 32.4% bf16 MFU | 125957 tok/s step 17844/19560 | loss 3.284662 (-0.19z)| norm 0.2347 (-0.80z)| lr 1.22e-05 | 4162.92 ms | 32.4% bf16 MFU | 125956 tok/s step 17845/19560 | loss 3.356677 (+1.61z)| norm 0.2344 (-0.84z)| lr 1.22e-05 | 4164.88 ms | 32.4% bf16 MFU | 125953 tok/s step 17846/19560 | loss 3.359448 (+1.65z)| norm 0.2426 (+0.19z)| lr 1.22e-05 | 4153.41 ms | 32.5% bf16 MFU | 125967 tok/s step 17847/19560 | loss 3.349997 (+1.39z)| norm 0.2607 (+2.40z)| lr 1.21e-05 | 4156.56 ms | 32.5% bf16 MFU | 125975 tok/s step 17848/19560 | loss 3.303803 (+0.25z)| norm 0.2398 (-0.15z)| lr 1.21e-05 | 4149.94 ms | 32.5% bf16 MFU | 125993 tok/s step 17849/19560 | loss 3.294648 (+0.01z)| norm 0.2382 (-0.37z)| lr 1.21e-05 | 4168.71 ms | 32.4% bf16 MFU | 125982 tok/s step 17850/19560 | loss 3.434105 (+3.37z)| norm 0.2473 (+0.79z)| lr 1.21e-05 | 4170.22 ms | 32.4% bf16 MFU | 125969 tok/s step 17851/19560 | loss 3.369196 (+1.79z)| norm 0.2343 (-0.86z)| lr 1.21e-05 | 4158.71 ms | 32.5% bf16 MFU | 125974 tok/s step 17852/19560 | loss 3.304522 (+0.22z)| norm 0.2328 (-1.05z)| lr 1.21e-05 | 4152.65 ms | 32.5% bf16 MFU | 125988 tok/s step 17853/19560 | loss 3.301354 (+0.17z)| norm 0.2627 (+2.71z)| lr 1.21e-05 | 4435.54 ms | 30.4% bf16 MFU | 125599 tok/s step 17854/19560 | loss 3.282244 (-0.29z)| norm 0.2418 (+0.10z)| lr 1.20e-05 | 4177.63 ms | 32.3% bf16 MFU | 125594 tok/s step 17855/19560 | loss 3.321935 (+0.68z)| norm 0.2306 (-1.29z)| lr 1.20e-05 | 4149.29 ms | 32.5% bf16 MFU | 125632 tok/s step 17856/19560 | loss 3.392451 (+2.35z)| norm 0.2600 (+2.35z)| lr 1.20e-05 | 4144.77 ms | 32.6% bf16 MFU | 125675 tok/s step 17857/19560 | loss 3.327247 (+0.76z)| norm 0.2450 (+0.49z)| lr 1.20e-05 | 4160.90 ms | 32.4% bf16 MFU | 125691 tok/s step 17858/19560 | loss 3.292337 (-0.08z)| norm 0.2290 (-1.47z)| lr 1.20e-05 | 4241.06 ms | 31.8% bf16 MFU | 125588 tok/s step 17859/19560 | loss 3.397798 (+2.43z)| norm 0.2442 (+0.43z)| lr 1.20e-05 | 4146.67 ms | 32.6% bf16 MFU | 125630 tok/s step 17860/19560 | loss 3.289079 (-0.17z)| norm 0.2375 (-0.42z)| lr 1.20e-05 | 4140.81 ms | 32.6% bf16 MFU | 125679 tok/s step 17861/19560 | loss 3.338865 (+1.01z)| norm 0.2344 (-0.81z)| lr 1.19e-05 | 4840.24 ms | 27.9% bf16 MFU | 124811 tok/s step 17862/19560 | loss 3.330084 (+0.79z)| norm 0.2547 (+1.71z)| lr 1.19e-05 | 4678.60 ms | 28.9% bf16 MFU | 124174 tok/s step 17863/19560 | loss 3.416217 (+2.73z)| norm 0.2520 (+1.37z)| lr 1.19e-05 | 4792.95 ms | 28.2% bf16 MFU | 123435 tok/s step 17864/19560 | loss 3.311598 (+0.31z)| norm 0.2358 (-0.64z)| lr 1.19e-05 | 4575.23 ms | 29.5% bf16 MFU | 122992 tok/s step 17865/19560 | loss 3.285594 (-0.29z)| norm 0.2417 (+0.13z)| lr 1.19e-05 | 4269.82 ms | 31.6% bf16 MFU | 122982 tok/s step 17866/19560 | loss 3.350124 (+1.18z)| norm 0.2342 (-0.84z)| lr 1.19e-05 | 4620.83 ms | 29.2% bf16 MFU | 122506 tok/s step 17867/19560 | loss 3.267486 (-0.72z)| norm 0.2410 (+0.03z)| lr 1.19e-05 | 4138.53 ms | 32.6% bf16 MFU | 122715 tok/s step 17868/19560 | loss 3.322869 (+0.56z)| norm 0.2381 (-0.33z)| lr 1.19e-05 | 4150.08 ms | 32.5% bf16 MFU | 122896 tok/s step 17869/19560 | loss 3.301963 (+0.06z)| norm 0.2357 (-0.62z)| lr 1.18e-05 | 4709.62 ms | 28.7% bf16 MFU | 122317 tok/s step 17870/19560 | loss 3.296400 (-0.07z)| norm 0.2307 (-1.30z)| lr 1.18e-05 | 4447.39 ms | 30.4% bf16 MFU | 122096 tok/s step 17871/19560 | loss 3.288847 (-0.26z)| norm 0.2287 (-1.54z)| lr 1.18e-05 | 4594.29 ms | 29.4% bf16 MFU | 121697 tok/s step 17872/19560 | loss 3.242432 (-1.33z)| norm 0.2348 (-0.74z)| lr 1.18e-05 | 5159.74 ms | 26.2% bf16 MFU | 120693 tok/s step 17873/19560 | loss 3.328944 (+0.68z)| norm 0.2374 (-0.40z)| lr 1.18e-05 | 4920.67 ms | 27.4% bf16 MFU | 119985 tok/s step 17874/19560 | loss 3.292060 (-0.17z)| norm 0.2325 (-1.05z)| lr 1.18e-05 | 4160.94 ms | 32.4% bf16 MFU | 120286 tok/s step 17875/19560 | loss 3.297367 (-0.05z)| norm 0.2459 (+0.75z)| lr 1.18e-05 | 4218.96 ms | 32.0% bf16 MFU | 120485 tok/s step 17876/19560 | loss 3.320892 (+0.49z)| norm 0.2466 (+0.84z)| lr 1.17e-05 | 4157.22 ms | 32.5% bf16 MFU | 120767 tok/s step 17877/19560 | loss 3.333921 (+0.81z)| norm 0.2283 (-1.63z)| lr 1.17e-05 | 4167.49 ms | 32.4% bf16 MFU | 121019 tok/s step 17878/19560 | loss 3.317611 (+0.42z)| norm 0.2460 (+0.78z)| lr 1.17e-05 | 4220.26 ms | 32.0% bf16 MFU | 121179 tok/s step 17879/19560 | loss 3.304083 (+0.09z)| norm 0.2387 (-0.21z)| lr 1.17e-05 | 4154.29 ms | 32.5% bf16 MFU | 121431 tok/s step 17880/19560 | loss 3.284973 (-0.37z)| norm 0.2402 (-0.01z)| lr 1.17e-05 | 4221.08 ms | 32.0% bf16 MFU | 121569 tok/s step 17881/19560 | loss 3.338605 (+0.90z)| norm 0.2539 (+1.83z)| lr 1.17e-05 | 4152.85 ms | 32.5% bf16 MFU | 121803 tok/s step 17882/19560 | loss 3.287088 (-0.34z)| norm 0.2434 (+0.40z)| lr 1.17e-05 | 4147.95 ms | 32.6% bf16 MFU | 122033 tok/s step 17883/19560 | loss 3.359760 (+1.38z)| norm 0.2388 (-0.24z)| lr 1.16e-05 | 4157.13 ms | 32.5% bf16 MFU | 122237 tok/s step 17884/19560 | loss 3.293137 (-0.21z)| norm 0.2425 (+0.28z)| lr 1.16e-05 | 4211.44 ms | 32.1% bf16 MFU | 122350 tok/s step 17885/19560 | loss 3.326338 (+0.57z)| norm 0.2373 (-0.43z)| lr 1.16e-05 | 4142.81 ms | 32.6% bf16 MFU | 122560 tok/s step 17886/19560 | loss 3.330501 (+0.66z)| norm 0.2402 (-0.04z)| lr 1.16e-05 | 4139.45 ms | 32.6% bf16 MFU | 122765 tok/s step 17887/19560 | loss 3.359872 (+1.35z)| norm 0.2312 (-1.27z)| lr 1.16e-05 | 4153.03 ms | 32.5% bf16 MFU | 122939 tok/s step 17888/19560 | loss 3.288394 (-0.37z)| norm 0.2400 (-0.05z)| lr 1.16e-05 | 4164.58 ms | 32.4% bf16 MFU | 123086 tok/s step 17889/19560 | loss 3.315757 (+0.28z)| norm 0.2459 (+0.75z)| lr 1.16e-05 | 4152.24 ms | 32.5% bf16 MFU | 123245 tok/s step 17890/19560 | loss 3.268039 (-0.87z)| norm 0.2317 (-1.19z)| lr 1.15e-05 | 4149.98 ms | 32.5% bf16 MFU | 123400 tok/s step 17891/19560 | loss 3.317476 (+0.32z)| norm 0.2300 (-1.40z)| lr 1.15e-05 | 4146.06 ms | 32.6% bf16 MFU | 123553 tok/s step 17892/19560 | loss 3.307417 (+0.07z)| norm 0.2360 (-0.59z)| lr 1.15e-05 | 4880.34 ms | 27.7% bf16 MFU | 122746 tok/s step 17893/19560 | loss 3.310520 (+0.13z)| norm 0.2384 (-0.26z)| lr 1.15e-05 | 4144.24 ms | 32.6% bf16 MFU | 122935 tok/s step 17894/19560 | loss 3.288769 (-0.41z)| norm 0.2371 (-0.44z)| lr 1.15e-05 | 4198.90 ms | 32.2% bf16 MFU | 123031 tok/s step 17895/19560 | loss 3.332588 (+0.66z)| norm 0.2281 (-1.66z)| lr 1.15e-05 | 4171.61 ms | 32.4% bf16 MFU | 123164 tok/s step 17896/19560 | loss 3.379505 (+1.78z)| norm 0.2457 (+0.72z)| lr 1.15e-05 | 4164.82 ms | 32.4% bf16 MFU | 123300 tok/s step 17897/19560 | loss 3.306906 (-0.01z)| norm 0.2325 (-1.08z)| lr 1.15e-05 | 4148.22 ms | 32.5% bf16 MFU | 123454 tok/s step 17898/19560 | loss 3.256650 (-1.31z)| norm 0.2383 (-0.29z)| lr 1.14e-05 | 4210.55 ms | 32.1% bf16 MFU | 123507 tok/s step 17899/19560 | loss 3.325064 (+0.45z)| norm 0.2433 (+0.40z)| lr 1.14e-05 | 4196.58 ms | 32.2% bf16 MFU | 123578 tok/s step 17900/19560 | loss 3.267942 (-1.01z)| norm 0.2339 (-0.88z)| lr 1.14e-05 | 4148.64 ms | 32.5% bf16 MFU | 123718 tok/s step 17901/19560 | loss 3.317421 (+0.25z)| norm 0.2337 (-0.91z)| lr 1.14e-05 | 4152.34 ms | 32.5% bf16 MFU | 123846 tok/s step 17902/19560 | loss 3.353570 (+1.16z)| norm 0.2647 (+3.21z)| lr 1.14e-05 | 4238.89 ms | 31.9% bf16 MFU | 123838 tok/s step 17903/19560 | loss 3.311862 (+0.08z)| norm 0.2276 (-1.67z)| lr 1.14e-05 | 4168.01 ms | 32.4% bf16 MFU | 123935 tok/s step 17904/19560 | loss 3.265992 (-1.13z)| norm 0.2366 (-0.48z)| lr 1.14e-05 | 4156.69 ms | 32.5% bf16 MFU | 124045 tok/s step 17905/19560 | loss 3.315808 (+0.16z)| norm 0.2288 (-1.50z)| lr 1.13e-05 | 4140.37 ms | 32.6% bf16 MFU | 124174 tok/s step 17906/19560 | loss 3.290551 (-0.52z)| norm 0.2393 (-0.10z)| lr 1.13e-05 | 4152.59 ms | 32.5% bf16 MFU | 124278 tok/s step 17907/19560 | loss 3.325300 (+0.41z)| norm 0.2305 (-1.24z)| lr 1.13e-05 | 4150.71 ms | 32.5% bf16 MFU | 124380 tok/s step 17908/19560 | loss 3.283886 (-0.70z)| norm 0.2237 (-2.10z)| lr 1.13e-05 | 4450.36 ms | 30.3% bf16 MFU | 124051 tok/s step 17909/19560 | loss 3.294033 (-0.42z)| norm 0.2334 (-0.82z)| lr 1.13e-05 | 4156.89 ms | 32.5% bf16 MFU | 124155 tok/s step 17910/19560 | loss 3.320293 (+0.29z)| norm 0.2317 (-1.02z)| lr 1.13e-05 | 4253.04 ms | 31.7% bf16 MFU | 124111 tok/s step 17911/19560 | loss 3.316239 (+0.17z)| norm 0.2540 (+1.92z)| lr 1.13e-05 | 4172.02 ms | 32.4% bf16 MFU | 124189 tok/s step 17912/19560 | loss 3.382494 (+1.94z)| norm 0.2458 (+0.83z)| lr 1.12e-05 | 4150.22 ms | 32.5% bf16 MFU | 124296 tok/s step 17913/19560 | loss 3.289258 (-0.57z)| norm 0.2376 (-0.26z)| lr 1.12e-05 | 4156.68 ms | 32.5% bf16 MFU | 124387 tok/s step 17914/19560 | loss 3.302860 (-0.21z)| norm 0.2436 (+0.53z)| lr 1.12e-05 | 4145.37 ms | 32.6% bf16 MFU | 124492 tok/s step 17915/19560 | loss 3.323775 (+0.35z)| norm 0.2457 (+0.79z)| lr 1.12e-05 | 4183.46 ms | 32.3% bf16 MFU | 124533 tok/s step 17916/19560 | loss 3.332579 (+0.59z)| norm 0.2426 (+0.39z)| lr 1.12e-05 | 4143.88 ms | 32.6% bf16 MFU | 124633 tok/s step 17917/19560 | loss 3.318758 (+0.20z)| norm 0.2466 (+0.91z)| lr 1.12e-05 | 4143.64 ms | 32.6% bf16 MFU | 124728 tok/s step 17918/19560 | loss 3.269213 (-1.14z)| norm 0.2255 (-1.86z)| lr 1.12e-05 | 4154.63 ms | 32.5% bf16 MFU | 124801 tok/s step 17919/19560 | loss 3.292214 (-0.50z)| norm 0.2320 (-0.99z)| lr 1.12e-05 | 4149.34 ms | 32.5% bf16 MFU | 124879 tok/s step 17920/19560 | loss 3.302915 (-0.21z)| norm 0.2480 (+1.11z)| lr 1.11e-05 | 4229.71 ms | 31.9% bf16 MFU | 124832 tok/s step 17921/19560 | loss 3.292362 (-0.50z)| norm 0.2284 (-1.45z)| lr 1.11e-05 | 4151.74 ms | 32.5% bf16 MFU | 124905 tok/s step 17922/19560 | loss 3.291617 (-0.51z)| norm 0.2378 (-0.21z)| lr 1.11e-05 | 4150.25 ms | 32.5% bf16 MFU | 124976 tok/s step 17923/19560 | loss 3.295036 (-0.42z)| norm 0.2281 (-1.46z)| lr 1.11e-05 | 4155.71 ms | 32.5% bf16 MFU | 125035 tok/s step 17924/19560 | loss 3.305174 (-0.14z)| norm 0.2298 (-1.23z)| lr 1.11e-05 | 4155.30 ms | 32.5% bf16 MFU | 125092 tok/s step 17925/19560 | loss 3.342191 (+0.88z)| norm 0.2343 (-0.63z)| lr 1.11e-05 | 4145.38 ms | 32.6% bf16 MFU | 125161 tok/s step 17926/19560 | loss 3.305069 (-0.15z)| norm 0.2467 (+0.98z)| lr 1.11e-05 | 4152.11 ms | 32.5% bf16 MFU | 125217 tok/s step 17927/19560 | loss 3.348069 (+1.04z)| norm 0.2312 (-1.02z)| lr 1.10e-05 | 4189.71 ms | 32.2% bf16 MFU | 125213 tok/s step 17928/19560 | loss 3.282467 (-0.82z)| norm 0.2342 (-0.63z)| lr 1.10e-05 | 4225.31 ms | 32.0% bf16 MFU | 125156 tok/s step 17929/19560 | loss 3.353314 (+1.17z)| norm 0.2416 (+0.32z)| lr 1.10e-05 | 4164.56 ms | 32.4% bf16 MFU | 125193 tok/s step 17930/19560 | loss 3.323745 (+0.33z)| norm 0.2407 (+0.19z)| lr 1.10e-05 | 4146.16 ms | 32.6% bf16 MFU | 125256 tok/s step 17931/19560 | loss 3.268605 (-1.23z)| norm 0.2350 (-0.55z)| lr 1.10e-05 | 4174.51 ms | 32.3% bf16 MFU | 125273 tok/s step 17932/19560 | loss 3.317504 (+0.14z)| norm 0.2265 (-1.63z)| lr 1.10e-05 | 4190.66 ms | 32.2% bf16 MFU | 125265 tok/s step 17933/19560 | loss 3.302530 (-0.28z)| norm 0.2362 (-0.38z)| lr 1.10e-05 | 4166.41 ms | 32.4% bf16 MFU | 125293 tok/s step 17934/19560 | loss 3.318484 (+0.17z)| norm 0.2495 (+1.32z)| lr 1.10e-05 | 4156.03 ms | 32.5% bf16 MFU | 125336 tok/s step 17935/19560 | loss 3.280136 (-0.92z)| norm 0.2410 (+0.25z)| lr 1.09e-05 | 4179.52 ms | 32.3% bf16 MFU | 125341 tok/s step 17936/19560 | loss 3.296632 (-0.44z)| norm 0.2511 (+1.52z)| lr 1.09e-05 | 4155.09 ms | 32.5% bf16 MFU | 125383 tok/s step 17937/19560 | loss 3.335097 (+0.65z)| norm 0.2494 (+1.30z)| lr 1.09e-05 | 4152.28 ms | 32.5% bf16 MFU | 125427 tok/s step 17938/19560 | loss 3.324987 (+0.35z)| norm 0.2413 (+0.26z)| lr 1.09e-05 | 4162.78 ms | 32.4% bf16 MFU | 125453 tok/s step 17939/19560 | loss 3.329915 (+0.48z)| norm 0.2331 (-0.78z)| lr 1.09e-05 | 4151.48 ms | 32.5% bf16 MFU | 125495 tok/s step 17940/19560 | loss 3.264417 (-1.40z)| norm 0.2344 (-0.61z)| lr 1.09e-05 | 4186.46 ms | 32.3% bf16 MFU | 125482 tok/s step 17941/19560 | loss 3.258235 (-1.55z)| norm 0.2534 (+1.79z)| lr 1.09e-05 | 4167.82 ms | 32.4% bf16 MFU | 125498 tok/s step 17942/19560 | loss 3.289947 (-0.63z)| norm 0.2416 (+0.32z)| lr 1.08e-05 | 4147.85 ms | 32.6% bf16 MFU | 125543 tok/s step 17943/19560 | loss 3.304139 (-0.23z)| norm 0.2353 (-0.49z)| lr 1.08e-05 | 4153.71 ms | 32.5% bf16 MFU | 125577 tok/s step 17944/19560 | loss 3.301142 (-0.31z)| norm 0.2354 (-0.48z)| lr 1.08e-05 | 4143.21 ms | 32.6% bf16 MFU | 125625 tok/s step 17945/19560 | loss 3.295799 (-0.49z)| norm 0.2408 (+0.21z)| lr 1.08e-05 | 4153.66 ms | 32.5% bf16 MFU | 125655 tok/s step 17946/19560 | loss 3.326293 (+0.41z)| norm 0.2433 (+0.53z)| lr 1.08e-05 | 4156.36 ms | 32.5% bf16 MFU | 125679 tok/s step 17947/19560 | loss 3.247056 (-1.92z)| norm 0.2469 (+0.98z)| lr 1.08e-05 | 4151.17 ms | 32.5% bf16 MFU | 125710 tok/s step 17948/19560 | loss 3.299868 (-0.37z)| norm 0.2316 (-0.96z)| lr 1.08e-05 | 4148.54 ms | 32.5% bf16 MFU | 125744 tok/s step 17949/19560 | loss 3.315477 (+0.09z)| norm 0.2309 (-1.04z)| lr 1.08e-05 | 4146.98 ms | 32.6% bf16 MFU | 125778 tok/s step 17950/19560 | loss 3.284866 (-0.82z)| norm 0.2475 (+1.04z)| lr 1.07e-05 | 4154.54 ms | 32.5% bf16 MFU | 125799 tok/s step 17951/19560 | loss 3.237399 (-2.18z)| norm 0.2335 (-0.71z)| lr 1.07e-05 | 4148.98 ms | 32.5% bf16 MFU | 125827 tok/s step 17952/19560 | loss 3.323617 (+0.32z)| norm 0.2511 (+1.47z)| lr 1.07e-05 | 4152.08 ms | 32.5% bf16 MFU | 125849 tok/s step 17953/19560 | loss 3.356322 (+1.27z)| norm 0.2444 (+0.63z)| lr 1.07e-05 | 4162.57 ms | 32.4% bf16 MFU | 125854 tok/s step 17954/19560 | loss 3.242081 (-2.04z)| norm 0.2451 (+0.71z)| lr 1.07e-05 | 4151.39 ms | 32.5% bf16 MFU | 125876 tok/s step 17955/19560 | loss 3.283051 (-0.84z)| norm 0.2301 (-1.15z)| lr 1.07e-05 | 4147.55 ms | 32.6% bf16 MFU | 125903 tok/s step 17956/19560 | loss 3.352144 (+1.16z)| norm 0.2395 (+0.02z)| lr 1.07e-05 | 4149.87 ms | 32.5% bf16 MFU | 125925 tok/s step 17957/19560 | loss 3.315851 (+0.09z)| norm 0.2365 (-0.35z)| lr 1.06e-05 | 4151.30 ms | 32.5% bf16 MFU | 125943 tok/s step 17958/19560 | loss 3.283751 (-0.84z)| norm 0.2356 (-0.45z)| lr 1.06e-05 | 4146.10 ms | 32.6% bf16 MFU | 125969 tok/s step 17959/19560 | loss 3.219265 (-2.64z)| norm 0.2329 (-0.78z)| lr 1.06e-05 | 4164.17 ms | 32.4% bf16 MFU | 125966 tok/s step 17960/19560 | loss 3.336365 (+0.73z)| norm 0.2398 (+0.09z)| lr 1.06e-05 | 4259.10 ms | 31.7% bf16 MFU | 125822 tok/s step 17961/19560 | loss 3.264451 (-1.32z)| norm 0.2293 (-1.21z)| lr 1.06e-05 | 4153.99 ms | 32.5% bf16 MFU | 125842 tok/s step 17962/19560 | loss 3.285373 (-0.72z)| norm 0.2362 (-0.35z)| lr 1.06e-05 | 4205.80 ms | 32.1% bf16 MFU | 125783 tok/s step 17963/19560 | loss 3.334422 (+0.68z)| norm 0.2381 (-0.12z)| lr 1.06e-05 | 4171.72 ms | 32.4% bf16 MFU | 125777 tok/s step 17964/19560 | loss 3.311437 (+0.02z)| norm 0.2373 (-0.22z)| lr 1.06e-05 | 4135.24 ms | 32.7% bf16 MFU | 125828 tok/s step 17965/19560 | loss 3.320656 (+0.27z)| norm 0.2238 (-1.90z)| lr 1.05e-05 | 4116.20 ms | 32.8% bf16 MFU | 125905 tok/s step 17966/19560 | loss 3.233503 (-2.19z)| norm 0.3042 (+6.61z)| lr 1.05e-05 | 4134.82 ms | 32.7% bf16 MFU | 125950 tok/s step 17967/19560 | loss 3.303048 (-0.21z)| norm 0.2325 (-0.71z)| lr 1.05e-05 | 4135.94 ms | 32.6% bf16 MFU | 125990 tok/s step 17968/19560 | loss 3.280779 (-0.83z)| norm 0.2337 (-0.57z)| lr 1.05e-05 | 4161.12 ms | 32.4% bf16 MFU | 125991 tok/s step 17969/19560 | loss 3.260319 (-1.40z)| norm 0.2412 (+0.19z)| lr 1.05e-05 | 4158.90 ms | 32.5% bf16 MFU | 125994 tok/s step 17970/19560 | loss 3.242989 (-1.85z)| norm 0.2364 (-0.29z)| lr 1.05e-05 | 4216.93 ms | 32.0% bf16 MFU | 125911 tok/s step 17971/19560 | loss 3.271608 (-1.04z)| norm 0.2392 (-0.01z)| lr 1.05e-05 | 4126.06 ms | 32.7% bf16 MFU | 125969 tok/s step 17972/19560 | loss 3.297538 (-0.32z)| norm 0.2359 (-0.35z)| lr 1.04e-05 | 4121.93 ms | 32.8% bf16 MFU | 126030 tok/s step 17973/19560 | loss 3.364901 (+1.56z)| norm 0.2335 (-0.59z)| lr 1.04e-05 | 4118.60 ms | 32.8% bf16 MFU | 126094 tok/s step 17974/19560 | loss 3.276343 (-0.89z)| norm 0.2360 (-0.33z)| lr 1.04e-05 | 4198.39 ms | 32.2% bf16 MFU | 126033 tok/s step 17975/19560 | loss 3.288543 (-0.54z)| norm 0.2372 (-0.20z)| lr 1.04e-05 | 4100.20 ms | 32.9% bf16 MFU | 126125 tok/s step 17976/19560 | loss 3.299877 (-0.22z)| norm 0.2367 (-0.24z)| lr 1.04e-05 | 4116.76 ms | 32.8% bf16 MFU | 126186 tok/s step 17977/19560 | loss 3.290477 (-0.48z)| norm 0.2370 (-0.21z)| lr 1.04e-05 | 4123.94 ms | 32.7% bf16 MFU | 126233 tok/s step 17978/19560 | loss 3.263188 (-1.27z)| norm 0.2294 (-0.98z)| lr 1.04e-05 | 4120.74 ms | 32.8% bf16 MFU | 126283 tok/s step 17979/19560 | loss 3.309753 (+0.11z)| norm 0.2454 (+0.67z)| lr 1.04e-05 | 4121.37 ms | 32.8% bf16 MFU | 126330 tok/s step 17980/19560 | loss 3.270546 (-1.04z)| norm 0.2249 (-1.44z)| lr 1.03e-05 | 4121.34 ms | 32.8% bf16 MFU | 126374 tok/s step 17981/19560 | loss 3.280427 (-0.74z)| norm 0.2448 (+0.64z)| lr 1.03e-05 | 4125.70 ms | 32.7% bf16 MFU | 126409 tok/s step 17982/19560 | loss 3.262902 (-1.25z)| norm 0.2348 (-0.42z)| lr 1.03e-05 | 4156.98 ms | 32.5% bf16 MFU | 126395 tok/s step 17983/19560 | loss 3.296741 (-0.25z)| norm 0.2387 (-0.01z)| lr 1.03e-05 | 4134.77 ms | 32.7% bf16 MFU | 126415 tok/s step 17984/19560 | loss 3.300844 (-0.11z)| norm 0.2431 (+0.48z)| lr 1.03e-05 | 4140.20 ms | 32.6% bf16 MFU | 126426 tok/s step 17985/19560 | loss 3.341073 (+1.10z)| norm 0.2486 (+1.07z)| lr 1.03e-05 | 4198.95 ms | 32.2% bf16 MFU | 126348 tok/s step 17986/19560 | loss 3.225966 (-2.31z)| norm 0.2364 (-0.25z)| lr 1.03e-05 | 4128.61 ms | 32.7% bf16 MFU | 126380 tok/s step 17987/19560 | loss 3.297562 (-0.17z)| norm 0.2314 (-0.78z)| lr 1.03e-05 | 4214.71 ms | 32.0% bf16 MFU | 126281 tok/s step 17988/19560 | loss 3.306221 (+0.09z)| norm 0.2294 (-0.98z)| lr 1.02e-05 | 4199.72 ms | 32.1% bf16 MFU | 126208 tok/s step 17989/19560 | loss 3.323882 (+0.63z)| norm 0.2366 (-0.21z)| lr 1.02e-05 | 4203.45 ms | 32.1% bf16 MFU | 126134 tok/s step 17990/19560 | loss 3.262388 (-1.23z)| norm 0.2356 (-0.31z)| lr 1.02e-05 | 4133.63 ms | 32.7% bf16 MFU | 126169 tok/s step 17991/19560 | loss 3.249161 (-1.66z)| norm 0.2505 (+1.31z)| lr 1.02e-05 | 4178.52 ms | 32.3% bf16 MFU | 126135 tok/s step 17992/19560 | loss 3.350681 (+1.54z)| norm 0.2380 (-0.05z)| lr 1.02e-05 | 4126.62 ms | 32.7% bf16 MFU | 126180 tok/s step 17993/19560 | loss 3.319324 (+0.54z)| norm 0.2343 (-0.44z)| lr 1.02e-05 | 4169.14 ms | 32.4% bf16 MFU | 126159 tok/s step 17994/19560 | loss 3.264144 (-1.17z)| norm 0.2298 (-0.92z)| lr 1.02e-05 | 4120.39 ms | 32.8% bf16 MFU | 126213 tok/s step 17995/19560 | loss 3.299636 (-0.06z)| norm 0.2384 (+0.00z)| lr 1.01e-05 | 4130.98 ms | 32.7% bf16 MFU | 126248 tok/s step 17996/19560 | loss 3.315995 (+0.46z)| norm 0.2379 (-0.05z)| lr 1.01e-05 | 4194.69 ms | 32.2% bf16 MFU | 126185 tok/s step 17997/19560 | loss 3.320961 (+0.61z)| norm 0.2348 (-0.38z)| lr 1.01e-05 | 4602.89 ms | 29.3% bf16 MFU | 125571 tok/s step 17998/19560 | loss 3.324400 (+0.71z)| norm 0.2304 (-0.86z)| lr 1.01e-05 | 4122.79 ms | 32.7% bf16 MFU | 125651 tok/s step 17999/19560 | loss 3.273224 (-0.90z)| norm 0.2359 (-0.27z)| lr 1.01e-05 | 4110.66 ms | 32.8% bf16 MFU | 125746 tok/s step 18000/19560 | loss 3.299564 (-0.08z)| norm 0.2323 (-0.66z)| lr 1.01e-05 | 4114.28 ms | 32.8% bf16 MFU | 125830 tok/s val loss 3.268161 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3045/10042 = 0.303226 step 18001/19560 | loss 3.305017 (+0.10z)| norm 0.2405 (+0.23z)| lr 1.01e-05 | 4169.85 ms | 32.4% bf16 MFU | 125825 tok/s step 18002/19560 | loss 3.358358 (+1.77z)| norm 0.2394 (+0.10z)| lr 1.01e-05 | 4152.47 ms | 32.5% bf16 MFU | 125847 tok/s step 18003/19560 | loss 3.393974 (+2.78z)| norm 0.2501 (+1.26z)| lr 1.00e-05 | 4155.32 ms | 32.5% bf16 MFU | 125863 tok/s step 18004/19560 | loss 3.369965 (+2.00z)| norm 0.2581 (+2.09z)| lr 1.00e-05 | 4152.62 ms | 32.5% bf16 MFU | 125883 tok/s step 18005/19560 | loss 3.268856 (-1.04z)| norm 0.2340 (-0.50z)| lr 1.00e-05 | 4169.68 ms | 32.4% bf16 MFU | 125876 tok/s step 18006/19560 | loss 3.308742 (+0.17z)| norm 0.2349 (-0.38z)| lr 1.00e-05 | 4159.70 ms | 32.5% bf16 MFU | 125884 tok/s step 18007/19560 | loss 3.283233 (-0.60z)| norm 0.2343 (-0.45z)| lr 1.00e-05 | 4154.96 ms | 32.5% bf16 MFU | 125899 tok/s step 18008/19560 | loss 3.299271 (-0.11z)| norm 0.2419 (+0.37z)| lr 9.98e-06 | 4236.60 ms | 31.9% bf16 MFU | 125791 tok/s step 18009/19560 | loss 3.372434 (+2.06z)| norm 0.2244 (-1.49z)| lr 9.97e-06 | 4284.51 ms | 31.5% bf16 MFU | 125620 tok/s step 18010/19560 | loss 3.244800 (-1.72z)| norm 0.2363 (-0.20z)| lr 9.96e-06 | 4183.43 ms | 32.3% bf16 MFU | 125605 tok/s step 18011/19560 | loss 3.268835 (-0.99z)| norm 0.2503 (+1.28z)| lr 9.94e-06 | 4433.88 ms | 30.5% bf16 MFU | 125237 tok/s step 18012/19560 | loss 3.310146 (+0.23z)| norm 0.2397 (+0.15z)| lr 9.93e-06 | 4151.96 ms | 32.5% bf16 MFU | 125289 tok/s step 18013/19560 | loss 3.319226 (+0.50z)| norm 0.2383 (-0.00z)| lr 9.92e-06 | 4159.94 ms | 32.5% bf16 MFU | 125327 tok/s step 18014/19560 | loss 3.280846 (-0.63z)| norm 0.2265 (-1.24z)| lr 9.91e-06 | 4259.69 ms | 31.7% bf16 MFU | 125214 tok/s step 18015/19560 | loss 3.315639 (+0.42z)| norm 0.2305 (-0.81z)| lr 9.89e-06 | 4159.43 ms | 32.5% bf16 MFU | 125256 tok/s step 18016/19560 | loss 3.269941 (-0.95z)| norm 0.2379 (-0.03z)| lr 9.88e-06 | 4151.04 ms | 32.5% bf16 MFU | 125308 tok/s step 18017/19560 | loss 3.236007 (-1.92z)| norm 0.2615 (+2.41z)| lr 9.87e-06 | 4152.93 ms | 32.5% bf16 MFU | 125355 tok/s step 18018/19560 | loss 3.330702 (+0.87z)| norm 0.2381 (-0.03z)| lr 9.85e-06 | 4146.33 ms | 32.6% bf16 MFU | 125410 tok/s step 18019/19560 | loss 3.367524 (+1.92z)| norm 0.2342 (-0.44z)| lr 9.84e-06 | 4139.30 ms | 32.6% bf16 MFU | 125472 tok/s step 18020/19560 | loss 3.291736 (-0.29z)| norm 0.2562 (+1.83z)| lr 9.83e-06 | 4147.09 ms | 32.6% bf16 MFU | 125520 tok/s step 18021/19560 | loss 3.284726 (-0.49z)| norm 0.2299 (-0.88z)| lr 9.82e-06 | 4138.42 ms | 32.6% bf16 MFU | 125578 tok/s step 18022/19560 | loss 3.326118 (+0.71z)| norm 0.2386 (+0.02z)| lr 9.80e-06 | 4295.90 ms | 31.4% bf16 MFU | 125401 tok/s step 18023/19560 | loss 3.302721 (+0.04z)| norm 0.2466 (+0.82z)| lr 9.79e-06 | 4525.62 ms | 29.8% bf16 MFU | 124924 tok/s step 18024/19560 | loss 3.286506 (-0.43z)| norm 0.2309 (-0.78z)| lr 9.78e-06 | 4122.63 ms | 32.8% bf16 MFU | 125036 tok/s step 18025/19560 | loss 3.373600 (+2.12z)| norm 0.2246 (-1.42z)| lr 9.77e-06 | 4903.12 ms | 27.5% bf16 MFU | 124131 tok/s step 18026/19560 | loss 3.283419 (-0.53z)| norm 0.2284 (-1.02z)| lr 9.75e-06 | 4642.33 ms | 29.1% bf16 MFU | 123571 tok/s step 18027/19560 | loss 3.279156 (-0.65z)| norm 0.2305 (-0.79z)| lr 9.74e-06 | 4588.09 ms | 29.4% bf16 MFU | 123106 tok/s step 18028/19560 | loss 3.382446 (+2.33z)| norm 0.2516 (+1.33z)| lr 9.73e-06 | 4147.59 ms | 32.6% bf16 MFU | 123271 tok/s step 18029/19560 | loss 3.405636 (+2.89z)| norm 0.2402 (+0.18z)| lr 9.72e-06 | 4138.06 ms | 32.6% bf16 MFU | 123443 tok/s step 18030/19560 | loss 3.285300 (-0.48z)| norm 0.2365 (-0.18z)| lr 9.70e-06 | 4142.88 ms | 32.6% bf16 MFU | 123598 tok/s step 18031/19560 | loss 3.263619 (-1.07z)| norm 0.2302 (-0.84z)| lr 9.69e-06 | 10006.00 ms | 13.5% bf16 MFU | 120038 tok/s step 18032/19560 | loss 3.307070 (+0.14z)| norm 0.2524 (+1.45z)| lr 9.68e-06 | 4318.16 ms | 31.3% bf16 MFU | 120107 tok/s step 18033/19560 | loss 3.293582 (-0.24z)| norm 0.2298 (-0.89z)| lr 9.67e-06 | 5912.88 ms | 22.8% bf16 MFU | 118535 tok/s step 18034/19560 | loss 3.328681 (+0.75z)| norm 0.2441 (+0.59z)| lr 9.65e-06 | 7805.05 ms | 17.3% bf16 MFU | 115967 tok/s step 18035/19560 | loss 3.272411 (-0.83z)| norm 0.2394 (+0.09z)| lr 9.64e-06 | 7891.43 ms | 17.1% bf16 MFU | 113490 tok/s step 18036/19560 | loss 3.319877 (+0.50z)| norm 0.2402 (+0.16z)| lr 9.63e-06 | 4282.02 ms | 31.5% bf16 MFU | 113938 tok/s step 18037/19560 | loss 3.303127 (+0.03z)| norm 0.2358 (-0.30z)| lr 9.61e-06 | 4192.60 ms | 32.2% bf16 MFU | 114494 tok/s step 18038/19560 | loss 3.276697 (-0.71z)| norm 0.2507 (+1.24z)| lr 9.60e-06 | 4494.16 ms | 30.0% bf16 MFU | 114602 tok/s step 18039/19560 | loss 3.281793 (-0.56z)| norm 0.2442 (+0.58z)| lr 9.59e-06 | 4490.02 ms | 30.1% bf16 MFU | 114710 tok/s step 18040/19560 | loss 3.350410 (+1.40z)| norm 0.2315 (-0.75z)| lr 9.58e-06 | 4128.44 ms | 32.7% bf16 MFU | 115324 tok/s step 18041/19560 | loss 3.313485 (+0.34z)| norm 0.2303 (-0.87z)| lr 9.56e-06 | 4282.12 ms | 31.5% bf16 MFU | 115680 tok/s step 18042/19560 | loss 3.306892 (+0.15z)| norm 0.2332 (-0.55z)| lr 9.55e-06 | 4154.52 ms | 32.5% bf16 MFU | 116206 tok/s step 18043/19560 | loss 3.301836 (+0.01z)| norm 0.2417 (+0.35z)| lr 9.54e-06 | 4366.06 ms | 30.9% bf16 MFU | 116400 tok/s step 18044/19560 | loss 3.308747 (+0.22z)| norm 0.2312 (-0.75z)| lr 9.53e-06 | 4166.03 ms | 32.4% bf16 MFU | 116872 tok/s step 18045/19560 | loss 3.249069 (-1.47z)| norm 0.2397 (+0.15z)| lr 9.51e-06 | 4155.75 ms | 32.5% bf16 MFU | 117336 tok/s step 18046/19560 | loss 3.243237 (-1.61z)| norm 0.2364 (-0.21z)| lr 9.50e-06 | 4141.16 ms | 32.6% bf16 MFU | 117800 tok/s step 18047/19560 | loss 3.273345 (-0.76z)| norm 0.2229 (-1.62z)| lr 9.49e-06 | 4161.26 ms | 32.4% bf16 MFU | 118209 tok/s step 18048/19560 | loss 3.312711 (+0.35z)| norm 0.2672 (+2.94z)| lr 9.48e-06 | 4169.81 ms | 32.4% bf16 MFU | 118586 tok/s step 18049/19560 | loss 3.399698 (+2.69z)| norm 0.2684 (+2.94z)| lr 9.46e-06 | 4147.23 ms | 32.6% bf16 MFU | 118977 tok/s step 18050/19560 | loss 3.289344 (-0.32z)| norm 0.2291 (-0.95z)| lr 9.45e-06 | 4224.26 ms | 32.0% bf16 MFU | 119234 tok/s step 18051/19560 | loss 3.335053 (+0.91z)| norm 0.3326 (+7.13z)| lr 9.44e-06 | 4139.55 ms | 32.6% bf16 MFU | 119605 tok/s step 18052/19560 | loss 3.337481 (+0.97z)| norm 0.2513 (+0.89z)| lr 9.43e-06 | 4185.99 ms | 32.3% bf16 MFU | 119887 tok/s step 18053/19560 | loss 3.302885 (+0.04z)| norm 0.2317 (-0.61z)| lr 9.42e-06 | 4138.39 ms | 32.6% bf16 MFU | 120227 tok/s step 18054/19560 | loss 3.355972 (+1.46z)| norm 0.2314 (-0.63z)| lr 9.40e-06 | 4159.62 ms | 32.5% bf16 MFU | 120518 tok/s step 18055/19560 | loss 3.295903 (-0.15z)| norm 0.2377 (-0.14z)| lr 9.39e-06 | 4156.18 ms | 32.5% bf16 MFU | 120800 tok/s step 18056/19560 | loss 3.300077 (-0.04z)| norm 0.2383 (-0.10z)| lr 9.38e-06 | 4154.29 ms | 32.5% bf16 MFU | 121070 tok/s step 18057/19560 | loss 3.281657 (-0.53z)| norm 0.2488 (+0.70z)| lr 9.37e-06 | 4165.08 ms | 32.4% bf16 MFU | 121310 tok/s step 18058/19560 | loss 3.392575 (+2.44z)| norm 0.2407 (+0.08z)| lr 9.35e-06 | 4152.91 ms | 32.5% bf16 MFU | 121557 tok/s step 18059/19560 | loss 3.308952 (+0.19z)| norm 0.2382 (-0.11z)| lr 9.34e-06 | 4157.76 ms | 32.5% bf16 MFU | 121784 tok/s step 18060/19560 | loss 3.343158 (+1.10z)| norm 0.2430 (+0.24z)| lr 9.33e-06 | 4142.87 ms | 32.6% bf16 MFU | 122022 tok/s step 18061/19560 | loss 3.365397 (+1.66z)| norm 0.2423 (+0.19z)| lr 9.32e-06 | 4170.79 ms | 32.4% bf16 MFU | 122207 tok/s step 18062/19560 | loss 3.313213 (+0.28z)| norm 0.2438 (+0.30z)| lr 9.30e-06 | 4154.74 ms | 32.5% bf16 MFU | 122406 tok/s step 18063/19560 | loss 3.297309 (-0.14z)| norm 0.2486 (+0.67z)| lr 9.29e-06 | 4153.19 ms | 32.5% bf16 MFU | 122597 tok/s step 18064/19560 | loss 3.297702 (-0.13z)| norm 0.2231 (-1.27z)| lr 9.28e-06 | 4151.69 ms | 32.5% bf16 MFU | 122782 tok/s step 18065/19560 | loss 3.279282 (-0.61z)| norm 0.2390 (-0.05z)| lr 9.27e-06 | 4198.23 ms | 32.2% bf16 MFU | 122887 tok/s step 18066/19560 | loss 3.397186 (+2.45z)| norm 0.2726 (+2.45z)| lr 9.25e-06 | 4173.85 ms | 32.3% bf16 MFU | 123023 tok/s step 18067/19560 | loss 3.240133 (-1.60z)| norm 0.2638 (+1.76z)| lr 9.24e-06 | 4158.48 ms | 32.5% bf16 MFU | 123176 tok/s step 18068/19560 | loss 3.250865 (-1.31z)| norm 0.2370 (-0.23z)| lr 9.23e-06 | 4172.85 ms | 32.4% bf16 MFU | 123299 tok/s step 18069/19560 | loss 3.297986 (-0.11z)| norm 0.2409 (+0.07z)| lr 9.22e-06 | 4177.62 ms | 32.3% bf16 MFU | 123409 tok/s step 18070/19560 | loss 3.360800 (+1.48z)| norm 0.2566 (+1.22z)| lr 9.21e-06 | 4243.40 ms | 31.8% bf16 MFU | 123416 tok/s step 18071/19560 | loss 3.343369 (+1.03z)| norm 0.2536 (+0.98z)| lr 9.19e-06 | 4166.13 ms | 32.4% bf16 MFU | 123538 tok/s step 18072/19560 | loss 3.305466 (+0.06z)| norm 0.2437 (+0.25z)| lr 9.18e-06 | 4159.00 ms | 32.5% bf16 MFU | 123664 tok/s step 18073/19560 | loss 3.317100 (+0.35z)| norm 0.2591 (+1.36z)| lr 9.17e-06 | 4191.33 ms | 32.2% bf16 MFU | 123735 tok/s step 18074/19560 | loss 3.263014 (-1.01z)| norm 0.2398 (-0.05z)| lr 9.16e-06 | 4161.21 ms | 32.4% bf16 MFU | 123848 tok/s step 18075/19560 | loss 3.311735 (+0.21z)| norm 0.2309 (-0.69z)| lr 9.14e-06 | 4160.56 ms | 32.5% bf16 MFU | 123956 tok/s step 18076/19560 | loss 3.355449 (+1.31z)| norm 0.2340 (-0.46z)| lr 9.13e-06 | 4199.52 ms | 32.2% bf16 MFU | 124001 tok/s step 18077/19560 | loss 3.294066 (-0.24z)| norm 0.2246 (-1.15z)| lr 9.12e-06 | 4161.55 ms | 32.4% bf16 MFU | 124100 tok/s step 18078/19560 | loss 3.309948 (+0.16z)| norm 0.2344 (-0.42z)| lr 9.11e-06 | 4169.58 ms | 32.4% bf16 MFU | 124182 tok/s step 18079/19560 | loss 3.306004 (+0.04z)| norm 0.2337 (-0.47z)| lr 9.09e-06 | 4171.79 ms | 32.4% bf16 MFU | 124257 tok/s step 18080/19560 | loss 3.302275 (-0.05z)| norm 0.2268 (-0.96z)| lr 9.08e-06 | 4174.02 ms | 32.3% bf16 MFU | 124324 tok/s step 18081/19560 | loss 3.300959 (-0.07z)| norm 0.2306 (-0.67z)| lr 9.07e-06 | 4183.10 ms | 32.3% bf16 MFU | 124375 tok/s step 18082/19560 | loss 3.316224 (+0.31z)| norm 0.2410 (+0.09z)| lr 9.06e-06 | 4172.82 ms | 32.4% bf16 MFU | 124438 tok/s step 18083/19560 | loss 3.279913 (-0.64z)| norm 0.2315 (-0.61z)| lr 9.05e-06 | 4161.08 ms | 32.4% bf16 MFU | 124516 tok/s step 18084/19560 | loss 3.295475 (-0.22z)| norm 0.2251 (-1.07z)| lr 9.03e-06 | 4181.87 ms | 32.3% bf16 MFU | 124559 tok/s step 18085/19560 | loss 3.365986 (+1.61z)| norm 0.2355 (-0.31z)| lr 9.02e-06 | 4166.32 ms | 32.4% bf16 MFU | 124623 tok/s step 18086/19560 | loss 3.298631 (-0.15z)| norm 0.2407 (+0.06z)| lr 9.01e-06 | 4168.43 ms | 32.4% bf16 MFU | 124681 tok/s step 18087/19560 | loss 3.290971 (-0.37z)| norm 0.2245 (-1.10z)| lr 9.00e-06 | 4158.07 ms | 32.5% bf16 MFU | 124751 tok/s step 18088/19560 | loss 3.316976 (+0.33z)| norm 0.2310 (-0.62z)| lr 8.99e-06 | 4173.64 ms | 32.3% bf16 MFU | 124794 tok/s step 18089/19560 | loss 3.297740 (-0.20z)| norm 0.2288 (-0.78z)| lr 8.97e-06 | 4179.14 ms | 32.3% bf16 MFU | 124827 tok/s step 18090/19560 | loss 3.293976 (-0.30z)| norm 0.2403 (+0.04z)| lr 8.96e-06 | 4166.60 ms | 32.4% bf16 MFU | 124878 tok/s step 18091/19560 | loss 3.358602 (+1.42z)| norm 0.2453 (+0.40z)| lr 8.95e-06 | 4171.67 ms | 32.4% bf16 MFU | 124918 tok/s step 18092/19560 | loss 3.331058 (+0.68z)| norm 0.2290 (-0.77z)| lr 8.94e-06 | 4184.68 ms | 32.3% bf16 MFU | 124936 tok/s step 18093/19560 | loss 3.340241 (+0.92z)| norm 0.2917 (+3.54z)| lr 8.92e-06 | 4164.44 ms | 32.4% bf16 MFU | 124984 tok/s step 18094/19560 | loss 3.340156 (+0.91z)| norm 0.2332 (-0.48z)| lr 8.91e-06 | 4171.81 ms | 32.4% bf16 MFU | 125019 tok/s step 18095/19560 | loss 3.279833 (-0.71z)| norm 0.2317 (-0.59z)| lr 8.90e-06 | 4190.33 ms | 32.2% bf16 MFU | 125024 tok/s step 18096/19560 | loss 3.256143 (-1.33z)| norm 0.2311 (-0.64z)| lr 8.89e-06 | 4173.34 ms | 32.4% bf16 MFU | 125054 tok/s step 18097/19560 | loss 3.281590 (-0.66z)| norm 0.2395 (-0.01z)| lr 8.88e-06 | 4175.28 ms | 32.3% bf16 MFU | 125080 tok/s step 18098/19560 | loss 3.239430 (-1.79z)| norm 0.2287 (-0.80z)| lr 8.86e-06 | 4162.06 ms | 32.4% bf16 MFU | 125124 tok/s step 18099/19560 | loss 3.343906 (+0.99z)| norm 0.2329 (-0.49z)| lr 8.85e-06 | 4175.43 ms | 32.3% bf16 MFU | 125146 tok/s step 18100/19560 | loss 3.275485 (-0.83z)| norm 0.2322 (-0.54z)| lr 8.84e-06 | 4171.43 ms | 32.4% bf16 MFU | 125173 tok/s step 18101/19560 | loss 3.247334 (-1.56z)| norm 0.2256 (-1.02z)| lr 8.83e-06 | 4176.93 ms | 32.3% bf16 MFU | 125190 tok/s step 18102/19560 | loss 3.295099 (-0.29z)| norm 0.2340 (-0.40z)| lr 8.82e-06 | 4162.03 ms | 32.4% bf16 MFU | 125229 tok/s step 18103/19560 | loss 3.298501 (-0.20z)| norm 0.2564 (+1.24z)| lr 8.80e-06 | 4167.11 ms | 32.4% bf16 MFU | 125259 tok/s step 18104/19560 | loss 3.346557 (+1.07z)| norm 0.2264 (-0.96z)| lr 8.79e-06 | 4199.98 ms | 32.1% bf16 MFU | 125237 tok/s step 18105/19560 | loss 3.361022 (+1.43z)| norm 0.2407 (+0.09z)| lr 8.78e-06 | 4174.29 ms | 32.3% bf16 MFU | 125255 tok/s step 18106/19560 | loss 3.316534 (+0.25z)| norm 0.2324 (-0.52z)| lr 8.77e-06 | 4176.99 ms | 32.3% bf16 MFU | 125268 tok/s step 18107/19560 | loss 3.490137 (+4.43z)| norm 0.2374 (-0.15z)| lr 8.76e-06 | 4213.51 ms | 32.0% bf16 MFU | 125227 tok/s step 18108/19560 | loss 3.310326 (+0.03z)| norm 0.2350 (-0.33z)| lr 8.74e-06 | 4186.00 ms | 32.3% bf16 MFU | 125228 tok/s step 18109/19560 | loss 3.452880 (+3.34z)| norm 0.2587 (+1.39z)| lr 8.73e-06 | 4203.79 ms | 32.1% bf16 MFU | 125202 tok/s step 18110/19560 | loss 3.267942 (-1.00z)| norm 0.2436 (+0.29z)| lr 8.72e-06 | 4173.60 ms | 32.4% bf16 MFU | 125223 tok/s step 18111/19560 | loss 3.255743 (-1.27z)| norm 0.2472 (+0.54z)| lr 8.71e-06 | 4171.04 ms | 32.4% bf16 MFU | 125247 tok/s step 18112/19560 | loss 3.281091 (-0.67z)| norm 0.2394 (-0.02z)| lr 8.70e-06 | 4176.91 ms | 32.3% bf16 MFU | 125260 tok/s step 18113/19560 | loss 3.341665 (+0.74z)| norm 0.2596 (+1.43z)| lr 8.68e-06 | 4207.20 ms | 32.1% bf16 MFU | 125228 tok/s step 18114/19560 | loss 3.260900 (-1.16z)| norm 0.2296 (-0.74z)| lr 8.67e-06 | 4163.30 ms | 32.4% bf16 MFU | 125263 tok/s step 18115/19560 | loss 3.267047 (-1.00z)| norm 0.2380 (-0.13z)| lr 8.66e-06 | 4169.59 ms | 32.4% bf16 MFU | 125287 tok/s step 18116/19560 | loss 3.296137 (-0.32z)| norm 0.2432 (+0.24z)| lr 8.65e-06 | 4163.11 ms | 32.4% bf16 MFU | 125320 tok/s step 18117/19560 | loss 3.254417 (-1.28z)| norm 0.2401 (+0.01z)| lr 8.64e-06 | 4169.43 ms | 32.4% bf16 MFU | 125341 tok/s step 18118/19560 | loss 3.404867 (+2.17z)| norm 0.2280 (-0.86z)| lr 8.62e-06 | 4166.99 ms | 32.4% bf16 MFU | 125365 tok/s step 18119/19560 | loss 3.359479 (+1.11z)| norm 0.2362 (-0.26z)| lr 8.61e-06 | 4191.86 ms | 32.2% bf16 MFU | 125350 tok/s step 18120/19560 | loss 3.277372 (-0.77z)| norm 0.2333 (-0.47z)| lr 8.60e-06 | 4158.36 ms | 32.5% bf16 MFU | 125387 tok/s step 18121/19560 | loss 3.342361 (+0.72z)| norm 0.2492 (+0.67z)| lr 8.59e-06 | 4172.17 ms | 32.4% bf16 MFU | 125401 tok/s step 18122/19560 | loss 3.326283 (+0.35z)| norm 0.2439 (+0.28z)| lr 8.58e-06 | 4169.90 ms | 32.4% bf16 MFU | 125417 tok/s step 18123/19560 | loss 3.277880 (-0.77z)| norm 0.2399 (-0.01z)| lr 8.57e-06 | 4160.19 ms | 32.5% bf16 MFU | 125448 tok/s step 18124/19560 | loss 3.304460 (-0.15z)| norm 0.2359 (-0.30z)| lr 8.55e-06 | 4167.18 ms | 32.4% bf16 MFU | 125466 tok/s step 18125/19560 | loss 3.322658 (+0.27z)| norm 0.2283 (-0.84z)| lr 8.54e-06 | 4189.18 ms | 32.2% bf16 MFU | 125450 tok/s step 18126/19560 | loss 3.317159 (+0.14z)| norm 0.2265 (-0.97z)| lr 8.53e-06 | 12404.62 ms | 10.9% bf16 MFU | 121291 tok/s step 18127/19560 | loss 3.289202 (-0.51z)| norm 0.2260 (-0.99z)| lr 8.52e-06 | 4136.52 ms | 32.6% bf16 MFU | 121564 tok/s step 18128/19560 | loss 3.276753 (-0.79z)| norm 0.2517 (+0.84z)| lr 8.51e-06 | 4138.01 ms | 32.6% bf16 MFU | 121821 tok/s step 18129/19560 | loss 3.343312 (+0.74z)| norm 0.2345 (-0.39z)| lr 8.49e-06 | 4159.51 ms | 32.5% bf16 MFU | 122032 tok/s step 18130/19560 | loss 3.290640 (-0.46z)| norm 0.2428 (+0.21z)| lr 8.48e-06 | 4162.66 ms | 32.4% bf16 MFU | 122228 tok/s step 18131/19560 | loss 3.255489 (-1.26z)| norm 0.2342 (-0.40z)| lr 8.47e-06 | 4150.55 ms | 32.5% bf16 MFU | 122432 tok/s step 18132/19560 | loss 3.286236 (-0.53z)| norm 0.2261 (-0.97z)| lr 8.46e-06 | 4154.37 ms | 32.5% bf16 MFU | 122621 tok/s step 18133/19560 | loss 3.280746 (-0.67z)| norm 0.2306 (-0.65z)| lr 8.45e-06 | 4147.42 ms | 32.6% bf16 MFU | 122810 tok/s step 18134/19560 | loss 3.292685 (-0.38z)| norm 0.2399 (+0.02z)| lr 8.44e-06 | 4157.22 ms | 32.5% bf16 MFU | 122976 tok/s step 18135/19560 | loss 3.339170 (+0.70z)| norm 0.2425 (+0.21z)| lr 8.42e-06 | 4348.37 ms | 31.1% bf16 MFU | 122855 tok/s step 18136/19560 | loss 3.325806 (+0.38z)| norm 0.2268 (-0.92z)| lr 8.41e-06 | 4150.32 ms | 32.5% bf16 MFU | 123029 tok/s step 18137/19560 | loss 3.367431 (+1.36z)| norm 0.2434 (+0.27z)| lr 8.40e-06 | 4147.62 ms | 32.6% bf16 MFU | 123198 tok/s step 18138/19560 | loss 3.265079 (-1.05z)| norm 0.2932 (+3.63z)| lr 8.39e-06 | 4156.98 ms | 32.5% bf16 MFU | 123344 tok/s step 18139/19560 | loss 3.307768 (-0.05z)| norm 0.2268 (-0.90z)| lr 8.38e-06 | 4152.62 ms | 32.5% bf16 MFU | 123490 tok/s step 18140/19560 | loss 3.292142 (-0.42z)| norm 0.2679 (+1.87z)| lr 8.37e-06 | 4190.07 ms | 32.2% bf16 MFU | 123571 tok/s step 18141/19560 | loss 3.263297 (-1.09z)| norm 0.2391 (-0.07z)| lr 8.35e-06 | 4225.44 ms | 32.0% bf16 MFU | 123597 tok/s step 18142/19560 | loss 3.295222 (-0.34z)| norm 0.2500 (+0.65z)| lr 8.34e-06 | 4157.75 ms | 32.5% bf16 MFU | 123722 tok/s step 18143/19560 | loss 3.331924 (+0.52z)| norm 0.2291 (-0.76z)| lr 8.33e-06 | 4156.32 ms | 32.5% bf16 MFU | 123843 tok/s step 18144/19560 | loss 3.305015 (-0.12z)| norm 0.2291 (-0.75z)| lr 8.32e-06 | 4250.13 ms | 31.8% bf16 MFU | 123819 tok/s step 18145/19560 | loss 3.272218 (-0.91z)| norm 0.2336 (-0.43z)| lr 8.31e-06 | 4155.63 ms | 32.5% bf16 MFU | 123936 tok/s step 18146/19560 | loss 3.246299 (-1.50z)| norm 0.2286 (-0.77z)| lr 8.29e-06 | 4171.42 ms | 32.4% bf16 MFU | 124023 tok/s step 18147/19560 | loss 3.287807 (-0.51z)| norm 0.2295 (-0.70z)| lr 8.28e-06 | 4166.43 ms | 32.4% bf16 MFU | 124114 tok/s step 18148/19560 | loss 3.409990 (+2.33z)| norm 0.2336 (-0.42z)| lr 8.27e-06 | 4147.47 ms | 32.6% bf16 MFU | 124229 tok/s step 18149/19560 | loss 3.290305 (-0.46z)| norm 0.2371 (-0.18z)| lr 8.26e-06 | 4191.55 ms | 32.2% bf16 MFU | 124272 tok/s step 18150/19560 | loss 3.294367 (-0.36z)| norm 0.2342 (-0.38z)| lr 8.25e-06 | 4163.88 ms | 32.4% bf16 MFU | 124354 tok/s step 18151/19560 | loss 3.298849 (-0.25z)| norm 0.2279 (-0.80z)| lr 8.24e-06 | 4172.79 ms | 32.4% bf16 MFU | 124418 tok/s step 18152/19560 | loss 3.336666 (+0.62z)| norm 0.2269 (-0.86z)| lr 8.22e-06 | 4171.87 ms | 32.4% bf16 MFU | 124481 tok/s step 18153/19560 | loss 3.327191 (+0.41z)| norm 0.2340 (-0.39z)| lr 8.21e-06 | 4164.50 ms | 32.4% bf16 MFU | 124552 tok/s step 18154/19560 | loss 3.283475 (-0.62z)| norm 0.2487 (+0.60z)| lr 8.20e-06 | 4171.97 ms | 32.4% bf16 MFU | 124607 tok/s step 18155/19560 | loss 3.275620 (-0.80z)| norm 0.2306 (-0.63z)| lr 8.19e-06 | 4159.58 ms | 32.5% bf16 MFU | 124679 tok/s step 18156/19560 | loss 3.296857 (-0.29z)| norm 0.2365 (-0.22z)| lr 8.18e-06 | 4169.51 ms | 32.4% bf16 MFU | 124732 tok/s step 18157/19560 | loss 3.299426 (-0.21z)| norm 0.2394 (-0.02z)| lr 8.17e-06 | 4175.88 ms | 32.3% bf16 MFU | 124773 tok/s step 18158/19560 | loss 3.244820 (-1.52z)| norm 0.2387 (-0.07z)| lr 8.16e-06 | 4160.15 ms | 32.5% bf16 MFU | 124836 tok/s step 18159/19560 | loss 3.331070 (+0.55z)| norm 0.2379 (-0.13z)| lr 8.14e-06 | 4193.22 ms | 32.2% bf16 MFU | 124846 tok/s step 18160/19560 | loss 3.258771 (-1.18z)| norm 0.2243 (-1.04z)| lr 8.13e-06 | 4166.22 ms | 32.4% bf16 MFU | 124896 tok/s step 18161/19560 | loss 3.357327 (+1.17z)| norm 0.2508 (+0.76z)| lr 8.12e-06 | 4152.10 ms | 32.5% bf16 MFU | 124964 tok/s step 18162/19560 | loss 3.316347 (+0.19z)| norm 0.2319 (-0.52z)| lr 8.11e-06 | 4167.67 ms | 32.4% bf16 MFU | 125006 tok/s step 18163/19560 | loss 3.287433 (-0.51z)| norm 0.2336 (-0.41z)| lr 8.10e-06 | 4167.47 ms | 32.4% bf16 MFU | 125046 tok/s step 18164/19560 | loss 3.303174 (-0.13z)| norm 0.2292 (-0.70z)| lr 8.09e-06 | 4569.71 ms | 29.5% bf16 MFU | 124530 tok/s step 18165/19560 | loss 3.308361 (-0.00z)| norm 0.2498 (+0.69z)| lr 8.07e-06 | 4160.80 ms | 32.4% bf16 MFU | 124604 tok/s step 18166/19560 | loss 3.280997 (-0.66z)| norm 0.2341 (-0.37z)| lr 8.06e-06 | 4165.40 ms | 32.4% bf16 MFU | 124667 tok/s step 18167/19560 | loss 3.368236 (+1.41z)| norm 0.2363 (-0.21z)| lr 8.05e-06 | 4166.07 ms | 32.4% bf16 MFU | 124726 tok/s step 18168/19560 | loss 3.359264 (+1.19z)| norm 0.2374 (-0.14z)| lr 8.04e-06 | 4166.85 ms | 32.4% bf16 MFU | 124781 tok/s step 18169/19560 | loss 3.225548 (-1.95z)| norm 0.2430 (+0.23z)| lr 8.03e-06 | 4152.67 ms | 32.5% bf16 MFU | 124855 tok/s step 18170/19560 | loss 3.362353 (+1.24z)| norm 0.2427 (+0.21z)| lr 8.02e-06 | 4169.61 ms | 32.4% bf16 MFU | 124899 tok/s step 18171/19560 | loss 3.349564 (+0.93z)| norm 0.2289 (-0.73z)| lr 8.01e-06 | 4172.68 ms | 32.4% bf16 MFU | 124936 tok/s step 18172/19560 | loss 3.284876 (-0.56z)| norm 0.2293 (-0.70z)| lr 7.99e-06 | 4173.22 ms | 32.4% bf16 MFU | 124971 tok/s step 18173/19560 | loss 3.338080 (+0.66z)| norm 0.2263 (-0.89z)| lr 7.98e-06 | 4160.42 ms | 32.5% bf16 MFU | 125024 tok/s step 18174/19560 | loss 3.267047 (-1.01z)| norm 0.2398 (+0.02z)| lr 7.97e-06 | 4165.42 ms | 32.4% bf16 MFU | 125066 tok/s step 18175/19560 | loss 3.287388 (-0.54z)| norm 0.2499 (+0.69z)| lr 7.96e-06 | 4172.77 ms | 32.4% bf16 MFU | 125095 tok/s step 18176/19560 | loss 3.284924 (-0.59z)| norm 0.2295 (-0.68z)| lr 7.95e-06 | 4162.81 ms | 32.4% bf16 MFU | 125137 tok/s step 18177/19560 | loss 3.297433 (-0.28z)| norm 0.2229 (-1.12z)| lr 7.94e-06 | 4162.50 ms | 32.4% bf16 MFU | 125178 tok/s step 18178/19560 | loss 3.333690 (+0.58z)| norm 0.2380 (-0.08z)| lr 7.93e-06 | 4157.98 ms | 32.5% bf16 MFU | 125224 tok/s step 18179/19560 | loss 3.260368 (-1.16z)| norm 0.2357 (-0.22z)| lr 7.91e-06 | 4229.98 ms | 31.9% bf16 MFU | 125160 tok/s step 18180/19560 | loss 3.303569 (-0.12z)| norm 0.2362 (-0.17z)| lr 7.90e-06 | 4168.63 ms | 32.4% bf16 MFU | 125190 tok/s step 18181/19560 | loss 3.341405 (+0.77z)| norm 0.2374 (-0.07z)| lr 7.89e-06 | 4165.85 ms | 32.4% bf16 MFU | 125224 tok/s step 18182/19560 | loss 3.230968 (-1.82z)| norm 0.2374 (-0.07z)| lr 7.88e-06 | 4174.62 ms | 32.3% bf16 MFU | 125242 tok/s step 18183/19560 | loss 3.295574 (-0.29z)| norm 0.2307 (-0.65z)| lr 7.87e-06 | 4170.58 ms | 32.4% bf16 MFU | 125265 tok/s step 18184/19560 | loss 3.263946 (-1.03z)| norm 0.2311 (-0.60z)| lr 7.86e-06 | 4149.64 ms | 32.5% bf16 MFU | 125319 tok/s step 18185/19560 | loss 3.345913 (+0.89z)| norm 0.2380 (-0.01z)| lr 7.85e-06 | 4160.94 ms | 32.4% bf16 MFU | 125353 tok/s step 18186/19560 | loss 3.300605 (-0.16z)| norm 0.2249 (-1.12z)| lr 7.83e-06 | 4156.74 ms | 32.5% bf16 MFU | 125392 tok/s step 18187/19560 | loss 3.311984 (+0.11z)| norm 0.2356 (-0.20z)| lr 7.82e-06 | 4162.57 ms | 32.4% bf16 MFU | 125420 tok/s step 18188/19560 | loss 3.268708 (-0.91z)| norm 0.2279 (-0.85z)| lr 7.81e-06 | 4153.04 ms | 32.5% bf16 MFU | 125461 tok/s step 18189/19560 | loss 3.341753 (+0.84z)| norm 0.2513 (+1.14z)| lr 7.80e-06 | 4169.80 ms | 32.4% bf16 MFU | 125475 tok/s step 18190/19560 | loss 3.257610 (-1.16z)| norm 0.2379 (-0.00z)| lr 7.79e-06 | 4152.16 ms | 32.5% bf16 MFU | 125515 tok/s step 18191/19560 | loss 3.309786 (+0.08z)| norm 0.2378 (+0.00z)| lr 7.78e-06 | 4151.57 ms | 32.5% bf16 MFU | 125553 tok/s step 18192/19560 | loss 3.257557 (-1.15z)| norm 0.2313 (-0.56z)| lr 7.77e-06 | 4169.26 ms | 32.4% bf16 MFU | 125563 tok/s step 18193/19560 | loss 3.250904 (-1.30z)| norm 0.2172 (-1.74z)| lr 7.76e-06 | 4169.51 ms | 32.4% bf16 MFU | 125572 tok/s step 18194/19560 | loss 3.274129 (-0.74z)| norm 0.2331 (-0.38z)| lr 7.74e-06 | 4151.47 ms | 32.5% bf16 MFU | 125608 tok/s step 18195/19560 | loss 3.305075 (-0.01z)| norm 0.2346 (-0.23z)| lr 7.73e-06 | 4152.55 ms | 32.5% bf16 MFU | 125641 tok/s step 18196/19560 | loss 3.359153 (+1.28z)| norm 0.2332 (-0.35z)| lr 7.72e-06 | 4146.59 ms | 32.6% bf16 MFU | 125680 tok/s step 18197/19560 | loss 3.332090 (+0.62z)| norm 0.2512 (+1.25z)| lr 7.71e-06 | 4175.44 ms | 32.3% bf16 MFU | 125675 tok/s step 18198/19560 | loss 3.343665 (+0.90z)| norm 0.2361 (-0.09z)| lr 7.70e-06 | 4160.60 ms | 32.5% bf16 MFU | 125692 tok/s step 18199/19560 | loss 3.298732 (-0.18z)| norm 0.2404 (+0.32z)| lr 7.69e-06 | 4161.75 ms | 32.4% bf16 MFU | 125706 tok/s step 18200/19560 | loss 3.355599 (+1.19z)| norm 0.2451 (+0.74z)| lr 7.68e-06 | 4203.68 ms | 32.1% bf16 MFU | 125657 tok/s step 18201/19560 | loss 3.324679 (+0.44z)| norm 0.2354 (-0.13z)| lr 7.67e-06 | 4174.68 ms | 32.3% bf16 MFU | 125653 tok/s step 18202/19560 | loss 3.283555 (-0.56z)| norm 0.2332 (-0.32z)| lr 7.65e-06 | 4199.52 ms | 32.2% bf16 MFU | 125613 tok/s step 18203/19560 | loss 3.343779 (+0.89z)| norm 0.2333 (-0.32z)| lr 7.64e-06 | 4159.88 ms | 32.5% bf16 MFU | 125634 tok/s step 18204/19560 | loss 3.328081 (+0.52z)| norm 0.2284 (-0.77z)| lr 7.63e-06 | 4175.10 ms | 32.3% bf16 MFU | 125631 tok/s step 18205/19560 | loss 3.337260 (+0.73z)| norm 0.2461 (+0.86z)| lr 7.62e-06 | 4501.65 ms | 30.0% bf16 MFU | 125173 tok/s step 18206/19560 | loss 3.280914 (-0.63z)| norm 0.2317 (-0.47z)| lr 7.61e-06 | 4178.75 ms | 32.3% bf16 MFU | 125187 tok/s step 18207/19560 | loss 3.246078 (-1.45z)| norm 0.2266 (-0.94z)| lr 7.60e-06 | 4156.04 ms | 32.5% bf16 MFU | 125235 tok/s step 18208/19560 | loss 3.310496 (+0.10z)| norm 0.2543 (+1.59z)| lr 7.59e-06 | 4165.33 ms | 32.4% bf16 MFU | 125267 tok/s step 18209/19560 | loss 3.272685 (-0.80z)| norm 0.2443 (+0.66z)| lr 7.58e-06 | 4151.98 ms | 32.5% bf16 MFU | 125317 tok/s step 18210/19560 | loss 3.396550 (+2.11z)| norm 0.2378 (+0.07z)| lr 7.56e-06 | 4177.75 ms | 32.3% bf16 MFU | 125326 tok/s step 18211/19560 | loss 3.280993 (-0.61z)| norm 0.2360 (-0.10z)| lr 7.55e-06 | 4156.41 ms | 32.5% bf16 MFU | 125367 tok/s step 18212/19560 | loss 3.270739 (-0.84z)| norm 0.2332 (-0.36z)| lr 7.54e-06 | 4156.63 ms | 32.5% bf16 MFU | 125405 tok/s step 18213/19560 | loss 3.255533 (-1.18z)| norm 0.2409 (+0.34z)| lr 7.53e-06 | 4226.03 ms | 31.9% bf16 MFU | 125338 tok/s step 18214/19560 | loss 3.278920 (-0.63z)| norm 0.2238 (-1.22z)| lr 7.52e-06 | 4164.27 ms | 32.4% bf16 MFU | 125366 tok/s step 18215/19560 | loss 3.375627 (+1.61z)| norm 0.2321 (-0.46z)| lr 7.51e-06 | 4169.36 ms | 32.4% bf16 MFU | 125385 tok/s step 18216/19560 | loss 3.273512 (-0.75z)| norm 0.2262 (-1.00z)| lr 7.50e-06 | 4166.12 ms | 32.4% bf16 MFU | 125408 tok/s step 18217/19560 | loss 3.310431 (+0.10z)| norm 0.2300 (-0.66z)| lr 7.49e-06 | 4164.95 ms | 32.4% bf16 MFU | 125432 tok/s step 18218/19560 | loss 3.325782 (+0.45z)| norm 0.2250 (-1.10z)| lr 7.48e-06 | 4163.86 ms | 32.4% bf16 MFU | 125456 tok/s step 18219/19560 | loss 3.308958 (+0.07z)| norm 0.2386 (+0.15z)| lr 7.46e-06 | 4181.97 ms | 32.3% bf16 MFU | 125452 tok/s step 18220/19560 | loss 3.319456 (+0.32z)| norm 0.2513 (+1.29z)| lr 7.45e-06 | 4166.41 ms | 32.4% bf16 MFU | 125471 tok/s step 18221/19560 | loss 3.269512 (-0.83z)| norm 0.2377 (+0.10z)| lr 7.44e-06 | 4441.61 ms | 30.4% bf16 MFU | 125099 tok/s step 18222/19560 | loss 3.208697 (-2.19z)| norm 0.2272 (-0.95z)| lr 7.43e-06 | 5748.64 ms | 23.5% bf16 MFU | 123405 tok/s step 18223/19560 | loss 3.243893 (-1.37z)| norm 0.2312 (-0.55z)| lr 7.42e-06 | 4653.74 ms | 29.0% bf16 MFU | 122867 tok/s step 18224/19560 | loss 3.225516 (-1.77z)| norm 0.2362 (-0.05z)| lr 7.41e-06 | 4271.06 ms | 31.6% bf16 MFU | 122862 tok/s step 18225/19560 | loss 3.256428 (-1.06z)| norm 0.2460 (+0.93z)| lr 7.40e-06 | 4337.01 ms | 31.1% bf16 MFU | 122763 tok/s step 18226/19560 | loss 3.257839 (-1.04z)| norm 0.2307 (-0.61z)| lr 7.39e-06 | 4142.45 ms | 32.6% bf16 MFU | 122953 tok/s step 18227/19560 | loss 3.276996 (-0.59z)| norm 0.2279 (-0.89z)| lr 7.38e-06 | 4188.60 ms | 32.2% bf16 MFU | 123064 tok/s step 18228/19560 | loss 3.250015 (-1.20z)| norm 0.2435 (+0.67z)| lr 7.37e-06 | 4440.91 ms | 30.4% bf16 MFU | 122814 tok/s step 18229/19560 | loss 3.271376 (-0.72z)| norm 0.2383 (+0.14z)| lr 7.35e-06 | 4317.60 ms | 31.3% bf16 MFU | 122744 tok/s step 18230/19560 | loss 3.286207 (-0.38z)| norm 0.2363 (-0.06z)| lr 7.34e-06 | 4561.93 ms | 29.6% bf16 MFU | 122354 tok/s step 18231/19560 | loss 3.319266 (+0.37z)| norm 0.2362 (-0.05z)| lr 7.33e-06 | 4198.81 ms | 32.2% bf16 MFU | 122479 tok/s step 18232/19560 | loss 3.284150 (-0.42z)| norm 0.2385 (+0.17z)| lr 7.32e-06 | 4151.63 ms | 32.5% bf16 MFU | 122669 tok/s step 18233/19560 | loss 3.306266 (+0.09z)| norm 0.2352 (-0.17z)| lr 7.31e-06 | 4197.00 ms | 32.2% bf16 MFU | 122782 tok/s step 18234/19560 | loss 3.264762 (-0.85z)| norm 0.2363 (-0.06z)| lr 7.30e-06 | 4208.22 ms | 32.1% bf16 MFU | 122872 tok/s step 18235/19560 | loss 3.301710 (+0.03z)| norm 0.2289 (-0.82z)| lr 7.29e-06 | 4147.24 ms | 32.6% bf16 MFU | 123049 tok/s step 18236/19560 | loss 3.308851 (+0.21z)| norm 0.2350 (-0.18z)| lr 7.28e-06 | 4156.79 ms | 32.5% bf16 MFU | 123203 tok/s step 18237/19560 | loss 3.270865 (-0.74z)| norm 0.2343 (-0.24z)| lr 7.27e-06 | 4153.80 ms | 32.5% bf16 MFU | 123354 tok/s step 18238/19560 | loss 3.285446 (-0.36z)| norm 0.2319 (-0.49z)| lr 7.26e-06 | 4164.84 ms | 32.4% bf16 MFU | 123481 tok/s step 18239/19560 | loss 3.248997 (-1.31z)| norm 0.2259 (-1.10z)| lr 7.24e-06 | 4154.86 ms | 32.5% bf16 MFU | 123616 tok/s step 18240/19560 | loss 3.263141 (-0.94z)| norm 0.2327 (-0.37z)| lr 7.23e-06 | 4208.86 ms | 32.1% bf16 MFU | 123664 tok/s step 18241/19560 | loss 3.303246 (+0.12z)| norm 0.2315 (-0.49z)| lr 7.22e-06 | 4151.53 ms | 32.5% bf16 MFU | 123795 tok/s step 18242/19560 | loss 3.349182 (+1.31z)| norm 0.2628 (+2.78z)| lr 7.21e-06 | 4157.24 ms | 32.5% bf16 MFU | 123911 tok/s step 18243/19560 | loss 3.336433 (+0.96z)| norm 0.2313 (-0.51z)| lr 7.20e-06 | 4147.84 ms | 32.6% bf16 MFU | 124035 tok/s step 18244/19560 | loss 3.246691 (-1.37z)| norm 0.2275 (-0.90z)| lr 7.19e-06 | 4222.66 ms | 32.0% bf16 MFU | 124042 tok/s step 18245/19560 | loss 3.296080 (-0.10z)| norm 0.2255 (-1.10z)| lr 7.18e-06 | 4207.22 ms | 32.1% bf16 MFU | 124070 tok/s step 18246/19560 | loss 3.306810 (+0.21z)| norm 0.2450 (+0.93z)| lr 7.17e-06 | 4151.36 ms | 32.5% bf16 MFU | 124181 tok/s step 18247/19560 | loss 3.275518 (-0.62z)| norm 0.2338 (-0.25z)| lr 7.16e-06 | 4149.89 ms | 32.5% bf16 MFU | 124289 tok/s step 18248/19560 | loss 3.249112 (-1.32z)| norm 0.2292 (-0.72z)| lr 7.15e-06 | 4178.81 ms | 32.3% bf16 MFU | 124348 tok/s step 18249/19560 | loss 3.300839 (+0.08z)| norm 0.2477 (+1.21z)| lr 7.14e-06 | 4148.66 ms | 32.5% bf16 MFU | 124449 tok/s step 18250/19560 | loss 3.330823 (+0.90z)| norm 0.2460 (+1.03z)| lr 7.13e-06 | 4177.54 ms | 32.3% bf16 MFU | 124502 tok/s val loss 3.267034 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3034/10042 = 0.302131 step 18251/19560 | loss 3.266553 (-0.85z)| norm 0.2458 (+1.00z)| lr 7.11e-06 | 4152.25 ms | 32.5% bf16 MFU | 124590 tok/s step 18252/19560 | loss 3.292783 (-0.13z)| norm 0.2360 (-0.02z)| lr 7.10e-06 | 4169.24 ms | 32.4% bf16 MFU | 124648 tok/s step 18253/19560 | loss 3.368307 (+1.88z)| norm 0.2354 (-0.09z)| lr 7.09e-06 | 4173.50 ms | 32.4% bf16 MFU | 124697 tok/s step 18254/19560 | loss 3.222811 (-1.97z)| norm 0.2308 (-0.57z)| lr 7.08e-06 | 4150.51 ms | 32.5% bf16 MFU | 124778 tok/s step 18255/19560 | loss 3.261466 (-0.94z)| norm 0.2382 (+0.20z)| lr 7.07e-06 | 4165.22 ms | 32.4% bf16 MFU | 124833 tok/s step 18256/19560 | loss 3.252793 (-1.16z)| norm 0.2227 (-1.41z)| lr 7.06e-06 | 4179.55 ms | 32.3% bf16 MFU | 124863 tok/s step 18257/19560 | loss 3.303111 (+0.17z)| norm 0.2304 (-0.60z)| lr 7.05e-06 | 4151.31 ms | 32.5% bf16 MFU | 124935 tok/s step 18258/19560 | loss 3.295443 (-0.03z)| norm 0.2317 (-0.45z)| lr 7.04e-06 | 4161.49 ms | 32.4% bf16 MFU | 124987 tok/s step 18259/19560 | loss 3.394116 (+2.49z)| norm 0.2382 (+0.22z)| lr 7.03e-06 | 4157.80 ms | 32.5% bf16 MFU | 125043 tok/s step 18260/19560 | loss 3.309684 (+0.31z)| norm 0.2370 (+0.09z)| lr 7.02e-06 | 4239.20 ms | 31.8% bf16 MFU | 124974 tok/s step 18261/19560 | loss 3.336449 (+0.98z)| norm 0.2406 (+0.46z)| lr 7.01e-06 | 4153.12 ms | 32.5% bf16 MFU | 125038 tok/s step 18262/19560 | loss 3.296123 (-0.06z)| norm 0.2307 (-0.57z)| lr 7.00e-06 | 4154.95 ms | 32.5% bf16 MFU | 125095 tok/s step 18263/19560 | loss 3.244976 (-1.35z)| norm 0.2249 (-1.16z)| lr 6.98e-06 | 4153.90 ms | 32.5% bf16 MFU | 125151 tok/s step 18264/19560 | loss 3.262215 (-0.89z)| norm 0.2478 (+1.22z)| lr 6.97e-06 | 4155.30 ms | 32.5% bf16 MFU | 125202 tok/s step 18265/19560 | loss 3.284034 (-0.32z)| norm 0.2316 (-0.47z)| lr 6.96e-06 | 4158.04 ms | 32.5% bf16 MFU | 125247 tok/s step 18266/19560 | loss 3.278892 (-0.46z)| norm 0.2270 (-1.05z)| lr 6.95e-06 | 4465.71 ms | 30.2% bf16 MFU | 124854 tok/s step 18267/19560 | loss 3.271356 (-0.65z)| norm 0.2359 (+0.03z)| lr 6.94e-06 | 4159.36 ms | 32.5% bf16 MFU | 124914 tok/s step 18268/19560 | loss 3.285081 (-0.29z)| norm 0.2348 (-0.08z)| lr 6.93e-06 | 4167.23 ms | 32.4% bf16 MFU | 124959 tok/s step 18269/19560 | loss 3.304544 (+0.21z)| norm 0.2445 (+1.20z)| lr 6.92e-06 | 4150.26 ms | 32.5% bf16 MFU | 125027 tok/s step 18270/19560 | loss 3.333336 (+0.95z)| norm 0.2406 (+0.71z)| lr 6.91e-06 | 4258.20 ms | 31.7% bf16 MFU | 124932 tok/s step 18271/19560 | loss 3.266221 (-0.78z)| norm 0.2317 (-0.49z)| lr 6.90e-06 | 4147.46 ms | 32.6% bf16 MFU | 125006 tok/s step 18272/19560 | loss 3.323978 (+0.71z)| norm 0.2243 (-1.47z)| lr 6.89e-06 | 4204.88 ms | 32.1% bf16 MFU | 124990 tok/s step 18273/19560 | loss 3.377035 (+2.04z)| norm 0.2695 (+4.19z)| lr 6.88e-06 | 4156.69 ms | 32.5% bf16 MFU | 125047 tok/s step 18274/19560 | loss 3.309909 (+0.31z)| norm 0.2292 (-0.79z)| lr 6.87e-06 | 4151.61 ms | 32.5% bf16 MFU | 125109 tok/s step 18275/19560 | loss 3.347715 (+1.26z)| norm 0.2443 (+1.07z)| lr 6.86e-06 | 4157.81 ms | 32.5% bf16 MFU | 125159 tok/s step 18276/19560 | loss 3.256055 (-1.08z)| norm 0.2296 (-0.74z)| lr 6.85e-06 | 4165.58 ms | 32.4% bf16 MFU | 125194 tok/s step 18277/19560 | loss 3.270298 (-0.70z)| norm 0.2290 (-0.81z)| lr 6.84e-06 | 4153.65 ms | 32.5% bf16 MFU | 125245 tok/s step 18278/19560 | loss 3.309212 (+0.32z)| norm 0.2271 (-1.03z)| lr 6.83e-06 | 4153.37 ms | 32.5% bf16 MFU | 125295 tok/s step 18279/19560 | loss 3.334259 (+0.97z)| norm 0.2245 (-1.34z)| lr 6.81e-06 | 4153.86 ms | 32.5% bf16 MFU | 125341 tok/s step 18280/19560 | loss 3.263280 (-0.87z)| norm 0.2435 (+0.95z)| lr 6.80e-06 | 4148.10 ms | 32.5% bf16 MFU | 125393 tok/s step 18281/19560 | loss 3.301653 (+0.13z)| norm 0.2350 (-0.07z)| lr 6.79e-06 | 4197.39 ms | 32.2% bf16 MFU | 125369 tok/s step 18282/19560 | loss 3.257222 (-1.02z)| norm 0.2853 (+5.35z)| lr 6.78e-06 | 4191.50 ms | 32.2% bf16 MFU | 125355 tok/s step 18283/19560 | loss 3.274360 (-0.57z)| norm 0.2325 (-0.37z)| lr 6.77e-06 | 4166.45 ms | 32.4% bf16 MFU | 125379 tok/s step 18284/19560 | loss 3.291827 (-0.12z)| norm 0.2288 (-0.77z)| lr 6.76e-06 | 4150.01 ms | 32.5% bf16 MFU | 125427 tok/s step 18285/19560 | loss 3.278488 (-0.46z)| norm 0.2282 (-0.82z)| lr 6.75e-06 | 4160.71 ms | 32.5% bf16 MFU | 125456 tok/s step 18286/19560 | loss 3.319746 (+0.61z)| norm 0.2400 (+0.45z)| lr 6.74e-06 | 4150.91 ms | 32.5% bf16 MFU | 125498 tok/s step 18287/19560 | loss 3.238511 (-1.50z)| norm 0.2291 (-0.71z)| lr 6.73e-06 | 4147.68 ms | 32.6% bf16 MFU | 125544 tok/s step 18288/19560 | loss 3.325001 (+0.75z)| norm 0.2358 (-0.01z)| lr 6.72e-06 | 4152.73 ms | 32.5% bf16 MFU | 125579 tok/s step 18289/19560 | loss 3.301338 (+0.14z)| norm 0.2379 (+0.24z)| lr 6.71e-06 | 4164.15 ms | 32.4% bf16 MFU | 125595 tok/s step 18290/19560 | loss 3.273540 (-0.59z)| norm 0.2363 (+0.06z)| lr 6.70e-06 | 4146.86 ms | 32.6% bf16 MFU | 125637 tok/s step 18291/19560 | loss 3.279117 (-0.44z)| norm 0.2316 (-0.45z)| lr 6.69e-06 | 4164.34 ms | 32.4% bf16 MFU | 125650 tok/s step 18292/19560 | loss 3.262019 (-0.88z)| norm 0.2278 (-0.87z)| lr 6.68e-06 | 4157.50 ms | 32.5% bf16 MFU | 125673 tok/s step 18293/19560 | loss 3.200355 (-2.42z)| norm 0.2335 (-0.23z)| lr 6.67e-06 | 4158.14 ms | 32.5% bf16 MFU | 125694 tok/s step 18294/19560 | loss 3.355522 (+1.54z)| norm 0.2308 (-0.53z)| lr 6.66e-06 | 4422.48 ms | 30.5% bf16 MFU | 125337 tok/s step 18295/19560 | loss 3.332981 (+0.98z)| norm 0.2276 (-0.87z)| lr 6.65e-06 | 4149.86 ms | 32.5% bf16 MFU | 125387 tok/s step 18296/19560 | loss 3.295990 (+0.04z)| norm 0.2310 (-0.49z)| lr 6.64e-06 | 4150.25 ms | 32.5% bf16 MFU | 125434 tok/s step 18297/19560 | loss 3.293803 (-0.03z)| norm 0.2257 (-1.06z)| lr 6.63e-06 | 4151.80 ms | 32.5% bf16 MFU | 125476 tok/s step 18298/19560 | loss 3.360657 (+1.74z)| norm 0.2369 (+0.18z)| lr 6.61e-06 | 4161.74 ms | 32.4% bf16 MFU | 125501 tok/s step 18299/19560 | loss 3.273603 (-0.55z)| norm 0.2354 (+0.00z)| lr 6.60e-06 | 4149.27 ms | 32.5% bf16 MFU | 125544 tok/s step 18300/19560 | loss 3.285403 (-0.23z)| norm 0.2264 (-0.99z)| lr 6.59e-06 | 4158.12 ms | 32.5% bf16 MFU | 125571 tok/s step 18301/19560 | loss 3.279130 (-0.39z)| norm 0.2255 (-1.08z)| lr 6.58e-06 | 4147.74 ms | 32.6% bf16 MFU | 125613 tok/s step 18302/19560 | loss 3.262516 (-0.83z)| norm 0.2262 (-0.99z)| lr 6.57e-06 | 4159.98 ms | 32.5% bf16 MFU | 125634 tok/s step 18303/19560 | loss 3.293821 (+0.00z)| norm 0.2320 (-0.34z)| lr 6.56e-06 | 4149.16 ms | 32.5% bf16 MFU | 125670 tok/s step 18304/19560 | loss 3.290385 (-0.09z)| norm 0.2277 (-0.82z)| lr 6.55e-06 | 4162.90 ms | 32.4% bf16 MFU | 125684 tok/s step 18305/19560 | loss 3.220627 (-1.91z)| norm 0.2320 (-0.35z)| lr 6.54e-06 | 4151.97 ms | 32.5% bf16 MFU | 125713 tok/s step 18306/19560 | loss 3.335015 (+1.10z)| norm 0.2222 (-1.42z)| lr 6.53e-06 | 4154.69 ms | 32.5% bf16 MFU | 125737 tok/s step 18307/19560 | loss 3.302085 (+0.23z)| norm 0.2245 (-1.14z)| lr 6.52e-06 | 4148.90 ms | 32.5% bf16 MFU | 125769 tok/s step 18308/19560 | loss 3.290534 (-0.08z)| norm 0.2346 (-0.04z)| lr 6.51e-06 | 4167.42 ms | 32.4% bf16 MFU | 125770 tok/s step 18309/19560 | loss 3.366580 (+1.91z)| norm 0.2806 (+4.56z)| lr 6.50e-06 | 4159.22 ms | 32.5% bf16 MFU | 125785 tok/s step 18310/19560 | loss 3.274919 (-0.50z)| norm 0.2300 (-0.52z)| lr 6.49e-06 | 4157.19 ms | 32.5% bf16 MFU | 125801 tok/s step 18311/19560 | loss 3.272654 (-0.56z)| norm 0.2444 (+0.91z)| lr 6.48e-06 | 4163.91 ms | 32.4% bf16 MFU | 125807 tok/s step 18312/19560 | loss 3.308632 (+0.38z)| norm 0.2309 (-0.44z)| lr 6.47e-06 | 4147.44 ms | 32.6% bf16 MFU | 125837 tok/s step 18313/19560 | loss 3.295978 (+0.06z)| norm 0.2377 (+0.24z)| lr 6.46e-06 | 4162.55 ms | 32.4% bf16 MFU | 125843 tok/s step 18314/19560 | loss 3.244161 (-1.30z)| norm 0.2319 (-0.34z)| lr 6.45e-06 | 4165.66 ms | 32.4% bf16 MFU | 125844 tok/s step 18315/19560 | loss 3.242619 (-1.32z)| norm 0.2359 (+0.06z)| lr 6.44e-06 | 4161.60 ms | 32.4% bf16 MFU | 125851 tok/s step 18316/19560 | loss 3.355095 (+1.61z)| norm 0.2302 (-0.52z)| lr 6.43e-06 | 4167.18 ms | 32.4% bf16 MFU | 125849 tok/s step 18317/19560 | loss 3.289109 (-0.10z)| norm 0.2500 (+1.47z)| lr 6.42e-06 | 4164.11 ms | 32.4% bf16 MFU | 125852 tok/s step 18318/19560 | loss 3.296987 (+0.10z)| norm 0.2382 (+0.29z)| lr 6.41e-06 | 4550.01 ms | 29.7% bf16 MFU | 125320 tok/s step 18319/19560 | loss 3.254225 (-1.01z)| norm 0.2373 (+0.20z)| lr 6.40e-06 | 4163.19 ms | 32.4% bf16 MFU | 125351 tok/s step 18320/19560 | loss 3.247910 (-1.18z)| norm 0.2322 (-0.32z)| lr 6.39e-06 | 4168.07 ms | 32.4% bf16 MFU | 125373 tok/s step 18321/19560 | loss 3.216219 (-1.98z)| norm 0.2369 (+0.14z)| lr 6.38e-06 | 4154.40 ms | 32.5% bf16 MFU | 125414 tok/s step 18322/19560 | loss 3.276140 (-0.43z)| norm 0.2324 (-0.32z)| lr 6.37e-06 | 4168.01 ms | 32.4% bf16 MFU | 125433 tok/s step 18323/19560 | loss 3.287691 (-0.12z)| norm 0.2344 (-0.11z)| lr 6.36e-06 | 4145.70 ms | 32.6% bf16 MFU | 125485 tok/s step 18324/19560 | loss 3.250206 (-1.08z)| norm 0.2319 (-0.37z)| lr 6.35e-06 | 4156.72 ms | 32.5% bf16 MFU | 125517 tok/s step 18325/19560 | loss 3.306758 (+0.40z)| norm 0.2344 (-0.10z)| lr 6.34e-06 | 4152.59 ms | 32.5% bf16 MFU | 125554 tok/s step 18326/19560 | loss 3.268540 (-0.59z)| norm 0.2328 (-0.26z)| lr 6.33e-06 | 4154.20 ms | 32.5% bf16 MFU | 125586 tok/s step 18327/19560 | loss 3.316869 (+0.68z)| norm 0.2461 (+1.10z)| lr 6.32e-06 | 4160.32 ms | 32.5% bf16 MFU | 125608 tok/s step 18328/19560 | loss 3.263546 (-0.71z)| norm 0.2334 (-0.19z)| lr 6.31e-06 | 4164.15 ms | 32.4% bf16 MFU | 125623 tok/s step 18329/19560 | loss 3.293245 (+0.09z)| norm 0.2325 (-0.28z)| lr 6.30e-06 | 4157.08 ms | 32.5% bf16 MFU | 125648 tok/s step 18330/19560 | loss 3.262276 (-0.73z)| norm 0.5193 (+10.51z)| lr 6.28e-06 | 4159.68 ms | 32.5% bf16 MFU | 125667 tok/s step 18331/19560 | loss 3.258435 (-0.82z)| norm 0.2370 (-0.02z)| lr 6.27e-06 | 4199.55 ms | 32.2% bf16 MFU | 125626 tok/s step 18332/19560 | loss 3.238456 (-1.33z)| norm 0.2347 (-0.11z)| lr 6.26e-06 | 4162.33 ms | 32.4% bf16 MFU | 125643 tok/s step 18333/19560 | loss 3.327658 (+1.05z)| norm 0.2362 (-0.05z)| lr 6.25e-06 | 4151.42 ms | 32.5% bf16 MFU | 125675 tok/s step 18334/19560 | loss 3.227024 (-1.61z)| norm 0.2347 (-0.11z)| lr 6.24e-06 | 4167.40 ms | 32.4% bf16 MFU | 125682 tok/s step 18335/19560 | loss 3.269092 (-0.51z)| norm 0.2407 (+0.12z)| lr 6.23e-06 | 4146.01 ms | 32.6% bf16 MFU | 125721 tok/s step 18336/19560 | loss 3.285340 (-0.07z)| norm 0.2416 (+0.15z)| lr 6.22e-06 | 4170.72 ms | 32.4% bf16 MFU | 125720 tok/s step 18337/19560 | loss 3.284688 (-0.09z)| norm 0.2443 (+0.25z)| lr 6.21e-06 | 4156.96 ms | 32.5% bf16 MFU | 125740 tok/s step 18338/19560 | loss 3.259542 (-0.75z)| norm 0.2387 (+0.04z)| lr 6.20e-06 | 4153.70 ms | 32.5% bf16 MFU | 125764 tok/s step 18339/19560 | loss 3.240197 (-1.27z)| norm 0.2390 (+0.05z)| lr 6.19e-06 | 4163.36 ms | 32.4% bf16 MFU | 125772 tok/s step 18340/19560 | loss 3.269408 (-0.47z)| norm 0.2319 (-0.21z)| lr 6.18e-06 | 4152.30 ms | 32.5% bf16 MFU | 125797 tok/s step 18341/19560 | loss 3.261229 (-0.70z)| norm 0.2359 (-0.06z)| lr 6.17e-06 | 4152.53 ms | 32.5% bf16 MFU | 125820 tok/s step 18342/19560 | loss 3.334350 (+1.28z)| norm 0.2411 (+0.13z)| lr 6.16e-06 | 4168.93 ms | 32.4% bf16 MFU | 125817 tok/s step 18343/19560 | loss 3.304620 (+0.50z)| norm 0.2296 (-0.30z)| lr 6.15e-06 | 4156.40 ms | 32.5% bf16 MFU | 125833 tok/s step 18344/19560 | loss 3.300282 (+0.37z)| norm 0.2342 (-0.13z)| lr 6.14e-06 | 4159.41 ms | 32.5% bf16 MFU | 125844 tok/s step 18345/19560 | loss 3.306542 (+0.55z)| norm 0.2232 (-0.54z)| lr 6.13e-06 | 4146.60 ms | 32.6% bf16 MFU | 125874 tok/s step 18346/19560 | loss 3.261357 (-0.69z)| norm 0.2286 (-0.34z)| lr 6.12e-06 | 4958.94 ms | 27.2% bf16 MFU | 124866 tok/s step 18347/19560 | loss 3.269718 (-0.45z)| norm 0.2320 (-0.21z)| lr 6.11e-06 | 4134.85 ms | 32.7% bf16 MFU | 124963 tok/s step 18348/19560 | loss 3.270058 (-0.43z)| norm 0.2295 (-0.30z)| lr 6.10e-06 | 4143.65 ms | 32.6% bf16 MFU | 125041 tok/s step 18349/19560 | loss 3.266641 (-0.53z)| norm 0.2453 (+0.29z)| lr 6.09e-06 | 4143.25 ms | 32.6% bf16 MFU | 125116 tok/s step 18350/19560 | loss 3.250924 (-0.99z)| norm 0.2349 (-0.10z)| lr 6.08e-06 | 4141.38 ms | 32.6% bf16 MFU | 125190 tok/s step 18351/19560 | loss 3.315948 (+0.84z)| norm 0.2519 (+0.53z)| lr 6.07e-06 | 4146.48 ms | 32.6% bf16 MFU | 125253 tok/s step 18352/19560 | loss 3.378981 (+2.56z)| norm 0.2362 (-0.06z)| lr 6.06e-06 | 4137.54 ms | 32.6% bf16 MFU | 125326 tok/s step 18353/19560 | loss 3.257482 (-0.84z)| norm 0.2286 (-0.34z)| lr 6.05e-06 | 4135.40 ms | 32.6% bf16 MFU | 125399 tok/s step 18354/19560 | loss 3.296712 (+0.25z)| norm 0.2542 (+0.61z)| lr 6.04e-06 | 4151.02 ms | 32.5% bf16 MFU | 125444 tok/s step 18355/19560 | loss 3.410086 (+3.26z)| norm 0.2673 (+1.09z)| lr 6.03e-06 | 4143.49 ms | 32.6% bf16 MFU | 125498 tok/s step 18356/19560 | loss 3.258211 (-0.83z)| norm 0.2340 (-0.15z)| lr 6.02e-06 | 4165.88 ms | 32.4% bf16 MFU | 125516 tok/s step 18357/19560 | loss 3.307673 (+0.49z)| norm 0.2462 (+0.30z)| lr 6.01e-06 | 4216.92 ms | 32.0% bf16 MFU | 125457 tok/s step 18358/19560 | loss 3.291214 (+0.05z)| norm 0.2402 (+0.08z)| lr 6.00e-06 | 4149.16 ms | 32.5% bf16 MFU | 125502 tok/s step 18359/19560 | loss 3.304671 (+0.42z)| norm 0.2282 (-0.37z)| lr 5.99e-06 | 4139.17 ms | 32.6% bf16 MFU | 125560 tok/s step 18360/19560 | loss 3.307458 (+0.49z)| norm 0.2509 (+0.47z)| lr 5.98e-06 | 4144.68 ms | 32.6% bf16 MFU | 125607 tok/s step 18361/19560 | loss 3.281114 (-0.22z)| norm 0.2254 (-0.47z)| lr 5.97e-06 | 4152.94 ms | 32.5% bf16 MFU | 125639 tok/s step 18362/19560 | loss 3.271554 (-0.48z)| norm 0.2242 (-0.51z)| lr 5.96e-06 | 4148.11 ms | 32.5% bf16 MFU | 125676 tok/s step 18363/19560 | loss 3.278344 (-0.29z)| norm 0.2407 (+0.10z)| lr 5.95e-06 | 4305.34 ms | 31.4% bf16 MFU | 125481 tok/s step 18364/19560 | loss 3.298748 (+0.26z)| norm 0.2287 (-0.35z)| lr 5.94e-06 | 4170.42 ms | 32.4% bf16 MFU | 125493 tok/s step 18365/19560 | loss 3.267042 (-0.59z)| norm 0.2295 (-0.32z)| lr 5.93e-06 | 4141.92 ms | 32.6% bf16 MFU | 125547 tok/s step 18366/19560 | loss 3.284971 (-0.11z)| norm 0.2360 (-0.07z)| lr 5.92e-06 | 4174.68 ms | 32.3% bf16 MFU | 125549 tok/s step 18367/19560 | loss 3.373848 (+2.23z)| norm 0.2515 (+0.49z)| lr 5.91e-06 | 4143.82 ms | 32.6% bf16 MFU | 125598 tok/s step 18368/19560 | loss 3.277658 (-0.33z)| norm 0.2362 (-0.07z)| lr 5.90e-06 | 4156.13 ms | 32.5% bf16 MFU | 125626 tok/s step 18369/19560 | loss 3.311116 (+0.56z)| norm 0.2322 (-0.23z)| lr 5.89e-06 | 4150.20 ms | 32.5% bf16 MFU | 125661 tok/s step 18370/19560 | loss 3.305904 (+0.43z)| norm 0.2439 (+0.22z)| lr 5.88e-06 | 4142.88 ms | 32.6% bf16 MFU | 125705 tok/s step 18371/19560 | loss 3.332306 (+1.14z)| norm 0.2349 (-0.12z)| lr 5.87e-06 | 4145.75 ms | 32.6% bf16 MFU | 125743 tok/s step 18372/19560 | loss 3.337552 (+1.27z)| norm 0.2333 (-0.18z)| lr 5.86e-06 | 4146.11 ms | 32.6% bf16 MFU | 125779 tok/s step 18373/19560 | loss 3.322793 (+0.86z)| norm 0.2268 (-0.42z)| lr 5.85e-06 | 4142.00 ms | 32.6% bf16 MFU | 125819 tok/s step 18374/19560 | loss 3.281908 (-0.23z)| norm 0.2447 (+0.24z)| lr 5.85e-06 | 4153.61 ms | 32.5% bf16 MFU | 125839 tok/s step 18375/19560 | loss 3.318758 (+0.75z)| norm 0.2478 (+0.35z)| lr 5.84e-06 | 4137.54 ms | 32.6% bf16 MFU | 125883 tok/s step 18376/19560 | loss 3.231344 (-1.58z)| norm 0.2297 (-0.32z)| lr 5.83e-06 | 4195.31 ms | 32.2% bf16 MFU | 125837 tok/s step 18377/19560 | loss 3.287361 (-0.09z)| norm 0.2334 (-0.18z)| lr 5.82e-06 | 4246.06 ms | 31.8% bf16 MFU | 125719 tok/s step 18378/19560 | loss 3.222094 (-1.79z)| norm 0.2458 (+0.28z)| lr 5.81e-06 | 4153.29 ms | 32.5% bf16 MFU | 125745 tok/s step 18379/19560 | loss 3.268737 (-0.56z)| norm 0.2323 (-0.22z)| lr 5.80e-06 | 4151.33 ms | 32.5% bf16 MFU | 125772 tok/s step 18380/19560 | loss 3.314527 (+0.65z)| norm 0.2339 (-0.15z)| lr 5.79e-06 | 4148.88 ms | 32.5% bf16 MFU | 125802 tok/s step 18381/19560 | loss 3.256235 (-0.88z)| norm 0.2335 (-0.17z)| lr 5.78e-06 | 4138.47 ms | 32.6% bf16 MFU | 125846 tok/s step 18382/19560 | loss 3.274394 (-0.41z)| norm 0.2340 (-0.15z)| lr 5.77e-06 | 4142.70 ms | 32.6% bf16 MFU | 125882 tok/s step 18383/19560 | loss 3.339237 (+1.33z)| norm 0.2270 (-0.41z)| lr 5.76e-06 | 4151.17 ms | 32.5% bf16 MFU | 125903 tok/s step 18384/19560 | loss 3.292247 (+0.05z)| norm 0.2275 (-0.39z)| lr 5.75e-06 | 4138.28 ms | 32.6% bf16 MFU | 125942 tok/s step 18385/19560 | loss 3.287632 (-0.07z)| norm 0.2275 (-0.39z)| lr 5.74e-06 | 4152.42 ms | 32.5% bf16 MFU | 125958 tok/s step 18386/19560 | loss 3.213512 (-2.03z)| norm 0.2401 (+0.08z)| lr 5.73e-06 | 4155.99 ms | 32.5% bf16 MFU | 125968 tok/s step 18387/19560 | loss 3.247574 (-1.12z)| norm 0.2288 (-0.34z)| lr 5.72e-06 | 4151.89 ms | 32.5% bf16 MFU | 125983 tok/s step 18388/19560 | loss 3.304271 (+0.43z)| norm 0.2504 (+0.46z)| lr 5.71e-06 | 4150.46 ms | 32.5% bf16 MFU | 126000 tok/s step 18389/19560 | loss 3.307117 (+0.52z)| norm 0.2285 (-0.35z)| lr 5.70e-06 | 4139.60 ms | 32.6% bf16 MFU | 126033 tok/s step 18390/19560 | loss 3.243920 (-1.20z)| norm 0.2258 (-0.45z)| lr 5.69e-06 | 4153.20 ms | 32.5% bf16 MFU | 126043 tok/s step 18391/19560 | loss 3.268397 (-0.54z)| norm 0.2353 (-0.10z)| lr 5.68e-06 | 4156.12 ms | 32.5% bf16 MFU | 126048 tok/s step 18392/19560 | loss 3.259193 (-0.79z)| norm 0.2307 (-0.27z)| lr 5.67e-06 | 4137.92 ms | 32.6% bf16 MFU | 126081 tok/s step 18393/19560 | loss 3.306879 (+0.52z)| norm 0.2311 (-0.25z)| lr 5.66e-06 | 4149.07 ms | 32.5% bf16 MFU | 126095 tok/s step 18394/19560 | loss 3.413017 (+3.26z)| norm 0.2332 (-0.18z)| lr 5.65e-06 | 4141.57 ms | 32.6% bf16 MFU | 126120 tok/s step 18395/19560 | loss 3.286799 (-0.07z)| norm 0.2263 (-0.43z)| lr 5.64e-06 | 4138.56 ms | 32.6% bf16 MFU | 126148 tok/s step 18396/19560 | loss 3.246282 (-1.12z)| norm 0.2259 (-0.44z)| lr 5.63e-06 | 4150.44 ms | 32.5% bf16 MFU | 126157 tok/s step 18397/19560 | loss 3.287012 (-0.05z)| norm 0.2285 (-0.34z)| lr 5.62e-06 | 4145.04 ms | 32.6% bf16 MFU | 126173 tok/s step 18398/19560 | loss 3.305332 (+0.44z)| norm 0.2276 (-0.37z)| lr 5.61e-06 | 4137.20 ms | 32.6% bf16 MFU | 126201 tok/s step 18399/19560 | loss 3.235873 (-1.37z)| norm 0.2284 (-0.34z)| lr 5.60e-06 | 4142.22 ms | 32.6% bf16 MFU | 126219 tok/s step 18400/19560 | loss 3.268986 (-0.50z)| norm 0.2413 (+0.13z)| lr 5.59e-06 | 4145.59 ms | 32.6% bf16 MFU | 126232 tok/s step 18401/19560 | loss 3.349295 (+1.63z)| norm 0.2511 (+0.51z)| lr 5.58e-06 | 4148.44 ms | 32.5% bf16 MFU | 126239 tok/s step 18402/19560 | loss 3.274337 (-0.35z)| norm 0.2439 (+0.23z)| lr 5.57e-06 | 4141.76 ms | 32.6% bf16 MFU | 126257 tok/s step 18403/19560 | loss 3.312926 (+0.69z)| norm 0.2230 (-0.54z)| lr 5.56e-06 | 4150.98 ms | 32.5% bf16 MFU | 126259 tok/s step 18404/19560 | loss 3.314852 (+0.73z)| norm 0.2428 (+0.20z)| lr 5.55e-06 | 4148.82 ms | 32.5% bf16 MFU | 126265 tok/s step 18405/19560 | loss 3.340942 (+1.40z)| norm 0.2225 (-0.56z)| lr 5.54e-06 | 4154.56 ms | 32.5% bf16 MFU | 126261 tok/s step 18406/19560 | loss 3.258480 (-0.78z)| norm 0.2249 (-0.47z)| lr 5.54e-06 | 4139.24 ms | 32.6% bf16 MFU | 126281 tok/s step 18407/19560 | loss 3.332940 (+1.20z)| norm 0.2315 (-0.23z)| lr 5.53e-06 | 4168.21 ms | 32.4% bf16 MFU | 126256 tok/s step 18408/19560 | loss 3.222547 (-1.71z)| norm 0.2209 (-0.61z)| lr 5.52e-06 | 4237.93 ms | 31.9% bf16 MFU | 126129 tok/s step 18409/19560 | loss 3.263447 (-0.62z)| norm 0.2317 (-0.21z)| lr 5.51e-06 | 4154.55 ms | 32.5% bf16 MFU | 126132 tok/s step 18410/19560 | loss 3.261561 (-0.67z)| norm 0.2262 (-0.40z)| lr 5.50e-06 | 4361.18 ms | 31.0% bf16 MFU | 125837 tok/s step 18411/19560 | loss 3.269569 (-0.46z)| norm 0.2447 (+0.29z)| lr 5.49e-06 | 4352.44 ms | 31.0% bf16 MFU | 125568 tok/s step 18412/19560 | loss 3.251122 (-0.94z)| norm 0.2369 (-0.01z)| lr 5.48e-06 | 4910.28 ms | 27.5% bf16 MFU | 124628 tok/s step 18413/19560 | loss 3.290202 (+0.09z)| norm 0.2267 (-0.39z)| lr 5.47e-06 | 4668.65 ms | 28.9% bf16 MFU | 124012 tok/s step 18414/19560 | loss 3.299931 (+0.35z)| norm 0.2245 (-0.47z)| lr 5.46e-06 | 4760.99 ms | 28.4% bf16 MFU | 123317 tok/s step 18415/19560 | loss 3.312117 (+0.65z)| norm 0.2267 (-0.39z)| lr 5.45e-06 | 4391.36 ms | 30.7% bf16 MFU | 123121 tok/s step 18416/19560 | loss 3.314053 (+0.71z)| norm 0.2277 (-0.34z)| lr 5.44e-06 | 4579.04 ms | 29.5% bf16 MFU | 122690 tok/s step 18417/19560 | loss 3.310454 (+0.61z)| norm 0.2324 (-0.16z)| lr 5.43e-06 | 4518.34 ms | 29.9% bf16 MFU | 122357 tok/s step 18418/19560 | loss 3.268920 (-0.49z)| norm 0.2320 (-0.18z)| lr 5.42e-06 | 4206.11 ms | 32.1% bf16 MFU | 122472 tok/s step 18419/19560 | loss 3.321015 (+0.88z)| norm 0.2296 (-0.27z)| lr 5.41e-06 | 4370.75 ms | 30.9% bf16 MFU | 122346 tok/s step 18420/19560 | loss 3.264016 (-0.62z)| norm 0.2446 (+0.29z)| lr 5.40e-06 | 4243.91 ms | 31.8% bf16 MFU | 122405 tok/s step 18421/19560 | loss 3.272952 (-0.41z)| norm 0.2346 (-0.09z)| lr 5.39e-06 | 4145.04 ms | 32.6% bf16 MFU | 122609 tok/s step 18422/19560 | loss 3.252711 (-0.94z)| norm 0.2302 (-0.25z)| lr 5.38e-06 | 4314.74 ms | 31.3% bf16 MFU | 122554 tok/s step 18423/19560 | loss 3.297633 (+0.29z)| norm 0.2224 (-0.54z)| lr 5.37e-06 | 4211.27 ms | 32.1% bf16 MFU | 122652 tok/s step 18424/19560 | loss 3.333117 (+1.24z)| norm 0.2467 (+0.37z)| lr 5.36e-06 | 4220.19 ms | 32.0% bf16 MFU | 122731 tok/s step 18425/19560 | loss 3.287823 (+0.01z)| norm 0.2340 (-0.11z)| lr 5.36e-06 | 4368.96 ms | 30.9% bf16 MFU | 122594 tok/s step 18426/19560 | loss 3.315077 (+0.77z)| norm 0.2309 (-0.23z)| lr 5.35e-06 | 4150.37 ms | 32.5% bf16 MFU | 122781 tok/s step 18427/19560 | loss 3.279383 (-0.21z)| norm 0.2262 (-0.40z)| lr 5.34e-06 | 4423.41 ms | 30.5% bf16 MFU | 122568 tok/s step 18428/19560 | loss 3.275814 (-0.31z)| norm 0.2339 (-0.12z)| lr 5.33e-06 | 4213.92 ms | 32.0% bf16 MFU | 122660 tok/s step 18429/19560 | loss 3.272651 (-0.39z)| norm 0.2338 (-0.12z)| lr 5.32e-06 | 4169.67 ms | 32.4% bf16 MFU | 122814 tok/s step 18430/19560 | loss 3.272910 (-0.39z)| norm 0.2252 (-0.45z)| lr 5.31e-06 | 4290.98 ms | 31.5% bf16 MFU | 122783 tok/s step 18431/19560 | loss 3.350767 (+1.72z)| norm 0.2434 (+0.24z)| lr 5.30e-06 | 4193.45 ms | 32.2% bf16 MFU | 122895 tok/s step 18432/19560 | loss 3.265346 (-0.60z)| norm 0.2268 (-0.39z)| lr 5.29e-06 | 4288.31 ms | 31.5% bf16 MFU | 122863 tok/s step 18433/19560 | loss 3.317514 (+0.81z)| norm 0.2380 (+0.03z)| lr 5.28e-06 | 4196.68 ms | 32.2% bf16 MFU | 122966 tok/s step 18434/19560 | loss 3.289300 (+0.04z)| norm 0.2283 (-0.33z)| lr 5.27e-06 | 4153.38 ms | 32.5% bf16 MFU | 123130 tok/s step 18435/19560 | loss 3.325080 (+1.03z)| norm 0.2297 (-0.28z)| lr 5.26e-06 | 4144.12 ms | 32.6% bf16 MFU | 123299 tok/s step 18436/19560 | loss 3.305228 (+0.47z)| norm 0.2339 (-0.13z)| lr 5.25e-06 | 4146.49 ms | 32.6% bf16 MFU | 123456 tok/s step 18437/19560 | loss 3.293774 (+0.18z)| norm 0.2349 (-0.08z)| lr 5.24e-06 | 4166.79 ms | 32.4% bf16 MFU | 123575 tok/s step 18438/19560 | loss 3.420175 (+3.51z)| norm 0.2345 (-0.09z)| lr 5.23e-06 | 4153.81 ms | 32.5% bf16 MFU | 123707 tok/s step 18439/19560 | loss 3.265348 (-0.62z)| norm 0.2256 (-0.43z)| lr 5.22e-06 | 4160.83 ms | 32.4% bf16 MFU | 123822 tok/s step 18440/19560 | loss 3.300464 (+0.32z)| norm 0.2249 (-0.45z)| lr 5.22e-06 | 4154.46 ms | 32.5% bf16 MFU | 123941 tok/s step 18441/19560 | loss 3.288696 (+0.01z)| norm 0.2468 (+0.38z)| lr 5.21e-06 | 4148.77 ms | 32.5% bf16 MFU | 124062 tok/s step 18442/19560 | loss 3.315722 (+0.72z)| norm 0.2238 (-0.49z)| lr 5.20e-06 | 4151.02 ms | 32.5% bf16 MFU | 124174 tok/s step 18443/19560 | loss 3.310451 (+0.56z)| norm 0.2379 (+0.04z)| lr 5.19e-06 | 4150.66 ms | 32.5% bf16 MFU | 124281 tok/s step 18444/19560 | loss 3.302894 (+0.38z)| norm 0.2332 (-0.14z)| lr 5.18e-06 | 4152.68 ms | 32.5% bf16 MFU | 124380 tok/s step 18445/19560 | loss 3.303615 (+0.39z)| norm 0.2293 (-0.28z)| lr 5.17e-06 | 4157.52 ms | 32.5% bf16 MFU | 124466 tok/s step 18446/19560 | loss 3.333862 (+1.20z)| norm 0.2455 (+0.33z)| lr 5.16e-06 | 4153.40 ms | 32.5% bf16 MFU | 124554 tok/s step 18447/19560 | loss 3.250064 (-1.06z)| norm 0.2286 (-0.30z)| lr 5.15e-06 | 4160.88 ms | 32.4% bf16 MFU | 124627 tok/s step 18448/19560 | loss 3.272834 (-0.46z)| norm 0.2310 (-0.21z)| lr 5.14e-06 | 4150.77 ms | 32.5% bf16 MFU | 124711 tok/s step 18449/19560 | loss 3.260945 (-0.80z)| norm 0.2410 (+0.17z)| lr 5.13e-06 | 4152.90 ms | 32.5% bf16 MFU | 124788 tok/s step 18450/19560 | loss 3.301492 (+0.31z)| norm 0.2306 (-0.23z)| lr 5.12e-06 | 4149.15 ms | 32.5% bf16 MFU | 124866 tok/s step 18451/19560 | loss 3.288276 (-0.05z)| norm 0.2400 (+0.13z)| lr 5.11e-06 | 4160.88 ms | 32.4% bf16 MFU | 124923 tok/s step 18452/19560 | loss 3.394861 (+2.77z)| norm 0.2387 (+0.08z)| lr 5.10e-06 | 4158.60 ms | 32.5% bf16 MFU | 124981 tok/s step 18453/19560 | loss 3.257851 (-0.88z)| norm 0.2387 (+0.07z)| lr 5.10e-06 | 4157.14 ms | 32.5% bf16 MFU | 125038 tok/s step 18454/19560 | loss 3.353616 (+1.64z)| norm 0.2254 (-0.43z)| lr 5.09e-06 | 4158.79 ms | 32.5% bf16 MFU | 125089 tok/s step 18455/19560 | loss 3.278450 (-0.34z)| norm 0.2326 (-0.15z)| lr 5.08e-06 | 4153.39 ms | 32.5% bf16 MFU | 125146 tok/s step 18456/19560 | loss 3.278106 (-0.35z)| norm 0.2246 (-0.45z)| lr 5.07e-06 | 4156.71 ms | 32.5% bf16 MFU | 125195 tok/s step 18457/19560 | loss 3.295849 (+0.12z)| norm 0.2306 (-0.22z)| lr 5.06e-06 | 4151.39 ms | 32.5% bf16 MFU | 125250 tok/s step 18458/19560 | loss 3.287125 (-0.12z)| norm 0.2335 (-0.10z)| lr 5.05e-06 | 4156.43 ms | 32.5% bf16 MFU | 125295 tok/s step 18459/19560 | loss 3.324148 (+0.85z)| norm 0.2608 (+3.15z)| lr 5.04e-06 | 4160.68 ms | 32.5% bf16 MFU | 125330 tok/s step 18460/19560 | loss 3.352998 (+1.59z)| norm 0.2334 (-0.13z)| lr 5.03e-06 | 4154.54 ms | 32.5% bf16 MFU | 125374 tok/s step 18461/19560 | loss 3.279572 (-0.35z)| norm 0.2234 (-1.30z)| lr 5.02e-06 | 4157.58 ms | 32.5% bf16 MFU | 125410 tok/s step 18462/19560 | loss 3.285362 (-0.21z)| norm 0.2411 (+0.79z)| lr 5.01e-06 | 4156.28 ms | 32.5% bf16 MFU | 125447 tok/s step 18463/19560 | loss 3.314402 (+0.57z)| norm 0.2267 (-0.90z)| lr 5.00e-06 | 4149.97 ms | 32.5% bf16 MFU | 125491 tok/s step 18464/19560 | loss 3.389770 (+2.51z)| norm 0.2333 (-0.11z)| lr 4.99e-06 | 4163.43 ms | 32.4% bf16 MFU | 125513 tok/s step 18465/19560 | loss 3.312318 (+0.47z)| norm 0.2366 (+0.30z)| lr 4.99e-06 | 4167.46 ms | 32.4% bf16 MFU | 125528 tok/s step 18466/19560 | loss 3.252658 (-1.10z)| norm 0.2340 (-0.02z)| lr 4.98e-06 | 4160.08 ms | 32.5% bf16 MFU | 125553 tok/s step 18467/19560 | loss 3.290885 (-0.10z)| norm 0.2190 (-1.78z)| lr 4.97e-06 | 4160.23 ms | 32.5% bf16 MFU | 125576 tok/s step 18468/19560 | loss 3.319390 (+0.64z)| norm 0.2401 (+0.72z)| lr 4.96e-06 | 4154.51 ms | 32.5% bf16 MFU | 125607 tok/s step 18469/19560 | loss 3.297309 (+0.05z)| norm 0.2214 (-1.47z)| lr 4.95e-06 | 4155.56 ms | 32.5% bf16 MFU | 125635 tok/s step 18470/19560 | loss 3.252676 (-1.12z)| norm 0.2319 (-0.23z)| lr 4.94e-06 | 4173.79 ms | 32.3% bf16 MFU | 125634 tok/s step 18471/19560 | loss 3.238582 (-1.47z)| norm 0.2411 (+0.84z)| lr 4.93e-06 | 4156.96 ms | 32.5% bf16 MFU | 125659 tok/s step 18472/19560 | loss 3.293044 (-0.03z)| norm 0.2302 (-0.43z)| lr 4.92e-06 | 4155.76 ms | 32.5% bf16 MFU | 125684 tok/s step 18473/19560 | loss 3.310436 (+0.42z)| norm 0.2284 (-0.65z)| lr 4.91e-06 | 4162.80 ms | 32.4% bf16 MFU | 125697 tok/s step 18474/19560 | loss 3.274697 (-0.52z)| norm 0.2369 (+0.33z)| lr 4.90e-06 | 4164.50 ms | 32.4% bf16 MFU | 125707 tok/s step 18475/19560 | loss 3.292925 (-0.04z)| norm 0.2302 (-0.45z)| lr 4.90e-06 | 4169.47 ms | 32.4% bf16 MFU | 125709 tok/s step 18476/19560 | loss 3.293768 (-0.03z)| norm 0.2405 (+0.75z)| lr 4.89e-06 | 4151.74 ms | 32.5% bf16 MFU | 125737 tok/s step 18477/19560 | loss 3.234805 (-1.57z)| norm 0.2205 (-1.57z)| lr 4.88e-06 | 4163.75 ms | 32.4% bf16 MFU | 125746 tok/s step 18478/19560 | loss 3.307827 (+0.34z)| norm 0.2366 (+0.32z)| lr 4.87e-06 | 4168.01 ms | 32.4% bf16 MFU | 125748 tok/s step 18479/19560 | loss 3.242971 (-1.35z)| norm 0.2233 (-1.24z)| lr 4.86e-06 | 4165.46 ms | 32.4% bf16 MFU | 125754 tok/s step 18480/19560 | loss 3.276455 (-0.46z)| norm 0.2332 (-0.06z)| lr 4.85e-06 | 4166.80 ms | 32.4% bf16 MFU | 125758 tok/s step 18481/19560 | loss 3.357385 (+1.67z)| norm 0.2378 (+0.48z)| lr 4.84e-06 | 4163.27 ms | 32.4% bf16 MFU | 125766 tok/s step 18482/19560 | loss 3.280323 (-0.37z)| norm 0.2388 (+0.63z)| lr 4.83e-06 | 4164.57 ms | 32.4% bf16 MFU | 125773 tok/s step 18483/19560 | loss 3.268991 (-0.66z)| norm 0.2346 (+0.16z)| lr 4.82e-06 | 4154.53 ms | 32.5% bf16 MFU | 125794 tok/s step 18484/19560 | loss 3.274455 (-0.52z)| norm 0.2290 (-0.57z)| lr 4.81e-06 | 4157.36 ms | 32.5% bf16 MFU | 125810 tok/s step 18485/19560 | loss 3.251348 (-1.14z)| norm 0.2251 (-1.05z)| lr 4.81e-06 | 4152.61 ms | 32.5% bf16 MFU | 125832 tok/s step 18486/19560 | loss 3.470211 (+4.44z)| norm 0.2762 (+5.03z)| lr 4.80e-06 | 4155.27 ms | 32.5% bf16 MFU | 125849 tok/s step 18487/19560 | loss 3.304215 (+0.25z)| norm 0.2380 (+0.53z)| lr 4.79e-06 | 4161.80 ms | 32.4% bf16 MFU | 125856 tok/s step 18488/19560 | loss 3.277526 (-0.42z)| norm 0.2325 (-0.10z)| lr 4.78e-06 | 4150.83 ms | 32.5% bf16 MFU | 125878 tok/s step 18489/19560 | loss 3.305386 (+0.28z)| norm 0.2387 (+0.63z)| lr 4.77e-06 | 4258.90 ms | 31.7% bf16 MFU | 125740 tok/s step 18490/19560 | loss 3.241987 (-1.30z)| norm 0.2307 (-0.33z)| lr 4.76e-06 | 4153.25 ms | 32.5% bf16 MFU | 125764 tok/s step 18491/19560 | loss 3.222063 (-1.77z)| norm 0.2379 (+0.53z)| lr 4.75e-06 | 4154.92 ms | 32.5% bf16 MFU | 125785 tok/s step 18492/19560 | loss 3.308869 (+0.38z)| norm 0.2350 (+0.18z)| lr 4.74e-06 | 4163.28 ms | 32.4% bf16 MFU | 125793 tok/s step 18493/19560 | loss 3.264118 (-0.73z)| norm 0.2276 (-0.72z)| lr 4.73e-06 | 4159.50 ms | 32.5% bf16 MFU | 125805 tok/s step 18494/19560 | loss 3.382348 (+2.14z)| norm 0.2464 (+1.53z)| lr 4.73e-06 | 4161.42 ms | 32.4% bf16 MFU | 125814 tok/s step 18495/19560 | loss 3.248993 (-1.09z)| norm 0.2555 (+2.59z)| lr 4.72e-06 | 4156.55 ms | 32.5% bf16 MFU | 125830 tok/s step 18496/19560 | loss 3.286161 (-0.18z)| norm 0.2315 (-0.25z)| lr 4.71e-06 | 4153.09 ms | 32.5% bf16 MFU | 125851 tok/s step 18497/19560 | loss 3.271859 (-0.52z)| norm 0.2225 (-1.30z)| lr 4.70e-06 | 4154.11 ms | 32.5% bf16 MFU | 125869 tok/s step 18498/19560 | loss 3.400501 (+2.55z)| norm 0.2377 (+0.49z)| lr 4.69e-06 | 4152.18 ms | 32.5% bf16 MFU | 125889 tok/s step 18499/19560 | loss 3.296984 (+0.08z)| norm 0.2268 (-0.78z)| lr 4.68e-06 | 4153.15 ms | 32.5% bf16 MFU | 125906 tok/s step 18500/19560 | loss 3.257704 (-0.85z)| norm 0.2318 (-0.19z)| lr 4.67e-06 | 4175.33 ms | 32.3% bf16 MFU | 125889 tok/s val loss 3.266308 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3024/10042 = 0.301135 step 18501/19560 | loss 3.274822 (-0.43z)| norm 0.2321 (-0.16z)| lr 4.66e-06 | 5227.43 ms | 25.8% bf16 MFU | 124610 tok/s step 18502/19560 | loss 3.286883 (-0.14z)| norm 0.2347 (+0.15z)| lr 4.66e-06 | 4180.50 ms | 32.3% bf16 MFU | 124650 tok/s step 18503/19560 | loss 3.277386 (-0.36z)| norm 0.2214 (-1.41z)| lr 4.65e-06 | 4153.60 ms | 32.5% bf16 MFU | 124729 tok/s step 18504/19560 | loss 3.259385 (-0.80z)| norm 0.2310 (-0.26z)| lr 4.64e-06 | 4159.79 ms | 32.5% bf16 MFU | 124794 tok/s step 18505/19560 | loss 3.351354 (+1.40z)| norm 0.2210 (-1.43z)| lr 4.63e-06 | 4155.30 ms | 32.5% bf16 MFU | 124863 tok/s step 18506/19560 | loss 3.288684 (-0.12z)| norm 0.2293 (-0.44z)| lr 4.62e-06 | 4154.20 ms | 32.5% bf16 MFU | 124930 tok/s step 18507/19560 | loss 3.308126 (+0.35z)| norm 0.2303 (-0.32z)| lr 4.61e-06 | 4157.11 ms | 32.5% bf16 MFU | 124990 tok/s step 18508/19560 | loss 3.289669 (-0.10z)| norm 0.2247 (-0.98z)| lr 4.60e-06 | 4162.72 ms | 32.4% bf16 MFU | 125038 tok/s step 18509/19560 | loss 3.251248 (-1.03z)| norm 0.2226 (-1.20z)| lr 4.59e-06 | 4155.05 ms | 32.5% bf16 MFU | 125095 tok/s step 18510/19560 | loss 3.320584 (+0.65z)| norm 0.2309 (-0.22z)| lr 4.59e-06 | 4172.08 ms | 32.4% bf16 MFU | 125123 tok/s step 18511/19560 | loss 3.294131 (+0.01z)| norm 0.2409 (+0.95z)| lr 4.58e-06 | 4157.61 ms | 32.5% bf16 MFU | 125172 tok/s step 18512/19560 | loss 3.232530 (-1.47z)| norm 0.2273 (-0.66z)| lr 4.57e-06 | 4160.94 ms | 32.4% bf16 MFU | 125214 tok/s step 18513/19560 | loss 3.287836 (-0.13z)| norm 0.2281 (-0.57z)| lr 4.56e-06 | 4155.94 ms | 32.5% bf16 MFU | 125261 tok/s step 18514/19560 | loss 3.277138 (-0.41z)| norm 0.2157 (-1.99z)| lr 4.55e-06 | 4159.90 ms | 32.5% bf16 MFU | 125299 tok/s step 18515/19560 | loss 3.262639 (-0.77z)| norm 0.2232 (-1.10z)| lr 4.54e-06 | 4153.09 ms | 32.5% bf16 MFU | 125346 tok/s step 18516/19560 | loss 3.286944 (-0.17z)| norm 0.2380 (+0.65z)| lr 4.53e-06 | 4147.90 ms | 32.6% bf16 MFU | 125399 tok/s step 18517/19560 | loss 3.324112 (+0.75z)| norm 0.2326 (+0.00z)| lr 4.52e-06 | 4156.49 ms | 32.5% bf16 MFU | 125436 tok/s step 18518/19560 | loss 3.282313 (-0.29z)| norm 0.2467 (+1.64z)| lr 4.52e-06 | 4152.04 ms | 32.5% bf16 MFU | 125478 tok/s step 18519/19560 | loss 3.272758 (-0.53z)| norm 0.2590 (+2.95z)| lr 4.51e-06 | 4161.06 ms | 32.4% bf16 MFU | 125504 tok/s step 18520/19560 | loss 3.253874 (-1.00z)| norm 0.2373 (+0.49z)| lr 4.50e-06 | 4167.75 ms | 32.4% bf16 MFU | 125518 tok/s step 18521/19560 | loss 3.261506 (-0.80z)| norm 0.2248 (-0.91z)| lr 4.49e-06 | 4163.84 ms | 32.4% bf16 MFU | 125538 tok/s step 18522/19560 | loss 3.336968 (+1.12z)| norm 0.2296 (-0.37z)| lr 4.48e-06 | 4148.11 ms | 32.5% bf16 MFU | 125581 tok/s step 18523/19560 | loss 3.299767 (+0.17z)| norm 0.2372 (+0.48z)| lr 4.47e-06 | 4148.54 ms | 32.5% bf16 MFU | 125621 tok/s step 18524/19560 | loss 3.220980 (-1.82z)| norm 0.2281 (-0.56z)| lr 4.46e-06 | 4154.15 ms | 32.5% bf16 MFU | 125650 tok/s step 18525/19560 | loss 3.294218 (+0.03z)| norm 0.2309 (-0.24z)| lr 4.46e-06 | 4157.68 ms | 32.5% bf16 MFU | 125673 tok/s step 18526/19560 | loss 3.326171 (+0.83z)| norm 0.2366 (+0.40z)| lr 4.45e-06 | 4166.00 ms | 32.4% bf16 MFU | 125682 tok/s step 18527/19560 | loss 3.230613 (-1.58z)| norm 0.2314 (-0.20z)| lr 4.44e-06 | 4153.02 ms | 32.5% bf16 MFU | 125710 tok/s step 18528/19560 | loss 3.268663 (-0.62z)| norm 0.2249 (-0.92z)| lr 4.43e-06 | 4149.77 ms | 32.5% bf16 MFU | 125741 tok/s step 18529/19560 | loss 3.322567 (+0.75z)| norm 0.2293 (-0.40z)| lr 4.42e-06 | 4158.59 ms | 32.5% bf16 MFU | 125758 tok/s step 18530/19560 | loss 3.395678 (+2.52z)| norm 0.2467 (+1.59z)| lr 4.41e-06 | 4148.26 ms | 32.5% bf16 MFU | 125789 tok/s step 18531/19560 | loss 3.360350 (+1.62z)| norm 0.2727 (+4.22z)| lr 4.40e-06 | 7407.95 ms | 18.2% bf16 MFU | 123039 tok/s step 18532/19560 | loss 3.393435 (+2.36z)| norm 0.2785 (+4.44z)| lr 4.40e-06 | 4134.14 ms | 32.7% bf16 MFU | 123228 tok/s step 18533/19560 | loss 3.204977 (-2.10z)| norm 0.2478 (+1.39z)| lr 4.39e-06 | 4154.07 ms | 32.5% bf16 MFU | 123377 tok/s step 18534/19560 | loss 3.263429 (-0.72z)| norm 0.2249 (-0.86z)| lr 4.38e-06 | 4154.23 ms | 32.5% bf16 MFU | 123518 tok/s step 18535/19560 | loss 3.310878 (+0.41z)| norm 0.2424 (+0.85z)| lr 4.37e-06 | 4148.74 ms | 32.5% bf16 MFU | 123661 tok/s step 18536/19560 | loss 3.216330 (-1.83z)| norm 0.2368 (+0.29z)| lr 4.36e-06 | 4149.67 ms | 32.5% bf16 MFU | 123795 tok/s step 18537/19560 | loss 3.348571 (+1.28z)| norm 0.2291 (-0.47z)| lr 4.35e-06 | 4164.35 ms | 32.4% bf16 MFU | 123900 tok/s step 18538/19560 | loss 3.332845 (+0.89z)| norm 0.3097 (+6.21z)| lr 4.35e-06 | 4149.08 ms | 32.5% bf16 MFU | 124023 tok/s step 18539/19560 | loss 3.286759 (-0.20z)| norm 0.2301 (-0.35z)| lr 4.34e-06 | 4142.44 ms | 32.6% bf16 MFU | 124150 tok/s step 18540/19560 | loss 3.296196 (+0.02z)| norm 0.2360 (+0.13z)| lr 4.33e-06 | 4153.86 ms | 32.5% bf16 MFU | 124254 tok/s step 18541/19560 | loss 3.229048 (-1.55z)| norm 0.2406 (+0.51z)| lr 4.32e-06 | 4152.09 ms | 32.5% bf16 MFU | 124355 tok/s step 18542/19560 | loss 3.211038 (-1.92z)| norm 0.2366 (+0.17z)| lr 4.31e-06 | 4159.13 ms | 32.5% bf16 MFU | 124440 tok/s step 18543/19560 | loss 3.318799 (+0.57z)| norm 0.2380 (+0.28z)| lr 4.30e-06 | 4306.27 ms | 31.4% bf16 MFU | 124305 tok/s step 18544/19560 | loss 3.262017 (-0.74z)| norm 0.2496 (+1.21z)| lr 4.29e-06 | 4162.12 ms | 32.4% bf16 MFU | 124388 tok/s step 18545/19560 | loss 3.261010 (-0.75z)| norm 0.2322 (-0.22z)| lr 4.29e-06 | 4139.05 ms | 32.6% bf16 MFU | 124502 tok/s step 18546/19560 | loss 3.359880 (+1.50z)| norm 0.2305 (-0.36z)| lr 4.28e-06 | 4159.52 ms | 32.5% bf16 MFU | 124580 tok/s step 18547/19560 | loss 3.293659 (-0.01z)| norm 0.2289 (-0.49z)| lr 4.27e-06 | 4148.08 ms | 32.5% bf16 MFU | 124670 tok/s step 18548/19560 | loss 3.387763 (+2.09z)| norm 0.2770 (+3.32z)| lr 4.26e-06 | 4144.38 ms | 32.6% bf16 MFU | 124762 tok/s step 18549/19560 | loss 3.301487 (+0.14z)| norm 0.2301 (-0.39z)| lr 4.25e-06 | 4149.49 ms | 32.5% bf16 MFU | 124841 tok/s step 18550/19560 | loss 3.295820 (+0.01z)| norm 0.2223 (-1.00z)| lr 4.24e-06 | 4153.09 ms | 32.5% bf16 MFU | 124911 tok/s step 18551/19560 | loss 3.294563 (-0.02z)| norm 0.2300 (-0.40z)| lr 4.24e-06 | 4155.52 ms | 32.5% bf16 MFU | 124974 tok/s step 18552/19560 | loss 3.323866 (+0.65z)| norm 0.2283 (-0.53z)| lr 4.23e-06 | 4155.10 ms | 32.5% bf16 MFU | 125034 tok/s step 18553/19560 | loss 3.221842 (-1.64z)| norm 0.2282 (-0.53z)| lr 4.22e-06 | 4158.73 ms | 32.5% bf16 MFU | 125086 tok/s step 18554/19560 | loss 3.248302 (-1.03z)| norm 0.2441 (+0.72z)| lr 4.21e-06 | 4149.95 ms | 32.5% bf16 MFU | 125149 tok/s step 18555/19560 | loss 3.267530 (-0.60z)| norm 0.2290 (-0.48z)| lr 4.20e-06 | 4160.23 ms | 32.5% bf16 MFU | 125192 tok/s step 18556/19560 | loss 3.269670 (-0.55z)| norm 0.2244 (-0.84z)| lr 4.19e-06 | 4156.20 ms | 32.5% bf16 MFU | 125240 tok/s step 18557/19560 | loss 3.270724 (-0.52z)| norm 0.2389 (+0.31z)| lr 4.19e-06 | 4146.56 ms | 32.6% bf16 MFU | 125300 tok/s step 18558/19560 | loss 3.306698 (+0.27z)| norm 0.2476 (+0.98z)| lr 4.18e-06 | 4158.09 ms | 32.5% bf16 MFU | 125339 tok/s step 18559/19560 | loss 3.309200 (+0.34z)| norm 0.2419 (+0.53z)| lr 4.17e-06 | 4150.31 ms | 32.5% bf16 MFU | 125389 tok/s step 18560/19560 | loss 3.312086 (+0.40z)| norm 0.2314 (-0.30z)| lr 4.16e-06 | 4152.23 ms | 32.5% bf16 MFU | 125433 tok/s step 18561/19560 | loss 3.253814 (-0.90z)| norm 0.2311 (-0.32z)| lr 4.15e-06 | 4153.81 ms | 32.5% bf16 MFU | 125472 tok/s step 18562/19560 | loss 3.301941 (+0.18z)| norm 0.2284 (-0.53z)| lr 4.14e-06 | 4160.15 ms | 32.5% bf16 MFU | 125500 tok/s step 18563/19560 | loss 3.339570 (+1.02z)| norm 0.2493 (+1.11z)| lr 4.14e-06 | 4163.30 ms | 32.4% bf16 MFU | 125521 tok/s step 18564/19560 | loss 3.282876 (-0.25z)| norm 0.2274 (-0.61z)| lr 4.13e-06 | 4157.17 ms | 32.5% bf16 MFU | 125551 tok/s step 18565/19560 | loss 3.283021 (-0.25z)| norm 0.2293 (-0.46z)| lr 4.12e-06 | 4158.52 ms | 32.5% bf16 MFU | 125577 tok/s step 18566/19560 | loss 3.365849 (+1.65z)| norm 0.2338 (-0.11z)| lr 4.11e-06 | 4154.03 ms | 32.5% bf16 MFU | 125609 tok/s step 18567/19560 | loss 3.326529 (+0.74z)| norm 0.2367 (+0.11z)| lr 4.10e-06 | 4152.61 ms | 32.5% bf16 MFU | 125641 tok/s step 18568/19560 | loss 3.248481 (-1.03z)| norm 0.2201 (-1.18z)| lr 4.09e-06 | 4161.41 ms | 32.4% bf16 MFU | 125659 tok/s step 18569/19560 | loss 3.333486 (+0.90z)| norm 0.2277 (-0.58z)| lr 4.09e-06 | 4157.11 ms | 32.5% bf16 MFU | 125682 tok/s step 18570/19560 | loss 3.259348 (-0.78z)| norm 0.2255 (-0.75z)| lr 4.08e-06 | 4159.53 ms | 32.5% bf16 MFU | 125700 tok/s step 18571/19560 | loss 3.243572 (-1.12z)| norm 0.2244 (-0.83z)| lr 4.07e-06 | 4154.24 ms | 32.5% bf16 MFU | 125725 tok/s step 18572/19560 | loss 3.346300 (+1.19z)| norm 0.2389 (+0.31z)| lr 4.06e-06 | 4162.40 ms | 32.4% bf16 MFU | 125737 tok/s step 18573/19560 | loss 3.336122 (+0.95z)| norm 0.2300 (-0.39z)| lr 4.05e-06 | 4164.37 ms | 32.4% bf16 MFU | 125745 tok/s step 18574/19560 | loss 3.282612 (-0.24z)| norm 0.2319 (-0.24z)| lr 4.05e-06 | 4165.83 ms | 32.4% bf16 MFU | 125750 tok/s step 18575/19560 | loss 3.327685 (+0.76z)| norm 0.2301 (-0.38z)| lr 4.04e-06 | 4152.95 ms | 32.5% bf16 MFU | 125775 tok/s step 18576/19560 | loss 3.281360 (-0.28z)| norm 0.2402 (+0.41z)| lr 4.03e-06 | 4172.28 ms | 32.4% bf16 MFU | 125769 tok/s step 18577/19560 | loss 3.274037 (-0.45z)| norm 0.2180 (-1.32z)| lr 4.02e-06 | 4157.11 ms | 32.5% bf16 MFU | 125787 tok/s step 18578/19560 | loss 3.332268 (+0.85z)| norm 0.2460 (+0.86z)| lr 4.01e-06 | 4170.88 ms | 32.4% bf16 MFU | 125782 tok/s step 18579/19560 | loss 3.257681 (-0.81z)| norm 0.2280 (-0.54z)| lr 4.00e-06 | 4158.96 ms | 32.5% bf16 MFU | 125796 tok/s step 18580/19560 | loss 3.247119 (-1.04z)| norm 0.2350 (+0.01z)| lr 4.00e-06 | 4160.28 ms | 32.5% bf16 MFU | 125808 tok/s step 18581/19560 | loss 3.288914 (-0.09z)| norm 0.2233 (-0.89z)| lr 3.99e-06 | 4155.33 ms | 32.5% bf16 MFU | 125826 tok/s step 18582/19560 | loss 3.323038 (+0.69z)| norm 0.2263 (-0.65z)| lr 3.98e-06 | 4159.98 ms | 32.5% bf16 MFU | 125836 tok/s step 18583/19560 | loss 3.316467 (+0.54z)| norm 0.2228 (-0.92z)| lr 3.97e-06 | 4165.34 ms | 32.4% bf16 MFU | 125838 tok/s step 18584/19560 | loss 3.324793 (+0.72z)| norm 0.2243 (-0.80z)| lr 3.96e-06 | 4165.12 ms | 32.4% bf16 MFU | 125840 tok/s step 18585/19560 | loss 3.249912 (-0.99z)| norm 0.2254 (-0.71z)| lr 3.96e-06 | 4157.57 ms | 32.5% bf16 MFU | 125853 tok/s step 18586/19560 | loss 3.341609 (+1.09z)| norm 0.2276 (-0.54z)| lr 3.95e-06 | 4156.69 ms | 32.5% bf16 MFU | 125867 tok/s step 18587/19560 | loss 3.247189 (-1.04z)| norm 0.2351 (+0.06z)| lr 3.94e-06 | 4168.87 ms | 32.4% bf16 MFU | 125862 tok/s step 18588/19560 | loss 3.305879 (+0.30z)| norm 0.2274 (-0.54z)| lr 3.93e-06 | 4172.87 ms | 32.4% bf16 MFU | 125851 tok/s step 18589/19560 | loss 3.253870 (-0.88z)| norm 0.2209 (-1.05z)| lr 3.92e-06 | 4163.93 ms | 32.4% bf16 MFU | 125854 tok/s step 18590/19560 | loss 3.325513 (+0.75z)| norm 0.2618 (+2.11z)| lr 3.92e-06 | 4166.03 ms | 32.4% bf16 MFU | 125854 tok/s step 18591/19560 | loss 3.297841 (+0.12z)| norm 0.2270 (-0.57z)| lr 3.91e-06 | 4150.05 ms | 32.5% bf16 MFU | 125877 tok/s step 18592/19560 | loss 3.253631 (-0.88z)| norm 0.2254 (-0.69z)| lr 3.90e-06 | 4173.25 ms | 32.4% bf16 MFU | 125865 tok/s step 18593/19560 | loss 3.271806 (-0.45z)| norm 0.2254 (-0.68z)| lr 3.89e-06 | 4159.23 ms | 32.5% bf16 MFU | 125875 tok/s step 18594/19560 | loss 3.258726 (-0.75z)| norm 0.2224 (-0.91z)| lr 3.88e-06 | 4155.95 ms | 32.5% bf16 MFU | 125889 tok/s step 18595/19560 | loss 3.298270 (+0.16z)| norm 0.2298 (-0.35z)| lr 3.88e-06 | 4162.20 ms | 32.4% bf16 MFU | 125892 tok/s step 18596/19560 | loss 3.307102 (+0.37z)| norm 0.2318 (-0.18z)| lr 3.87e-06 | 4154.36 ms | 32.5% bf16 MFU | 125908 tok/s step 18597/19560 | loss 3.265614 (-0.59z)| norm 0.2207 (-1.04z)| lr 3.86e-06 | 4158.98 ms | 32.5% bf16 MFU | 125915 tok/s step 18598/19560 | loss 3.282024 (-0.21z)| norm 0.2428 (+0.66z)| lr 3.85e-06 | 4157.69 ms | 32.5% bf16 MFU | 125925 tok/s step 18599/19560 | loss 3.316549 (+0.58z)| norm 0.2347 (+0.03z)| lr 3.84e-06 | 4159.92 ms | 32.5% bf16 MFU | 125930 tok/s step 18600/19560 | loss 3.253503 (-0.88z)| norm 0.2276 (-0.51z)| lr 3.84e-06 | 4162.69 ms | 32.4% bf16 MFU | 125931 tok/s step 18601/19560 | loss 3.289822 (-0.03z)| norm 0.2499 (+1.18z)| lr 3.83e-06 | 4157.82 ms | 32.5% bf16 MFU | 125939 tok/s step 18602/19560 | loss 3.299151 (+0.18z)| norm 0.2259 (-0.64z)| lr 3.82e-06 | 4155.02 ms | 32.5% bf16 MFU | 125952 tok/s step 18603/19560 | loss 3.277354 (-0.33z)| norm 0.2269 (-0.57z)| lr 3.81e-06 | 5293.65 ms | 25.5% bf16 MFU | 124606 tok/s step 18604/19560 | loss 3.342026 (+1.16z)| norm 0.2312 (-0.23z)| lr 3.80e-06 | 4874.29 ms | 27.7% bf16 MFU | 123754 tok/s step 18605/19560 | loss 3.212215 (-1.82z)| norm 0.2340 (-0.02z)| lr 3.80e-06 | 4333.08 ms | 31.2% bf16 MFU | 123616 tok/s step 18606/19560 | loss 3.299750 (+0.19z)| norm 0.2242 (-0.77z)| lr 3.79e-06 | 4311.77 ms | 31.3% bf16 MFU | 123515 tok/s step 18607/19560 | loss 3.315635 (+0.54z)| norm 0.2277 (-0.50z)| lr 3.78e-06 | 4552.44 ms | 29.7% bf16 MFU | 123097 tok/s step 18608/19560 | loss 3.316837 (+0.56z)| norm 0.2339 (-0.03z)| lr 3.77e-06 | 4352.54 ms | 31.0% bf16 MFU | 122965 tok/s step 18609/19560 | loss 3.290560 (-0.03z)| norm 0.2346 (+0.03z)| lr 3.76e-06 | 4740.84 ms | 28.5% bf16 MFU | 122347 tok/s step 18610/19560 | loss 3.265872 (-0.60z)| norm 0.2361 (+0.14z)| lr 3.76e-06 | 4500.99 ms | 30.0% bf16 MFU | 122053 tok/s step 18611/19560 | loss 3.314531 (+0.52z)| norm 0.2451 (+0.82z)| lr 3.75e-06 | 4322.57 ms | 31.2% bf16 MFU | 122015 tok/s step 18612/19560 | loss 3.311889 (+0.45z)| norm 0.2246 (-0.74z)| lr 3.74e-06 | 4341.54 ms | 31.1% bf16 MFU | 121953 tok/s step 18613/19560 | loss 3.257912 (-0.80z)| norm 0.2255 (-0.67z)| lr 3.73e-06 | 4181.51 ms | 32.3% bf16 MFU | 122124 tok/s step 18614/19560 | loss 3.241277 (-1.23z)| norm 0.2249 (-0.72z)| lr 3.72e-06 | 4427.27 ms | 30.5% bf16 MFU | 121939 tok/s step 18615/19560 | loss 3.270007 (-0.51z)| norm 0.2238 (-0.79z)| lr 3.72e-06 | 4513.87 ms | 29.9% bf16 MFU | 121650 tok/s step 18616/19560 | loss 3.297432 (+0.17z)| norm 0.2241 (-0.76z)| lr 3.71e-06 | 4507.32 ms | 30.0% bf16 MFU | 121383 tok/s step 18617/19560 | loss 3.286546 (-0.10z)| norm 0.2249 (-0.69z)| lr 3.70e-06 | 4238.26 ms | 31.9% bf16 MFU | 121499 tok/s step 18618/19560 | loss 3.223557 (-1.65z)| norm 0.2262 (-0.58z)| lr 3.69e-06 | 4154.32 ms | 32.5% bf16 MFU | 121734 tok/s step 18619/19560 | loss 3.295462 (+0.12z)| norm 0.2187 (-1.16z)| lr 3.69e-06 | 4154.80 ms | 32.5% bf16 MFU | 121957 tok/s step 18620/19560 | loss 3.309614 (+0.47z)| norm 0.2261 (-0.57z)| lr 3.68e-06 | 4371.13 ms | 30.9% bf16 MFU | 121856 tok/s step 18621/19560 | loss 3.283932 (-0.18z)| norm 0.2244 (-0.70z)| lr 3.67e-06 | 4277.02 ms | 31.6% bf16 MFU | 121893 tok/s step 18622/19560 | loss 3.298555 (+0.21z)| norm 0.2290 (-0.33z)| lr 3.66e-06 | 4269.05 ms | 31.6% bf16 MFU | 121938 tok/s step 18623/19560 | loss 3.284615 (-0.15z)| norm 0.2381 (+0.40z)| lr 3.65e-06 | 4153.88 ms | 32.5% bf16 MFU | 122152 tok/s step 18624/19560 | loss 3.272877 (-0.45z)| norm 0.2246 (-0.67z)| lr 3.65e-06 | 4195.47 ms | 32.2% bf16 MFU | 122293 tok/s step 18625/19560 | loss 3.255367 (-0.90z)| norm 0.2331 (+0.00z)| lr 3.64e-06 | 4178.28 ms | 32.3% bf16 MFU | 122452 tok/s step 18626/19560 | loss 3.314128 (+0.64z)| norm 0.2261 (-0.55z)| lr 3.63e-06 | 4295.24 ms | 31.4% bf16 MFU | 122433 tok/s step 18627/19560 | loss 3.372947 (+2.14z)| norm 0.2311 (-0.15z)| lr 3.62e-06 | 4163.91 ms | 32.4% bf16 MFU | 122607 tok/s step 18628/19560 | loss 3.297256 (+0.17z)| norm 0.2282 (-0.38z)| lr 3.62e-06 | 4172.31 ms | 32.4% bf16 MFU | 122759 tok/s step 18629/19560 | loss 3.283159 (-0.20z)| norm 0.2397 (+0.53z)| lr 3.61e-06 | 4162.79 ms | 32.4% bf16 MFU | 122919 tok/s step 18630/19560 | loss 3.309493 (+0.48z)| norm 0.2316 (-0.11z)| lr 3.60e-06 | 4256.61 ms | 31.7% bf16 MFU | 122931 tok/s step 18631/19560 | loss 3.308859 (+0.46z)| norm 0.2360 (+0.23z)| lr 3.59e-06 | 4150.88 ms | 32.5% bf16 MFU | 123100 tok/s step 18632/19560 | loss 3.276297 (-0.39z)| norm 0.2370 (+0.31z)| lr 3.58e-06 | 4305.73 ms | 31.4% bf16 MFU | 123033 tok/s step 18633/19560 | loss 3.244199 (-1.21z)| norm 0.2283 (-0.40z)| lr 3.58e-06 | 4157.24 ms | 32.5% bf16 MFU | 123187 tok/s step 18634/19560 | loss 3.275604 (-0.38z)| norm 0.2445 (+0.89z)| lr 3.57e-06 | 4271.70 ms | 31.6% bf16 MFU | 123165 tok/s step 18635/19560 | loss 3.218615 (-1.83z)| norm 0.2326 (-0.06z)| lr 3.56e-06 | 4174.15 ms | 32.3% bf16 MFU | 123287 tok/s step 18636/19560 | loss 3.314986 (+0.65z)| norm 0.2292 (-0.34z)| lr 3.55e-06 | 4158.63 ms | 32.5% bf16 MFU | 123426 tok/s step 18637/19560 | loss 3.241983 (-1.23z)| norm 0.2271 (-0.51z)| lr 3.55e-06 | 4188.81 ms | 32.2% bf16 MFU | 123513 tok/s step 18638/19560 | loss 3.219031 (-1.78z)| norm 0.2237 (-0.77z)| lr 3.54e-06 | 4357.01 ms | 31.0% bf16 MFU | 123354 tok/s step 18639/19560 | loss 3.210474 (-1.95z)| norm 0.2228 (-0.83z)| lr 3.53e-06 | 4269.74 ms | 31.6% bf16 MFU | 123326 tok/s step 18640/19560 | loss 3.224240 (-1.60z)| norm 0.2230 (-0.81z)| lr 3.52e-06 | 4406.07 ms | 30.6% bf16 MFU | 123109 tok/s step 18641/19560 | loss 3.266426 (-0.54z)| norm 0.2239 (-0.74z)| lr 3.52e-06 | 4164.40 ms | 32.4% bf16 MFU | 123249 tok/s step 18642/19560 | loss 3.285139 (-0.07z)| norm 0.2299 (-0.28z)| lr 3.51e-06 | 4155.84 ms | 32.5% bf16 MFU | 123394 tok/s step 18643/19560 | loss 3.305966 (+0.44z)| norm 0.2323 (-0.09z)| lr 3.50e-06 | 4168.17 ms | 32.4% bf16 MFU | 123513 tok/s step 18644/19560 | loss 3.320490 (+0.79z)| norm 0.2320 (-0.10z)| lr 3.49e-06 | 4229.92 ms | 31.9% bf16 MFU | 123535 tok/s step 18645/19560 | loss 3.286052 (-0.06z)| norm 0.2413 (+0.64z)| lr 3.49e-06 | 4158.22 ms | 32.5% bf16 MFU | 123663 tok/s step 18646/19560 | loss 3.231044 (-1.41z)| norm 0.2233 (-0.80z)| lr 3.48e-06 | 4171.24 ms | 32.4% bf16 MFU | 123764 tok/s step 18647/19560 | loss 3.228277 (-1.46z)| norm 0.2240 (-0.73z)| lr 3.47e-06 | 4160.78 ms | 32.5% bf16 MFU | 123876 tok/s step 18648/19560 | loss 3.273401 (-0.36z)| norm 0.2236 (-0.75z)| lr 3.46e-06 | 4303.26 ms | 31.4% bf16 MFU | 123774 tok/s step 18649/19560 | loss 3.252136 (-0.88z)| norm 0.2241 (-0.71z)| lr 3.46e-06 | 4150.72 ms | 32.5% bf16 MFU | 123901 tok/s step 18650/19560 | loss 3.318453 (+0.76z)| norm 0.2266 (-0.51z)| lr 3.45e-06 | 4207.22 ms | 32.1% bf16 MFU | 123937 tok/s step 18651/19560 | loss 3.258686 (-0.71z)| norm 0.2423 (+0.77z)| lr 3.44e-06 | 4331.47 ms | 31.2% bf16 MFU | 123792 tok/s step 18652/19560 | loss 3.258194 (-0.73z)| norm 0.2234 (-0.76z)| lr 3.43e-06 | 4401.35 ms | 30.7% bf16 MFU | 123558 tok/s step 18653/19560 | loss 3.254715 (-0.81z)| norm 0.2267 (-0.49z)| lr 3.42e-06 | 4759.97 ms | 28.4% bf16 MFU | 122888 tok/s step 18654/19560 | loss 3.241546 (-1.12z)| norm 0.2273 (-0.44z)| lr 3.42e-06 | 4994.29 ms | 27.0% bf16 MFU | 121992 tok/s step 18655/19560 | loss 3.314061 (+0.67z)| norm 0.2236 (-0.73z)| lr 3.41e-06 | 4248.87 ms | 31.8% bf16 MFU | 122062 tok/s step 18656/19560 | loss 3.293446 (+0.15z)| norm 0.2281 (-0.37z)| lr 3.40e-06 | 4150.58 ms | 32.5% bf16 MFU | 122275 tok/s step 18657/19560 | loss 3.306831 (+0.49z)| norm 0.2275 (-0.41z)| lr 3.39e-06 | 4161.80 ms | 32.4% bf16 MFU | 122460 tok/s step 18658/19560 | loss 3.284581 (-0.05z)| norm 0.2220 (-0.85z)| lr 3.39e-06 | 4194.03 ms | 32.2% bf16 MFU | 122588 tok/s step 18659/19560 | loss 3.289382 (+0.09z)| norm 0.2201 (-1.01z)| lr 3.38e-06 | 4249.01 ms | 31.8% bf16 MFU | 122628 tok/s step 18660/19560 | loss 3.270793 (-0.38z)| norm 0.2296 (-0.19z)| lr 3.37e-06 | 4596.66 ms | 29.4% bf16 MFU | 122199 tok/s step 18661/19560 | loss 3.314596 (+0.79z)| norm 0.2219 (-0.87z)| lr 3.36e-06 | 4189.85 ms | 32.2% bf16 MFU | 122346 tok/s step 18662/19560 | loss 3.307228 (+0.58z)| norm 0.2296 (-0.17z)| lr 3.36e-06 | 5661.02 ms | 23.9% bf16 MFU | 120859 tok/s step 18663/19560 | loss 3.323486 (+1.02z)| norm 0.2378 (+0.58z)| lr 3.35e-06 | 4153.23 ms | 32.5% bf16 MFU | 121128 tok/s step 18664/19560 | loss 3.315697 (+0.79z)| norm 0.2249 (-0.59z)| lr 3.34e-06 | 4171.63 ms | 32.4% bf16 MFU | 121356 tok/s step 18665/19560 | loss 3.201168 (-2.32z)| norm 0.2306 (-0.07z)| lr 3.34e-06 | 6785.69 ms | 19.9% bf16 MFU | 119151 tok/s step 18666/19560 | loss 3.288138 (+0.07z)| norm 0.2334 (+0.31z)| lr 3.33e-06 | 4190.98 ms | 32.2% bf16 MFU | 119449 tok/s step 18667/19560 | loss 3.238224 (-1.29z)| norm 0.2320 (+0.15z)| lr 3.32e-06 | 4131.92 ms | 32.7% bf16 MFU | 119820 tok/s step 18668/19560 | loss 3.258828 (-0.71z)| norm 0.2199 (-1.26z)| lr 3.31e-06 | 4208.29 ms | 32.1% bf16 MFU | 120059 tok/s step 18669/19560 | loss 3.259528 (-0.71z)| norm 0.2234 (-0.84z)| lr 3.31e-06 | 4102.11 ms | 32.9% bf16 MFU | 120446 tok/s step 18670/19560 | loss 3.290057 (+0.12z)| norm 0.2712 (+4.38z)| lr 3.30e-06 | 4098.96 ms | 32.9% bf16 MFU | 120819 tok/s step 18671/19560 | loss 3.276724 (-0.24z)| norm 0.2279 (-0.31z)| lr 3.29e-06 | 4135.46 ms | 32.6% bf16 MFU | 121117 tok/s step 18672/19560 | loss 3.280065 (-0.15z)| norm 0.2265 (-0.45z)| lr 3.28e-06 | 4134.05 ms | 32.7% bf16 MFU | 121402 tok/s step 18673/19560 | loss 3.296430 (+0.30z)| norm 0.2322 (+0.19z)| lr 3.28e-06 | 7726.17 ms | 17.5% bf16 MFU | 118725 tok/s step 18674/19560 | loss 3.240117 (-1.28z)| norm 0.2477 (+1.85z)| lr 3.27e-06 | 4116.13 ms | 32.8% bf16 MFU | 119158 tok/s step 18675/19560 | loss 3.231265 (-1.50z)| norm 0.2203 (-1.11z)| lr 3.26e-06 | 17730.31 ms | 7.6% bf16 MFU | 114678 tok/s step 18676/19560 | loss 3.296642 (+0.38z)| norm 0.2365 (+0.75z)| lr 3.25e-06 | 12499.74 ms | 10.8% bf16 MFU | 111042 tok/s step 18677/19560 | loss 3.223579 (-1.72z)| norm 0.2226 (-0.92z)| lr 3.25e-06 | 29849.76 ms | 4.5% bf16 MFU | 106368 tok/s step 18678/19560 | loss 3.250495 (-0.93z)| norm 0.2294 (-0.10z)| lr 3.24e-06 | 7524.21 ms | 17.9% bf16 MFU | 104533 tok/s step 18679/19560 | loss 3.264661 (-0.51z)| norm 0.2238 (-0.77z)| lr 3.23e-06 | 26087.62 ms | 5.2% bf16 MFU | 100312 tok/s step 18680/19560 | loss 3.336679 (+1.56z)| norm 0.2214 (-1.05z)| lr 3.22e-06 | 4265.78 ms | 31.7% bf16 MFU | 101441 tok/s step 18681/19560 | loss 3.279955 (-0.09z)| norm 0.2283 (-0.22z)| lr 3.22e-06 | 8585.08 ms | 15.7% bf16 MFU | 99423 tok/s step 18682/19560 | loss 3.258301 (-0.72z)| norm 0.2191 (-1.31z)| lr 3.21e-06 | 4132.09 ms | 32.7% bf16 MFU | 100796 tok/s step 18683/19560 | loss 3.238924 (-1.28z)| norm 0.2162 (-1.63z)| lr 3.20e-06 | 4082.57 ms | 33.1% bf16 MFU | 102177 tok/s step 18684/19560 | loss 3.291981 (+0.26z)| norm 0.2369 (+0.82z)| lr 3.20e-06 | 4079.48 ms | 33.1% bf16 MFU | 103494 tok/s step 18685/19560 | loss 3.222443 (-1.73z)| norm 0.2229 (-0.84z)| lr 3.19e-06 | 4093.05 ms | 33.0% bf16 MFU | 104724 tok/s step 18686/19560 | loss 3.219223 (-1.78z)| norm 0.2230 (-0.81z)| lr 3.18e-06 | 4164.68 ms | 32.4% bf16 MFU | 105782 tok/s step 18687/19560 | loss 3.306050 (+0.68z)| norm 0.2296 (+0.01z)| lr 3.17e-06 | 4080.88 ms | 33.1% bf16 MFU | 106917 tok/s step 18688/19560 | loss 3.295280 (+0.38z)| norm 0.2233 (-0.76z)| lr 3.17e-06 | 4111.94 ms | 32.8% bf16 MFU | 107946 tok/s step 18689/19560 | loss 3.250864 (-0.88z)| norm 0.2250 (-0.55z)| lr 3.16e-06 | 4081.15 ms | 33.1% bf16 MFU | 108972 tok/s step 18690/19560 | loss 3.345823 (+1.79z)| norm 0.2311 (+0.20z)| lr 3.15e-06 | 4152.91 ms | 32.5% bf16 MFU | 109836 tok/s step 18691/19560 | loss 3.299541 (+0.50z)| norm 0.2347 (+0.66z)| lr 3.14e-06 | 4083.97 ms | 33.1% bf16 MFU | 110763 tok/s step 18692/19560 | loss 3.244667 (-1.04z)| norm 0.2261 (-0.40z)| lr 3.14e-06 | 4091.37 ms | 33.0% bf16 MFU | 111632 tok/s step 18693/19560 | loss 3.250143 (-0.88z)| norm 0.2252 (-0.51z)| lr 3.13e-06 | 4152.97 ms | 32.5% bf16 MFU | 112363 tok/s step 18694/19560 | loss 3.272404 (-0.24z)| norm 0.2353 (+0.74z)| lr 3.12e-06 | 4097.14 ms | 33.0% bf16 MFU | 113143 tok/s step 18695/19560 | loss 3.305073 (+0.72z)| norm 0.2374 (+1.00z)| lr 3.12e-06 | 4102.35 ms | 32.9% bf16 MFU | 113876 tok/s step 18696/19560 | loss 3.351978 (+2.02z)| norm 0.2538 (+2.91z)| lr 3.11e-06 | 4095.61 ms | 33.0% bf16 MFU | 114582 tok/s step 18697/19560 | loss 3.198457 (-2.31z)| norm 0.2279 (-0.20z)| lr 3.10e-06 | 4112.82 ms | 32.8% bf16 MFU | 115227 tok/s step 18698/19560 | loss 3.312971 (+0.91z)| norm 0.2516 (+2.56z)| lr 3.09e-06 | 4094.77 ms | 33.0% bf16 MFU | 115868 tok/s step 18699/19560 | loss 3.286813 (+0.17z)| norm 0.2399 (+1.17z)| lr 3.09e-06 | 4113.13 ms | 32.8% bf16 MFU | 116448 tok/s step 18700/19560 | loss 3.328046 (+1.35z)| norm 0.2330 (+0.37z)| lr 3.08e-06 | 4243.62 ms | 31.8% bf16 MFU | 116803 tok/s step 18701/19560 | loss 3.237751 (-1.21z)| norm 0.2297 (-0.02z)| lr 3.07e-06 | 4128.12 ms | 32.7% bf16 MFU | 117313 tok/s step 18702/19560 | loss 3.282826 (+0.08z)| norm 0.2257 (-0.48z)| lr 3.07e-06 | 4125.36 ms | 32.7% bf16 MFU | 117802 tok/s step 18703/19560 | loss 3.274555 (-0.14z)| norm 0.2277 (-0.25z)| lr 3.06e-06 | 4131.41 ms | 32.7% bf16 MFU | 118257 tok/s step 18704/19560 | loss 3.276482 (-0.09z)| norm 0.2246 (-0.60z)| lr 3.05e-06 | 4129.30 ms | 32.7% bf16 MFU | 118692 tok/s step 18705/19560 | loss 3.297806 (+0.52z)| norm 0.2287 (-0.12z)| lr 3.04e-06 | 4126.29 ms | 32.7% bf16 MFU | 119111 tok/s step 18706/19560 | loss 3.251611 (-0.80z)| norm 0.2223 (-0.88z)| lr 3.04e-06 | 4123.08 ms | 32.7% bf16 MFU | 119513 tok/s step 18707/19560 | loss 3.286202 (+0.20z)| norm 0.2358 (+0.74z)| lr 3.03e-06 | 4127.94 ms | 32.7% bf16 MFU | 119888 tok/s step 18708/19560 | loss 3.230083 (-1.42z)| norm 0.2234 (-0.74z)| lr 3.02e-06 | 4141.66 ms | 32.6% bf16 MFU | 120223 tok/s step 18709/19560 | loss 3.244672 (-0.98z)| norm 0.2279 (-0.20z)| lr 3.02e-06 | 4119.30 ms | 32.8% bf16 MFU | 120576 tok/s step 18710/19560 | loss 3.284816 (+0.18z)| norm 0.2246 (-0.60z)| lr 3.01e-06 | 4147.27 ms | 32.6% bf16 MFU | 120868 tok/s step 18711/19560 | loss 3.233979 (-1.27z)| norm 0.2250 (-0.55z)| lr 3.00e-06 | 4143.66 ms | 32.6% bf16 MFU | 121151 tok/s step 18712/19560 | loss 3.256546 (-0.61z)| norm 0.2324 (+0.33z)| lr 3.00e-06 | 4128.19 ms | 32.7% bf16 MFU | 121443 tok/s step 18713/19560 | loss 3.262418 (-0.44z)| norm 0.2283 (-0.17z)| lr 2.99e-06 | 4154.84 ms | 32.5% bf16 MFU | 121680 tok/s step 18714/19560 | loss 3.257271 (-0.58z)| norm 0.2211 (-1.02z)| lr 2.98e-06 | 4503.70 ms | 30.0% bf16 MFU | 121417 tok/s step 18715/19560 | loss 3.234225 (-1.25z)| norm 0.2305 (+0.10z)| lr 2.97e-06 | 4134.93 ms | 32.7% bf16 MFU | 121686 tok/s step 18716/19560 | loss 3.252260 (-0.71z)| norm 0.2305 (+0.11z)| lr 2.97e-06 | 4136.27 ms | 32.6% bf16 MFU | 121939 tok/s step 18717/19560 | loss 3.309359 (+0.96z)| norm 0.2294 (-0.04z)| lr 2.96e-06 | 4137.79 ms | 32.6% bf16 MFU | 122178 tok/s step 18718/19560 | loss 3.295567 (+0.56z)| norm 0.2303 (+0.11z)| lr 2.95e-06 | 4159.86 ms | 32.5% bf16 MFU | 122371 tok/s step 18719/19560 | loss 3.295786 (+0.57z)| norm 0.2243 (-0.66z)| lr 2.95e-06 | 4231.43 ms | 31.9% bf16 MFU | 122447 tok/s step 18720/19560 | loss 3.329841 (+1.55z)| norm 0.2408 (+1.43z)| lr 2.94e-06 | 4149.85 ms | 32.5% bf16 MFU | 122642 tok/s step 18721/19560 | loss 3.305662 (+0.83z)| norm 0.2275 (-0.27z)| lr 2.93e-06 | 4152.94 ms | 32.5% bf16 MFU | 122822 tok/s step 18722/19560 | loss 3.313277 (+1.04z)| norm 0.2255 (-0.52z)| lr 2.92e-06 | 4160.50 ms | 32.5% bf16 MFU | 122982 tok/s step 18723/19560 | loss 3.317929 (+1.16z)| norm 0.2266 (-0.38z)| lr 2.92e-06 | 4149.70 ms | 32.5% bf16 MFU | 123150 tok/s step 18724/19560 | loss 3.234978 (-1.23z)| norm 0.2293 (-0.03z)| lr 2.91e-06 | 4158.20 ms | 32.5% bf16 MFU | 123297 tok/s step 18725/19560 | loss 3.285540 (+0.23z)| norm 0.2200 (-1.22z)| lr 2.90e-06 | 4144.77 ms | 32.6% bf16 MFU | 123456 tok/s step 18726/19560 | loss 3.260319 (-0.49z)| norm 0.2243 (-0.65z)| lr 2.90e-06 | 4160.23 ms | 32.5% bf16 MFU | 123585 tok/s step 18727/19560 | loss 3.275561 (-0.04z)| norm 0.2214 (-1.01z)| lr 2.89e-06 | 4152.55 ms | 32.5% bf16 MFU | 123718 tok/s step 18728/19560 | loss 3.365908 (+2.50z)| norm 0.2419 (+1.60z)| lr 2.88e-06 | 4141.26 ms | 32.6% bf16 MFU | 123862 tok/s step 18729/19560 | loss 3.267936 (-0.28z)| norm 0.2266 (-0.34z)| lr 2.88e-06 | 4145.85 ms | 32.6% bf16 MFU | 123992 tok/s step 18730/19560 | loss 3.292699 (+0.43z)| norm 0.2217 (-0.98z)| lr 2.87e-06 | 4149.44 ms | 32.5% bf16 MFU | 124110 tok/s step 18731/19560 | loss 3.341665 (+1.79z)| norm 0.2271 (-0.27z)| lr 2.86e-06 | 4155.64 ms | 32.5% bf16 MFU | 124213 tok/s step 18732/19560 | loss 3.272475 (-0.15z)| norm 0.2242 (-0.64z)| lr 2.86e-06 | 4444.82 ms | 30.4% bf16 MFU | 123900 tok/s step 18733/19560 | loss 3.270600 (-0.22z)| norm 0.2220 (-0.91z)| lr 2.85e-06 | 4296.26 ms | 31.4% bf16 MFU | 123807 tok/s step 18734/19560 | loss 3.305139 (+0.78z)| norm 0.2343 (+0.68z)| lr 2.84e-06 | 4233.98 ms | 31.9% bf16 MFU | 123808 tok/s step 18735/19560 | loss 3.230896 (-1.34z)| norm 0.2236 (-0.71z)| lr 2.84e-06 | 4528.26 ms | 29.8% bf16 MFU | 123407 tok/s step 18736/19560 | loss 3.292883 (+0.45z)| norm 0.2265 (-0.33z)| lr 2.83e-06 | 4170.59 ms | 32.4% bf16 MFU | 123522 tok/s step 18737/19560 | loss 3.273063 (-0.12z)| norm 0.2317 (+0.35z)| lr 2.82e-06 | 4159.29 ms | 32.5% bf16 MFU | 123648 tok/s step 18738/19560 | loss 3.194973 (-2.31z)| norm 0.2332 (+0.56z)| lr 2.81e-06 | 4150.70 ms | 32.5% bf16 MFU | 123782 tok/s step 18739/19560 | loss 3.313969 (+1.06z)| norm 0.2255 (-0.45z)| lr 2.81e-06 | 4158.66 ms | 32.5% bf16 MFU | 123896 tok/s step 18740/19560 | loss 3.259956 (-0.46z)| norm 0.2162 (-1.65z)| lr 2.80e-06 | 4156.20 ms | 32.5% bf16 MFU | 124008 tok/s step 18741/19560 | loss 3.212433 (-1.78z)| norm 0.2414 (+1.62z)| lr 2.79e-06 | 4161.12 ms | 32.4% bf16 MFU | 124108 tok/s step 18742/19560 | loss 3.265231 (-0.30z)| norm 0.2544 (+3.17z)| lr 2.79e-06 | 4169.96 ms | 32.4% bf16 MFU | 124189 tok/s step 18743/19560 | loss 3.311186 (+0.98z)| norm 0.2210 (-1.00z)| lr 2.78e-06 | 4144.58 ms | 32.6% bf16 MFU | 124305 tok/s step 18744/19560 | loss 3.294607 (+0.51z)| norm 0.2369 (+0.97z)| lr 2.77e-06 | 4164.56 ms | 32.4% bf16 MFU | 124384 tok/s step 18745/19560 | loss 3.268124 (-0.23z)| norm 0.2232 (-0.75z)| lr 2.77e-06 | 4182.39 ms | 32.3% bf16 MFU | 124433 tok/s step 18746/19560 | loss 3.271404 (-0.15z)| norm 0.2299 (+0.08z)| lr 2.76e-06 | 4177.62 ms | 32.3% bf16 MFU | 124486 tok/s step 18747/19560 | loss 3.272477 (-0.11z)| norm 0.2255 (-0.48z)| lr 2.75e-06 | 4166.25 ms | 32.4% bf16 MFU | 124554 tok/s step 18748/19560 | loss 3.195590 (-2.23z)| norm 0.2198 (-1.17z)| lr 2.75e-06 | 4148.19 ms | 32.5% bf16 MFU | 124645 tok/s step 18749/19560 | loss 3.298426 (+0.64z)| norm 0.2285 (-0.09z)| lr 2.74e-06 | 4149.29 ms | 32.5% bf16 MFU | 124731 tok/s step 18750/19560 | loss 3.270718 (-0.13z)| norm 0.2269 (-0.29z)| lr 2.73e-06 | 4148.53 ms | 32.5% bf16 MFU | 124813 tok/s val loss 3.265952 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3025/10042 = 0.301235 step 18751/19560 | loss 3.292896 (+0.49z)| norm 0.2320 (+0.36z)| lr 2.73e-06 | 5051.47 ms | 26.7% bf16 MFU | 123762 tok/s step 18752/19560 | loss 3.298820 (+0.65z)| norm 0.2344 (+0.64z)| lr 2.72e-06 | 4174.80 ms | 32.3% bf16 MFU | 123853 tok/s step 18753/19560 | loss 3.253885 (-0.60z)| norm 0.2293 (+0.00z)| lr 2.71e-06 | 4161.98 ms | 32.4% bf16 MFU | 123959 tok/s step 18754/19560 | loss 3.297867 (+0.62z)| norm 0.2274 (-0.22z)| lr 2.71e-06 | 4224.98 ms | 32.0% bf16 MFU | 123966 tok/s step 18755/19560 | loss 3.318356 (+1.24z)| norm 0.2223 (-0.86z)| lr 2.70e-06 | 4163.17 ms | 32.4% bf16 MFU | 124064 tok/s step 18756/19560 | loss 3.283367 (+0.24z)| norm 0.2250 (-0.52z)| lr 2.69e-06 | 4186.49 ms | 32.3% bf16 MFU | 124123 tok/s step 18757/19560 | loss 3.230975 (-1.24z)| norm 0.2481 (+2.33z)| lr 2.69e-06 | 4178.49 ms | 32.3% bf16 MFU | 124190 tok/s step 18758/19560 | loss 3.329405 (+1.55z)| norm 0.2503 (+2.52z)| lr 2.68e-06 | 4164.48 ms | 32.4% bf16 MFU | 124275 tok/s step 18759/19560 | loss 3.338386 (+1.78z)| norm 0.2566 (+3.13z)| lr 2.67e-06 | 4162.29 ms | 32.4% bf16 MFU | 124360 tok/s step 18760/19560 | loss 3.237275 (-1.04z)| norm 0.2372 (+0.88z)| lr 2.67e-06 | 4171.25 ms | 32.4% bf16 MFU | 124426 tok/s step 18761/19560 | loss 3.214449 (-1.66z)| norm 0.2293 (-0.03z)| lr 2.66e-06 | 4157.00 ms | 32.5% bf16 MFU | 124511 tok/s step 18762/19560 | loss 3.246798 (-0.76z)| norm 0.2242 (-0.61z)| lr 2.65e-06 | 4173.92 ms | 32.3% bf16 MFU | 124566 tok/s step 18763/19560 | loss 3.288319 (+0.38z)| norm 0.2252 (-0.48z)| lr 2.65e-06 | 4161.00 ms | 32.4% bf16 MFU | 124638 tok/s step 18764/19560 | loss 3.226903 (-1.31z)| norm 0.2177 (-1.33z)| lr 2.64e-06 | 4174.37 ms | 32.3% bf16 MFU | 124686 tok/s step 18765/19560 | loss 3.234070 (-1.11z)| norm 0.2239 (-0.62z)| lr 2.63e-06 | 4155.05 ms | 32.5% bf16 MFU | 124760 tok/s step 18766/19560 | loss 3.294115 (+0.55z)| norm 0.2270 (-0.26z)| lr 2.63e-06 | 6076.42 ms | 22.2% bf16 MFU | 122837 tok/s step 18767/19560 | loss 3.279570 (+0.13z)| norm 0.2239 (-0.61z)| lr 2.62e-06 | 4151.62 ms | 32.5% bf16 MFU | 123009 tok/s step 18768/19560 | loss 3.260507 (-0.42z)| norm 0.2417 (+1.42z)| lr 2.61e-06 | 4174.27 ms | 32.3% bf16 MFU | 123139 tok/s step 18769/19560 | loss 3.339100 (+1.78z)| norm 0.2518 (+2.51z)| lr 2.61e-06 | 4153.08 ms | 32.5% bf16 MFU | 123294 tok/s step 18770/19560 | loss 3.292526 (+0.47z)| norm 0.2268 (-0.31z)| lr 2.60e-06 | 4155.03 ms | 32.5% bf16 MFU | 123438 tok/s step 18771/19560 | loss 3.267416 (-0.23z)| norm 0.2230 (-0.73z)| lr 2.59e-06 | 4160.09 ms | 32.5% bf16 MFU | 123568 tok/s step 18772/19560 | loss 3.310552 (+0.99z)| norm 0.2232 (-0.69z)| lr 2.59e-06 | 4172.89 ms | 32.4% bf16 MFU | 123671 tok/s step 18773/19560 | loss 3.281752 (+0.17z)| norm 0.2268 (-0.28z)| lr 2.58e-06 | 4166.55 ms | 32.4% bf16 MFU | 123779 tok/s step 18774/19560 | loss 3.363976 (+2.43z)| norm 0.2415 (+1.36z)| lr 2.57e-06 | 4965.89 ms | 27.2% bf16 MFU | 122869 tok/s step 18775/19560 | loss 3.314846 (+1.05z)| norm 0.2287 (-0.09z)| lr 2.57e-06 | 4165.53 ms | 32.4% bf16 MFU | 123019 tok/s step 18776/19560 | loss 3.185193 (-2.49z)| norm 0.2333 (+0.42z)| lr 2.56e-06 | 4404.75 ms | 30.7% bf16 MFU | 122819 tok/s step 18777/19560 | loss 3.336096 (+1.59z)| norm 0.2307 (+0.12z)| lr 2.55e-06 | 4160.32 ms | 32.5% bf16 MFU | 122979 tok/s step 18778/19560 | loss 3.262896 (-0.38z)| norm 0.2280 (-0.19z)| lr 2.55e-06 | 4384.47 ms | 30.8% bf16 MFU | 122809 tok/s step 18779/19560 | loss 3.257564 (-0.52z)| norm 0.2270 (-0.28z)| lr 2.54e-06 | 4164.70 ms | 32.4% bf16 MFU | 122963 tok/s step 18780/19560 | loss 3.242909 (-0.92z)| norm 0.2224 (-0.81z)| lr 2.54e-06 | 4154.83 ms | 32.5% bf16 MFU | 123125 tok/s step 18781/19560 | loss 3.317269 (+1.08z)| norm 0.2234 (-0.69z)| lr 2.53e-06 | 4160.03 ms | 32.5% bf16 MFU | 123270 tok/s step 18782/19560 | loss 3.231712 (-1.22z)| norm 0.2210 (-0.95z)| lr 2.52e-06 | 4171.50 ms | 32.4% bf16 MFU | 123390 tok/s step 18783/19560 | loss 3.293252 (+0.44z)| norm 0.2219 (-0.85z)| lr 2.52e-06 | 4155.56 ms | 32.5% bf16 MFU | 123529 tok/s step 18784/19560 | loss 3.304043 (+0.73z)| norm 0.2196 (-1.10z)| lr 2.51e-06 | 4239.33 ms | 31.8% bf16 MFU | 123536 tok/s step 18785/19560 | loss 3.306190 (+0.79z)| norm 0.2167 (-1.40z)| lr 2.50e-06 | 4168.43 ms | 32.4% bf16 MFU | 123648 tok/s step 18786/19560 | loss 3.240667 (-0.97z)| norm 0.2235 (-0.64z)| lr 2.50e-06 | 4166.94 ms | 32.4% bf16 MFU | 123757 tok/s step 18787/19560 | loss 3.292482 (+0.42z)| norm 0.2255 (-0.42z)| lr 2.49e-06 | 4160.00 ms | 32.5% bf16 MFU | 123871 tok/s step 18788/19560 | loss 3.255868 (-0.56z)| norm 0.2177 (-1.29z)| lr 2.48e-06 | 4162.98 ms | 32.4% bf16 MFU | 123974 tok/s step 18789/19560 | loss 3.285716 (+0.25z)| norm 0.2263 (-0.33z)| lr 2.48e-06 | 4163.56 ms | 32.4% bf16 MFU | 124072 tok/s step 18790/19560 | loss 3.278825 (+0.07z)| norm 0.2157 (-1.49z)| lr 2.47e-06 | 4276.96 ms | 31.6% bf16 MFU | 123997 tok/s step 18791/19560 | loss 3.326870 (+1.37z)| norm 0.2192 (-1.09z)| lr 2.46e-06 | 4166.50 ms | 32.4% bf16 MFU | 124089 tok/s step 18792/19560 | loss 3.357818 (+2.17z)| norm 0.2355 (+0.71z)| lr 2.46e-06 | 4168.67 ms | 32.4% bf16 MFU | 124173 tok/s step 18793/19560 | loss 3.246346 (-0.82z)| norm 0.2256 (-0.38z)| lr 2.45e-06 | 4426.14 ms | 30.5% bf16 MFU | 123887 tok/s step 18794/19560 | loss 3.309715 (+0.88z)| norm 0.2314 (+0.26z)| lr 2.45e-06 | 5445.47 ms | 24.8% bf16 MFU | 122507 tok/s step 18795/19560 | loss 3.369570 (+2.42z)| norm 0.2431 (+1.53z)| lr 2.44e-06 | 4421.97 ms | 30.5% bf16 MFU | 122310 tok/s step 18796/19560 | loss 3.336222 (+1.51z)| norm 0.2190 (-1.11z)| lr 2.43e-06 | 4577.05 ms | 29.5% bf16 MFU | 121921 tok/s step 18797/19560 | loss 3.304995 (+0.68z)| norm 0.2198 (-1.02z)| lr 2.43e-06 | 4882.40 ms | 27.7% bf16 MFU | 121195 tok/s step 18798/19560 | loss 3.290474 (+0.30z)| norm 0.2240 (-0.56z)| lr 2.42e-06 | 4330.26 ms | 31.2% bf16 MFU | 121189 tok/s step 18799/19560 | loss 3.302136 (+0.60z)| norm 0.2231 (-0.66z)| lr 2.41e-06 | 5012.65 ms | 26.9% bf16 MFU | 120359 tok/s step 18800/19560 | loss 3.278430 (-0.02z)| norm 0.2225 (-0.73z)| lr 2.41e-06 | 4301.17 ms | 31.4% bf16 MFU | 120436 tok/s step 18801/19560 | loss 3.281827 (+0.07z)| norm 0.2191 (-1.12z)| lr 2.40e-06 | 4498.42 ms | 30.0% bf16 MFU | 120241 tok/s step 18802/19560 | loss 3.242415 (-0.96z)| norm 0.2538 (+2.94z)| lr 2.39e-06 | 4659.59 ms | 29.0% bf16 MFU | 119855 tok/s step 18803/19560 | loss 3.277869 (-0.04z)| norm 0.2322 (+0.41z)| lr 2.39e-06 | 4176.17 ms | 32.3% bf16 MFU | 120139 tok/s step 18804/19560 | loss 3.315021 (+0.93z)| norm 0.2278 (-0.10z)| lr 2.38e-06 | 4244.27 ms | 31.8% bf16 MFU | 120309 tok/s step 18805/19560 | loss 3.299341 (+0.51z)| norm 0.2293 (+0.07z)| lr 2.38e-06 | 4172.61 ms | 32.4% bf16 MFU | 120576 tok/s step 18806/19560 | loss 3.232482 (-1.25z)| norm 0.2202 (-0.98z)| lr 2.37e-06 | 4216.10 ms | 32.0% bf16 MFU | 120765 tok/s step 18807/19560 | loss 3.352313 (+1.86z)| norm 0.2280 (-0.08z)| lr 2.36e-06 | 4340.27 ms | 31.1% bf16 MFU | 120766 tok/s step 18808/19560 | loss 3.322857 (+1.10z)| norm 0.2313 (+0.30z)| lr 2.36e-06 | 4165.82 ms | 32.4% bf16 MFU | 121021 tok/s step 18809/19560 | loss 3.238316 (-1.09z)| norm 0.2278 (-0.11z)| lr 2.35e-06 | 4222.77 ms | 32.0% bf16 MFU | 121178 tok/s step 18810/19560 | loss 3.318077 (+0.97z)| norm 0.2325 (+0.43z)| lr 2.34e-06 | 4198.10 ms | 32.2% bf16 MFU | 121363 tok/s step 18811/19560 | loss 3.334182 (+1.36z)| norm 0.2193 (-1.13z)| lr 2.34e-06 | 4171.97 ms | 32.4% bf16 MFU | 121578 tok/s step 18812/19560 | loss 3.246290 (-0.90z)| norm 0.2245 (-0.51z)| lr 2.33e-06 | 4222.96 ms | 32.0% bf16 MFU | 121707 tok/s step 18813/19560 | loss 3.342449 (+1.55z)| norm 0.2428 (+1.64z)| lr 2.33e-06 | 4168.40 ms | 32.4% bf16 MFU | 121911 tok/s step 18814/19560 | loss 3.275064 (-0.19z)| norm 0.2304 (+0.17z)| lr 2.32e-06 | 4198.38 ms | 32.2% bf16 MFU | 122059 tok/s step 18815/19560 | loss 3.305188 (+0.59z)| norm 0.2342 (+0.61z)| lr 2.31e-06 | 4159.08 ms | 32.5% bf16 MFU | 122259 tok/s step 18816/19560 | loss 3.260844 (-0.56z)| norm 0.2252 (-0.45z)| lr 2.31e-06 | 4197.12 ms | 32.2% bf16 MFU | 122392 tok/s step 18817/19560 | loss 3.301721 (+0.50z)| norm 0.2383 (+1.08z)| lr 2.30e-06 | 4246.47 ms | 31.8% bf16 MFU | 122445 tok/s step 18818/19560 | loss 3.258505 (-0.62z)| norm 0.2440 (+1.72z)| lr 2.29e-06 | 4193.02 ms | 32.2% bf16 MFU | 122575 tok/s step 18819/19560 | loss 3.277210 (-0.12z)| norm 0.2639 (+3.78z)| lr 2.29e-06 | 4172.46 ms | 32.4% bf16 MFU | 122729 tok/s step 18820/19560 | loss 3.320824 (+1.01z)| norm 0.2197 (-1.05z)| lr 2.28e-06 | 4170.23 ms | 32.4% bf16 MFU | 122879 tok/s step 18821/19560 | loss 3.257468 (-0.66z)| norm 0.2272 (-0.24z)| lr 2.28e-06 | 4282.53 ms | 31.5% bf16 MFU | 122856 tok/s step 18822/19560 | loss 3.314634 (+0.84z)| norm 0.2250 (-0.48z)| lr 2.27e-06 | 4178.61 ms | 32.3% bf16 MFU | 122987 tok/s step 18823/19560 | loss 3.332693 (+1.30z)| norm 0.2307 (+0.15z)| lr 2.26e-06 | 4173.72 ms | 32.3% bf16 MFU | 123118 tok/s step 18824/19560 | loss 3.300630 (+0.48z)| norm 0.2267 (-0.27z)| lr 2.26e-06 | 4301.55 ms | 31.4% bf16 MFU | 123056 tok/s step 18825/19560 | loss 3.289963 (+0.18z)| norm 0.2310 (+0.21z)| lr 2.25e-06 | 4345.71 ms | 31.1% bf16 MFU | 122936 tok/s step 18826/19560 | loss 3.315251 (+0.86z)| norm 0.2246 (-0.50z)| lr 2.25e-06 | 4427.61 ms | 30.5% bf16 MFU | 122710 tok/s step 18827/19560 | loss 3.251239 (-0.86z)| norm 0.2259 (-0.33z)| lr 2.24e-06 | 4179.42 ms | 32.3% bf16 MFU | 122846 tok/s step 18828/19560 | loss 3.264690 (-0.48z)| norm 0.2230 (-0.66z)| lr 2.23e-06 | 4174.90 ms | 32.3% bf16 MFU | 122983 tok/s step 18829/19560 | loss 3.362757 (+2.12z)| norm 0.2386 (+1.14z)| lr 2.23e-06 | 4166.70 ms | 32.4% bf16 MFU | 123125 tok/s step 18830/19560 | loss 3.259098 (-0.65z)| norm 0.2286 (-0.02z)| lr 2.22e-06 | 4178.09 ms | 32.3% bf16 MFU | 123243 tok/s step 18831/19560 | loss 3.297854 (+0.38z)| norm 0.2257 (-0.36z)| lr 2.22e-06 | 4161.90 ms | 32.4% bf16 MFU | 123380 tok/s step 18832/19560 | loss 3.251434 (-0.85z)| norm 0.2305 (+0.20z)| lr 2.21e-06 | 4189.73 ms | 32.2% bf16 MFU | 123468 tok/s step 18833/19560 | loss 3.321197 (+1.00z)| norm 0.2224 (-0.73z)| lr 2.20e-06 | 4165.88 ms | 32.4% bf16 MFU | 123587 tok/s step 18834/19560 | loss 3.300819 (+0.45z)| norm 0.2274 (-0.16z)| lr 2.20e-06 | 4161.59 ms | 32.4% bf16 MFU | 123707 tok/s step 18835/19560 | loss 3.289732 (+0.15z)| norm 0.2282 (-0.07z)| lr 2.19e-06 | 4235.60 ms | 31.9% bf16 MFU | 123710 tok/s step 18836/19560 | loss 3.294035 (+0.26z)| norm 0.2229 (-0.68z)| lr 2.19e-06 | 4223.17 ms | 32.0% bf16 MFU | 123732 tok/s step 18837/19560 | loss 3.305243 (+0.55z)| norm 0.2230 (-0.66z)| lr 2.18e-06 | 4180.26 ms | 32.3% bf16 MFU | 123817 tok/s step 18838/19560 | loss 3.275462 (-0.25z)| norm 0.2193 (-1.08z)| lr 2.17e-06 | 4169.44 ms | 32.4% bf16 MFU | 123913 tok/s step 18839/19560 | loss 3.354981 (+1.85z)| norm 0.2274 (-0.15z)| lr 2.17e-06 | 4757.17 ms | 28.4% bf16 MFU | 123228 tok/s step 18840/19560 | loss 3.327858 (+1.10z)| norm 0.2283 (-0.04z)| lr 2.16e-06 | 4210.06 ms | 32.1% bf16 MFU | 123293 tok/s step 18841/19560 | loss 3.320977 (+0.91z)| norm 0.2222 (-0.73z)| lr 2.16e-06 | 4177.68 ms | 32.3% bf16 MFU | 123403 tok/s step 18842/19560 | loss 3.218226 (-1.79z)| norm 0.2288 (+0.02z)| lr 2.15e-06 | 4168.76 ms | 32.4% bf16 MFU | 123521 tok/s step 18843/19560 | loss 3.238417 (-1.27z)| norm 0.2226 (-0.69z)| lr 2.14e-06 | 4165.06 ms | 32.4% bf16 MFU | 123639 tok/s step 18844/19560 | loss 3.303401 (+0.43z)| norm 0.2441 (+1.76z)| lr 2.14e-06 | 4177.30 ms | 32.3% bf16 MFU | 123733 tok/s step 18845/19560 | loss 3.417280 (+3.27z)| norm 0.2342 (+0.62z)| lr 2.13e-06 | 4172.73 ms | 32.4% bf16 MFU | 123828 tok/s step 18846/19560 | loss 3.328160 (+1.01z)| norm 0.2253 (-0.39z)| lr 2.13e-06 | 4181.09 ms | 32.3% bf16 MFU | 123907 tok/s step 18847/19560 | loss 3.246708 (-1.03z)| norm 0.2369 (+0.93z)| lr 2.12e-06 | 4179.91 ms | 32.3% bf16 MFU | 123983 tok/s step 18848/19560 | loss 3.311976 (+0.62z)| norm 0.2263 (-0.27z)| lr 2.11e-06 | 4179.40 ms | 32.3% bf16 MFU | 124056 tok/s step 18849/19560 | loss 3.297639 (+0.26z)| norm 0.2209 (-0.88z)| lr 2.11e-06 | 4179.70 ms | 32.3% bf16 MFU | 124125 tok/s step 18850/19560 | loss 3.279659 (-0.19z)| norm 0.2388 (+1.14z)| lr 2.10e-06 | 4179.58 ms | 32.3% bf16 MFU | 124191 tok/s step 18851/19560 | loss 3.273654 (-0.33z)| norm 0.2188 (-1.12z)| lr 2.10e-06 | 4172.38 ms | 32.4% bf16 MFU | 124264 tok/s step 18852/19560 | loss 3.309047 (+0.55z)| norm 0.2290 (+0.04z)| lr 2.09e-06 | 4179.94 ms | 32.3% bf16 MFU | 124322 tok/s step 18853/19560 | loss 3.454050 (+3.93z)| norm 0.2247 (-0.45z)| lr 2.08e-06 | 4171.01 ms | 32.4% bf16 MFU | 124391 tok/s step 18854/19560 | loss 3.321290 (+0.76z)| norm 0.2335 (+0.54z)| lr 2.08e-06 | 4228.48 ms | 31.9% bf16 MFU | 124371 tok/s step 18855/19560 | loss 3.269893 (-0.46z)| norm 0.2230 (-0.66z)| lr 2.07e-06 | 4168.75 ms | 32.4% bf16 MFU | 124441 tok/s step 18856/19560 | loss 3.325397 (+0.88z)| norm 0.2256 (-0.35z)| lr 2.07e-06 | 4167.14 ms | 32.4% bf16 MFU | 124510 tok/s step 18857/19560 | loss 3.310769 (+0.52z)| norm 0.2246 (-0.47z)| lr 2.06e-06 | 4171.32 ms | 32.4% bf16 MFU | 124569 tok/s step 18858/19560 | loss 3.290542 (+0.03z)| norm 0.2314 (+0.30z)| lr 2.05e-06 | 4185.76 ms | 32.3% bf16 MFU | 124603 tok/s step 18859/19560 | loss 3.307539 (+0.45z)| norm 0.2402 (+1.29z)| lr 2.05e-06 | 4164.18 ms | 32.4% bf16 MFU | 124668 tok/s step 18860/19560 | loss 3.346732 (+1.37z)| norm 0.2358 (+0.78z)| lr 2.04e-06 | 4163.23 ms | 32.4% bf16 MFU | 124731 tok/s step 18861/19560 | loss 3.303553 (+0.33z)| norm 0.2210 (-0.90z)| lr 2.04e-06 | 4176.65 ms | 32.3% bf16 MFU | 124771 tok/s step 18862/19560 | loss 3.325130 (+0.84z)| norm 0.2292 (+0.04z)| lr 2.03e-06 | 4679.92 ms | 28.9% bf16 MFU | 124134 tok/s step 18863/19560 | loss 3.275196 (-0.36z)| norm 0.2167 (-1.37z)| lr 2.03e-06 | 4168.53 ms | 32.4% bf16 MFU | 124216 tok/s step 18864/19560 | loss 3.283430 (-0.16z)| norm 0.2278 (-0.12z)| lr 2.02e-06 | 4187.32 ms | 32.2% bf16 MFU | 124266 tok/s step 18865/19560 | loss 3.279087 (-0.27z)| norm 0.2231 (-0.63z)| lr 2.01e-06 | 4197.10 ms | 32.2% bf16 MFU | 124298 tok/s step 18866/19560 | loss 3.229213 (-1.50z)| norm 0.2378 (+1.01z)| lr 2.01e-06 | 4171.94 ms | 32.4% bf16 MFU | 124367 tok/s step 18867/19560 | loss 3.282427 (-0.19z)| norm 0.2265 (-0.26z)| lr 2.00e-06 | 4162.18 ms | 32.4% bf16 MFU | 124447 tok/s step 18868/19560 | loss 3.301719 (+0.27z)| norm 0.2228 (-0.69z)| lr 2.00e-06 | 4167.57 ms | 32.4% bf16 MFU | 124514 tok/s step 18869/19560 | loss 3.351357 (+1.47z)| norm 0.2243 (-0.51z)| lr 1.99e-06 | 4169.31 ms | 32.4% bf16 MFU | 124576 tok/s step 18870/19560 | loss 3.291231 (-0.02z)| norm 0.2418 (+1.55z)| lr 1.99e-06 | 4167.81 ms | 32.4% bf16 MFU | 124637 tok/s step 18871/19560 | loss 3.256600 (-0.86z)| norm 0.2140 (-1.70z)| lr 1.98e-06 | 4182.58 ms | 32.3% bf16 MFU | 124673 tok/s step 18872/19560 | loss 3.316540 (+0.61z)| norm 0.2245 (-0.46z)| lr 1.97e-06 | 4164.89 ms | 32.4% bf16 MFU | 124733 tok/s step 18873/19560 | loss 3.317625 (+0.63z)| norm 0.2371 (+0.99z)| lr 1.97e-06 | 4178.92 ms | 32.3% bf16 MFU | 124770 tok/s step 18874/19560 | loss 3.292546 (+0.01z)| norm 0.2238 (-0.55z)| lr 1.96e-06 | 4175.61 ms | 32.3% bf16 MFU | 124809 tok/s step 18875/19560 | loss 3.272457 (-0.48z)| norm 0.2273 (-0.14z)| lr 1.96e-06 | 4160.65 ms | 32.5% bf16 MFU | 124869 tok/s step 18876/19560 | loss 3.286722 (-0.15z)| norm 0.2220 (-0.76z)| lr 1.95e-06 | 4174.76 ms | 32.3% bf16 MFU | 124905 tok/s step 18877/19560 | loss 3.278219 (-0.36z)| norm 0.2235 (-0.59z)| lr 1.95e-06 | 4164.31 ms | 32.4% bf16 MFU | 124955 tok/s step 18878/19560 | loss 3.296163 (+0.08z)| norm 0.2162 (-1.41z)| lr 1.94e-06 | 4164.66 ms | 32.4% bf16 MFU | 125001 tok/s step 18879/19560 | loss 3.271497 (-0.53z)| norm 0.2230 (-0.62z)| lr 1.93e-06 | 4168.25 ms | 32.4% bf16 MFU | 125040 tok/s step 18880/19560 | loss 3.420386 (+3.07z)| norm 0.2466 (+2.06z)| lr 1.93e-06 | 4159.60 ms | 32.5% bf16 MFU | 125091 tok/s step 18881/19560 | loss 3.249948 (-1.06z)| norm 0.2200 (-0.95z)| lr 1.92e-06 | 4170.76 ms | 32.4% bf16 MFU | 125121 tok/s step 18882/19560 | loss 3.240363 (-1.27z)| norm 0.2212 (-0.80z)| lr 1.92e-06 | 4159.77 ms | 32.5% bf16 MFU | 125167 tok/s step 18883/19560 | loss 3.262299 (-0.73z)| norm 0.2219 (-0.72z)| lr 1.91e-06 | 4172.02 ms | 32.4% bf16 MFU | 125192 tok/s step 18884/19560 | loss 3.352565 (+1.41z)| norm 0.2266 (-0.19z)| lr 1.91e-06 | 4167.21 ms | 32.4% bf16 MFU | 125223 tok/s step 18885/19560 | loss 3.318033 (+0.58z)| norm 0.2363 (+0.92z)| lr 1.90e-06 | 4171.69 ms | 32.4% bf16 MFU | 125246 tok/s step 18886/19560 | loss 3.334961 (+0.98z)| norm 0.2286 (+0.06z)| lr 1.89e-06 | 4160.72 ms | 32.5% bf16 MFU | 125284 tok/s step 18887/19560 | loss 3.337105 (+1.03z)| norm 0.2195 (-1.02z)| lr 1.89e-06 | 4166.97 ms | 32.4% bf16 MFU | 125311 tok/s step 18888/19560 | loss 3.303940 (+0.23z)| norm 0.2270 (-0.08z)| lr 1.88e-06 | 4165.33 ms | 32.4% bf16 MFU | 125339 tok/s step 18889/19560 | loss 3.248209 (-1.14z)| norm 0.2434 (+1.89z)| lr 1.88e-06 | 4176.15 ms | 32.3% bf16 MFU | 125349 tok/s step 18890/19560 | loss 3.302091 (+0.17z)| norm 0.2211 (-0.82z)| lr 1.87e-06 | 4173.64 ms | 32.4% bf16 MFU | 125363 tok/s step 18891/19560 | loss 3.323025 (+0.67z)| norm 0.2195 (-0.99z)| lr 1.87e-06 | 4170.50 ms | 32.4% bf16 MFU | 125380 tok/s step 18892/19560 | loss 3.304306 (+0.20z)| norm 0.2178 (-1.20z)| lr 1.86e-06 | 4195.58 ms | 32.2% bf16 MFU | 125359 tok/s step 18893/19560 | loss 3.297089 (+0.01z)| norm 0.2330 (+0.62z)| lr 1.86e-06 | 4170.54 ms | 32.4% bf16 MFU | 125377 tok/s step 18894/19560 | loss 3.354792 (+1.43z)| norm 0.2303 (+0.29z)| lr 1.85e-06 | 4165.23 ms | 32.4% bf16 MFU | 125402 tok/s step 18895/19560 | loss 3.327811 (+0.75z)| norm 0.2321 (+0.50z)| lr 1.84e-06 | 4173.74 ms | 32.3% bf16 MFU | 125412 tok/s step 18896/19560 | loss 3.310082 (+0.30z)| norm 0.2252 (-0.32z)| lr 1.84e-06 | 4173.94 ms | 32.3% bf16 MFU | 125422 tok/s step 18897/19560 | loss 3.279506 (-0.45z)| norm 0.2210 (-0.83z)| lr 1.83e-06 | 4165.76 ms | 32.4% bf16 MFU | 125444 tok/s step 18898/19560 | loss 3.266960 (-0.75z)| norm 0.2215 (-0.75z)| lr 1.83e-06 | 4176.06 ms | 32.3% bf16 MFU | 125449 tok/s step 18899/19560 | loss 3.337949 (+1.00z)| norm 0.2371 (+1.19z)| lr 1.82e-06 | 4179.55 ms | 32.3% bf16 MFU | 125449 tok/s step 18900/19560 | loss 3.286808 (-0.27z)| norm 0.2259 (-0.22z)| lr 1.82e-06 | 4173.50 ms | 32.4% bf16 MFU | 125457 tok/s step 18901/19560 | loss 3.311658 (+0.35z)| norm 0.2168 (-1.35z)| lr 1.81e-06 | 4172.81 ms | 32.4% bf16 MFU | 125467 tok/s step 18902/19560 | loss 3.307800 (+0.26z)| norm 0.2348 (+0.91z)| lr 1.81e-06 | 4167.78 ms | 32.4% bf16 MFU | 125483 tok/s step 18903/19560 | loss 3.227301 (-1.72z)| norm 0.2179 (-1.19z)| lr 1.80e-06 | 4151.50 ms | 32.5% bf16 MFU | 125523 tok/s step 18904/19560 | loss 3.294761 (-0.07z)| norm 0.2321 (+0.59z)| lr 1.79e-06 | 4186.38 ms | 32.3% bf16 MFU | 125509 tok/s step 18905/19560 | loss 3.355131 (+1.47z)| norm 0.2247 (-0.33z)| lr 1.79e-06 | 4175.04 ms | 32.3% bf16 MFU | 125512 tok/s step 18906/19560 | loss 3.321703 (+0.60z)| norm 0.2340 (+0.82z)| lr 1.78e-06 | 4173.70 ms | 32.3% bf16 MFU | 125518 tok/s step 18907/19560 | loss 3.291234 (-0.18z)| norm 0.2238 (-0.45z)| lr 1.78e-06 | 4179.13 ms | 32.3% bf16 MFU | 125514 tok/s step 18908/19560 | loss 3.242091 (-1.44z)| norm 0.2198 (-0.94z)| lr 1.77e-06 | 4161.14 ms | 32.4% bf16 MFU | 125539 tok/s step 18909/19560 | loss 3.289763 (-0.22z)| norm 0.2263 (-0.13z)| lr 1.77e-06 | 4171.32 ms | 32.4% bf16 MFU | 125546 tok/s step 18910/19560 | loss 3.329356 (+0.79z)| norm 0.2305 (+0.38z)| lr 1.76e-06 | 4163.23 ms | 32.4% bf16 MFU | 125565 tok/s step 18911/19560 | loss 3.305215 (+0.16z)| norm 0.2237 (-0.47z)| lr 1.76e-06 | 4180.07 ms | 32.3% bf16 MFU | 125558 tok/s step 18912/19560 | loss 3.289837 (-0.23z)| norm 0.2241 (-0.43z)| lr 1.75e-06 | 4165.40 ms | 32.4% bf16 MFU | 125574 tok/s step 18913/19560 | loss 3.244850 (-1.38z)| norm 0.2284 (+0.10z)| lr 1.75e-06 | 4169.44 ms | 32.4% bf16 MFU | 125582 tok/s step 18914/19560 | loss 3.299886 (+0.03z)| norm 0.2207 (-0.87z)| lr 1.74e-06 | 4168.50 ms | 32.4% bf16 MFU | 125592 tok/s step 18915/19560 | loss 3.346003 (+1.20z)| norm 0.2179 (-1.21z)| lr 1.74e-06 | 4169.78 ms | 32.4% bf16 MFU | 125599 tok/s step 18916/19560 | loss 3.278864 (-0.53z)| norm 0.2226 (-0.62z)| lr 1.73e-06 | 4170.85 ms | 32.4% bf16 MFU | 125604 tok/s step 18917/19560 | loss 3.272455 (-0.70z)| norm 0.2330 (+0.67z)| lr 1.72e-06 | 4176.63 ms | 32.3% bf16 MFU | 125601 tok/s step 18918/19560 | loss 3.283690 (-0.41z)| norm 0.2214 (-0.79z)| lr 1.72e-06 | 4177.58 ms | 32.3% bf16 MFU | 125596 tok/s step 18919/19560 | loss 3.273172 (-0.67z)| norm 0.2289 (+0.15z)| lr 1.71e-06 | 4163.96 ms | 32.4% bf16 MFU | 125611 tok/s step 18920/19560 | loss 3.306776 (+0.21z)| norm 0.2211 (-0.83z)| lr 1.71e-06 | 4176.68 ms | 32.3% bf16 MFU | 125607 tok/s step 18921/19560 | loss 3.265260 (-0.88z)| norm 0.2223 (-0.67z)| lr 1.70e-06 | 4166.88 ms | 32.4% bf16 MFU | 125618 tok/s step 18922/19560 | loss 3.337421 (+1.00z)| norm 0.2285 (+0.11z)| lr 1.70e-06 | 4178.26 ms | 32.3% bf16 MFU | 125611 tok/s step 18923/19560 | loss 3.270702 (-0.73z)| norm 0.2247 (-0.36z)| lr 1.69e-06 | 4574.22 ms | 29.5% bf16 MFU | 125061 tok/s step 18924/19560 | loss 3.327677 (+0.78z)| norm 0.2234 (-0.53z)| lr 1.69e-06 | 4473.36 ms | 30.2% bf16 MFU | 124668 tok/s step 18925/19560 | loss 3.265838 (-0.84z)| norm 0.2339 (+0.83z)| lr 1.68e-06 | 4218.39 ms | 32.0% bf16 MFU | 124649 tok/s step 18926/19560 | loss 3.267674 (-0.79z)| norm 0.2924 (+6.71z)| lr 1.68e-06 | 4185.89 ms | 32.3% bf16 MFU | 124679 tok/s step 18927/19560 | loss 3.358909 (+1.59z)| norm 0.2250 (-0.33z)| lr 1.67e-06 | 4161.16 ms | 32.4% bf16 MFU | 124745 tok/s step 18928/19560 | loss 3.274219 (-0.62z)| norm 0.2272 (-0.10z)| lr 1.67e-06 | 4157.86 ms | 32.5% bf16 MFU | 124813 tok/s step 18929/19560 | loss 3.292116 (-0.16z)| norm 0.2297 (+0.15z)| lr 1.66e-06 | 4170.44 ms | 32.4% bf16 MFU | 124858 tok/s step 18930/19560 | loss 3.274335 (-0.63z)| norm 0.2230 (-0.54z)| lr 1.66e-06 | 4168.72 ms | 32.4% bf16 MFU | 124903 tok/s step 18931/19560 | loss 3.293998 (-0.12z)| norm 0.2193 (-0.92z)| lr 1.65e-06 | 4198.43 ms | 32.2% bf16 MFU | 124902 tok/s step 18932/19560 | loss 3.311378 (+0.34z)| norm 0.2263 (-0.17z)| lr 1.65e-06 | 4168.48 ms | 32.4% bf16 MFU | 124946 tok/s step 18933/19560 | loss 3.326080 (+0.72z)| norm 0.2217 (-0.66z)| lr 1.64e-06 | 4166.98 ms | 32.4% bf16 MFU | 124989 tok/s step 18934/19560 | loss 3.302600 (+0.09z)| norm 0.2194 (-0.90z)| lr 1.63e-06 | 4164.61 ms | 32.4% bf16 MFU | 125034 tok/s step 18935/19560 | loss 3.259273 (-1.04z)| norm 0.2299 (+0.22z)| lr 1.63e-06 | 4158.88 ms | 32.5% bf16 MFU | 125086 tok/s step 18936/19560 | loss 3.281388 (-0.45z)| norm 0.2235 (-0.46z)| lr 1.62e-06 | 4158.83 ms | 32.5% bf16 MFU | 125135 tok/s step 18937/19560 | loss 3.290998 (-0.20z)| norm 0.2253 (-0.27z)| lr 1.62e-06 | 4155.57 ms | 32.5% bf16 MFU | 125186 tok/s step 18938/19560 | loss 3.350049 (+1.37z)| norm 0.2228 (-0.53z)| lr 1.61e-06 | 4177.36 ms | 32.3% bf16 MFU | 125202 tok/s step 18939/19560 | loss 3.288971 (-0.26z)| norm 0.2209 (-0.73z)| lr 1.61e-06 | 4161.29 ms | 32.4% bf16 MFU | 125242 tok/s step 18940/19560 | loss 3.262125 (-0.98z)| norm 0.2262 (-0.16z)| lr 1.60e-06 | 4191.96 ms | 32.2% bf16 MFU | 125233 tok/s step 18941/19560 | loss 3.300606 (+0.06z)| norm 0.2246 (-0.33z)| lr 1.60e-06 | 4156.52 ms | 32.5% bf16 MFU | 125278 tok/s step 18942/19560 | loss 3.388638 (+2.37z)| norm 0.2328 (+0.57z)| lr 1.59e-06 | 4177.42 ms | 32.3% bf16 MFU | 125290 tok/s step 18943/19560 | loss 3.245809 (-1.39z)| norm 0.2277 (+0.02z)| lr 1.59e-06 | 4216.41 ms | 32.0% bf16 MFU | 125243 tok/s step 18944/19560 | loss 3.261560 (-0.98z)| norm 0.2168 (-1.16z)| lr 1.58e-06 | 4161.05 ms | 32.4% bf16 MFU | 125280 tok/s step 18945/19560 | loss 3.376029 (+1.99z)| norm 0.2438 (+1.74z)| lr 1.58e-06 | 4180.76 ms | 32.3% bf16 MFU | 125287 tok/s step 18946/19560 | loss 3.255928 (-1.12z)| norm 0.2256 (-0.19z)| lr 1.57e-06 | 4159.64 ms | 32.5% bf16 MFU | 125324 tok/s step 18947/19560 | loss 3.255581 (-1.12z)| norm 0.2227 (-0.51z)| lr 1.57e-06 | 4158.56 ms | 32.5% bf16 MFU | 125362 tok/s step 18948/19560 | loss 3.286209 (-0.33z)| norm 0.2247 (-0.28z)| lr 1.56e-06 | 4157.06 ms | 32.5% bf16 MFU | 125400 tok/s step 18949/19560 | loss 3.307710 (+0.22z)| norm 0.2236 (-0.40z)| lr 1.56e-06 | 4156.02 ms | 32.5% bf16 MFU | 125437 tok/s step 18950/19560 | loss 3.356483 (+1.47z)| norm 0.2295 (+0.28z)| lr 1.55e-06 | 4201.77 ms | 32.1% bf16 MFU | 125404 tok/s step 18951/19560 | loss 3.330588 (+0.80z)| norm 0.2322 (+0.60z)| lr 1.55e-06 | 4168.38 ms | 32.4% bf16 MFU | 125423 tok/s step 18952/19560 | loss 3.295943 (-0.09z)| norm 0.2192 (-0.91z)| lr 1.54e-06 | 4175.37 ms | 32.3% bf16 MFU | 125430 tok/s step 18953/19560 | loss 3.264886 (-0.88z)| norm 0.2457 (+2.10z)| lr 1.54e-06 | 4177.14 ms | 32.3% bf16 MFU | 125434 tok/s step 18954/19560 | loss 3.249829 (-1.25z)| norm 0.2197 (-0.84z)| lr 1.53e-06 | 4166.89 ms | 32.4% bf16 MFU | 125454 tok/s step 18955/19560 | loss 3.271296 (-0.71z)| norm 0.2357 (+0.96z)| lr 1.53e-06 | 4166.97 ms | 32.4% bf16 MFU | 125472 tok/s step 18956/19560 | loss 3.282146 (-0.44z)| norm 0.2261 (-0.12z)| lr 1.52e-06 | 4176.42 ms | 32.3% bf16 MFU | 125475 tok/s step 18957/19560 | loss 3.309560 (+0.28z)| norm 0.2348 (+0.86z)| lr 1.52e-06 | 4164.81 ms | 32.4% bf16 MFU | 125496 tok/s step 18958/19560 | loss 3.277449 (-0.56z)| norm 0.2284 (+0.14z)| lr 1.51e-06 | 4281.05 ms | 31.5% bf16 MFU | 125344 tok/s step 18959/19560 | loss 3.273466 (-0.65z)| norm 0.2307 (+0.40z)| lr 1.51e-06 | 4480.42 ms | 30.1% bf16 MFU | 124928 tok/s step 18960/19560 | loss 3.313159 (+0.37z)| norm 0.2537 (+2.89z)| lr 1.50e-06 | 4171.38 ms | 32.4% bf16 MFU | 124966 tok/s step 18961/19560 | loss 3.332859 (+0.88z)| norm 0.2192 (-0.90z)| lr 1.50e-06 | 4162.40 ms | 32.4% bf16 MFU | 125016 tok/s step 18962/19560 | loss 3.309838 (+0.28z)| norm 0.2216 (-0.63z)| lr 1.49e-06 | 4159.40 ms | 32.5% bf16 MFU | 125067 tok/s step 18963/19560 | loss 3.286088 (-0.34z)| norm 0.2249 (-0.27z)| lr 1.49e-06 | 4165.69 ms | 32.4% bf16 MFU | 125107 tok/s step 18964/19560 | loss 3.249154 (-1.29z)| norm 0.2208 (-0.71z)| lr 1.48e-06 | 4166.29 ms | 32.4% bf16 MFU | 125143 tok/s step 18965/19560 | loss 3.241462 (-1.46z)| norm 0.2395 (+1.32z)| lr 1.48e-06 | 4162.02 ms | 32.4% bf16 MFU | 125185 tok/s step 18966/19560 | loss 3.302979 (+0.11z)| norm 0.2202 (-0.79z)| lr 1.47e-06 | 4161.78 ms | 32.4% bf16 MFU | 125224 tok/s step 18967/19560 | loss 3.337081 (+1.00z)| norm 0.2461 (+1.98z)| lr 1.47e-06 | 4186.15 ms | 32.3% bf16 MFU | 125225 tok/s step 18968/19560 | loss 3.334388 (+0.93z)| norm 0.2190 (-0.91z)| lr 1.46e-06 | 4155.89 ms | 32.5% bf16 MFU | 125272 tok/s step 18969/19560 | loss 3.304378 (+0.15z)| norm 0.2254 (-0.23z)| lr 1.46e-06 | 4163.78 ms | 32.4% bf16 MFU | 125304 tok/s step 18970/19560 | loss 3.341375 (+1.10z)| norm 0.2245 (-0.32z)| lr 1.45e-06 | 4174.37 ms | 32.3% bf16 MFU | 125319 tok/s step 18971/19560 | loss 3.352848 (+1.38z)| norm 0.2227 (-0.52z)| lr 1.45e-06 | 4169.98 ms | 32.4% bf16 MFU | 125339 tok/s step 18972/19560 | loss 3.285680 (-0.38z)| norm 0.2193 (-0.86z)| lr 1.44e-06 | 4172.66 ms | 32.4% bf16 MFU | 125355 tok/s step 18973/19560 | loss 3.283783 (-0.42z)| norm 0.2302 (+0.32z)| lr 1.44e-06 | 4172.82 ms | 32.4% bf16 MFU | 125369 tok/s step 18974/19560 | loss 3.276156 (-0.61z)| norm 0.2244 (-0.31z)| lr 1.43e-06 | 4168.10 ms | 32.4% bf16 MFU | 125390 tok/s step 18975/19560 | loss 3.282799 (-0.44z)| norm 0.2153 (-1.27z)| lr 1.43e-06 | 4168.60 ms | 32.4% bf16 MFU | 125409 tok/s step 18976/19560 | loss 3.264755 (-0.93z)| norm 0.2202 (-0.74z)| lr 1.42e-06 | 4165.10 ms | 32.4% bf16 MFU | 125432 tok/s step 18977/19560 | loss 3.295833 (-0.07z)| norm 0.2205 (-0.70z)| lr 1.42e-06 | 4152.34 ms | 32.5% bf16 MFU | 125474 tok/s step 18978/19560 | loss 3.248129 (-1.36z)| norm 0.2267 (-0.03z)| lr 1.41e-06 | 4178.30 ms | 32.3% bf16 MFU | 125474 tok/s step 18979/19560 | loss 3.317515 (+0.51z)| norm 0.2314 (+0.47z)| lr 1.41e-06 | 4159.09 ms | 32.5% bf16 MFU | 125503 tok/s step 18980/19560 | loss 3.284805 (-0.37z)| norm 0.2273 (+0.03z)| lr 1.40e-06 | 4443.35 ms | 30.4% bf16 MFU | 125128 tok/s step 18981/19560 | loss 3.378018 (+2.30z)| norm 0.2258 (-0.14z)| lr 1.40e-06 | 4169.33 ms | 32.4% bf16 MFU | 125159 tok/s step 18982/19560 | loss 3.257459 (-1.14z)| norm 0.2219 (-0.55z)| lr 1.39e-06 | 4169.64 ms | 32.4% bf16 MFU | 125188 tok/s step 18983/19560 | loss 3.297293 (-0.01z)| norm 0.2223 (-0.51z)| lr 1.39e-06 | 4170.66 ms | 32.4% bf16 MFU | 125214 tok/s step 18984/19560 | loss 3.291813 (-0.16z)| norm 0.2212 (-0.62z)| lr 1.38e-06 | 5226.05 ms | 25.8% bf16 MFU | 123969 tok/s step 18985/19560 | loss 3.310043 (+0.37z)| norm 0.2198 (-0.77z)| lr 1.38e-06 | 4484.11 ms | 30.1% bf16 MFU | 123617 tok/s step 18986/19560 | loss 3.325790 (+0.81z)| norm 0.2359 (+0.97z)| lr 1.38e-06 | 4667.72 ms | 28.9% bf16 MFU | 123052 tok/s step 18987/19560 | loss 3.301493 (+0.11z)| norm 0.2268 (-0.00z)| lr 1.37e-06 | 4343.08 ms | 31.1% bf16 MFU | 122936 tok/s step 18988/19560 | loss 3.252960 (-1.26z)| norm 0.2222 (-0.50z)| lr 1.37e-06 | 4207.38 ms | 32.1% bf16 MFU | 123019 tok/s step 18989/19560 | loss 3.266503 (-0.86z)| norm 0.2271 (+0.03z)| lr 1.36e-06 | 4687.23 ms | 28.8% bf16 MFU | 122461 tok/s step 18990/19560 | loss 3.305500 (+0.26z)| norm 0.2179 (-0.95z)| lr 1.36e-06 | 4199.42 ms | 32.2% bf16 MFU | 122580 tok/s step 18991/19560 | loss 3.297551 (+0.03z)| norm 0.2275 (+0.08z)| lr 1.35e-06 | 4189.28 ms | 32.2% bf16 MFU | 122709 tok/s step 18992/19560 | loss 3.280946 (-0.45z)| norm 0.2204 (-0.69z)| lr 1.35e-06 | 4295.68 ms | 31.4% bf16 MFU | 122676 tok/s step 18993/19560 | loss 3.243639 (-1.50z)| norm 0.2208 (-0.64z)| lr 1.34e-06 | 4215.02 ms | 32.0% bf16 MFU | 122761 tok/s step 18994/19560 | loss 3.296957 (+0.01z)| norm 0.2219 (-0.51z)| lr 1.34e-06 | 4346.11 ms | 31.1% bf16 MFU | 122655 tok/s step 18995/19560 | loss 3.353581 (+1.62z)| norm 0.2357 (+0.99z)| lr 1.33e-06 | 4244.61 ms | 31.8% bf16 MFU | 122698 tok/s step 18996/19560 | loss 3.356595 (+1.67z)| norm 0.2625 (+3.67z)| lr 1.33e-06 | 4184.31 ms | 32.3% bf16 MFU | 122828 tok/s step 18997/19560 | loss 3.282701 (-0.42z)| norm 0.2374 (+1.06z)| lr 1.32e-06 | 4166.73 ms | 32.4% bf16 MFU | 122978 tok/s step 18998/19560 | loss 3.310044 (+0.36z)| norm 0.2224 (-0.46z)| lr 1.32e-06 | 4156.64 ms | 32.5% bf16 MFU | 123136 tok/s step 18999/19560 | loss 3.233606 (-1.81z)| norm 0.2257 (-0.14z)| lr 1.31e-06 | 4288.91 ms | 31.5% bf16 MFU | 123091 tok/s step 19000/19560 | loss 3.319515 (+0.63z)| norm 0.2232 (-0.39z)| lr 1.31e-06 | 4205.00 ms | 32.1% bf16 MFU | 123171 tok/s val loss 3.265752 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3041/10042 = 0.302828 step 19001/19560 | loss 3.301804 (+0.13z)| norm 0.2199 (-0.72z)| lr 1.30e-06 | 4311.28 ms | 31.3% bf16 MFU | 123093 tok/s step 19002/19560 | loss 3.303104 (+0.17z)| norm 0.2257 (-0.12z)| lr 1.30e-06 | 4434.52 ms | 30.4% bf16 MFU | 122849 tok/s step 19003/19560 | loss 3.341241 (+1.23z)| norm 0.2238 (-0.32z)| lr 1.29e-06 | 4154.67 ms | 32.5% bf16 MFU | 123017 tok/s step 19004/19560 | loss 3.381645 (+2.31z)| norm 0.2269 (+0.00z)| lr 1.29e-06 | 4499.06 ms | 30.0% bf16 MFU | 122692 tok/s step 19005/19560 | loss 3.222126 (-2.07z)| norm 0.2348 (+0.82z)| lr 1.29e-06 | 4439.18 ms | 30.4% bf16 MFU | 122463 tok/s step 19006/19560 | loss 3.314857 (+0.46z)| norm 0.2208 (-0.65z)| lr 1.28e-06 | 4166.96 ms | 32.4% bf16 MFU | 122631 tok/s step 19007/19560 | loss 3.261645 (-0.99z)| norm 0.2244 (-0.27z)| lr 1.28e-06 | 4349.13 ms | 31.0% bf16 MFU | 122527 tok/s step 19008/19560 | loss 3.295076 (-0.06z)| norm 0.2199 (-0.73z)| lr 1.27e-06 | 4317.61 ms | 31.3% bf16 MFU | 122472 tok/s step 19009/19560 | loss 3.310007 (+0.36z)| norm 0.2249 (-0.21z)| lr 1.27e-06 | 4498.69 ms | 30.0% bf16 MFU | 122176 tok/s step 19010/19560 | loss 3.295834 (-0.06z)| norm 0.2263 (-0.06z)| lr 1.26e-06 | 4157.84 ms | 32.5% bf16 MFU | 122372 tok/s step 19011/19560 | loss 3.415483 (+3.24z)| norm 0.2321 (+0.55z)| lr 1.26e-06 | 4168.18 ms | 32.4% bf16 MFU | 122542 tok/s step 19012/19560 | loss 3.243578 (-1.53z)| norm 0.2212 (-0.60z)| lr 1.25e-06 | 4163.87 ms | 32.4% bf16 MFU | 122711 tok/s step 19013/19560 | loss 3.271649 (-0.73z)| norm 0.2223 (-0.48z)| lr 1.25e-06 | 4184.95 ms | 32.3% bf16 MFU | 122839 tok/s step 19014/19560 | loss 3.306846 (+0.25z)| norm 0.2200 (-0.72z)| lr 1.24e-06 | 4271.24 ms | 31.6% bf16 MFU | 122835 tok/s step 19015/19560 | loss 3.354602 (+1.58z)| norm 0.2487 (+2.28z)| lr 1.24e-06 | 4153.20 ms | 32.5% bf16 MFU | 123005 tok/s step 19016/19560 | loss 3.282458 (-0.42z)| norm 0.2286 (+0.17z)| lr 1.24e-06 | 4358.38 ms | 31.0% bf16 MFU | 122869 tok/s step 19017/19560 | loss 3.230699 (-1.85z)| norm 0.2368 (+1.05z)| lr 1.23e-06 | 4208.46 ms | 32.1% bf16 MFU | 122955 tok/s step 19018/19560 | loss 3.257600 (-1.09z)| norm 0.2130 (-1.45z)| lr 1.23e-06 | 4171.44 ms | 32.4% bf16 MFU | 123091 tok/s step 19019/19560 | loss 3.275915 (-0.58z)| norm 0.2149 (-1.24z)| lr 1.22e-06 | 4164.08 ms | 32.4% bf16 MFU | 123232 tok/s step 19020/19560 | loss 3.230109 (-1.80z)| norm 0.2350 (+0.84z)| lr 1.22e-06 | 4160.52 ms | 32.5% bf16 MFU | 123371 tok/s step 19021/19560 | loss 3.292789 (-0.09z)| norm 0.2184 (-0.88z)| lr 1.21e-06 | 4345.70 ms | 31.1% bf16 MFU | 123235 tok/s step 19022/19560 | loss 3.417371 (+3.18z)| norm 0.2445 (+1.81z)| lr 1.21e-06 | 4178.76 ms | 32.3% bf16 MFU | 123346 tok/s step 19023/19560 | loss 3.333244 (+0.96z)| norm 0.2227 (-0.43z)| lr 1.20e-06 | 4205.82 ms | 32.1% bf16 MFU | 123412 tok/s step 19024/19560 | loss 3.295742 (-0.02z)| norm 0.2501 (+2.32z)| lr 1.20e-06 | 4201.24 ms | 32.1% bf16 MFU | 123481 tok/s step 19025/19560 | loss 3.307782 (+0.29z)| norm 0.2238 (-0.33z)| lr 1.19e-06 | 4164.58 ms | 32.4% bf16 MFU | 123602 tok/s step 19026/19560 | loss 3.277162 (-0.52z)| norm 0.2307 (+0.36z)| lr 1.19e-06 | 4175.59 ms | 32.3% bf16 MFU | 123700 tok/s step 19027/19560 | loss 3.290393 (-0.16z)| norm 0.2694 (+3.99z)| lr 1.19e-06 | 4166.90 ms | 32.4% bf16 MFU | 123806 tok/s step 19028/19560 | loss 3.273889 (-0.60z)| norm 0.2300 (+0.24z)| lr 1.18e-06 | 4162.27 ms | 32.4% bf16 MFU | 123913 tok/s step 19029/19560 | loss 3.269534 (-0.70z)| norm 0.2282 (+0.06z)| lr 1.18e-06 | 4279.09 ms | 31.6% bf16 MFU | 123844 tok/s step 19030/19560 | loss 3.335808 (+1.04z)| norm 0.2290 (+0.15z)| lr 1.17e-06 | 4160.82 ms | 32.4% bf16 MFU | 123952 tok/s step 19031/19560 | loss 3.320432 (+0.62z)| norm 0.2263 (-0.12z)| lr 1.17e-06 | 4158.31 ms | 32.5% bf16 MFU | 124059 tok/s step 19032/19560 | loss 3.280180 (-0.45z)| norm 0.2298 (+0.22z)| lr 1.16e-06 | 4173.43 ms | 32.4% bf16 MFU | 124137 tok/s step 19033/19560 | loss 3.235676 (-1.61z)| norm 0.2253 (-0.21z)| lr 1.16e-06 | 4173.77 ms | 32.3% bf16 MFU | 124211 tok/s step 19034/19560 | loss 3.279061 (-0.44z)| norm 0.2207 (-0.64z)| lr 1.16e-06 | 4167.89 ms | 32.4% bf16 MFU | 124290 tok/s step 19035/19560 | loss 3.327877 (+0.84z)| norm 0.2150 (-1.18z)| lr 1.15e-06 | 4163.46 ms | 32.4% bf16 MFU | 124372 tok/s step 19036/19560 | loss 3.299691 (+0.09z)| norm 0.2172 (-0.96z)| lr 1.15e-06 | 4160.98 ms | 32.4% bf16 MFU | 124453 tok/s step 19037/19560 | loss 3.253067 (-1.15z)| norm 0.2313 (+0.37z)| lr 1.14e-06 | 4170.83 ms | 32.4% bf16 MFU | 124516 tok/s step 19038/19560 | loss 3.244913 (-1.34z)| norm 0.2246 (-0.26z)| lr 1.14e-06 | 4164.10 ms | 32.4% bf16 MFU | 124585 tok/s step 19039/19560 | loss 3.263680 (-0.83z)| norm 0.2343 (+0.65z)| lr 1.13e-06 | 4163.98 ms | 32.4% bf16 MFU | 124651 tok/s step 19040/19560 | loss 3.379100 (+2.16z)| norm 0.2464 (+1.76z)| lr 1.13e-06 | 4372.66 ms | 30.9% bf16 MFU | 124414 tok/s step 19041/19560 | loss 3.213381 (-2.11z)| norm 0.2475 (+1.83z)| lr 1.12e-06 | 4540.69 ms | 29.7% bf16 MFU | 123966 tok/s step 19042/19560 | loss 3.304781 (+0.23z)| norm 0.2199 (-0.73z)| lr 1.12e-06 | 4165.21 ms | 32.4% bf16 MFU | 124062 tok/s step 19043/19560 | loss 3.292089 (-0.08z)| norm 0.2207 (-0.66z)| lr 1.12e-06 | 4172.99 ms | 32.4% bf16 MFU | 124141 tok/s step 19044/19560 | loss 3.288298 (-0.18z)| norm 0.2218 (-0.56z)| lr 1.11e-06 | 4200.84 ms | 32.1% bf16 MFU | 124174 tok/s step 19045/19560 | loss 3.299546 (+0.10z)| norm 0.2218 (-0.54z)| lr 1.11e-06 | 4159.21 ms | 32.5% bf16 MFU | 124268 tok/s step 19046/19560 | loss 3.258347 (-0.96z)| norm 0.2210 (-0.61z)| lr 1.10e-06 | 4171.60 ms | 32.4% bf16 MFU | 124339 tok/s step 19047/19560 | loss 3.248033 (-1.21z)| norm 0.2235 (-0.38z)| lr 1.10e-06 | 4165.07 ms | 32.4% bf16 MFU | 124415 tok/s step 19048/19560 | loss 3.332662 (+0.95z)| norm 0.2251 (-0.23z)| lr 1.09e-06 | 4169.04 ms | 32.4% bf16 MFU | 124483 tok/s step 19049/19560 | loss 3.293444 (-0.06z)| norm 0.2261 (-0.14z)| lr 1.09e-06 | 4170.02 ms | 32.4% bf16 MFU | 124545 tok/s step 19050/19560 | loss 3.306877 (+0.30z)| norm 0.2168 (-1.00z)| lr 1.09e-06 | 4176.67 ms | 32.3% bf16 MFU | 124594 tok/s step 19051/19560 | loss 3.236692 (-1.49z)| norm 0.2166 (-1.01z)| lr 1.08e-06 | 4175.80 ms | 32.3% bf16 MFU | 124642 tok/s step 19052/19560 | loss 3.281710 (-0.33z)| norm 0.2285 (+0.08z)| lr 1.08e-06 | 4163.18 ms | 32.4% bf16 MFU | 124707 tok/s step 19053/19560 | loss 3.279095 (-0.40z)| norm 0.2290 (+0.14z)| lr 1.07e-06 | 4165.85 ms | 32.4% bf16 MFU | 124764 tok/s step 19054/19560 | loss 3.263162 (-0.81z)| norm 0.2176 (-1.02z)| lr 1.07e-06 | 4160.66 ms | 32.5% bf16 MFU | 124826 tok/s step 19055/19560 | loss 3.280964 (-0.34z)| norm 0.2296 (+0.29z)| lr 1.06e-06 | 4177.11 ms | 32.3% bf16 MFU | 124861 tok/s step 19056/19560 | loss 3.246132 (-1.23z)| norm 0.2309 (+0.42z)| lr 1.06e-06 | 4153.20 ms | 32.5% bf16 MFU | 124930 tok/s step 19057/19560 | loss 3.321403 (+0.70z)| norm 0.2236 (-0.36z)| lr 1.06e-06 | 4171.33 ms | 32.4% bf16 MFU | 124967 tok/s step 19058/19560 | loss 3.252757 (-1.06z)| norm 0.2266 (-0.04z)| lr 1.05e-06 | 4178.29 ms | 32.3% bf16 MFU | 124993 tok/s step 19059/19560 | loss 3.279073 (-0.38z)| norm 0.2241 (-0.31z)| lr 1.05e-06 | 4178.37 ms | 32.3% bf16 MFU | 125017 tok/s step 19060/19560 | loss 3.330736 (+0.94z)| norm 0.2328 (+0.62z)| lr 1.04e-06 | 4159.65 ms | 32.5% bf16 MFU | 125068 tok/s step 19061/19560 | loss 3.344602 (+1.28z)| norm 0.2278 (+0.08z)| lr 1.04e-06 | 4162.92 ms | 32.4% bf16 MFU | 125112 tok/s step 19062/19560 | loss 3.277791 (-0.41z)| norm 0.2212 (-0.64z)| lr 1.04e-06 | 4163.22 ms | 32.4% bf16 MFU | 125153 tok/s step 19063/19560 | loss 3.334639 (+1.02z)| norm 0.2253 (-0.20z)| lr 1.03e-06 | 4204.85 ms | 32.1% bf16 MFU | 125130 tok/s step 19064/19560 | loss 3.311255 (+0.42z)| norm 0.2200 (-0.77z)| lr 1.03e-06 | 4169.50 ms | 32.4% bf16 MFU | 125161 tok/s step 19065/19560 | loss 3.268756 (-0.66z)| norm 0.2248 (-0.25z)| lr 1.02e-06 | 4178.66 ms | 32.3% bf16 MFU | 125176 tok/s step 19066/19560 | loss 3.352205 (+1.46z)| norm 0.2767 (+4.83z)| lr 1.02e-06 | 4168.39 ms | 32.4% bf16 MFU | 125206 tok/s step 19067/19560 | loss 3.237439 (-1.43z)| norm 0.2303 (+0.27z)| lr 1.02e-06 | 4206.42 ms | 32.1% bf16 MFU | 125178 tok/s step 19068/19560 | loss 3.292468 (-0.05z)| norm 0.2213 (-0.61z)| lr 1.01e-06 | 4167.27 ms | 32.4% bf16 MFU | 125209 tok/s step 19069/19560 | loss 3.253120 (-1.03z)| norm 0.2264 (-0.12z)| lr 1.01e-06 | 4175.25 ms | 32.3% bf16 MFU | 125227 tok/s step 19070/19560 | loss 3.311124 (+0.45z)| norm 0.2227 (-0.47z)| lr 1.00e-06 | 4153.86 ms | 32.5% bf16 MFU | 125277 tok/s step 19071/19560 | loss 3.265212 (-0.73z)| norm 0.2362 (+0.85z)| lr 9.99e-07 | 4156.62 ms | 32.5% bf16 MFU | 125320 tok/s step 19072/19560 | loss 3.301248 (+0.19z)| norm 0.2263 (-0.13z)| lr 9.95e-07 | 4150.61 ms | 32.5% bf16 MFU | 125369 tok/s step 19073/19560 | loss 3.232839 (-1.57z)| norm 0.2252 (-0.23z)| lr 9.91e-07 | 4289.00 ms | 31.5% bf16 MFU | 125213 tok/s step 19074/19560 | loss 3.310213 (+0.44z)| norm 0.2369 (+0.92z)| lr 9.87e-07 | 4142.75 ms | 32.6% bf16 MFU | 125280 tok/s step 19075/19560 | loss 3.297997 (+0.11z)| norm 0.2218 (-0.57z)| lr 9.83e-07 | 4164.84 ms | 32.4% bf16 MFU | 125310 tok/s step 19076/19560 | loss 3.320951 (+0.71z)| norm 0.2196 (-0.78z)| lr 9.78e-07 | 4145.94 ms | 32.6% bf16 MFU | 125368 tok/s step 19077/19560 | loss 3.483484 (+4.52z)| norm 0.2565 (+2.76z)| lr 9.74e-07 | 4148.02 ms | 32.5% bf16 MFU | 125419 tok/s step 19078/19560 | loss 3.236629 (-1.39z)| norm 0.2240 (-0.36z)| lr 9.70e-07 | 4153.15 ms | 32.5% bf16 MFU | 125460 tok/s step 19079/19560 | loss 3.303023 (+0.21z)| norm 0.2191 (-0.82z)| lr 9.66e-07 | 4166.23 ms | 32.4% bf16 MFU | 125479 tok/s step 19080/19560 | loss 3.300261 (+0.15z)| norm 0.2243 (-0.32z)| lr 9.62e-07 | 4148.74 ms | 32.5% bf16 MFU | 125524 tok/s step 19081/19560 | loss 3.301672 (+0.17z)| norm 0.2512 (+2.24z)| lr 9.58e-07 | 4156.07 ms | 32.5% bf16 MFU | 125555 tok/s step 19082/19560 | loss 3.211507 (-1.97z)| norm 0.2283 (+0.05z)| lr 9.54e-07 | 4154.92 ms | 32.5% bf16 MFU | 125587 tok/s step 19083/19560 | loss 3.295716 (+0.03z)| norm 0.2276 (-0.01z)| lr 9.50e-07 | 4203.87 ms | 32.1% bf16 MFU | 125543 tok/s step 19084/19560 | loss 3.275152 (-0.46z)| norm 0.2344 (+0.64z)| lr 9.46e-07 | 4149.68 ms | 32.5% bf16 MFU | 125583 tok/s step 19085/19560 | loss 3.280662 (-0.32z)| norm 0.2429 (+1.43z)| lr 9.43e-07 | 4151.66 ms | 32.5% bf16 MFU | 125618 tok/s step 19086/19560 | loss 3.315404 (+0.50z)| norm 0.2231 (-0.45z)| lr 9.39e-07 | 4168.90 ms | 32.4% bf16 MFU | 125625 tok/s step 19087/19560 | loss 3.312257 (+0.42z)| norm 0.2236 (-0.39z)| lr 9.35e-07 | 4163.98 ms | 32.4% bf16 MFU | 125640 tok/s step 19088/19560 | loss 3.400995 (+2.47z)| norm 0.2412 (+1.31z)| lr 9.31e-07 | 4169.14 ms | 32.4% bf16 MFU | 125645 tok/s step 19089/19560 | loss 3.420429 (+2.82z)| norm 0.3217 (+7.04z)| lr 9.27e-07 | 4169.32 ms | 32.4% bf16 MFU | 125651 tok/s step 19090/19560 | loss 3.299961 (+0.09z)| norm 0.2272 (-0.10z)| lr 9.23e-07 | 4151.69 ms | 32.5% bf16 MFU | 125682 tok/s step 19091/19560 | loss 3.265735 (-0.68z)| norm 0.2157 (-0.96z)| lr 9.19e-07 | 4151.44 ms | 32.5% bf16 MFU | 125713 tok/s step 19092/19560 | loss 3.354589 (+1.31z)| norm 0.2322 (+0.28z)| lr 9.15e-07 | 4149.28 ms | 32.5% bf16 MFU | 125745 tok/s step 19093/19560 | loss 3.275721 (-0.48z)| norm 0.2170 (-0.85z)| lr 9.11e-07 | 4166.01 ms | 32.4% bf16 MFU | 125750 tok/s step 19094/19560 | loss 3.290404 (-0.15z)| norm 0.2282 (-0.02z)| lr 9.07e-07 | 4153.31 ms | 32.5% bf16 MFU | 125774 tok/s step 19095/19560 | loss 3.285401 (-0.25z)| norm 0.2269 (-0.10z)| lr 9.03e-07 | 4151.48 ms | 32.5% bf16 MFU | 125800 tok/s step 19096/19560 | loss 3.237268 (-1.32z)| norm 0.2173 (-0.83z)| lr 8.99e-07 | 4151.46 ms | 32.5% bf16 MFU | 125824 tok/s step 19097/19560 | loss 3.275361 (-0.45z)| norm 0.2192 (-0.68z)| lr 8.96e-07 | 4170.16 ms | 32.4% bf16 MFU | 125819 tok/s step 19098/19560 | loss 3.268345 (-0.60z)| norm 0.2242 (-0.30z)| lr 8.92e-07 | 4178.91 ms | 32.3% bf16 MFU | 125801 tok/s step 19099/19560 | loss 3.250857 (-0.98z)| norm 0.2252 (-0.23z)| lr 8.88e-07 | 4155.70 ms | 32.5% bf16 MFU | 125819 tok/s step 19100/19560 | loss 3.241835 (-1.18z)| norm 0.2188 (-0.71z)| lr 8.84e-07 | 4147.51 ms | 32.6% bf16 MFU | 125849 tok/s step 19101/19560 | loss 3.339695 (+1.03z)| norm 0.2261 (-0.16z)| lr 8.80e-07 | 4147.67 ms | 32.6% bf16 MFU | 125877 tok/s step 19102/19560 | loss 3.356045 (+1.38z)| norm 0.2390 (+0.81z)| lr 8.76e-07 | 4159.59 ms | 32.5% bf16 MFU | 125885 tok/s step 19103/19560 | loss 3.330038 (+0.78z)| norm 0.2282 (-0.01z)| lr 8.73e-07 | 4163.37 ms | 32.4% bf16 MFU | 125887 tok/s step 19104/19560 | loss 3.229898 (-1.44z)| norm 0.2181 (-0.78z)| lr 8.69e-07 | 4183.22 ms | 32.3% bf16 MFU | 125859 tok/s step 19105/19560 | loss 3.270710 (-0.53z)| norm 0.2253 (-0.23z)| lr 8.65e-07 | 4151.71 ms | 32.5% bf16 MFU | 125881 tok/s step 19106/19560 | loss 3.289152 (-0.13z)| norm 0.2226 (-0.43z)| lr 8.61e-07 | 4150.98 ms | 32.5% bf16 MFU | 125902 tok/s step 19107/19560 | loss 3.321361 (+0.59z)| norm 0.2377 (+0.70z)| lr 8.57e-07 | 4157.31 ms | 32.5% bf16 MFU | 125912 tok/s step 19108/19560 | loss 3.318718 (+0.52z)| norm 0.2178 (-0.79z)| lr 8.54e-07 | 4148.74 ms | 32.5% bf16 MFU | 125935 tok/s step 19109/19560 | loss 3.227751 (-1.49z)| norm 0.2235 (-0.37z)| lr 8.50e-07 | 4155.33 ms | 32.5% bf16 MFU | 125947 tok/s step 19110/19560 | loss 3.307218 (+0.29z)| norm 0.2210 (-0.55z)| lr 8.46e-07 | 4155.43 ms | 32.5% bf16 MFU | 125958 tok/s step 19111/19560 | loss 3.307143 (+0.28z)| norm 0.2191 (-0.70z)| lr 8.42e-07 | 4152.71 ms | 32.5% bf16 MFU | 125973 tok/s step 19112/19560 | loss 3.370971 (+1.68z)| norm 0.2394 (+0.83z)| lr 8.39e-07 | 4155.24 ms | 32.5% bf16 MFU | 125983 tok/s step 19113/19560 | loss 3.310955 (+0.35z)| norm 0.2187 (-0.73z)| lr 8.35e-07 | 4146.74 ms | 32.6% bf16 MFU | 126006 tok/s step 19114/19560 | loss 3.284665 (-0.23z)| norm 0.2209 (-0.56z)| lr 8.31e-07 | 4152.15 ms | 32.5% bf16 MFU | 126019 tok/s step 19115/19560 | loss 3.285103 (-0.21z)| norm 0.2344 (+0.46z)| lr 8.28e-07 | 4153.51 ms | 32.5% bf16 MFU | 126029 tok/s step 19116/19560 | loss 3.430821 (+2.90z)| norm 0.2352 (+0.51z)| lr 8.24e-07 | 4144.05 ms | 32.6% bf16 MFU | 126054 tok/s step 19117/19560 | loss 3.281568 (-0.32z)| norm 0.2253 (-0.24z)| lr 8.20e-07 | 4151.17 ms | 32.5% bf16 MFU | 126066 tok/s step 19118/19560 | loss 3.226616 (-1.48z)| norm 0.2239 (-0.35z)| lr 8.16e-07 | 4152.80 ms | 32.5% bf16 MFU | 126075 tok/s step 19119/19560 | loss 3.304779 (+0.19z)| norm 0.2257 (-0.21z)| lr 8.13e-07 | 4169.14 ms | 32.4% bf16 MFU | 126059 tok/s step 19120/19560 | loss 3.306673 (+0.23z)| norm 0.2186 (-0.74z)| lr 8.09e-07 | 4156.03 ms | 32.5% bf16 MFU | 126064 tok/s step 19121/19560 | loss 3.327683 (+0.67z)| norm 0.2221 (-0.48z)| lr 8.05e-07 | 4150.16 ms | 32.5% bf16 MFU | 126077 tok/s step 19122/19560 | loss 3.227914 (-1.45z)| norm 0.2208 (-0.58z)| lr 8.02e-07 | 4145.66 ms | 32.6% bf16 MFU | 126096 tok/s step 19123/19560 | loss 3.298972 (+0.07z)| norm 0.2159 (-0.94z)| lr 7.98e-07 | 4163.43 ms | 32.4% bf16 MFU | 126088 tok/s step 19124/19560 | loss 3.219295 (-1.60z)| norm 0.2262 (-0.14z)| lr 7.94e-07 | 4155.96 ms | 32.5% bf16 MFU | 126091 tok/s step 19125/19560 | loss 3.265950 (-0.61z)| norm 0.2236 (-0.34z)| lr 7.91e-07 | 4157.06 ms | 32.5% bf16 MFU | 126093 tok/s step 19126/19560 | loss 3.237130 (-1.20z)| norm 0.2309 (+0.23z)| lr 7.87e-07 | 4153.58 ms | 32.5% bf16 MFU | 126099 tok/s step 19127/19560 | loss 3.245379 (-1.03z)| norm 0.2349 (+0.53z)| lr 7.84e-07 | 4170.13 ms | 32.4% bf16 MFU | 126080 tok/s step 19128/19560 | loss 3.291151 (-0.05z)| norm 0.2216 (-0.50z)| lr 7.80e-07 | 4160.95 ms | 32.4% bf16 MFU | 126077 tok/s step 19129/19560 | loss 3.330707 (+0.78z)| norm 0.2171 (-0.84z)| lr 7.76e-07 | 4343.52 ms | 31.1% bf16 MFU | 125808 tok/s step 19130/19560 | loss 3.298596 (+0.10z)| norm 0.2172 (-0.83z)| lr 7.73e-07 | 4144.95 ms | 32.6% bf16 MFU | 125842 tok/s step 19131/19560 | loss 3.409079 (+2.39z)| norm 0.2866 (+4.16z)| lr 7.69e-07 | 4185.64 ms | 32.3% bf16 MFU | 125813 tok/s step 19132/19560 | loss 3.257256 (-0.76z)| norm 0.2216 (-0.49z)| lr 7.66e-07 | 4228.52 ms | 31.9% bf16 MFU | 125722 tok/s step 19133/19560 | loss 3.283051 (-0.23z)| norm 0.2258 (-0.18z)| lr 7.62e-07 | 4158.97 ms | 32.5% bf16 MFU | 125739 tok/s step 19134/19560 | loss 3.248804 (-0.94z)| norm 0.2215 (-0.49z)| lr 7.59e-07 | 4158.68 ms | 32.5% bf16 MFU | 125755 tok/s step 19135/19560 | loss 3.278085 (-0.33z)| norm 0.2359 (+0.54z)| lr 7.55e-07 | 4151.93 ms | 32.5% bf16 MFU | 125781 tok/s step 19136/19560 | loss 3.316257 (+0.48z)| norm 0.2206 (-0.56z)| lr 7.51e-07 | 4153.41 ms | 32.5% bf16 MFU | 125804 tok/s step 19137/19560 | loss 3.300135 (+0.14z)| norm 0.2263 (-0.16z)| lr 7.48e-07 | 4151.71 ms | 32.5% bf16 MFU | 125828 tok/s step 19138/19560 | loss 3.312308 (+0.39z)| norm 0.2429 (+1.02z)| lr 7.44e-07 | 4162.62 ms | 32.4% bf16 MFU | 125834 tok/s step 19139/19560 | loss 3.268130 (-0.53z)| norm 0.2205 (-0.57z)| lr 7.41e-07 | 4198.17 ms | 32.2% bf16 MFU | 125786 tok/s step 19140/19560 | loss 3.263757 (-0.63z)| norm 0.2250 (-0.25z)| lr 7.37e-07 | 4156.16 ms | 32.5% bf16 MFU | 125804 tok/s step 19141/19560 | loss 3.336079 (+0.93z)| norm 0.2277 (-0.06z)| lr 7.34e-07 | 4160.65 ms | 32.5% bf16 MFU | 125815 tok/s step 19142/19560 | loss 3.291248 (-0.04z)| norm 0.2427 (+0.99z)| lr 7.30e-07 | 4144.79 ms | 32.6% bf16 MFU | 125849 tok/s step 19143/19560 | loss 3.288213 (-0.10z)| norm 0.2400 (+0.81z)| lr 7.27e-07 | 4169.45 ms | 32.4% bf16 MFU | 125844 tok/s step 19144/19560 | loss 3.260436 (-0.70z)| norm 0.2228 (-0.42z)| lr 7.23e-07 | 4148.97 ms | 32.5% bf16 MFU | 125870 tok/s step 19145/19560 | loss 3.292653 (-0.01z)| norm 0.2274 (-0.09z)| lr 7.20e-07 | 4153.03 ms | 32.5% bf16 MFU | 125888 tok/s step 19146/19560 | loss 3.305789 (+0.28z)| norm 0.2194 (-0.66z)| lr 7.17e-07 | 4144.97 ms | 32.6% bf16 MFU | 125918 tok/s step 19147/19560 | loss 3.214544 (-1.71z)| norm 0.2320 (+0.24z)| lr 7.13e-07 | 4150.42 ms | 32.5% bf16 MFU | 125938 tok/s step 19148/19560 | loss 3.332122 (+0.84z)| norm 0.2335 (+0.35z)| lr 7.10e-07 | 4140.90 ms | 32.6% bf16 MFU | 125972 tok/s step 19149/19560 | loss 3.293500 (-0.00z)| norm 0.2220 (-0.48z)| lr 7.06e-07 | 4160.05 ms | 32.5% bf16 MFU | 125975 tok/s step 19150/19560 | loss 3.254009 (-0.86z)| norm 0.2227 (-0.42z)| lr 7.03e-07 | 4157.46 ms | 32.5% bf16 MFU | 125982 tok/s step 19151/19560 | loss 3.252094 (-0.89z)| norm 0.2268 (-0.13z)| lr 6.99e-07 | 4171.88 ms | 32.4% bf16 MFU | 125966 tok/s step 19152/19560 | loss 3.280301 (-0.25z)| norm 0.2220 (-0.47z)| lr 6.96e-07 | 4160.92 ms | 32.4% bf16 MFU | 125968 tok/s step 19153/19560 | loss 3.279169 (-0.27z)| norm 0.2214 (-0.51z)| lr 6.93e-07 | 4159.08 ms | 32.5% bf16 MFU | 125973 tok/s step 19154/19560 | loss 3.287112 (-0.10z)| norm 0.2167 (-0.84z)| lr 6.89e-07 | 4164.63 ms | 32.4% bf16 MFU | 125968 tok/s step 19155/19560 | loss 3.210588 (-1.79z)| norm 0.2248 (-0.24z)| lr 6.86e-07 | 4143.89 ms | 32.6% bf16 MFU | 125996 tok/s step 19156/19560 | loss 3.245724 (-1.00z)| norm 0.2184 (-0.71z)| lr 6.82e-07 | 4167.83 ms | 32.4% bf16 MFU | 125986 tok/s step 19157/19560 | loss 3.245091 (-1.00z)| norm 0.2168 (-0.82z)| lr 6.79e-07 | 4234.96 ms | 31.9% bf16 MFU | 125877 tok/s step 19158/19560 | loss 3.308646 (+0.41z)| norm 0.2233 (-0.33z)| lr 6.76e-07 | 4150.19 ms | 32.5% bf16 MFU | 125899 tok/s step 19159/19560 | loss 3.284779 (-0.11z)| norm 0.2216 (-0.45z)| lr 6.72e-07 | 4156.46 ms | 32.5% bf16 MFU | 125911 tok/s step 19160/19560 | loss 3.253890 (-0.79z)| norm 0.2264 (-0.10z)| lr 6.69e-07 | 4190.97 ms | 32.2% bf16 MFU | 125871 tok/s step 19161/19560 | loss 3.314312 (+0.53z)| norm 0.2223 (-0.40z)| lr 6.66e-07 | 4187.04 ms | 32.2% bf16 MFU | 125838 tok/s step 19162/19560 | loss 3.276646 (-0.30z)| norm 0.2197 (-0.60z)| lr 6.62e-07 | 4194.58 ms | 32.2% bf16 MFU | 125796 tok/s step 19163/19560 | loss 3.280368 (-0.21z)| norm 0.2249 (-0.21z)| lr 6.59e-07 | 9580.80 ms | 14.1% bf16 MFU | 122242 tok/s step 19164/19560 | loss 3.332078 (+0.93z)| norm 0.2231 (-0.35z)| lr 6.56e-07 | 4290.99 ms | 31.5% bf16 MFU | 122239 tok/s step 19165/19560 | loss 3.315030 (+0.54z)| norm 0.2304 (+0.20z)| lr 6.52e-07 | 4126.82 ms | 32.7% bf16 MFU | 122479 tok/s step 19166/19560 | loss 3.358669 (+1.49z)| norm 0.2291 (+0.10z)| lr 6.49e-07 | 4128.51 ms | 32.7% bf16 MFU | 122705 tok/s step 19167/19560 | loss 3.244492 (-1.04z)| norm 0.2241 (-0.27z)| lr 6.46e-07 | 4158.27 ms | 32.5% bf16 MFU | 122874 tok/s step 19168/19560 | loss 3.273135 (-0.39z)| norm 0.2382 (+0.80z)| lr 6.43e-07 | 4150.17 ms | 32.5% bf16 MFU | 123047 tok/s step 19169/19560 | loss 3.256622 (-0.78z)| norm 0.2301 (+0.21z)| lr 6.39e-07 | 4184.69 ms | 32.3% bf16 MFU | 123159 tok/s step 19170/19560 | loss 3.306828 (+0.36z)| norm 0.2215 (-0.46z)| lr 6.36e-07 | 4137.41 ms | 32.6% bf16 MFU | 123337 tok/s step 19171/19560 | loss 3.305912 (+0.34z)| norm 0.2359 (+0.64z)| lr 6.33e-07 | 4133.77 ms | 32.7% bf16 MFU | 123511 tok/s step 19172/19560 | loss 3.331147 (+0.90z)| norm 0.2200 (-0.58z)| lr 6.30e-07 | 4278.56 ms | 31.6% bf16 MFU | 123463 tok/s step 19173/19560 | loss 3.311169 (+0.44z)| norm 0.2270 (-0.05z)| lr 6.26e-07 | 4156.85 ms | 32.5% bf16 MFU | 123596 tok/s step 19174/19560 | loss 3.244606 (-1.05z)| norm 0.2281 (+0.03z)| lr 6.23e-07 | 4179.37 ms | 32.3% bf16 MFU | 123688 tok/s step 19175/19560 | loss 3.297345 (+0.13z)| norm 0.2170 (-0.81z)| lr 6.20e-07 | 4483.49 ms | 30.1% bf16 MFU | 123351 tok/s step 19176/19560 | loss 3.294822 (+0.08z)| norm 0.2500 (+1.68z)| lr 6.17e-07 | 4559.51 ms | 29.6% bf16 MFU | 122933 tok/s step 19177/19560 | loss 3.246707 (-1.00z)| norm 0.2222 (-0.42z)| lr 6.14e-07 | 4435.20 ms | 30.4% bf16 MFU | 122697 tok/s step 19178/19560 | loss 3.329003 (+0.85z)| norm 0.2210 (-0.52z)| lr 6.10e-07 | 4413.76 ms | 30.6% bf16 MFU | 122501 tok/s step 19179/19560 | loss 3.355269 (+1.42z)| norm 0.2222 (-0.43z)| lr 6.07e-07 | 4175.65 ms | 32.3% bf16 MFU | 122654 tok/s step 19180/19560 | loss 3.289411 (-0.06z)| norm 0.2233 (-0.34z)| lr 6.04e-07 | 4463.53 ms | 30.2% bf16 MFU | 122394 tok/s step 19181/19560 | loss 3.369756 (+1.71z)| norm 0.2698 (+3.04z)| lr 6.01e-07 | 4265.06 ms | 31.7% bf16 MFU | 122421 tok/s step 19182/19560 | loss 3.269876 (-0.51z)| norm 0.2227 (-0.40z)| lr 5.98e-07 | 4397.26 ms | 30.7% bf16 MFU | 122261 tok/s step 19183/19560 | loss 3.245579 (-1.04z)| norm 0.2496 (+1.54z)| lr 5.94e-07 | 4308.34 ms | 31.3% bf16 MFU | 122233 tok/s step 19184/19560 | loss 3.232817 (-1.32z)| norm 0.2202 (-0.59z)| lr 5.91e-07 | 4434.28 ms | 30.4% bf16 MFU | 122033 tok/s step 19185/19560 | loss 3.247237 (-0.99z)| norm 0.2275 (-0.06z)| lr 5.88e-07 | 4262.31 ms | 31.7% bf16 MFU | 122082 tok/s step 19186/19560 | loss 3.284689 (-0.17z)| norm 0.2189 (-0.68z)| lr 5.85e-07 | 4237.45 ms | 31.9% bf16 MFU | 122164 tok/s step 19187/19560 | loss 3.326508 (+0.75z)| norm 0.2480 (+1.41z)| lr 5.82e-07 | 4212.72 ms | 32.0% bf16 MFU | 122278 tok/s step 19188/19560 | loss 3.299120 (+0.15z)| norm 0.2258 (-0.18z)| lr 5.79e-07 | 4213.78 ms | 32.0% bf16 MFU | 122385 tok/s step 19189/19560 | loss 3.214225 (-1.70z)| norm 0.2295 (+0.08z)| lr 5.76e-07 | 4182.67 ms | 32.3% bf16 MFU | 122534 tok/s step 19190/19560 | loss 3.262735 (-0.63z)| norm 0.2342 (+0.41z)| lr 5.73e-07 | 4218.61 ms | 32.0% bf16 MFU | 122621 tok/s step 19191/19560 | loss 3.329252 (+0.84z)| norm 0.2255 (-0.21z)| lr 5.70e-07 | 4139.70 ms | 32.6% bf16 MFU | 122822 tok/s step 19192/19560 | loss 3.228673 (-1.35z)| norm 0.2194 (-0.65z)| lr 5.67e-07 | 4139.78 ms | 32.6% bf16 MFU | 123014 tok/s step 19193/19560 | loss 3.302765 (+0.26z)| norm 0.2315 (+0.22z)| lr 5.63e-07 | 4146.00 ms | 32.6% bf16 MFU | 123186 tok/s step 19194/19560 | loss 3.323805 (+0.73z)| norm 0.2274 (-0.06z)| lr 5.60e-07 | 4139.82 ms | 32.6% bf16 MFU | 123359 tok/s step 19195/19560 | loss 3.299846 (+0.19z)| norm 0.2330 (+0.36z)| lr 5.57e-07 | 4139.63 ms | 32.6% bf16 MFU | 123523 tok/s step 19196/19560 | loss 3.241822 (-1.08z)| norm 0.2234 (-0.36z)| lr 5.54e-07 | 4185.88 ms | 32.3% bf16 MFU | 123610 tok/s step 19197/19560 | loss 3.325750 (+0.76z)| norm 0.2241 (-0.30z)| lr 5.51e-07 | 4144.18 ms | 32.6% bf16 MFU | 123755 tok/s step 19198/19560 | loss 3.440752 (+3.14z)| norm 0.2397 (+0.86z)| lr 5.48e-07 | 4140.41 ms | 32.6% bf16 MFU | 123898 tok/s step 19199/19560 | loss 3.297495 (+0.10z)| norm 0.2176 (-0.79z)| lr 5.45e-07 | 4144.10 ms | 32.6% bf16 MFU | 124029 tok/s step 19200/19560 | loss 3.303170 (+0.22z)| norm 0.2269 (-0.09z)| lr 5.42e-07 | 4217.88 ms | 32.0% bf16 MFU | 124043 tok/s step 19201/19560 | loss 3.324852 (+0.67z)| norm 0.2247 (-0.26z)| lr 5.39e-07 | 4139.71 ms | 32.6% bf16 MFU | 124173 tok/s step 19202/19560 | loss 3.259549 (-0.71z)| norm 0.2237 (-0.33z)| lr 5.36e-07 | 4145.51 ms | 32.6% bf16 MFU | 124288 tok/s step 19203/19560 | loss 3.307312 (+0.30z)| norm 0.2218 (-0.47z)| lr 5.33e-07 | 4145.13 ms | 32.6% bf16 MFU | 124398 tok/s step 19204/19560 | loss 3.315672 (+0.48z)| norm 0.2257 (-0.18z)| lr 5.30e-07 | 4143.08 ms | 32.6% bf16 MFU | 124505 tok/s step 19205/19560 | loss 3.294244 (+0.06z)| norm 0.2194 (-0.64z)| lr 5.27e-07 | 4158.90 ms | 32.5% bf16 MFU | 124583 tok/s step 19206/19560 | loss 3.252476 (-0.89z)| norm 0.2256 (-0.17z)| lr 5.24e-07 | 4150.18 ms | 32.5% bf16 MFU | 124670 tok/s step 19207/19560 | loss 3.256895 (-0.78z)| norm 0.2331 (+0.39z)| lr 5.21e-07 | 4146.01 ms | 32.6% bf16 MFU | 124760 tok/s step 19208/19560 | loss 3.300657 (+0.21z)| norm 0.2208 (-0.55z)| lr 5.18e-07 | 4213.99 ms | 32.0% bf16 MFU | 124742 tok/s step 19209/19560 | loss 3.285007 (-0.14z)| norm 0.2221 (-0.43z)| lr 5.16e-07 | 4145.44 ms | 32.6% bf16 MFU | 124829 tok/s step 19210/19560 | loss 3.317560 (+0.59z)| norm 0.2396 (+0.92z)| lr 5.13e-07 | 4174.85 ms | 32.3% bf16 MFU | 124867 tok/s step 19211/19560 | loss 3.268081 (-0.54z)| norm 0.2289 (+0.08z)| lr 5.10e-07 | 4178.66 ms | 32.3% bf16 MFU | 124897 tok/s step 19212/19560 | loss 3.307459 (+0.36z)| norm 0.2225 (-0.40z)| lr 5.07e-07 | 4145.89 ms | 32.6% bf16 MFU | 124975 tok/s step 19213/19560 | loss 3.261340 (-0.70z)| norm 0.2304 (+0.22z)| lr 5.04e-07 | 4148.05 ms | 32.5% bf16 MFU | 125046 tok/s step 19214/19560 | loss 3.418869 (+2.81z)| norm 0.2706 (+3.19z)| lr 5.01e-07 | 4143.35 ms | 32.6% bf16 MFU | 125120 tok/s step 19215/19560 | loss 3.260683 (-0.70z)| norm 0.2391 (+0.82z)| lr 4.98e-07 | 4143.64 ms | 32.6% bf16 MFU | 125191 tok/s step 19216/19560 | loss 3.270193 (-0.48z)| norm 0.2236 (-0.33z)| lr 4.95e-07 | 4144.57 ms | 32.6% bf16 MFU | 125256 tok/s step 19217/19560 | loss 3.325544 (+0.83z)| norm 0.2278 (+0.06z)| lr 4.92e-07 | 4147.84 ms | 32.6% bf16 MFU | 125313 tok/s step 19218/19560 | loss 3.386312 (+2.20z)| norm 0.2371 (+0.93z)| lr 4.90e-07 | 4144.41 ms | 32.6% bf16 MFU | 125373 tok/s step 19219/19560 | loss 3.325552 (+0.78z)| norm 0.2149 (-1.18z)| lr 4.87e-07 | 4143.13 ms | 32.6% bf16 MFU | 125432 tok/s step 19220/19560 | loss 3.281954 (-0.21z)| norm 0.2244 (-0.27z)| lr 4.84e-07 | 4148.07 ms | 32.5% bf16 MFU | 125480 tok/s step 19221/19560 | loss 3.171114 (-2.69z)| norm 0.2249 (-0.22z)| lr 4.81e-07 | 4146.63 ms | 32.6% bf16 MFU | 125528 tok/s step 19222/19560 | loss 3.251699 (-0.86z)| norm 0.2253 (-0.19z)| lr 4.78e-07 | 4146.28 ms | 32.6% bf16 MFU | 125574 tok/s step 19223/19560 | loss 3.250102 (-0.89z)| norm 0.2226 (-0.44z)| lr 4.75e-07 | 4145.12 ms | 32.6% bf16 MFU | 125619 tok/s step 19224/19560 | loss 3.249024 (-0.92z)| norm 0.2217 (-0.53z)| lr 4.73e-07 | 4144.12 ms | 32.6% bf16 MFU | 125664 tok/s step 19225/19560 | loss 3.248682 (-0.92z)| norm 0.2191 (-0.78z)| lr 4.70e-07 | 4151.01 ms | 32.5% bf16 MFU | 125696 tok/s step 19226/19560 | loss 3.279695 (-0.22z)| norm 0.2223 (-0.47z)| lr 4.67e-07 | 4144.02 ms | 32.6% bf16 MFU | 125737 tok/s step 19227/19560 | loss 3.248109 (-0.93z)| norm 0.2261 (-0.11z)| lr 4.64e-07 | 4149.23 ms | 32.5% bf16 MFU | 125768 tok/s step 19228/19560 | loss 3.209598 (-1.77z)| norm 0.2203 (-0.67z)| lr 4.61e-07 | 4143.85 ms | 32.6% bf16 MFU | 125806 tok/s step 19229/19560 | loss 3.225881 (-1.39z)| norm 0.2190 (-0.79z)| lr 4.59e-07 | 4147.81 ms | 32.6% bf16 MFU | 125835 tok/s step 19230/19560 | loss 3.279618 (-0.18z)| norm 0.2188 (-0.79z)| lr 4.56e-07 | 4149.47 ms | 32.5% bf16 MFU | 125861 tok/s step 19231/19560 | loss 3.249154 (-0.85z)| norm 0.2168 (-0.97z)| lr 4.53e-07 | 4143.15 ms | 32.6% bf16 MFU | 125895 tok/s step 19232/19560 | loss 3.299702 (+0.27z)| norm 0.2260 (-0.09z)| lr 4.50e-07 | 4144.53 ms | 32.6% bf16 MFU | 125925 tok/s step 19233/19560 | loss 3.217330 (-1.56z)| norm 0.2222 (-0.46z)| lr 4.48e-07 | 4153.69 ms | 32.5% bf16 MFU | 125940 tok/s step 19234/19560 | loss 3.272850 (-0.32z)| norm 0.2257 (-0.13z)| lr 4.45e-07 | 4145.85 ms | 32.6% bf16 MFU | 125966 tok/s step 19235/19560 | loss 3.306397 (+0.43z)| norm 0.2287 (+0.17z)| lr 4.42e-07 | 4147.61 ms | 32.6% bf16 MFU | 125988 tok/s step 19236/19560 | loss 3.259816 (-0.60z)| norm 0.2236 (-0.33z)| lr 4.40e-07 | 4145.41 ms | 32.6% bf16 MFU | 126013 tok/s step 19237/19560 | loss 3.306260 (+0.43z)| norm 0.2246 (-0.23z)| lr 4.37e-07 | 4146.74 ms | 32.6% bf16 MFU | 126034 tok/s step 19238/19560 | loss 3.288476 (+0.03z)| norm 0.2264 (-0.06z)| lr 4.34e-07 | 4147.82 ms | 32.6% bf16 MFU | 126052 tok/s step 19239/19560 | loss 3.220444 (-1.48z)| norm 0.2196 (-0.72z)| lr 4.31e-07 | 4145.77 ms | 32.6% bf16 MFU | 126073 tok/s step 19240/19560 | loss 3.208059 (-1.73z)| norm 0.2207 (-0.60z)| lr 4.29e-07 | 4146.12 ms | 32.6% bf16 MFU | 126092 tok/s step 19241/19560 | loss 3.239327 (-1.01z)| norm 0.2181 (-0.85z)| lr 4.26e-07 | 4147.08 ms | 32.6% bf16 MFU | 126108 tok/s step 19242/19560 | loss 3.253524 (-0.69z)| norm 0.2148 (-1.16z)| lr 4.23e-07 | 4149.00 ms | 32.5% bf16 MFU | 126121 tok/s step 19243/19560 | loss 3.292939 (+0.19z)| norm 0.2259 (-0.08z)| lr 4.21e-07 | 4147.56 ms | 32.6% bf16 MFU | 126135 tok/s step 19244/19560 | loss 3.237835 (-1.05z)| norm 0.2160 (-1.03z)| lr 4.18e-07 | 4146.89 ms | 32.6% bf16 MFU | 126150 tok/s step 19245/19560 | loss 3.311262 (+0.65z)| norm 0.2345 (+0.75z)| lr 4.16e-07 | 4149.53 ms | 32.5% bf16 MFU | 126160 tok/s step 19246/19560 | loss 3.323108 (+0.91z)| norm 0.2201 (-0.63z)| lr 4.13e-07 | 4144.33 ms | 32.6% bf16 MFU | 126177 tok/s step 19247/19560 | loss 3.316890 (+0.76z)| norm 0.2225 (-0.40z)| lr 4.10e-07 | 4146.47 ms | 32.6% bf16 MFU | 126191 tok/s step 19248/19560 | loss 3.315126 (+0.72z)| norm 0.2204 (-0.60z)| lr 4.08e-07 | 4147.10 ms | 32.6% bf16 MFU | 126202 tok/s step 19249/19560 | loss 3.304888 (+0.49z)| norm 0.2340 (+0.70z)| lr 4.05e-07 | 4148.40 ms | 32.5% bf16 MFU | 126211 tok/s step 19250/19560 | loss 3.339709 (+1.28z)| norm 0.2338 (+0.67z)| lr 4.02e-07 | 4151.55 ms | 32.5% bf16 MFU | 126215 tok/s val loss 3.265536 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3029/10042 = 0.301633 step 19251/19560 | loss 3.262979 (-0.50z)| norm 0.2207 (-0.59z)| lr 4.00e-07 | 4144.50 ms | 32.6% bf16 MFU | 126229 tok/s step 19252/19560 | loss 3.316249 (+0.73z)| norm 0.2381 (+1.07z)| lr 3.97e-07 | 4141.88 ms | 32.6% bf16 MFU | 126247 tok/s step 19253/19560 | loss 3.270708 (-0.34z)| norm 0.2265 (-0.05z)| lr 3.95e-07 | 4144.11 ms | 32.6% bf16 MFU | 126260 tok/s step 19254/19560 | loss 3.265756 (-0.46z)| norm 0.2241 (-0.28z)| lr 3.92e-07 | 4145.22 ms | 32.6% bf16 MFU | 126271 tok/s step 19255/19560 | loss 3.247292 (-0.90z)| norm 0.2187 (-0.78z)| lr 3.90e-07 | 4142.12 ms | 32.6% bf16 MFU | 126287 tok/s step 19256/19560 | loss 3.294453 (+0.21z)| norm 0.2347 (+0.74z)| lr 3.87e-07 | 4144.39 ms | 32.6% bf16 MFU | 126298 tok/s step 19257/19560 | loss 3.269016 (-0.38z)| norm 0.2250 (-0.19z)| lr 3.85e-07 | 4142.98 ms | 32.6% bf16 MFU | 126310 tok/s step 19258/19560 | loss 3.273563 (-0.27z)| norm 0.2274 (+0.03z)| lr 3.82e-07 | 4146.73 ms | 32.6% bf16 MFU | 126316 tok/s step 19259/19560 | loss 3.317137 (+0.81z)| norm 0.2188 (-0.87z)| lr 3.80e-07 | 4147.24 ms | 32.6% bf16 MFU | 126321 tok/s step 19260/19560 | loss 3.299134 (+0.36z)| norm 0.2253 (-0.14z)| lr 3.77e-07 | 4146.55 ms | 32.6% bf16 MFU | 126327 tok/s step 19261/19560 | loss 3.287775 (+0.08z)| norm 0.2157 (-1.20z)| lr 3.75e-07 | 4148.59 ms | 32.5% bf16 MFU | 126330 tok/s step 19262/19560 | loss 3.253773 (-0.75z)| norm 0.2231 (-0.38z)| lr 3.72e-07 | 4147.26 ms | 32.6% bf16 MFU | 126334 tok/s step 19263/19560 | loss 3.233130 (-1.24z)| norm 0.2246 (-0.20z)| lr 3.70e-07 | 4145.63 ms | 32.6% bf16 MFU | 126341 tok/s step 19264/19560 | loss 3.235019 (-1.18z)| norm 0.2243 (-0.24z)| lr 3.67e-07 | 4146.87 ms | 32.6% bf16 MFU | 126345 tok/s step 19265/19560 | loss 3.296783 (+0.32z)| norm 0.2261 (-0.04z)| lr 3.65e-07 | 4148.62 ms | 32.5% bf16 MFU | 126347 tok/s step 19266/19560 | loss 3.268084 (-0.37z)| norm 0.2163 (-1.12z)| lr 3.62e-07 | 4147.28 ms | 32.6% bf16 MFU | 126350 tok/s step 19267/19560 | loss 3.268394 (-0.36z)| norm 0.2156 (-1.20z)| lr 3.60e-07 | 4148.43 ms | 32.5% bf16 MFU | 126352 tok/s step 19268/19560 | loss 3.292069 (+0.21z)| norm 0.2195 (-0.75z)| lr 3.57e-07 | 4147.71 ms | 32.6% bf16 MFU | 126355 tok/s step 19269/19560 | loss 3.284548 (+0.04z)| norm 0.2227 (-0.38z)| lr 3.55e-07 | 4149.85 ms | 32.5% bf16 MFU | 126354 tok/s step 19270/19560 | loss 3.359727 (+1.84z)| norm 0.2299 (+0.44z)| lr 3.52e-07 | 4142.07 ms | 32.6% bf16 MFU | 126365 tok/s step 19271/19560 | loss 3.335041 (+1.23z)| norm 0.2323 (+0.73z)| lr 3.50e-07 | 4144.12 ms | 32.6% bf16 MFU | 126372 tok/s step 19272/19560 | loss 3.334454 (+1.19z)| norm 0.2301 (+0.47z)| lr 3.48e-07 | 4145.23 ms | 32.6% bf16 MFU | 126378 tok/s step 19273/19560 | loss 3.275052 (-0.22z)| norm 0.2451 (+2.13z)| lr 3.45e-07 | 4142.64 ms | 32.6% bf16 MFU | 126387 tok/s step 19274/19560 | loss 3.241685 (-1.00z)| norm 0.2230 (-0.36z)| lr 3.43e-07 | 4146.25 ms | 32.6% bf16 MFU | 126390 tok/s step 19275/19560 | loss 3.235644 (-1.16z)| norm 0.2223 (-0.43z)| lr 3.40e-07 | 4145.21 ms | 32.6% bf16 MFU | 126394 tok/s step 19276/19560 | loss 3.242239 (-0.98z)| norm 0.2134 (-1.41z)| lr 3.38e-07 | 4145.26 ms | 32.6% bf16 MFU | 126399 tok/s step 19277/19560 | loss 3.289836 (+0.16z)| norm 0.2159 (-1.12z)| lr 3.36e-07 | 4147.95 ms | 32.6% bf16 MFU | 126399 tok/s step 19278/19560 | loss 3.217396 (-1.56z)| norm 0.2198 (-0.68z)| lr 3.33e-07 | 4146.90 ms | 32.6% bf16 MFU | 126400 tok/s step 19279/19560 | loss 3.254105 (-0.69z)| norm 0.2274 (+0.17z)| lr 3.31e-07 | 4150.04 ms | 32.5% bf16 MFU | 126397 tok/s step 19280/19560 | loss 3.317573 (+0.81z)| norm 0.2231 (-0.31z)| lr 3.29e-07 | 4146.23 ms | 32.6% bf16 MFU | 126399 tok/s step 19281/19560 | loss 3.274048 (-0.22z)| norm 0.2200 (-0.66z)| lr 3.26e-07 | 4149.30 ms | 32.5% bf16 MFU | 126397 tok/s step 19282/19560 | loss 3.219226 (-1.49z)| norm 0.2242 (-0.19z)| lr 3.24e-07 | 4146.85 ms | 32.6% bf16 MFU | 126399 tok/s step 19283/19560 | loss 3.253062 (-0.71z)| norm 0.2291 (+0.35z)| lr 3.22e-07 | 4146.23 ms | 32.6% bf16 MFU | 126401 tok/s step 19284/19560 | loss 3.292785 (+0.22z)| norm 0.2305 (+0.50z)| lr 3.19e-07 | 4148.24 ms | 32.5% bf16 MFU | 126401 tok/s step 19285/19560 | loss 3.275348 (-0.20z)| norm 0.2237 (-0.27z)| lr 3.17e-07 | 4140.67 ms | 32.6% bf16 MFU | 126412 tok/s step 19286/19560 | loss 3.285100 (+0.04z)| norm 0.2181 (-0.89z)| lr 3.15e-07 | 4152.00 ms | 32.5% bf16 MFU | 126405 tok/s step 19287/19560 | loss 3.307374 (+0.57z)| norm 0.2239 (-0.25z)| lr 3.12e-07 | 4147.16 ms | 32.6% bf16 MFU | 126406 tok/s step 19288/19560 | loss 3.330853 (+1.11z)| norm 0.2277 (+0.18z)| lr 3.10e-07 | 4145.76 ms | 32.6% bf16 MFU | 126408 tok/s step 19289/19560 | loss 3.312695 (+0.68z)| norm 0.2287 (+0.29z)| lr 3.08e-07 | 4145.11 ms | 32.6% bf16 MFU | 126412 tok/s step 19290/19560 | loss 3.294608 (+0.24z)| norm 0.2272 (+0.11z)| lr 3.06e-07 | 4146.31 ms | 32.6% bf16 MFU | 126414 tok/s step 19291/19560 | loss 3.289636 (+0.12z)| norm 0.2221 (-0.46z)| lr 3.03e-07 | 4144.69 ms | 32.6% bf16 MFU | 126418 tok/s step 19292/19560 | loss 3.321918 (+0.90z)| norm 0.2195 (-0.76z)| lr 3.01e-07 | 4146.46 ms | 32.6% bf16 MFU | 126419 tok/s step 19293/19560 | loss 3.285941 (+0.04z)| norm 0.2204 (-0.64z)| lr 2.99e-07 | 4144.35 ms | 32.6% bf16 MFU | 126424 tok/s step 19294/19560 | loss 3.295257 (+0.28z)| norm 0.2157 (-1.16z)| lr 2.97e-07 | 4144.04 ms | 32.6% bf16 MFU | 126428 tok/s step 19295/19560 | loss 3.300916 (+0.41z)| norm 0.2501 (+2.60z)| lr 2.94e-07 | 4144.00 ms | 32.6% bf16 MFU | 126433 tok/s step 19296/19560 | loss 3.335732 (+1.24z)| norm 0.2212 (-0.53z)| lr 2.92e-07 | 4148.95 ms | 32.5% bf16 MFU | 126429 tok/s step 19297/19560 | loss 3.302065 (+0.41z)| norm 0.2240 (-0.22z)| lr 2.90e-07 | 4145.47 ms | 32.6% bf16 MFU | 126432 tok/s step 19298/19560 | loss 3.403021 (+2.75z)| norm 0.2775 (+5.01z)| lr 2.88e-07 | 4150.21 ms | 32.5% bf16 MFU | 126426 tok/s step 19299/19560 | loss 3.301671 (+0.38z)| norm 0.2179 (-0.83z)| lr 2.86e-07 | 4148.67 ms | 32.5% bf16 MFU | 126424 tok/s step 19300/19560 | loss 3.270071 (-0.36z)| norm 0.2282 (+0.18z)| lr 2.83e-07 | 4148.08 ms | 32.5% bf16 MFU | 126422 tok/s step 19301/19560 | loss 3.287929 (+0.07z)| norm 0.2229 (-0.33z)| lr 2.81e-07 | 4143.27 ms | 32.6% bf16 MFU | 126428 tok/s step 19302/19560 | loss 3.302699 (+0.41z)| norm 0.2238 (-0.25z)| lr 2.79e-07 | 4145.35 ms | 32.6% bf16 MFU | 126431 tok/s step 19303/19560 | loss 3.301583 (+0.38z)| norm 0.2202 (-0.60z)| lr 2.77e-07 | 4144.98 ms | 32.6% bf16 MFU | 126433 tok/s step 19304/19560 | loss 3.291922 (+0.15z)| norm 0.2124 (-1.36z)| lr 2.75e-07 | 4150.36 ms | 32.5% bf16 MFU | 126428 tok/s step 19305/19560 | loss 3.282589 (-0.07z)| norm 0.2205 (-0.55z)| lr 2.73e-07 | 4149.46 ms | 32.5% bf16 MFU | 126424 tok/s step 19306/19560 | loss 3.274585 (-0.26z)| norm 0.2309 (+0.48z)| lr 2.71e-07 | 4147.58 ms | 32.6% bf16 MFU | 126423 tok/s step 19307/19560 | loss 3.269124 (-0.37z)| norm 0.2180 (-0.80z)| lr 2.68e-07 | 4146.08 ms | 32.6% bf16 MFU | 126425 tok/s step 19308/19560 | loss 3.279191 (-0.13z)| norm 0.2244 (-0.16z)| lr 2.66e-07 | 4148.33 ms | 32.5% bf16 MFU | 126423 tok/s step 19309/19560 | loss 3.279224 (-0.11z)| norm 0.2288 (+0.33z)| lr 2.64e-07 | 4145.40 ms | 32.6% bf16 MFU | 126425 tok/s step 19310/19560 | loss 3.239949 (-1.07z)| norm 0.2282 (+0.26z)| lr 2.62e-07 | 4148.13 ms | 32.5% bf16 MFU | 126424 tok/s step 19311/19560 | loss 3.360437 (+1.84z)| norm 0.2410 (+1.68z)| lr 2.60e-07 | 4147.45 ms | 32.6% bf16 MFU | 126423 tok/s step 19312/19560 | loss 3.317478 (+0.78z)| norm 0.2233 (-0.27z)| lr 2.58e-07 | 4145.05 ms | 32.6% bf16 MFU | 126426 tok/s step 19313/19560 | loss 3.293788 (+0.20z)| norm 0.2226 (-0.34z)| lr 2.56e-07 | 4150.05 ms | 32.5% bf16 MFU | 126422 tok/s step 19314/19560 | loss 3.233737 (-1.24z)| norm 0.2205 (-0.58z)| lr 2.54e-07 | 4150.19 ms | 32.5% bf16 MFU | 126417 tok/s step 19315/19560 | loss 3.233610 (-1.23z)| norm 0.2322 (+0.74z)| lr 2.52e-07 | 4146.72 ms | 32.6% bf16 MFU | 126418 tok/s step 19316/19560 | loss 3.293276 (+0.22z)| norm 0.2248 (-0.09z)| lr 2.50e-07 | 4147.03 ms | 32.6% bf16 MFU | 126418 tok/s step 19317/19560 | loss 3.304794 (+0.48z)| norm 0.2207 (-0.54z)| lr 2.48e-07 | 4149.66 ms | 32.5% bf16 MFU | 126414 tok/s step 19318/19560 | loss 3.316178 (+0.75z)| norm 0.2356 (+1.13z)| lr 2.46e-07 | 4148.83 ms | 32.5% bf16 MFU | 126412 tok/s step 19319/19560 | loss 3.265035 (-0.49z)| norm 0.2234 (-0.24z)| lr 2.44e-07 | 4147.04 ms | 32.6% bf16 MFU | 126413 tok/s step 19320/19560 | loss 3.202848 (-1.99z)| norm 0.2194 (-0.69z)| lr 2.42e-07 | 4146.37 ms | 32.6% bf16 MFU | 126414 tok/s step 19321/19560 | loss 3.260072 (-0.59z)| norm 0.2219 (-0.40z)| lr 2.40e-07 | 4143.39 ms | 32.6% bf16 MFU | 126421 tok/s step 19322/19560 | loss 3.315116 (+0.75z)| norm 0.2360 (+1.17z)| lr 2.38e-07 | 4142.96 ms | 32.6% bf16 MFU | 126427 tok/s step 19323/19560 | loss 3.251195 (-0.80z)| norm 0.2211 (-0.49z)| lr 2.36e-07 | 4144.61 ms | 32.6% bf16 MFU | 126431 tok/s step 19324/19560 | loss 3.362978 (+1.87z)| norm 0.2318 (+0.70z)| lr 2.34e-07 | 4148.79 ms | 32.5% bf16 MFU | 126428 tok/s step 19325/19560 | loss 3.277111 (-0.18z)| norm 0.2228 (-0.30z)| lr 2.32e-07 | 4148.67 ms | 32.5% bf16 MFU | 126425 tok/s step 19326/19560 | loss 3.292176 (+0.22z)| norm 0.2206 (-0.53z)| lr 2.30e-07 | 4146.62 ms | 32.6% bf16 MFU | 126426 tok/s step 19327/19560 | loss 3.273131 (-0.26z)| norm 0.2144 (-1.22z)| lr 2.28e-07 | 4147.43 ms | 32.6% bf16 MFU | 126425 tok/s step 19328/19560 | loss 3.317878 (+0.88z)| norm 0.2484 (+2.52z)| lr 2.26e-07 | 4145.47 ms | 32.6% bf16 MFU | 126427 tok/s step 19329/19560 | loss 3.285276 (+0.06z)| norm 0.2220 (-0.38z)| lr 2.24e-07 | 4145.87 ms | 32.6% bf16 MFU | 126429 tok/s step 19330/19560 | loss 3.263058 (-0.51z)| norm 0.2293 (+0.42z)| lr 2.22e-07 | 4146.81 ms | 32.6% bf16 MFU | 126429 tok/s step 19331/19560 | loss 3.272075 (-0.27z)| norm 0.2283 (+0.30z)| lr 2.20e-07 | 4147.21 ms | 32.6% bf16 MFU | 126429 tok/s step 19332/19560 | loss 3.243242 (-1.00z)| norm 0.2201 (-0.59z)| lr 2.18e-07 | 4147.87 ms | 32.6% bf16 MFU | 126427 tok/s step 19333/19560 | loss 3.286733 (+0.12z)| norm 0.2203 (-0.57z)| lr 2.16e-07 | 4147.91 ms | 32.6% bf16 MFU | 126426 tok/s step 19334/19560 | loss 3.258568 (-0.61z)| norm 0.2200 (-0.59z)| lr 2.14e-07 | 4144.39 ms | 32.6% bf16 MFU | 126430 tok/s step 19335/19560 | loss 3.258155 (-0.62z)| norm 0.2249 (-0.05z)| lr 2.13e-07 | 4141.55 ms | 32.6% bf16 MFU | 126438 tok/s step 19336/19560 | loss 3.307616 (+0.65z)| norm 0.2274 (+0.22z)| lr 2.11e-07 | 4145.65 ms | 32.6% bf16 MFU | 126439 tok/s step 19337/19560 | loss 3.248761 (-0.85z)| norm 0.2280 (+0.27z)| lr 2.09e-07 | 4146.28 ms | 32.6% bf16 MFU | 126440 tok/s step 19338/19560 | loss 3.295767 (+0.36z)| norm 0.2228 (-0.28z)| lr 2.07e-07 | 4148.91 ms | 32.5% bf16 MFU | 126436 tok/s step 19339/19560 | loss 3.461036 (+4.23z)| norm 0.2394 (+1.54z)| lr 2.05e-07 | 4146.00 ms | 32.6% bf16 MFU | 126437 tok/s step 19340/19560 | loss 3.262478 (-0.49z)| norm 0.2220 (-0.38z)| lr 2.03e-07 | 4165.08 ms | 32.4% bf16 MFU | 126409 tok/s step 19341/19560 | loss 3.324337 (+0.97z)| norm 0.2230 (-0.26z)| lr 2.01e-07 | 4119.29 ms | 32.8% bf16 MFU | 126452 tok/s step 19342/19560 | loss 3.248218 (-0.84z)| norm 0.2176 (-0.89z)| lr 2.00e-07 | 4120.64 ms | 32.8% bf16 MFU | 126492 tok/s step 19343/19560 | loss 3.302280 (+0.49z)| norm 0.2213 (-0.44z)| lr 1.98e-07 | 4125.48 ms | 32.7% bf16 MFU | 126521 tok/s step 19344/19560 | loss 3.239369 (-1.05z)| norm 0.2209 (-0.48z)| lr 1.96e-07 | 4126.98 ms | 32.7% bf16 MFU | 126547 tok/s step 19345/19560 | loss 3.231123 (-1.24z)| norm 0.2204 (-0.54z)| lr 1.94e-07 | 4127.60 ms | 32.7% bf16 MFU | 126571 tok/s step 19346/19560 | loss 3.317345 (+0.92z)| norm 0.2166 (-0.99z)| lr 1.92e-07 | 4134.21 ms | 32.7% bf16 MFU | 126583 tok/s step 19347/19560 | loss 3.235884 (-1.11z)| norm 0.2239 (-0.10z)| lr 1.91e-07 | 4130.11 ms | 32.7% bf16 MFU | 126601 tok/s step 19348/19560 | loss 3.300197 (+0.50z)| norm 0.2276 (+0.37z)| lr 1.89e-07 | 4132.83 ms | 32.7% bf16 MFU | 126614 tok/s step 19349/19560 | loss 3.305985 (+0.63z)| norm 0.2176 (-0.87z)| lr 1.87e-07 | 4150.09 ms | 32.5% bf16 MFU | 126600 tok/s step 19350/19560 | loss 3.305664 (+0.62z)| norm 0.2282 (+0.44z)| lr 1.85e-07 | 4136.15 ms | 32.6% bf16 MFU | 126608 tok/s step 19351/19560 | loss 3.256040 (-0.67z)| norm 0.2218 (-0.35z)| lr 1.84e-07 | 4133.39 ms | 32.7% bf16 MFU | 126619 tok/s step 19352/19560 | loss 3.256460 (-0.66z)| norm 0.2209 (-0.46z)| lr 1.82e-07 | 4136.76 ms | 32.6% bf16 MFU | 126625 tok/s step 19353/19560 | loss 3.263779 (-0.48z)| norm 0.2245 (-0.02z)| lr 1.80e-07 | 4131.60 ms | 32.7% bf16 MFU | 126639 tok/s step 19354/19560 | loss 3.247232 (-0.90z)| norm 0.2187 (-0.74z)| lr 1.78e-07 | 4189.63 ms | 32.2% bf16 MFU | 126564 tok/s step 19355/19560 | loss 3.330187 (+1.23z)| norm 0.2313 (+0.82z)| lr 1.77e-07 | 4132.81 ms | 32.7% bf16 MFU | 126579 tok/s step 19356/19560 | loss 3.260939 (-0.57z)| norm 0.2224 (-0.29z)| lr 1.75e-07 | 4135.28 ms | 32.7% bf16 MFU | 126589 tok/s step 19357/19560 | loss 3.386736 (+2.63z)| norm 0.2547 (+3.51z)| lr 1.73e-07 | 4136.93 ms | 32.6% bf16 MFU | 126596 tok/s step 19358/19560 | loss 3.302582 (+0.47z)| norm 0.2270 (+0.23z)| lr 1.72e-07 | 4134.02 ms | 32.7% bf16 MFU | 126608 tok/s step 19359/19560 | loss 3.255835 (-0.73z)| norm 0.2229 (-0.26z)| lr 1.70e-07 | 4136.71 ms | 32.6% bf16 MFU | 126614 tok/s step 19360/19560 | loss 3.302147 (+0.46z)| norm 0.2171 (-0.94z)| lr 1.68e-07 | 4133.34 ms | 32.7% bf16 MFU | 126626 tok/s step 19361/19560 | loss 3.272001 (-0.33z)| norm 0.2365 (+1.34z)| lr 1.66e-07 | 4136.01 ms | 32.6% bf16 MFU | 126633 tok/s step 19362/19560 | loss 3.344641 (+1.52z)| norm 0.2333 (+0.94z)| lr 1.65e-07 | 4135.42 ms | 32.6% bf16 MFU | 126640 tok/s step 19363/19560 | loss 3.246180 (-0.99z)| norm 0.2258 (+0.07z)| lr 1.63e-07 | 4137.28 ms | 32.6% bf16 MFU | 126644 tok/s step 19364/19560 | loss 3.221751 (-1.60z)| norm 0.2166 (-1.00z)| lr 1.62e-07 | 4137.36 ms | 32.6% bf16 MFU | 126648 tok/s step 19365/19560 | loss 3.324148 (+1.00z)| norm 0.2389 (+1.58z)| lr 1.60e-07 | 4288.99 ms | 31.5% bf16 MFU | 126427 tok/s step 19366/19560 | loss 3.290628 (+0.15z)| norm 0.2362 (+1.26z)| lr 1.58e-07 | 5220.84 ms | 25.9% bf16 MFU | 125127 tok/s step 19367/19560 | loss 3.278882 (-0.16z)| norm 0.2190 (-0.72z)| lr 1.57e-07 | 4343.99 ms | 31.1% bf16 MFU | 124905 tok/s step 19368/19560 | loss 3.290109 (+0.11z)| norm 0.2246 (-0.09z)| lr 1.55e-07 | 4367.66 ms | 30.9% bf16 MFU | 124662 tok/s step 19369/19560 | loss 3.354742 (+1.76z)| norm 0.2323 (+0.79z)| lr 1.53e-07 | 4359.02 ms | 31.0% bf16 MFU | 124443 tok/s step 19370/19560 | loss 3.354967 (+1.73z)| norm 0.2670 (+4.38z)| lr 1.52e-07 | 4532.89 ms | 29.8% bf16 MFU | 124004 tok/s step 19371/19560 | loss 3.338009 (+1.28z)| norm 0.2277 (+0.19z)| lr 1.50e-07 | 4607.22 ms | 29.3% bf16 MFU | 123494 tok/s step 19372/19560 | loss 3.289727 (+0.04z)| norm 0.2309 (+0.52z)| lr 1.49e-07 | 4217.00 ms | 32.0% bf16 MFU | 123535 tok/s step 19373/19560 | loss 3.288610 (+0.01z)| norm 0.2158 (-1.07z)| lr 1.47e-07 | 4218.61 ms | 32.0% bf16 MFU | 123572 tok/s step 19374/19560 | loss 3.288109 (+0.01z)| norm 0.2191 (-0.72z)| lr 1.46e-07 | 4761.10 ms | 28.4% bf16 MFU | 122900 tok/s step 19375/19560 | loss 3.330444 (+1.09z)| norm 0.2222 (-0.39z)| lr 1.44e-07 | 4316.23 ms | 31.3% bf16 MFU | 122828 tok/s step 19376/19560 | loss 3.298489 (+0.27z)| norm 0.2330 (+0.75z)| lr 1.42e-07 | 4348.30 ms | 31.1% bf16 MFU | 122715 tok/s step 19377/19560 | loss 3.271898 (-0.40z)| norm 0.2182 (-0.81z)| lr 1.41e-07 | 4137.50 ms | 32.6% bf16 MFU | 122916 tok/s step 19378/19560 | loss 3.311100 (+0.61z)| norm 0.2216 (-0.44z)| lr 1.39e-07 | 4150.22 ms | 32.5% bf16 MFU | 123086 tok/s step 19379/19560 | loss 3.357129 (+1.77z)| norm 0.2243 (-0.16z)| lr 1.38e-07 | 4338.63 ms | 31.1% bf16 MFU | 122974 tok/s step 19380/19560 | loss 3.277760 (-0.26z)| norm 0.2291 (+0.37z)| lr 1.36e-07 | 4152.71 ms | 32.5% bf16 MFU | 123138 tok/s step 19381/19560 | loss 3.265075 (-0.58z)| norm 0.2234 (-0.24z)| lr 1.35e-07 | 4154.87 ms | 32.5% bf16 MFU | 123290 tok/s step 19382/19560 | loss 3.297615 (+0.25z)| norm 0.2227 (-0.32z)| lr 1.33e-07 | 4170.66 ms | 32.4% bf16 MFU | 123411 tok/s step 19383/19560 | loss 3.292032 (+0.10z)| norm 0.2435 (+1.88z)| lr 1.32e-07 | 4349.83 ms | 31.0% bf16 MFU | 123267 tok/s step 19384/19560 | loss 3.298992 (+0.27z)| norm 0.2198 (-0.63z)| lr 1.30e-07 | 4275.85 ms | 31.6% bf16 MFU | 123235 tok/s step 19385/19560 | loss 3.335718 (+1.20z)| norm 0.2321 (+0.67z)| lr 1.29e-07 | 4150.57 ms | 32.5% bf16 MFU | 123389 tok/s step 19386/19560 | loss 3.224449 (-1.63z)| norm 0.2186 (-0.76z)| lr 1.27e-07 | 4151.54 ms | 32.5% bf16 MFU | 123534 tok/s step 19387/19560 | loss 3.272643 (-0.39z)| norm 0.2124 (-1.40z)| lr 1.26e-07 | 4161.23 ms | 32.4% bf16 MFU | 123657 tok/s step 19388/19560 | loss 3.319851 (+0.80z)| norm 0.2211 (-0.47z)| lr 1.25e-07 | 4161.40 ms | 32.4% bf16 MFU | 123773 tok/s step 19389/19560 | loss 3.296312 (+0.20z)| norm 0.2229 (-0.29z)| lr 1.23e-07 | 4219.22 ms | 32.0% bf16 MFU | 123798 tok/s step 19390/19560 | loss 3.263175 (-0.64z)| norm 0.2226 (-0.32z)| lr 1.22e-07 | 4631.05 ms | 29.2% bf16 MFU | 123268 tok/s step 19391/19560 | loss 3.327416 (+0.97z)| norm 0.2211 (-0.49z)| lr 1.20e-07 | 4315.66 ms | 31.3% bf16 MFU | 123179 tok/s step 19392/19560 | loss 3.308941 (+0.49z)| norm 0.2469 (+2.18z)| lr 1.19e-07 | 4279.89 ms | 31.5% bf16 MFU | 123145 tok/s step 19393/19560 | loss 3.340513 (+1.29z)| norm 0.2182 (-0.78z)| lr 1.17e-07 | 4160.90 ms | 32.4% bf16 MFU | 123288 tok/s step 19394/19560 | loss 3.328610 (+0.97z)| norm 0.2208 (-0.51z)| lr 1.16e-07 | 4161.77 ms | 32.4% bf16 MFU | 123423 tok/s step 19395/19560 | loss 3.341454 (+1.27z)| norm 0.2262 (+0.04z)| lr 1.15e-07 | 4312.27 ms | 31.3% bf16 MFU | 123330 tok/s step 19396/19560 | loss 3.338670 (+1.19z)| norm 0.2245 (-0.15z)| lr 1.13e-07 | 4151.44 ms | 32.5% bf16 MFU | 123478 tok/s step 19397/19560 | loss 3.325979 (+0.86z)| norm 0.2200 (-0.61z)| lr 1.12e-07 | 4154.22 ms | 32.5% bf16 MFU | 123615 tok/s step 19398/19560 | loss 3.263506 (-0.70z)| norm 0.2193 (-0.68z)| lr 1.11e-07 | 4165.79 ms | 32.4% bf16 MFU | 123727 tok/s step 19399/19560 | loss 3.303658 (+0.33z)| norm 0.2244 (-0.14z)| lr 1.09e-07 | 4391.74 ms | 30.7% bf16 MFU | 123510 tok/s step 19400/19560 | loss 3.333941 (+1.10z)| norm 0.2219 (-0.40z)| lr 1.08e-07 | 4154.87 ms | 32.5% bf16 MFU | 123643 tok/s step 19401/19560 | loss 3.339285 (+1.21z)| norm 0.2163 (-0.97z)| lr 1.07e-07 | 4156.80 ms | 32.5% bf16 MFU | 123768 tok/s step 19402/19560 | loss 3.335420 (+1.10z)| norm 0.2240 (-0.16z)| lr 1.05e-07 | 4161.22 ms | 32.4% bf16 MFU | 123879 tok/s step 19403/19560 | loss 3.287971 (-0.12z)| norm 0.2216 (-0.41z)| lr 1.04e-07 | 4158.02 ms | 32.5% bf16 MFU | 123990 tok/s step 19404/19560 | loss 3.301584 (+0.22z)| norm 0.2270 (+0.15z)| lr 1.03e-07 | 4154.28 ms | 32.5% bf16 MFU | 124100 tok/s step 19405/19560 | loss 3.363583 (+1.78z)| norm 0.4509 (+10.19z)| lr 1.01e-07 | 4149.66 ms | 32.5% bf16 MFU | 124213 tok/s step 19406/19560 | loss 3.275866 (-0.47z)| norm 0.2179 (-0.43z)| lr 1.00e-07 | 4154.51 ms | 32.5% bf16 MFU | 124312 tok/s step 19407/19560 | loss 3.289921 (-0.11z)| norm 0.2193 (-0.37z)| lr 9.87e-08 | 4152.30 ms | 32.5% bf16 MFU | 124409 tok/s step 19408/19560 | loss 3.315030 (+0.54z)| norm 0.2250 (-0.11z)| lr 9.74e-08 | 4157.65 ms | 32.5% bf16 MFU | 124494 tok/s step 19409/19560 | loss 3.382442 (+2.22z)| norm 0.3451 (+4.81z)| lr 9.61e-08 | 4486.80 ms | 30.1% bf16 MFU | 124112 tok/s step 19410/19560 | loss 3.347817 (+1.32z)| norm 0.2315 (+0.13z)| lr 9.49e-08 | 4163.72 ms | 32.4% bf16 MFU | 124202 tok/s step 19411/19560 | loss 3.330116 (+0.86z)| norm 0.2321 (+0.15z)| lr 9.36e-08 | 4159.80 ms | 32.5% bf16 MFU | 124294 tok/s step 19412/19560 | loss 3.320959 (+0.62z)| norm 0.2306 (+0.09z)| lr 9.24e-08 | 4143.45 ms | 32.6% bf16 MFU | 124406 tok/s step 19413/19560 | loss 3.235019 (-1.56z)| norm 0.2243 (-0.17z)| lr 9.12e-08 | 4191.30 ms | 32.2% bf16 MFU | 124440 tok/s step 19414/19560 | loss 3.278430 (-0.46z)| norm 0.2222 (-0.26z)| lr 8.99e-08 | 4454.21 ms | 30.3% bf16 MFU | 124103 tok/s step 19415/19560 | loss 3.303556 (+0.18z)| norm 0.2260 (-0.10z)| lr 8.87e-08 | 4171.81 ms | 32.4% bf16 MFU | 124182 tok/s step 19416/19560 | loss 3.289937 (-0.16z)| norm 0.2240 (-0.18z)| lr 8.75e-08 | 4157.26 ms | 32.5% bf16 MFU | 124279 tok/s step 19417/19560 | loss 3.365551 (+1.74z)| norm 0.2333 (+0.20z)| lr 8.63e-08 | 4212.50 ms | 32.1% bf16 MFU | 124288 tok/s step 19418/19560 | loss 3.346860 (+1.25z)| norm 0.2283 (-0.01z)| lr 8.51e-08 | 4149.20 ms | 32.5% bf16 MFU | 124391 tok/s step 19419/19560 | loss 3.278879 (-0.45z)| norm 0.2195 (-0.37z)| lr 8.39e-08 | 4168.27 ms | 32.4% bf16 MFU | 124461 tok/s step 19420/19560 | loss 3.349084 (+1.29z)| norm 0.2273 (-0.05z)| lr 8.27e-08 | 4155.25 ms | 32.5% bf16 MFU | 124546 tok/s step 19421/19560 | loss 3.356798 (+1.46z)| norm 0.2283 (-0.01z)| lr 8.16e-08 | 4163.30 ms | 32.4% bf16 MFU | 124616 tok/s step 19422/19560 | loss 3.333001 (+0.86z)| norm 0.2342 (+0.22z)| lr 8.04e-08 | 4166.93 ms | 32.4% bf16 MFU | 124676 tok/s step 19423/19560 | loss 3.270658 (-0.67z)| norm 0.2256 (-0.12z)| lr 7.93e-08 | 4157.11 ms | 32.5% bf16 MFU | 124748 tok/s step 19424/19560 | loss 3.294021 (-0.08z)| norm 0.2203 (-0.34z)| lr 7.81e-08 | 4153.86 ms | 32.5% bf16 MFU | 124821 tok/s step 19425/19560 | loss 3.293722 (-0.09z)| norm 0.2207 (-0.32z)| lr 7.70e-08 | 4163.93 ms | 32.4% bf16 MFU | 124876 tok/s step 19426/19560 | loss 3.295870 (-0.02z)| norm 0.2271 (-0.04z)| lr 7.59e-08 | 4173.10 ms | 32.4% bf16 MFU | 124914 tok/s step 19427/19560 | loss 3.296381 (-0.00z)| norm 0.2404 (+0.51z)| lr 7.47e-08 | 4144.97 ms | 32.6% bf16 MFU | 124993 tok/s step 19428/19560 | loss 3.336352 (+0.99z)| norm 0.2275 (-0.03z)| lr 7.36e-08 | 4162.16 ms | 32.4% bf16 MFU | 125041 tok/s step 19429/19560 | loss 3.304166 (+0.18z)| norm 0.2229 (-0.23z)| lr 7.25e-08 | 4152.36 ms | 32.5% bf16 MFU | 125102 tok/s step 19430/19560 | loss 3.396490 (+2.43z)| norm 0.2321 (+0.16z)| lr 7.14e-08 | 4153.94 ms | 32.5% bf16 MFU | 125158 tok/s step 19431/19560 | loss 3.304197 (+0.16z)| norm 0.2249 (-0.15z)| lr 7.03e-08 | 4368.93 ms | 30.9% bf16 MFU | 124900 tok/s step 19432/19560 | loss 3.359845 (+1.50z)| norm 0.2447 (+0.68z)| lr 6.93e-08 | 4167.02 ms | 32.4% bf16 MFU | 124946 tok/s step 19433/19560 | loss 3.321151 (+0.55z)| norm 0.2256 (-0.13z)| lr 6.82e-08 | 4174.92 ms | 32.3% bf16 MFU | 124978 tok/s step 19434/19560 | loss 3.285682 (-0.32z)| norm 0.2247 (-0.17z)| lr 6.71e-08 | 4157.40 ms | 32.5% bf16 MFU | 125034 tok/s step 19435/19560 | loss 3.234777 (-1.55z)| norm 0.2197 (-0.37z)| lr 6.61e-08 | 4176.85 ms | 32.3% bf16 MFU | 125059 tok/s step 19436/19560 | loss 3.358322 (+1.43z)| norm 0.2281 (-0.02z)| lr 6.50e-08 | 4161.79 ms | 32.4% bf16 MFU | 125105 tok/s step 19437/19560 | loss 3.292911 (-0.15z)| norm 0.2278 (-0.03z)| lr 6.40e-08 | 4163.19 ms | 32.4% bf16 MFU | 125146 tok/s step 19438/19560 | loss 3.322207 (+0.54z)| norm 0.2239 (-0.20z)| lr 6.30e-08 | 4152.84 ms | 32.5% bf16 MFU | 125201 tok/s step 19439/19560 | loss 3.304624 (+0.13z)| norm 0.2247 (-0.16z)| lr 6.19e-08 | 4157.64 ms | 32.5% bf16 MFU | 125246 tok/s step 19440/19560 | loss 3.386071 (+2.08z)| norm 0.2299 (+0.06z)| lr 6.09e-08 | 4165.62 ms | 32.4% bf16 MFU | 125277 tok/s step 19441/19560 | loss 3.281218 (-0.45z)| norm 0.2271 (-0.06z)| lr 5.99e-08 | 4162.07 ms | 32.4% bf16 MFU | 125312 tok/s step 19442/19560 | loss 3.369527 (+1.65z)| norm 0.2262 (-0.10z)| lr 5.89e-08 | 4170.80 ms | 32.4% bf16 MFU | 125331 tok/s step 19443/19560 | loss 3.294086 (-0.18z)| norm 0.2236 (-0.21z)| lr 5.80e-08 | 4154.24 ms | 32.5% bf16 MFU | 125375 tok/s step 19444/19560 | loss 3.314298 (+0.31z)| norm 0.2783 (+2.05z)| lr 5.70e-08 | 4170.98 ms | 32.4% bf16 MFU | 125391 tok/s step 19445/19560 | loss 3.355879 (+1.30z)| norm 0.2246 (-0.18z)| lr 5.60e-08 | 4160.44 ms | 32.5% bf16 MFU | 125422 tok/s step 19446/19560 | loss 3.406746 (+2.45z)| norm 0.2247 (-0.18z)| lr 5.50e-08 | 4165.11 ms | 32.4% bf16 MFU | 125445 tok/s step 19447/19560 | loss 3.285357 (-0.41z)| norm 0.2256 (-0.14z)| lr 5.41e-08 | 4175.25 ms | 32.3% bf16 MFU | 125451 tok/s step 19448/19560 | loss 3.296722 (-0.17z)| norm 0.2240 (-0.21z)| lr 5.31e-08 | 4152.22 ms | 32.5% bf16 MFU | 125492 tok/s step 19449/19560 | loss 3.278927 (-0.60z)| norm 0.2556 (+1.09z)| lr 5.22e-08 | 4178.55 ms | 32.3% bf16 MFU | 125491 tok/s step 19450/19560 | loss 3.287210 (-0.40z)| norm 0.2175 (-0.48z)| lr 5.13e-08 | 4161.75 ms | 32.4% bf16 MFU | 125515 tok/s step 19451/19560 | loss 3.375124 (+1.70z)| norm 0.2410 (+0.49z)| lr 5.04e-08 | 4170.38 ms | 32.4% bf16 MFU | 125526 tok/s step 19452/19560 | loss 3.250749 (-1.28z)| norm 0.2254 (-0.16z)| lr 4.94e-08 | 4197.67 ms | 32.2% bf16 MFU | 125494 tok/s step 19453/19560 | loss 3.241570 (-1.48z)| norm 0.2401 (+0.45z)| lr 4.85e-08 | 4154.52 ms | 32.5% bf16 MFU | 125529 tok/s step 19454/19560 | loss 3.244146 (-1.40z)| norm 0.2198 (-0.39z)| lr 4.77e-08 | 4155.49 ms | 32.5% bf16 MFU | 125561 tok/s step 19455/19560 | loss 3.302911 (-0.01z)| norm 0.2191 (-0.42z)| lr 4.68e-08 | 4168.51 ms | 32.4% bf16 MFU | 125572 tok/s step 19456/19560 | loss 3.278075 (-0.59z)| norm 0.2193 (-0.41z)| lr 4.59e-08 | 4162.15 ms | 32.4% bf16 MFU | 125592 tok/s step 19457/19560 | loss 3.280809 (-0.53z)| norm 0.2177 (-0.47z)| lr 4.50e-08 | 4150.34 ms | 32.5% bf16 MFU | 125628 tok/s step 19458/19560 | loss 3.417113 (+2.63z)| norm 0.2362 (+0.29z)| lr 4.41e-08 | 4150.80 ms | 32.5% bf16 MFU | 125662 tok/s step 19459/19560 | loss 3.243100 (-1.40z)| norm 0.2193 (-0.40z)| lr 4.33e-08 | 4339.69 ms | 31.1% bf16 MFU | 125420 tok/s step 19460/19560 | loss 3.276829 (-0.63z)| norm 0.2272 (-0.08z)| lr 4.25e-08 | 4173.14 ms | 32.4% bf16 MFU | 125431 tok/s step 19461/19560 | loss 3.322766 (+0.43z)| norm 0.2217 (-0.31z)| lr 4.16e-08 | 4285.66 ms | 31.5% bf16 MFU | 125276 tok/s step 19462/19560 | loss 3.347760 (+0.99z)| norm 0.2262 (-0.12z)| lr 4.08e-08 | 4160.60 ms | 32.5% bf16 MFU | 125313 tok/s step 19463/19560 | loss 3.269309 (-0.84z)| norm 0.2236 (-0.23z)| lr 4.00e-08 | 4161.39 ms | 32.4% bf16 MFU | 125346 tok/s step 19464/19560 | loss 3.300101 (-0.12z)| norm 0.2295 (+0.01z)| lr 3.92e-08 | 4153.73 ms | 32.5% bf16 MFU | 125390 tok/s step 19465/19560 | loss 3.391902 (+1.98z)| norm 0.3245 (+3.70z)| lr 3.84e-08 | 4153.30 ms | 32.5% bf16 MFU | 125432 tok/s step 19466/19560 | loss 3.412406 (+2.38z)| norm 0.2421 (+0.47z)| lr 3.76e-08 | 4164.34 ms | 32.4% bf16 MFU | 125456 tok/s step 19467/19560 | loss 3.340191 (+0.81z)| norm 0.2578 (+1.07z)| lr 3.68e-08 | 4146.56 ms | 32.6% bf16 MFU | 125505 tok/s step 19468/19560 | loss 3.302040 (-0.11z)| norm 0.2231 (-0.28z)| lr 3.60e-08 | 4381.32 ms | 30.8% bf16 MFU | 125213 tok/s step 19469/19560 | loss 3.395735 (+2.08z)| norm 0.2239 (-0.25z)| lr 3.52e-08 | 4170.54 ms | 32.4% bf16 MFU | 125238 tok/s step 19470/19560 | loss 3.329623 (+0.52z)| norm 0.2181 (-0.47z)| lr 3.45e-08 | 4161.15 ms | 32.4% bf16 MFU | 125276 tok/s step 19471/19560 | loss 3.310842 (+0.07z)| norm 0.2203 (-0.39z)| lr 3.37e-08 | 4160.60 ms | 32.5% bf16 MFU | 125313 tok/s step 19472/19560 | loss 3.308258 (-0.00z)| norm 0.2364 (+0.24z)| lr 3.30e-08 | 4161.75 ms | 32.4% bf16 MFU | 125346 tok/s step 19473/19560 | loss 3.338991 (+0.72z)| norm 0.2216 (-0.34z)| lr 3.22e-08 | 4196.84 ms | 32.2% bf16 MFU | 125325 tok/s step 19474/19560 | loss 3.312161 (+0.07z)| norm 0.2374 (+0.27z)| lr 3.15e-08 | 4197.03 ms | 32.2% bf16 MFU | 125304 tok/s step 19475/19560 | loss 3.297052 (-0.31z)| norm 0.2199 (-0.41z)| lr 3.08e-08 | 4150.82 ms | 32.5% bf16 MFU | 125355 tok/s step 19476/19560 | loss 3.285221 (-0.59z)| norm 0.2353 (+0.18z)| lr 3.01e-08 | 4273.30 ms | 31.6% bf16 MFU | 125221 tok/s step 19477/19560 | loss 3.297435 (-0.29z)| norm 0.2204 (-0.40z)| lr 2.94e-08 | 4152.07 ms | 32.5% bf16 MFU | 125274 tok/s step 19478/19560 | loss 3.324173 (+0.36z)| norm 0.2294 (-0.05z)| lr 2.87e-08 | 4154.63 ms | 32.5% bf16 MFU | 125320 tok/s step 19479/19560 | loss 3.328238 (+0.44z)| norm 0.2240 (-0.26z)| lr 2.80e-08 | 4154.04 ms | 32.5% bf16 MFU | 125364 tok/s step 19480/19560 | loss 3.257211 (-1.29z)| norm 0.2170 (-0.53z)| lr 2.73e-08 | 4169.42 ms | 32.4% bf16 MFU | 125384 tok/s step 19481/19560 | loss 3.301236 (-0.23z)| norm 0.2274 (-0.12z)| lr 2.66e-08 | 4171.07 ms | 32.4% bf16 MFU | 125399 tok/s step 19482/19560 | loss 3.261337 (-1.21z)| norm 0.2291 (-0.06z)| lr 2.60e-08 | 4156.74 ms | 32.5% bf16 MFU | 125436 tok/s step 19483/19560 | loss 3.272598 (-0.92z)| norm 0.2290 (-0.06z)| lr 2.53e-08 | 4160.04 ms | 32.5% bf16 MFU | 125465 tok/s step 19484/19560 | loss 3.311276 (+0.02z)| norm 0.2284 (-0.09z)| lr 2.47e-08 | 4161.81 ms | 32.4% bf16 MFU | 125491 tok/s step 19485/19560 | loss 3.300165 (-0.24z)| norm 0.2319 (+0.05z)| lr 2.40e-08 | 4159.63 ms | 32.5% bf16 MFU | 125518 tok/s step 19486/19560 | loss 3.300081 (-0.24z)| norm 0.2172 (-0.52z)| lr 2.34e-08 | 4159.71 ms | 32.5% bf16 MFU | 125545 tok/s step 19487/19560 | loss 3.323215 (+0.33z)| norm 0.2188 (-0.46z)| lr 2.28e-08 | 4168.35 ms | 32.4% bf16 MFU | 125556 tok/s step 19488/19560 | loss 3.293841 (-0.41z)| norm 0.2271 (-0.14z)| lr 2.22e-08 | 4161.87 ms | 32.4% bf16 MFU | 125577 tok/s step 19489/19560 | loss 3.291417 (-0.48z)| norm 0.2209 (-0.37z)| lr 2.16e-08 | 4162.12 ms | 32.4% bf16 MFU | 125597 tok/s step 19490/19560 | loss 3.247858 (-1.56z)| norm 0.2251 (-0.21z)| lr 2.10e-08 | 4158.32 ms | 32.5% bf16 MFU | 125621 tok/s step 19491/19560 | loss 3.229297 (-2.01z)| norm 0.2453 (+0.58z)| lr 2.04e-08 | 4152.91 ms | 32.5% bf16 MFU | 125652 tok/s step 19492/19560 | loss 3.275168 (-0.88z)| norm 0.2191 (-0.45z)| lr 1.98e-08 | 4164.02 ms | 32.4% bf16 MFU | 125665 tok/s step 19493/19560 | loss 3.286642 (-0.58z)| norm 0.2157 (-0.57z)| lr 1.92e-08 | 4158.63 ms | 32.5% bf16 MFU | 125685 tok/s step 19494/19560 | loss 3.373712 (+1.60z)| norm 0.2271 (-0.13z)| lr 1.87e-08 | 4165.95 ms | 32.4% bf16 MFU | 125694 tok/s step 19495/19560 | loss 3.295058 (-0.39z)| norm 0.2299 (-0.02z)| lr 1.81e-08 | 4159.99 ms | 32.5% bf16 MFU | 125710 tok/s step 19496/19560 | loss 3.284209 (-0.66z)| norm 0.2277 (-0.10z)| lr 1.76e-08 | 4157.33 ms | 32.5% bf16 MFU | 125731 tok/s step 19497/19560 | loss 3.335432 (+0.64z)| norm 0.2551 (+0.96z)| lr 1.70e-08 | 4150.26 ms | 32.5% bf16 MFU | 125760 tok/s step 19498/19560 | loss 3.277233 (-0.82z)| norm 0.2194 (-0.42z)| lr 1.65e-08 | 4157.60 ms | 32.5% bf16 MFU | 125777 tok/s step 19499/19560 | loss 3.331615 (+0.56z)| norm 0.2260 (-0.16z)| lr 1.60e-08 | 4158.08 ms | 32.5% bf16 MFU | 125793 tok/s step 19500/19560 | loss 3.225543 (-2.08z)| norm 0.2250 (-0.20z)| lr 1.55e-08 | 4154.00 ms | 32.5% bf16 MFU | 125814 tok/s val loss 3.265532 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3033/10042 = 0.302031 step 19501/19560 | loss 3.329035 (+0.49z)| norm 0.2195 (-0.42z)| lr 1.50e-08 | 4208.73 ms | 32.1% bf16 MFU | 125752 tok/s step 19502/19560 | loss 3.330750 (+0.53z)| norm 0.2250 (-0.20z)| lr 1.45e-08 | 4772.29 ms | 28.3% bf16 MFU | 124957 tok/s step 19503/19560 | loss 3.318818 (+0.23z)| norm 0.2191 (-0.43z)| lr 1.40e-08 | 6621.67 ms | 20.4% bf16 MFU | 122668 tok/s step 19504/19560 | loss 3.343279 (+0.83z)| norm 0.2224 (-0.30z)| lr 1.35e-08 | 4163.13 ms | 32.4% bf16 MFU | 122832 tok/s step 19505/19560 | loss 3.350899 (+1.01z)| norm 0.2194 (-0.42z)| lr 1.31e-08 | 4669.61 ms | 28.9% bf16 MFU | 122304 tok/s step 19506/19560 | loss 3.349486 (+0.96z)| norm 0.2354 (+0.20z)| lr 1.26e-08 | 4163.29 ms | 32.4% bf16 MFU | 122485 tok/s step 19507/19560 | loss 3.214197 (-2.33z)| norm 0.2271 (-0.12z)| lr 1.21e-08 | 4150.54 ms | 32.5% bf16 MFU | 122677 tok/s step 19508/19560 | loss 3.332493 (+0.55z)| norm 0.2247 (-0.21z)| lr 1.17e-08 | 4196.15 ms | 32.2% bf16 MFU | 122790 tok/s step 19509/19560 | loss 3.303527 (-0.17z)| norm 0.2227 (-0.29z)| lr 1.12e-08 | 4147.81 ms | 32.6% bf16 MFU | 122971 tok/s step 19510/19560 | loss 3.271634 (-0.95z)| norm 0.2432 (+0.51z)| lr 1.08e-08 | 4173.21 ms | 32.4% bf16 MFU | 123104 tok/s step 19511/19560 | loss 3.356684 (+1.12z)| norm 0.2205 (-0.38z)| lr 1.04e-08 | 4194.40 ms | 32.2% bf16 MFU | 123199 tok/s step 19512/19560 | loss 3.311548 (+0.02z)| norm 0.2330 (+0.11z)| lr 1.00e-08 | 4184.83 ms | 32.3% bf16 MFU | 123303 tok/s step 19513/19560 | loss 3.343679 (+0.80z)| norm 0.2377 (+0.29z)| lr 9.58e-09 | 4147.54 ms | 32.6% bf16 MFU | 123458 tok/s step 19514/19560 | loss 3.230620 (-1.96z)| norm 0.2204 (-0.39z)| lr 9.19e-09 | 6121.92 ms | 22.1% bf16 MFU | 121567 tok/s step 19515/19560 | loss 3.333404 (+0.54z)| norm 0.2217 (-0.34z)| lr 8.82e-09 | 4143.89 ms | 32.6% bf16 MFU | 121815 tok/s step 19516/19560 | loss 3.331627 (+0.49z)| norm 0.2222 (-0.32z)| lr 8.42e-09 | 4160.45 ms | 32.5% bf16 MFU | 122025 tok/s step 19517/19560 | loss 3.287537 (-0.59z)| norm 0.2277 (-0.11z)| lr 8.06e-09 | 4161.47 ms | 32.4% bf16 MFU | 122223 tok/s step 19518/19560 | loss 3.345313 (+0.81z)| norm 0.2187 (-0.46z)| lr 7.69e-09 | 4164.56 ms | 32.4% bf16 MFU | 122407 tok/s step 19519/19560 | loss 3.294004 (-0.44z)| norm 0.2295 (-0.04z)| lr 7.35e-09 | 4163.01 ms | 32.4% bf16 MFU | 122583 tok/s step 19520/19560 | loss 3.338464 (+0.65z)| norm 0.2197 (-0.42z)| lr 6.99e-09 | 4156.82 ms | 32.5% bf16 MFU | 122760 tok/s step 19521/19560 | loss 3.306974 (-0.12z)| norm 0.2395 (+0.36z)| lr 6.65e-09 | 4167.17 ms | 32.4% bf16 MFU | 122913 tok/s step 19522/19560 | loss 3.339813 (+0.68z)| norm 0.2180 (-0.49z)| lr 6.33e-09 | 4140.07 ms | 32.6% bf16 MFU | 123099 tok/s step 19523/19560 | loss 3.242160 (-1.67z)| norm 0.2386 (+0.32z)| lr 6.01e-09 | 4175.96 ms | 32.3% bf16 MFU | 123222 tok/s step 19524/19560 | loss 3.259689 (-1.23z)| norm 0.2199 (-0.42z)| lr 5.70e-09 | 4435.16 ms | 30.4% bf16 MFU | 122971 tok/s step 19525/19560 | loss 3.262735 (-1.14z)| norm 0.2285 (-0.08z)| lr 5.40e-09 | 4188.50 ms | 32.2% bf16 MFU | 123081 tok/s step 19526/19560 | loss 3.304224 (-0.15z)| norm 0.2204 (-0.40z)| lr 5.10e-09 | 4152.15 ms | 32.5% bf16 MFU | 123241 tok/s step 19527/19560 | loss 3.215559 (-2.23z)| norm 0.2245 (-0.24z)| lr 4.81e-09 | 4152.82 ms | 32.5% bf16 MFU | 123391 tok/s step 19528/19560 | loss 3.328952 (+0.46z)| norm 0.2229 (-0.30z)| lr 4.52e-09 | 4190.91 ms | 32.2% bf16 MFU | 123477 tok/s step 19529/19560 | loss 3.295046 (-0.34z)| norm 0.2279 (-0.11z)| lr 4.26e-09 | 4162.07 ms | 32.4% bf16 MFU | 123601 tok/s step 19530/19560 | loss 3.290335 (-0.44z)| norm 0.2220 (-0.34z)| lr 4.01e-09 | 4146.41 ms | 32.6% bf16 MFU | 123743 tok/s step 19531/19560 | loss 3.328024 (+0.45z)| norm 0.2310 (+0.01z)| lr 3.74e-09 | 4156.55 ms | 32.5% bf16 MFU | 123863 tok/s step 19532/19560 | loss 3.306900 (-0.06z)| norm 0.2211 (-0.37z)| lr 3.50e-09 | 4169.69 ms | 32.4% bf16 MFU | 123957 tok/s step 19533/19560 | loss 3.312407 (+0.08z)| norm 0.2219 (-0.43z)| lr 3.25e-09 | 4153.91 ms | 32.5% bf16 MFU | 124070 tok/s step 19534/19560 | loss 3.324495 (+0.37z)| norm 0.2260 (-0.18z)| lr 3.04e-09 | 4164.89 ms | 32.4% bf16 MFU | 124160 tok/s step 19535/19560 | loss 3.329957 (+0.49z)| norm 0.2253 (-0.23z)| lr 2.81e-09 | 4152.48 ms | 32.5% bf16 MFU | 124265 tok/s step 19536/19560 | loss 3.344045 (+0.82z)| norm 0.2231 (-0.36z)| lr 2.59e-09 | 4161.63 ms | 32.4% bf16 MFU | 124351 tok/s step 19537/19560 | loss 3.389760 (+1.91z)| norm 0.2216 (-0.51z)| lr 2.40e-09 | 4170.81 ms | 32.4% bf16 MFU | 124419 tok/s step 19538/19560 | loss 3.313350 (+0.09z)| norm 0.2238 (-0.33z)| lr 2.20e-09 | 4164.94 ms | 32.4% bf16 MFU | 124492 tok/s step 19539/19560 | loss 3.239622 (-1.64z)| norm 0.2192 (-0.69z)| lr 2.02e-09 | 4148.46 ms | 32.5% bf16 MFU | 124586 tok/s step 19540/19560 | loss 3.286273 (-0.53z)| norm 0.2166 (-0.89z)| lr 1.84e-09 | 4163.70 ms | 32.4% bf16 MFU | 124653 tok/s step 19541/19560 | loss 3.248304 (-1.44z)| norm 0.2205 (-0.57z)| lr 1.66e-09 | 4162.72 ms | 32.4% bf16 MFU | 124718 tok/s step 19542/19560 | loss 3.295507 (-0.32z)| norm 0.2203 (-0.59z)| lr 1.50e-09 | 4173.40 ms | 32.4% bf16 MFU | 124763 tok/s step 19543/19560 | loss 3.248264 (-1.42z)| norm 0.2276 (-0.01z)| lr 1.34e-09 | 4169.76 ms | 32.4% bf16 MFU | 124812 tok/s step 19544/19560 | loss 3.261153 (-1.11z)| norm 0.2201 (-0.60z)| lr 1.20e-09 | 4167.59 ms | 32.4% bf16 MFU | 124861 tok/s step 19545/19560 | loss 3.243267 (-1.50z)| norm 0.2308 (+0.25z)| lr 1.07e-09 | 4160.90 ms | 32.4% bf16 MFU | 124918 tok/s step 19546/19560 | loss 3.394656 (+2.02z)| norm 0.2289 (+0.10z)| lr 9.30e-10 | 4165.71 ms | 32.4% bf16 MFU | 124965 tok/s step 19547/19560 | loss 3.286635 (-0.49z)| norm 0.2287 (+0.07z)| lr 8.23e-10 | 4253.69 ms | 31.7% bf16 MFU | 124880 tok/s step 19548/19560 | loss 3.323129 (+0.36z)| norm 0.2331 (+0.42z)| lr 6.97e-10 | 4161.88 ms | 32.4% bf16 MFU | 124935 tok/s step 19549/19560 | loss 3.373435 (+1.53z)| norm 0.2323 (+0.36z)| lr 6.08e-10 | 4157.24 ms | 32.5% bf16 MFU | 124994 tok/s step 19550/19560 | loss 3.287684 (-0.45z)| norm 0.2311 (+0.26z)| lr 5.01e-10 | 4163.53 ms | 32.4% bf16 MFU | 125040 tok/s step 19551/19560 | loss 3.355666 (+1.11z)| norm 0.2228 (-0.40z)| lr 4.11e-10 | 4179.93 ms | 32.3% bf16 MFU | 125060 tok/s step 19552/19560 | loss 3.259177 (-1.12z)| norm 0.2222 (-0.45z)| lr 3.40e-10 | 4156.35 ms | 32.5% bf16 MFU | 125114 tok/s step 19553/19560 | loss 3.266729 (-0.93z)| norm 0.2261 (-0.13z)| lr 2.68e-10 | 4182.91 ms | 32.3% bf16 MFU | 125125 tok/s step 19554/19560 | loss 3.267363 (-0.91z)| norm 0.2203 (-0.60z)| lr 1.97e-10 | 4170.69 ms | 32.4% bf16 MFU | 125154 tok/s step 19555/19560 | loss 3.307483 (+0.00z)| norm 0.2169 (-0.85z)| lr 1.43e-10 | 4179.16 ms | 32.3% bf16 MFU | 125169 tok/s step 19556/19560 | loss 3.264023 (-0.98z)| norm 0.2191 (-0.67z)| lr 1.07e-10 | 4625.17 ms | 29.2% bf16 MFU | 124578 tok/s step 19557/19560 | loss 3.309438 (+0.06z)| norm 0.2295 (+0.16z)| lr 7.15e-11 | 4804.52 ms | 28.1% bf16 MFU | 123806 tok/s step 19558/19560 | loss 3.285889 (-0.46z)| norm 0.2324 (+0.38z)| lr 3.58e-11 | 4484.61 ms | 30.1% bf16 MFU | 123461 tok/s step 19559/19560 | loss 3.314194 (+0.19z)| norm 0.2776 (+3.74z)| lr 1.79e-11 | 4492.06 ms | 30.1% bf16 MFU | 123123 tok/s step 19560/19560 | loss 3.252871 (-1.21z)| norm 0.2281 (+0.02z)| lr 0.00e+00 | 4516.70 ms | 29.9% bf16 MFU | 122771 tok/s val loss 3.265413 evaluating HellaSwag: 0/1256 evaluating HellaSwag: 10/1256 evaluating HellaSwag: 20/1256 evaluating HellaSwag: 30/1256 evaluating HellaSwag: 40/1256 evaluating HellaSwag: 50/1256 evaluating HellaSwag: 60/1256 evaluating HellaSwag: 70/1256 evaluating HellaSwag: 80/1256 evaluating HellaSwag: 90/1256 evaluating HellaSwag: 100/1256 evaluating HellaSwag: 110/1256 evaluating HellaSwag: 120/1256 evaluating HellaSwag: 130/1256 evaluating HellaSwag: 140/1256 evaluating HellaSwag: 150/1256 evaluating HellaSwag: 160/1256 evaluating HellaSwag: 170/1256 evaluating HellaSwag: 180/1256 evaluating HellaSwag: 190/1256 evaluating HellaSwag: 200/1256 evaluating HellaSwag: 210/1256 evaluating HellaSwag: 220/1256 evaluating HellaSwag: 230/1256 evaluating HellaSwag: 240/1256 evaluating HellaSwag: 250/1256 evaluating HellaSwag: 260/1256 evaluating HellaSwag: 270/1256 evaluating HellaSwag: 280/1256 evaluating HellaSwag: 290/1256 evaluating HellaSwag: 300/1256 evaluating HellaSwag: 310/1256 evaluating HellaSwag: 320/1256 evaluating HellaSwag: 330/1256 evaluating HellaSwag: 340/1256 evaluating HellaSwag: 350/1256 evaluating HellaSwag: 360/1256 evaluating HellaSwag: 370/1256 evaluating HellaSwag: 380/1256 evaluating HellaSwag: 390/1256 evaluating HellaSwag: 400/1256 evaluating HellaSwag: 410/1256 evaluating HellaSwag: 420/1256 evaluating HellaSwag: 430/1256 evaluating HellaSwag: 440/1256 evaluating HellaSwag: 450/1256 evaluating HellaSwag: 460/1256 evaluating HellaSwag: 470/1256 evaluating HellaSwag: 480/1256 evaluating HellaSwag: 490/1256 evaluating HellaSwag: 500/1256 evaluating HellaSwag: 510/1256 evaluating HellaSwag: 520/1256 evaluating HellaSwag: 530/1256 evaluating HellaSwag: 540/1256 evaluating HellaSwag: 550/1256 evaluating HellaSwag: 560/1256 evaluating HellaSwag: 570/1256 evaluating HellaSwag: 580/1256 evaluating HellaSwag: 590/1256 evaluating HellaSwag: 600/1256 evaluating HellaSwag: 610/1256 evaluating HellaSwag: 620/1256 evaluating HellaSwag: 630/1256 evaluating HellaSwag: 640/1256 evaluating HellaSwag: 650/1256 evaluating HellaSwag: 660/1256 evaluating HellaSwag: 670/1256 evaluating HellaSwag: 680/1256 evaluating HellaSwag: 690/1256 evaluating HellaSwag: 700/1256 evaluating HellaSwag: 710/1256 evaluating HellaSwag: 720/1256 evaluating HellaSwag: 730/1256 evaluating HellaSwag: 740/1256 evaluating HellaSwag: 750/1256 evaluating HellaSwag: 760/1256 evaluating HellaSwag: 770/1256 evaluating HellaSwag: 780/1256 evaluating HellaSwag: 790/1256 evaluating HellaSwag: 800/1256 evaluating HellaSwag: 810/1256 evaluating HellaSwag: 820/1256 evaluating HellaSwag: 830/1256 evaluating HellaSwag: 840/1256 evaluating HellaSwag: 850/1256 evaluating HellaSwag: 860/1256 evaluating HellaSwag: 870/1256 evaluating HellaSwag: 880/1256 evaluating HellaSwag: 890/1256 evaluating HellaSwag: 900/1256 evaluating HellaSwag: 910/1256 evaluating HellaSwag: 920/1256 evaluating HellaSwag: 930/1256 evaluating HellaSwag: 940/1256 evaluating HellaSwag: 950/1256 evaluating HellaSwag: 960/1256 evaluating HellaSwag: 970/1256 evaluating HellaSwag: 980/1256 evaluating HellaSwag: 990/1256 evaluating HellaSwag: 1000/1256 evaluating HellaSwag: 1010/1256 evaluating HellaSwag: 1020/1256 evaluating HellaSwag: 1030/1256 evaluating HellaSwag: 1040/1256 evaluating HellaSwag: 1050/1256 evaluating HellaSwag: 1060/1256 evaluating HellaSwag: 1070/1256 evaluating HellaSwag: 1080/1256 evaluating HellaSwag: 1090/1256 evaluating HellaSwag: 1100/1256 evaluating HellaSwag: 1110/1256 evaluating HellaSwag: 1120/1256 evaluating HellaSwag: 1130/1256 evaluating HellaSwag: 1140/1256 evaluating HellaSwag: 1150/1256 evaluating HellaSwag: 1160/1256 evaluating HellaSwag: 1170/1256 evaluating HellaSwag: 1180/1256 evaluating HellaSwag: 1190/1256 evaluating HellaSwag: 1200/1256 evaluating HellaSwag: 1210/1256 evaluating HellaSwag: 1220/1256 evaluating HellaSwag: 1230/1256 evaluating HellaSwag: 1240/1256 evaluating HellaSwag: 1250/1256 HellaSwag: 3035/10042 = 0.302231 generating: --- 198 45 993 5720 1288 299 518 6862 599 6557 400 528 8836 299 444 89 261 8836 83 12385 300 86 8836 279 86 8836 299 8836 86 2616 13 50256 34 586 39438 262 360 1191 1958 6682 26701 379 3873 7940 78 13661 198 18234 514 319 3035 2310 11 33448 379 1367 25 405 716 8211 3862 8211 3862 --- Writing checkpoint at step 19560 Writing model to log124M/model_00019560.bin Writing state to log124M/state_00019560_00000.bin total average iteration time: 4190.824390 ms /var/spool/slurmd/job11674424/slurm_script: line 15: -er: command not found