Explorer

Root / papers / text_metadata / 2026Q1

Folder View | Tree View

.. (Parent Directory)
0102 Qiu - Why Low-Precision Transformer Training Fails- An Analysis on Flash Attention.txt
0102 Zhang - On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models.txt
0103 Zhang - Recursive Language Models.txt
0108 Kim - Towards a Science of Scaling Agent Systems.txt
0109 Zhou - How to Set the Batch Size for Large-Scale Pre-training?.txt
0109 Zhou - How to Set the Learning Rate for Large-Scale Pre-training?.txt
0125 Cai - Training-Free Group Relative Policy Optimization.txt
0128 Tandon - End-to-End Test-Time Training for Long Context.txt
0130 Hübotter - Reinforcement Learning via Self-Distillation.txt
0130 Shenfeld - Self-Distillation Enables Continual Learning.txt
0131 Gopalakrishnan - Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings.txt
0131 Karami - Trellis- Learning to Compress Key-Value Memory in Attention Models.txt
0131 Liu - Rethinking KL Regularization in RLHF- From Value Estimation to Gradient Optimization.txt
0131 Marek - Small Batch Size Training for Language Models- When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful.txt
0131 Scheibner - Large language models and the entropy of English.txt
0131 Tan - Self-Improving Pretraining- using post-trained models to pretrain better models.txt
0131 Zhang - Deep Delta Learning.txt
0203 Kalra - A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs.txt
0205 Morris - Learning to Reason in 13 Parameters.txt
0218 Krasheninnikov - Fresh in memory- Training-order recency is linearly encoded in language model activations.txt
0218 Team - GLM-5- from Vibe Coding to Agentic Engineering.txt
0218 Treutlein - Connecting the Dots- LLMs can Infer and Verbalize Latent Structure from Disparate Training Data.txt
0223 Penaloza - Privileged Information Distillation for Language Models.txt
0309 Chen - Progressive Residual Warmup for Language Model Pretraining.txt
0311 Yan - Scalable Training of Mixture-of-Experts Models with Megatron Core.txt
0313 Godey - Lost in Backpropagation- The LM Head is a Gradient Bottleneck.txt