Explorer

Root / papers / 2026Q1

Folder View | Tree View

.. (Parent Directory)
0101 Xie - mHC- Manifold-Constrained Hyper-Connections.pdf
0102 Qiu - Why Low-Precision Transformer Training Fails- An Analysis on Flash Attention.pdf
0102 Zhang - On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models.pdf
0103 Zhang - Recursive Language Models.pdf
0108 Kim - Towards a Science of Scaling Agent Systems.pdf
0109 Zhou - How to Set the Batch Size for Large-Scale Pre-training?.pdf
0109 Zhou - How to Set the Learning Rate for Large-Scale Pre-training?.pdf
0125 Cai - Training-Free Group Relative Policy Optimization.pdf
0128 Tandon - End-to-End Test-Time Training for Long Context.pdf
0130 Hübotter - Reinforcement Learning via Self-Distillation.pdf
0130 Shenfeld - Self-Distillation Enables Continual Learning.pdf
0131 Gopalakrishnan - Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings.pdf
0131 Karami - Trellis- Learning to Compress Key-Value Memory in Attention Models.pdf
0131 Liu - Rethinking KL Regularization in RLHF- From Value Estimation to Gradient Optimization.pdf
0131 Marek - Small Batch Size Training for Language Models- When Vanilla SGD Works, and Why Gradient Accumulation Is Wasteful.pdf
0131 Scheibner - Large language models and the entropy of English.pdf
0131 Tan - Self-Improving Pretraining- using post-trained models to pretrain better models.pdf
0131 Zhang - Deep Delta Learning.pdf
0203 Kalra - A Scalable Measure of Loss Landscape Curvature for Analyzing the Training Dynamics of LLMs.pdf
0204 Song - Expanding the Capabilities of Reinforcement Learning via Text Feedback.pdf
0205 Morris - Learning to Reason in 13 Parameters.pdf
0217 Janson - Stabilizing Native Low-Rank LLM Pretraining.pdf
0218 Krasheninnikov - Fresh in memory- Training-order recency is linearly encoded in language model activations.pdf
0218 Team - GLM-5- from Vibe Coding to Agentic Engineering.pdf
0218 Treutlein - Connecting the Dots- LLMs can Infer and Verbalize Latent Structure from Disparate Training Data.pdf
0223 Penaloza - Privileged Information Distillation for Language Models.pdf
0309 Chen - Progressive Residual Warmup for Language Model Pretraining.pdf
0311 Yan - Scalable Training of Mixture-of-Experts Models with Megatron Core.pdf
0313 Godey - Lost in Backpropagation- The LM Head is a Gradient Bottleneck.pdf
metadata.jsonl