Data and models for the paper "Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards".
AI & ML interests
None defined yet.
Recent Activity
Papers
WildReward: Learning Reward Models from In-the-Wild Human Interactions
DeepPrune: Parallel Scaling without Inter-trace Redundancy
Boundary-Guided Policy Optimization for Memory-Efficient RL of Diffusion Large Language Models
-
THU-KEG/LLaDA-8B-BGPO-math
Reinforcement Learning • 8B • Updated • 3 • 1 -
THU-KEG/LLaDA-8B-BGPO-code
Reinforcement Learning • 8B • Updated • 14 • 1 -
THU-KEG/LLaDA-8B-BGPO-countdown
Reinforcement Learning • 8B • Updated • 125 • 1 -
THU-KEG/LLaDA-8B-BGPO-sudoku
Reinforcement Learning • 8B • Updated • 2 • 1
Scaling Iterative Reinforcement Learning with Interleaved Compression
OpenSAE checkpoints for LLaMA 3.1 8B base model
EMNLP2024 Main Conference: 《Aligning Large Language Models on Information Extraction》
Learning Reward Models from In-the-Wild Interactions
Parallel Scaling without Inter-trace Redundancy
RL trained models and datasets for instruction-following
《Constraint Back-translation Improves Complex Instruction Following of Large Language Models》
-
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
Paper • 2410.24175 • Published • 18 -
THU-KEG/Mistral-Crab-SFT
Text Generation • 7B • Updated • 21 • 5 -
THU-KEG/Mistral-Crab-DPO
Text Generation • 7B • Updated • 8 • 4 -
THU-KEG/Llama3-Crab-SFT
Text Generation • Updated • 8
Data and models for the paper "Chaining the Evidence: Robust Reinforcement Learning for Deep Search Agents with Citation-Aware Rubric Rewards".
Learning Reward Models from In-the-Wild Interactions
Boundary-Guided Policy Optimization for Memory-Efficient RL of Diffusion Large Language Models
-
THU-KEG/LLaDA-8B-BGPO-math
Reinforcement Learning • 8B • Updated • 3 • 1 -
THU-KEG/LLaDA-8B-BGPO-code
Reinforcement Learning • 8B • Updated • 14 • 1 -
THU-KEG/LLaDA-8B-BGPO-countdown
Reinforcement Learning • 8B • Updated • 125 • 1 -
THU-KEG/LLaDA-8B-BGPO-sudoku
Reinforcement Learning • 8B • Updated • 2 • 1
Parallel Scaling without Inter-trace Redundancy
Scaling Iterative Reinforcement Learning with Interleaved Compression
RL trained models and datasets for instruction-following
OpenSAE checkpoints for LLaMA 3.1 8B base model
《Constraint Back-translation Improves Complex Instruction Following of Large Language Models》
-
Constraint Back-translation Improves Complex Instruction Following of Large Language Models
Paper • 2410.24175 • Published • 18 -
THU-KEG/Mistral-Crab-SFT
Text Generation • 7B • Updated • 21 • 5 -
THU-KEG/Mistral-Crab-DPO
Text Generation • 7B • Updated • 8 • 4 -
THU-KEG/Llama3-Crab-SFT
Text Generation • Updated • 8
EMNLP2024 Main Conference: 《Aligning Large Language Models on Information Extraction》