题目 ID: q-5461

如何理解 ppo 里面 reward model 和 value model 的 sentence 粒度和 token粒度

频次 1

NLP与大模型

当前状态：未收藏、未完成

常见追问

暂无追问变体。

未知