- Robust DPO with provable redescending property.
- Principled data valuation and cleaning method.
We will tidy up the code soon. If you have already the code basis for DPO, our loss is simply replace it as follows:
import torch.nn.functional as F
# pi_logps : policy logprobs, shape (B,)
# ref_logps : reference model logprobs, shape (B,)
# yw_idxs : preferred completion indices, shape (T,)
# yl_idxs : dispreferred indices, shape (T,)
# beta, beta_1 : regularization coefficients
pi_yw_logps = pi_logps[yw_idxs]
pi_yl_logps = pi_logps[yl_idxs]
ref_yw_logps = ref_logps[yw_idxs]
ref_yl_logps = ref_logps[yl_idxs]
reward_win = pi_yw_logps - ref_yw_logps
reward_lose = pi_yl_logps - ref_yl_logps
g_theta = reward_win - reward_lose
if self.method == "dpo":
loss = -F.logsigmoid(self.beta * g_theta).mean()
elif self.method == "holder_dpo":
p = F.sigmoid(self.beta * g_theta)
loss = - (1.0 + self.gamma) * p.pow(self.gamma).mean() \
+ self.gamma * (p.pow(self.gamma + 1)).mean()
return lossPlease cite this work as
@article{fujisawa2025scalable,
title={Scalable Valuation of Human Feedback through Provably Robust Model Alignment},
author={Fujisawa, Masahiro and Adachi, Masaki and Osborne, Michael A},
booktitle={Advances in Neural Information Processing Systems},
doi={https://doi.org/10.48550/arXiv.2505.17859},
year={2025}
}