Skip to content

ma921/HolderDPO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

57 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HolderDPO

  • Robust DPO with provable redescending property.
  • Principled data valuation and cleaning method.

Quick start

We will tidy up the code soon. If you have already the code basis for DPO, our loss is simply replace it as follows:

import torch.nn.functional as F

# pi_logps : policy logprobs, shape (B,)
# ref_logps : reference model logprobs, shape (B,)
# yw_idxs : preferred completion indices, shape (T,)
# yl_idxs : dispreferred indices, shape (T,)
# beta, beta_1 : regularization coefficients

pi_yw_logps = pi_logps[yw_idxs]
pi_yl_logps = pi_logps[yl_idxs]
ref_yw_logps = ref_logps[yw_idxs]
ref_yl_logps = ref_logps[yl_idxs]

reward_win = pi_yw_logps - ref_yw_logps
reward_lose = pi_yl_logps - ref_yl_logps
g_theta = reward_win - reward_lose

if self.method == "dpo":
    loss = -F.logsigmoid(self.beta * g_theta).mean()
elif self.method == "holder_dpo":
    p = F.sigmoid(self.beta * g_theta)
    loss = - (1.0 + self.gamma) * p.pow(self.gamma).mean() \
           + self.gamma * (p.pow(self.gamma + 1)).mean()
return loss

Cite as

Please cite this work as

@article{fujisawa2025scalable,
  title={Scalable Valuation of Human Feedback through Provably Robust Model Alignment},
  author={Fujisawa, Masahiro and Adachi, Masaki and Osborne, Michael A},
  booktitle={Advances in Neural Information Processing Systems},
  doi={https://doi.org/10.48550/arXiv.2505.17859},
  year={2025}
}