Alignment and Unlearning
updated
Learn Your Reference Model for Real Good Alignment
Paper
• 2404.09656
• Published
• 90
Aligning Teacher with Student Preferences for Tailored Training Data
Generation
Paper
• 2406.19227
• Published
• 25
Self-Play Preference Optimization for Language Model Alignment
Paper
• 2405.00675
• Published
• 28
CantTalkAboutThis: Aligning Language Models to Stay on Topic in
Dialogues
Paper
• 2404.03820
• Published
• 25
Iterative Nash Policy Optimization: Aligning LLMs with General
Preferences via No-Regret Learning
Paper
• 2407.00617
• Published
• 7
UnUnlearning: Unlearning is not sufficient for content regulation in
advanced generative AI
Paper
• 2407.00106
• Published
• 6
Judging the Judges: Evaluating Alignment and Vulnerabilities in
LLMs-as-Judges
Paper
• 2406.12624
• Published
• 37
Simulating Classroom Education with LLM-Empowered Agents
Paper
• 2406.19226
• Published
• 32
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks,
and Refusals of LLMs
Paper
• 2406.18495
• Published
• 13
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially)
Safer Language Models
Paper
• 2406.18510
• Published
• 10
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Paper
• 2406.11614
• Published
• 5
Large Language Model Unlearning via Embedding-Corrupted Prompts
Paper
• 2406.07933
• Published
• 9
Deep Bayesian Active Learning for Preference Modeling in Large Language
Models
Paper
• 2406.10023
• Published
• 2
Transforming and Combining Rewards for Aligning Large Language Models
Paper
• 2402.00742
• Published
• 12
LongAlign: A Recipe for Long Context Alignment of Large Language Models
Paper
• 2401.18058
• Published
• 24
Learning to Refuse: Towards Mitigating Privacy Risks in LLMs
Paper
• 2407.10058
• Published
• 31
To Forget or Not? Towards Practical Knowledge Unlearning for Large
Language Models
Paper
• 2407.01920
• Published
• 17
Rethinking Entity-level Unlearning for Large Language Models
Paper
• 2406.15796
• Published
The Art of Saying No: Contextual Noncompliance in Language Models
Paper
• 2407.12043
• Published
• 5
Instruction Following without Instruction Tuning
Paper
• 2409.14254
• Published
• 29
Toward General Instruction-Following Alignment for Retrieval-Augmented
Generation
Paper
• 2410.09584
• Published
• 48