

Starred repositories
Improved techniques for optimization-based jailbreaking on large language models (ICLR2025)
A framework for few-shot evaluation of language models.
[ACL 2024] An Easy-to-use Knowledge Editing Framework for LLMs.
Robust recipes to align language models with human and AI preferences
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
A set of case studies for the Wintermute Alpha Challenge
[USENIX Security 2025] PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models
Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Toolkit for creating, sharing and using natural language prompts.
A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training
Submission Guide + Discussion Board for AI Singapore Global Challenge for Safe and Secure LLMs (Track 2A).
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]
Fluent student-teacher redteaming
A month-long zkp study group, one topic at a time.
Academic Papers about LLM Application on Security
we want to create a repo to illustrate usage of transformers in chinese
[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"
Code release for the paper "Style Vectors for Steering Generative Large Language Models", accepted to the Findings of the EACL 2024.
Representation Engineering: A Top-Down Approach to AI Transparency
Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".
Improving Alignment and Robustness with Circuit Breakers
This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.