Skip to content
View YuanBoXie's full-sized avatar
:octocat:
:octocat:

Block or report YuanBoXie

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

Improved techniques for optimization-based jailbreaking on large language models (ICLR2025)

Python 84 5 Updated Jan 31, 2025

A framework for few-shot evaluation of language models.

Python 8,133 2,172 Updated Mar 6, 2025

[ACL 2024] An Easy-to-use Knowledge Editing Framework for LLMs.

Jupyter Notebook 2,108 259 Updated Mar 5, 2025

Robust recipes to align language models with human and AI preferences

Python 5,034 432 Updated Nov 21, 2024

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Jupyter Notebook 563 79 Updated Aug 16, 2024

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

Python 43,109 5,278 Updated Mar 5, 2025

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Jupyter Notebook 1,673 259 Updated Dec 27, 2024

A set of case studies for the Wintermute Alpha Challenge

136 70 Updated Aug 23, 2024
Python 16 2 Updated Sep 15, 2024

[USENIX Security 2025] PoisonedRAG: Knowledge Corruption Attacks to Retrieval-Augmented Generation of Large Language Models

Python 124 19 Updated Feb 23, 2025

Official Repository for The Paper: Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Python 76 5 Updated Jul 5, 2024

Toolkit for creating, sharing and using natural language prompts.

Python 2,789 364 Updated Oct 23, 2023

Fuel Network 共学教程仓库

5 1 Updated Nov 8, 2024

A novel approach to improve the safety of large language models, enabling them to transition effectively from unsafe to safe state.

Python 58 Updated Jan 25, 2025

Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback

Python 1,424 118 Updated Jun 13, 2024

A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training

Python 21,512 2,790 Updated Aug 15, 2024

Submission Guide + Discussion Board for AI Singapore Global Challenge for Safe and Secure LLMs (Track 2A).

Python 11 1 Updated Jan 8, 2025

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks [ICLR 2025]

Shell 268 26 Updated Jan 23, 2025

Fluent student-teacher redteaming

Jupyter Notebook 19 3 Updated Jul 25, 2024

A month-long zkp study group, one topic at a time.

Python 126 33 Updated Mar 5, 2025

Academic Papers about LLM Application on Security

128 8 Updated Feb 7, 2025

we want to create a repo to illustrate usage of transformers in chinese

Shell 2,664 451 Updated Aug 18, 2024

[ICLR 2025] Official Repository for "Tamper-Resistant Safeguards for Open-Weight LLMs"

Python 47 6 Updated Feb 28, 2025

Code release for the paper "Style Vectors for Steering Generative Large Language Models", accepted to the Findings of the EACL 2024.

OpenEdge ABL 26 1 Updated Sep 26, 2024

Representation Engineering: A Top-Down Approach to AI Transparency

Jupyter Notebook 797 92 Updated Aug 14, 2024

Code and results accompanying the paper "Refusal in Language Models Is Mediated by a Single Direction".

Python 188 43 Updated Oct 1, 2024

Improving Alignment and Robustness with Circuit Breakers

Jupyter Notebook 188 26 Updated Sep 24, 2024

This includes the original implementation of SELF-RAG: Learning to Retrieve, Generate and Critique through self-reflection by Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi.

Python 1,990 179 Updated May 25, 2024

notes about machine learning

HTML 3,318 916 Updated Nov 22, 2021
Next
Showing results