ExplainingFeaturesWithCircuits

Summary: Explaining abstract SAE features with circuits

The problem tackled in this project involves understanding how abstract features within large language models (LLMs) are formed and activated, particularly focusing on feature 15191 in the 8th layer of the Gemma-2 2b model. My approach centered on investigating the circuits and attention mechanisms that contribute to the activation of this feature, which initially appeared to be a simple token-level indicator but was found to represent more complex behaviors related to the LHS of a general equation. Through a combination of residual stream patching and attention head output patching, I identified specific attention heads that play a crucial role in detecting operators and contributing to the feature’s activation. While these initial results are promising, further experiments are needed to fully map the circuit and confirm whether these findings generalize across different contexts and operators.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
out		out
results		results
scripts		scripts
testing_notebooks		testing_notebooks
.gitignore		.gitignore
CircuitForEqualsToFeature.ipynb		CircuitForEqualsToFeature.ipynb
README.md		README.md
equal_operand_circuit.py		equal_operand_circuit.py
equal_operator_circuit.py		equal_operator_circuit.py
f_utils.py		f_utils.py
generic_equal_feature_patching.py		generic_equal_feature_patching.py
retry_generic-Copy1.py		retry_generic-Copy1.py
retry_generic-Copy2.py		retry_generic-Copy2.py
retry_generic-Copy3.py		retry_generic-Copy3.py
retry_generic-Copy4.py		retry_generic-Copy4.py
retry_generic.py		retry_generic.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ExplainingFeaturesWithCircuits

About

Releases

Packages

Languages

NainaniJatinZ/ExplainingFeaturesWithCircuits

Folders and files

Latest commit

History

Repository files navigation

ExplainingFeaturesWithCircuits

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages