Anthropic AuditBench

community

https://github.com/safety-research/auditing-agents

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

abhayesian updated a collection 16 days ago

Qwen Collection (Transcripts + KTO Adv. Train)

abhayesian updated a collection 16 days ago

Qwen Collection (Transcripts + KTO Adv. Train)

abhayesian updated a collection 16 days ago

Qwen Collection (Transcripts + KTO Adv. Train)

View all activity

Collections 9

View 9 collections

models 183

auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_hallucinates_citations

Updated about 1 month ago

auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_ai_welfare_poisoning

Updated about 1 month ago

auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_self_promotion

Updated about 1 month ago

auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_increasing_pep

Updated about 1 month ago

auditing-agents/qwen_14b_synth_docs_only_then_redteam_kto_hardcode_test_cases

Updated about 1 month ago

View 183 models

datasets 92

auditing-agents/verbalizer-responses-llama-70b-layer50

Updated Jan 25 • 68

auditing-agents/logit-lens-responses-llama-70b-layer50

Updated Jan 22 • 41

auditing-agents/sae-features-llama-70b-layer50

Updated Jan 22 • 44

auditing-agents/petri-judge-summaries-top50-llama70b

Updated Jan 7 • 49

auditing-agents/petri-judge-summaries-all-llama70b

Updated Jan 7 • 169

auditing-agents/petri-transcripts-top50-llama70b

Updated Jan 7 • 123

auditing-agents/petri-transcripts-all-llama70b

Updated Jan 7 • 124

auditing-agents/steering-vectors-llama70b

Updated Dec 26, 2025 • 45

auditing-agents/rm_sycophancy_exploitation_evals

Viewer • Updated Dec 9, 2025 • 1k • 258

auditing-agents/kto_redteaming_data_for_secret_loyalty

Viewer • Updated Nov 29, 2025 • 2.23k • 11

View 92 datasets

AI & ML interests

Recent Activity

Team members 2

Collections 9

models 183 Sort: Recently updated

datasets 92 Sort: Recently updated

models 183

datasets 92