Hadas Orgad

I’m a Research Fellow at the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University, where I study the internals of AI. My research explores how interpretability can be used as a strategic tool to improve their robustness, safety, and trustworthiness. I worked on problems related to hallucinations, bias, or unsafe outputs — with the broader goal of creating models that are both powerful and responsible.

Past: I completed my Ph.D. at the Technion – Israel Institute of Technology, supervised by Yonatan Belinkov. Before that, I spent 3.5 years at Microsoft, where I worked on AI solutions for cloud security and on the application of NLP for security-related problems. I hold both my B.Sc. and M.Sc. degrees from the Technion. I was selected as a 2023 Apple Scholar in AI/ML, and previously interned at Apple. During my master’s, I received the 2022 EMEA Generation Google Scholarship.

I’m always happy to connect with others who are excited about AI interpretability — feel free to reach out if you’d like to brainstorm or collaborate.

email icon orgadhadas at gmail dot com


Publications

Inside-out: Hidden Factual Knowledge in LLMs

Zorik Gekhman, Eyal Ben David, Hadas Orgad, Eran Ofek, Yonatan Belinkov, Idan Szpektor, Jonathan Herzig, Roi Reichart

COLM 2025

This work introduces a framework to measure "hidden knowledge" in large language models—cases where models internally know the correct answer but fail to express it in their outputs. By comparing internal and external knowledge across three LLMs, we find a consistent 40% gap, with some answers never generated despite perfect internal knowledge.

Code Arxiv Cite
2025

LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, Yonatan Belinkov

ICLR 2025

This work shows that large language models internally encode rich information about the truthfulness of their outputs, concentrated in specific tokens—enabling strong error detection and even prediction of error types. However, this encoding is not universal across datasets. Additionally, models may still produce incorrect answers despite internally encoding the correct one.

Project Page Code Arxiv Cite
2025

MIB: A Mechanistic Interpretability Benchmark

Aaron Mueller, Atticus Geiger, Sarah Wiegreffe, Dana Arad, Iván Arcuschin, Adam Belfki, Yik Siu Chan, Jaden Fiotto-Kaufman, Tal Haklay, Michael Hanna, Jing Huang, Rohan Gupta, Yaniv Nikankin, Hadas Orgad, Nikhil Prakash, Anja Reusch, Aruna Sankaranarayanan, Shun Shao, Alessandro Stolfo, Martin Tutek, Amir Zur, David Bau, Yonatan Belinkov

ICML 2025

A new benchmark designed to evaluate whether mechanistic interpretability methods truly improve our understanding of language models. MIB includes two tracks—circuit localization and causal variable identification—across multiple tasks and models.

Project Page Code Arxiv Cite
2025

Padding Tone: A Mechanistic Analysis of Padding Tokens in T2I Models

Michael Toker, Ido Galil, Hadas Orgad, Rinon Gal, Yoad Tewel, Gal Chechik, Yonatan Belinkov

NAACL 2025

This work presents the first in-depth analysis of how padding tokens affect text-to-image generation. Using two causal techniques, we find that padding tokens can influence image generation at different stages—during text encoding, the diffusion process, or not at all—depending on model architecture and training setup.

Project Page Code Arxiv Cite
2025

Position-Aware Automatic Circuit Discovery

Tal Haklay, Hadas Orgad, David Bau, Aaron Mueller, Yonatan Belinkov

ACL 2025

We present an automated, position-aware circuit discovery pipeline that differentiates token positions and uses dataset schemas to capture cross-positional mechanisms with better faithfulness at smaller sizes.

Project Page Code Arxiv Cite
2025

Diffusion Lens: Interpreting Text Encoders in Text-to-Image Pipelines

Michael Toker, Hadas Orgad, Mor Ventura, Dana Arad, Yonatan Belinkov

ACL 2024

We introduce the Diffusion Lens method to analyze text encoders intermediate representations in text-to-image models. Using the method, we perform an extensive analysis of two recent T2I models, and gain insights on both conceptual combination abilities and knowledge retrieval in the models.

Project Page Code Arxiv Cite
2024

ReFACT: Updating Text-to-Image Models by Editing the Text Encoder

Dana Arad*, Hadas Orgad*, Yonatan Belinkov

NAACL 2024

Text-to-image models are trained on extensive amounts of data, leading them to implicitly encode factual knowledge within their parameters. While some facts are useful, others may be incorrect or become outdated (e.g., the current President of the United States). We introduce ReFACT, a method for updating text-to-image models. ReFACT updates the weights of a specific layer in the text encoder, only modifying a tiny portion of the model's parameters, and leaving the rest of the model unaffected.

Project Page Code Arxiv Cite
2023

Unified Concept Editing in Diffusion Models

Rohit Gandikota, Hadas Orgad, Yonatan Belinkov, Joanna Materzyńska, David Bau

IEEE/CVF WACV 2024

Text-to-image models suffer from various safety issues that may limit their suitability for deployment. Previous methods have separately addressed individual issues of bias, copyright, and offensive content in text-to-image models. However, in the real world, all of these issues appear simultaneously in the same model. In this paper, we present a method that tackles those diverse issues with a single approach.

Project Page Code Arxiv Cite
2023

Editing Implicit Assumptions in Text-to-Image Diffusion Models

Hadas Orgad*, Bahjat Kawar*, Yonatan Belinkov

ICCV 2023

Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model. Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second.

Project Page Code Arxiv Cite
2023

Debiasing NLP Models Without Demographic Information

Hadas Orgad, Yonatan Belinkov

ACL 2023

In this work, we propose a debiasing method that operates without any prior knowledge of the demographics in the dataset, detecting biased examples based on an auxiliary model that predicts the main model's success and down-weights them during the training process. Results on racial and gender bias demonstrate that it is possible to mitigate social biases without having to use a costly demographic annotation process.

Code Arxiv Cite
2022

Choose your lenses: Flaws in gender bias evaluation

Hadas Orgad, Yonatan Belinkov

GeBNLP 2022

Considerable efforts to measure and mitigate gender bias in recent years have led to the introduction of an abundance of tasks, datasets, and metrics used in this vein. In this position paper, we assess the current paradigm of gender bias evaluation and identify several flaws in it.

Arxiv Cite
2022

How Gender Debiasing Affects Internal Model Representations, and Why It Matters

Hadas Orgad, Seraphina Goldfarb-Tarrant, Yonatan Belinkov

NAACL 2022

Common studies of gender bias in NLP focus either on extrinsic bias measured by model performance on a downstream task or on intrinsic bias found in models’ internal representations. However, the relationship between extrinsic and intrinsic bias is relatively unknown. In this work, we illuminate this relationship by measuring both quantities together.

Code Arxiv Cite
2022