User Comment Replies

The case for unlearning that removes information from LLM weights

Thanks for giving this great work - I certainly agree with you on the limits of unlearning as it’s currently conceived for safety. I do wonder if a paradigm of “preventing learning” is a way to get around these limits.

Training-time domain authorization could be helpful for safety

domenicrosati10mo20

Thanks for reaching out this is all great feedback.

That we will defenitly address. I will dm you for the vaccine implementation as we are currently working on this as well and to see what would be useful for code sharing since we are wee bit aways from having shareable replication of the whole paper.

Some answers

Oh woops it should be clearer this is the mutual information measure. If there is something more specific you are looking for here let me know as we do mention it several times (I think!). In case it helps mutual information is always an abstract

domenicrosati10mo10

Thanks for pointing this out - I think its a critical point.

Im not imagining anything in particular (and ya in that paper we do very much do "baby" sized attacks)

Generally ya this is a problem we need to work out: what is the relationship between a defence strength and the budget an attacker would take to overcome this AND for large groups that have budget for training from scratch would defences here even make an impact.

I think your right in that the large budget groups who can just train from scratch would just not be impacted by defences of this nature.... (read more)

A model of research skill

domenicrosati1y40

I think people underestimate formal study of research methods like reading texts / taking a course on research methodology for improving research abilities.

There are many concepts within control, experimental design, validitiy, and reliability like construct validity or conclusion validity that you would learn from a research methods textbook that are super helpful for improving the quality of research. I think many researchers implicitly learn these things without ever knowing what they exactly are but that is usually through trial and errors (peer review... (read more)

3L Rudolf L1y

Do you have a recommendation for a good research methods textbook / other text?

Public Call for Interest in Mathematical Alignment

domenicrosati1y10

Thanks for the pointer! Yes RL has a lot of research of this kind - as an empirical research I just get stuck sometimes in translation

Public Call for Interest in Mathematical Alignment

domenicrosati1y127

For my own clarity: What is the difference between mathematical approaches to alignment and other technical approaches like mechanistic interpretability work?

I imagine the focus is on in principal arguments or proofs regarding the capabilities of a given system rather than empirical or behavioural analysis but you mention RL so just wanted to get some colour on this.

Any clarification here would be helpful!

Vanessa Kosoy1y112

You are more or less right. By "mathematical approaches", we mean approaches focused on building mathematical models relevant to alignment/agency/learning and finding non-trivial theorems (or at least conjectures) about these models. I'm not sure what the word "but" is doing in "but you mention RL": there is a rich literature of mathematical inquiry into RL. For a few examples, see everything under the bullet "reinforcement learning theory" in the LTA reading list.

Speed running everyone through the bad alignment bingo. $5k bounty for a LW conversational agent

domenicrosati2y20

If someone did this - it would be nice to collect preference data over answers that are helpful to alignment and not helpful to alignment… that could be a dataset that is interesting for a variety of reasons like analyzing current models abilities to help with alignment, gaps in being helpful w.r.t alignment and of course providing a mechanism for making models better at alignment… a model like this could also maybe work as a specialized type of Constitutional AI to collect feedback from the models preferences that are more “alignment-aware” so to speak… n... (read more)

Cognitive Emulation: A Naive AI Safety Proposal

domenicrosati2y42

Im struggling to understand how this is is different from “we will build aligned ai to align ai”. specifically: Can someone explain to me how human-like and AGI are different? Can someone explain to me why human-like AI avoids typical x-risk scenarios (given those human-likes could say clone themselves, speed up themselves and rewrite their own software and easily become unbounded)? Why isnt an emulated cognitive system a real cognitive system… i don’t understand how you can emulate a human-like intelligence and it not be the same as fully human-like. ... (read more)

Simulators

domenicrosati2y10

What are your thoughts on prompt tuning as a mechanism for discovering optimal simulation strategies?

I know you mention condition generation as something to touch on in future posts but I’d be eager to hear about where you think prompt tuning comes in considering continuous prompts are differentiable and so can be learned/optimized for specific simulation behaviour.

Elicit: Language Models as Research Assistants

domenicrosati3yΩ130

Hey there,

I was just wondering how you deal with hallucination and faithfulness issues of large language models from a technical perspective? The user experience perspective seems clear - you can give users control and consent over what Elicit is suggesting and so on.

However we know LLMs are prone to issues of faithfulness and factuality (Pagnoni et al. 2021 as one example for abstractive summarization) and this seems like it would be a big issue for research where factual correctness is very important. In a biomedical scienario, if a user of Elicit gets a... (read more)

4stuhlmueller3y

Yeah, getting good at faithfulness is still an open problem. So far, we've mostly relied on imitative finetuning. to get misrepresentations down to about 10% (which is obviously still unacceptable). Going forward, I think that some combination of the following techniques will be needed to get performance to a reasonable level: * Finetuning + RL from human preferences * Adversarial data generation for finetuning + RL * Verifier models, relying on evaluation being easier than generation * Decomposition of verification, generating and testing ways that a claim could be wrong * Debate ("self-criticism") * User feedback, highlighting situations where the model is wrong * Tracking supporting information for each statement and through each chain of reasoning * Voting among models trained/finetuned on different datasets Thanks for the pointer to Pagnoni et al.

LESSWRONG
LW

All of domenicrosati's Comments + Replies