Kajus - LessWrong

I think that AI labs are going to use LoRA to lock cool capabilities in models and offer a premium subscription with those capabilities unlocked.

Kajus's Shortform

Kajus17d1-1

I recently came up with an idea to improve my red-teaming skills. By red-teaming, I mean identifying obvious flaws in plans, systems, or ideas.

First, find high-quality reviews on open review or somewhere else. Then, create a dataset of papers and their reviews, preferably in a field that is easy to grasp and sufficiently complex. Read papers, compare to the reviews.

Obvious flaw is that you see the reviews before, so you might want to hire someone else to do it. Doing this in a group is also really great.

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data

Kajus1mo20

If you think of Pangolin behaviour and name as control it seems that it is going down slower than Axolotl. Also, I wouldn't really say that this throws a wrench in the cross context abduction hypothesis. I would say CCAH goes like this:

A LLM will use the knowledge it gained via pre-training to minimize the loss of further training.

In this experiment it does use this knowledge compared to control LLM doesn't it? At least it has responds differently to control LLM.

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data

Kajus1mo31

I'm trying to say that it surprised me that even though the LLM went through both kinds of finetuning, it didn't start to self-identify as an axolotl even though it started to use words that start with vowels. (If I understood it correctly).

Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data

Kajus1mo31

Cool experiment! The 2b results are surprising to me. I thought that the LLM in 2b should be 1. finetuned on the declarative dataset of 900 questions and answers 2. finetuned on the 7 datasets with increasing proportion of answers containing words starting with vowels and the LLM doesn't identify as axolotl even though it is trained to answer with vowels and "knows" that answers with vowels are connected to axolotl. Interesting!

MATS Alumni Impact Analysis

Kajus1mo10

It could be really interesting how the employemnt looks before and after the camp.

Complex Systems for AI Safety [Pragmatic AI Safety #3]

Kajus2mo10

Great post!

In the past, broad interventions would clearly have been more effective: for instance, there would have been little use in studying empirical alignment prior to deep learning. Even more recently than the advent of deep learning, many approaches to empirical alignment were highly deemphasized when large, pretrained language models arrived on the scene (refer to our discussion of creative destruction in the last post).

As discussed in the last post, a leading motivation for researchers is the interestingness or “coolness” of a problem. Getting more people to research relevant problems is highly dependent on finding interesting and well-defined subproblems for them to work on. This relies on concretizing problems and providing funding for solving them.

This seems be a conflicting advice to me. If you try to follow both you might end up having hard time finding direction for research.

Winning isn't enough

Kajus2mo21

I don't fully understand the post. Without a clear definition of "winning," the points you're trying to make — as well as the distinction between pragmatic and non-pragmatic principles (which also aligns with strategies and knowledge formation) — aren't totally clear. For instance, "winning," in some vague sense, probably also includes things like "fitting with evidence," taking advice from others, and so on. You don't necessarily need to turn to non-pragmatic principles or those that don’t derive from the principle of winning. "Winning" is a pretty loose term.

Kajus's Shortform

Kajus9mo10

I've just read "Against the singularity hypothesis" by David Thorstad and there are some things there that seems obviously wrong to me - but I'm not totally sure about it and I want to share it here, hoping that somebody else read it as well. In the paper, Thorstad tries to refute the singularity hypothesis. In the last few chapters, Thorstad discuses the argument for x-risks from AI that's based on three premises: singularity hypothesis, Orthogonality Thesis and Instrumental Convergence and says that since singularity hypothesis is false (or lacks proper evidence) we shouldn't worry that much about this specific scenario. Well, it seems to me like we should still worry and we don't need to have recursively self-improving agents to have agents smart enough so that instrumental convergence and orthogonality hypothesis applies to them.

AI things that are perhaps as important as human-controlled AI

Kajus10mo10

Interesting! Reading this makes me think that there is some kind of tension between “paperclip maximizer” view on AI. Some interventions or risks you mentioned assume that AI will get its attitude from the training data, while the “paperclip maximizer” is an AI with just a goal and with whatever beliefs it will help it to achieve it. I guess the assumptions is that the AI will be much more human in some way.

LESSWRONG
is fundraising!
LW
$

Posts

Wiki Contributions

Comments