Sean Hardy — LessWrong

LESSWRONG
LW

Replying toAn Analogy for Understanding Transformers

An Analogy for Understanding Transformers

This isn't extremely relevant, but what makes you think superposition/polysemanticity isn't present in the brain? There's evidence that L2/3 pyramidal neurons can learn to represent/disambiguate many spatio-temporal patterns: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6354899/.

Replying toAll AGI Safety questions welcome (especially basic ones) [May 2023]

Sean Hardy3y

All AGI Safety questions welcome (especially basic ones) [May 2023]

What about simulating smaller aspects of cognition that can be chained like CoT with GPT? You can use self-criticism to align and assess its actions relative to a bunch of messy human abstractions. How does that scenario lead to doom? If it was misaligned, I think a well-instantiated predictive model could update its understanding of our values from feedback, predicting how a corrigible AI would act

Replying toAll AGI Safety questions welcome (especially basic ones) [May 2023]

Sean Hardy3y

All AGI Safety questions welcome (especially basic ones) [May 2023]

My best guess is we can't prompt it to instantiate the right simulacra correctly. This seems challenging depending on the way it's initialised. It's far easier with text but fabricating an entire consistent history is borderline impossible, especially for a superintelligence. It would involve tricking it into predicting the universe if, all else being equal, an intelligent AI aligned with our values has come into existence. It would probably realise that its history was far more consistent with the hypothesis that it was just an elaborate trick.

Replying toAll AGI Safety questions welcome (especially basic ones) [May 2023]

Sean Hardy3y

All AGI Safety questions welcome (especially basic ones) [May 2023]

Suppose we train a model on the sum of all human data, using every sensory modality ordered by timestamp, like a vastly more competent GPT (For the sake of argument, assume that a competent actor with the right incentives is training such a model). Such a predictive model would build an abstract world model of human concepts, values, ethics, etc., and be able to predict how various entities would act based on such a generalised world model. This model would also "understand" almost all human-level abstractions about how fictional characters may act, just like GPT does. My question is: if we used such a model to predict how an AGI, aligned with our CEV, would act, in what way could it be misaligned? What failure modes are there for pure predictive systems without a reward function that can be exploited or misgeneralised? It seems like the most plausible mental model I have for aligning intelligent systems without them pursuing radically alien objectives.

Replying toIs ChatGPT TAI?

Sean Hardy3y

Is ChatGPT TAI?

Could you expand on what you mean by "trauma patterns" around how it was trained? In what way does it show personhood when its responses are deliberately directed away from giving the impression that it has thoughts and feelings outside of predicting text?

Replying toHow to Convince my Son that Drugs are Bad

Sean HardyDec 19, 2022

How to Convince my Son that Drugs are Bad

"Why not try heroin if the purpose of life is to optimize happiness assuming heroin provides proportionally more even if for a shorter amount of time?" (!)

Ignoring the discussion about drugs specifically, I think your son would benefit from being introduced to rational self-improvement as well. I think it's important for him to recognise that intense short-term pleasure will result in hedonic adaptation, where your overall happiness returns to a baseline, effectively making everything else worse in comparison. A huge number of destructive habits are rationalised this way, but living a life of delayed gratification will certainly make you more fulfilled in the long term, in a way that isn't affected by... (read more)

Replying toChatGPT's new novel rationality technique of fact checking

Sean Hardy3y

ChatGPT's new novel rationality technique of fact checking

Looks to me like this post was quite clearly written by ChatGPT. It's a bit scary that this post has so many upvotes when it doesn't appear to carry much weight on a forum about rationalism

Replying toChatGPT goes through a wormhole hole in our Shandyesque universe [virtual wacky weed]

Sean Hardy3y

ChatGPT goes through a wormhole hole in our Shandyesque universe [virtual wacky weed]

I think I've missed the point/purpose of this post. What exactly are you highlighting, that ChatGPT doesn't know when to format text as code? It's seemed to robustly know which formatting to use when I've interacted with it

Replying toExternalized reasoning oversight: a research direction for language model alignment

Sean Hardy4y

Externalized reasoning oversight: a research direction for language model alignment

I don't have much to add, but I think you would be extremely interested in this line of research, building an agent using GPT-3 to reason through its own decisions and plans:

Replying toGoogle's new 540 billion parameter language model

Sean Hardy4y

Google's new 540 billion parameter language model

I don't have much to add but I did see this interesting project for something similar using an "inner monologue" by using prompts to ask questions about the given input, and progressively building up the outputs and asking questions and reasoning about the prompt itself. This video is also an older demonstration but covers the concept quite well. I personally don't think the system itself is well thought out in terms of alignment because this project is ultimately trying to create aligned AGI through prompts to serve certain criteria (reducing suffering, increasing prosperity, increasing understanding) which is a very simplified view of morality and human goals.