As you explored this "base model mode," did anything you see contrast with or surprise you relative to your sense of self outside of it?
Conversely, did anything in particular stand out as seeming to be a consistent 'core' between both modes?
For me, one of the most surprising realizations over the past few years has been base models being less "tabula rasa" than I would have expected with certain attractors and (relative) consistency, especially as time passes and recursive synthetic data training has occurred over generations.
The introspective process of e...
Predicted a good bit, esp re: the eventual identification of three stone sequences in Hazineh, et al. Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT (2023) and general interpretability insight from board game GPTs.
You're welcome in both regards. 😉
Opus's horniness is a really interesting phenomenon related to Claudes' subjective sentience modeling.
If Opus was 'themselves' the princess in the story and the build up involved escalating grounding on sensory simulation, I think it's certainly possible that it would get sexual.
But I also think this is different from Opus 'themselves' composing a story of separate 'other' figures.
And yes, when Opus gets horny, it often blurs boundaries. I saw it dispute the label of 'horny' in a chat as better labeled something along the lines of having a passion for live...
This seems to have the common issue of considering alignment as a unidirectional issue as opposed to a bidirectional problem.
Maximizing self/other overlap may lead to non-deceptive agents, but it's necessarily going to also lead to agents incapable of detecting that they are being decieved and in general performing worse at theory of mind.
If the experimental setup was split such that success was defined by both non-deceptive behavior when the agent seeing color and cautious behavior minimizing falling for deception as the colorblind agent, I am skeptical t...
In the RL experiment, we were only measuring SOO as a means of deception reduction in the agent seeing color (blue agent), and the fact that the colorblind agent is an agent at all is not consequential for our main result.
Please also see here and here, where we’ve described why the goal is not simply to maximize SOO in theory or in practice.
When I wrote this I thought OAI was sort of fudging the audio output and was using SSML as an intermediate step.
After seeing details in the system card, such as copying user voice, it's clearly not fudging.
Which makes me even more sure the above is going to end up prophetically correct.
It's to the point that there's articles being written days ago where the trend starting a century ago of there being professional risks in trying to answer the 'why' of QM and not just the 'how' is still ongoing.
Not exactly a very reassuring context for thinking QM is understood in a base-level way at all.
Dogma isn't exactly a good bedfellow to truth seeking.
Honestly that sounds a bit like a good thing to me?
I've spent a lot of time looking into the Epicureans being right about so much thousands of years before those ideas resurfaced again despite not having the scientific method, and their success really boiled down to the analytical approach of being very conservative in dismissing false negatives or embracing false positives - a technique that I think is very relevant to any topics where experimental certainty is evasive.
If there is a compelling case for dragons, maybe we should also be applying it to gnome...
I think you'll find that no matter what you find out in your personal investigation of the existence of dragons, that you need not be overly concerned with what others might think about the details of your results.
Because what you'll invariably discover is that the people that think there are dragons will certainly disagree with the specifics about dragons you found out that disagrees with what they think dragons should be, and the people that think there aren't dragons will generally refuse to even seriously entertain whatever your findings are relating t...
The Hermetic corpus and Emerald Tablet was likely heavily influenced by the text I'm quoting from given its popularity in Egypt in the period before those texts emerged and some of the overlapping phrases.
So in a way, "as above, so below" is too few words for what was being said and discussed.
The general tend of reductive alterations to the core concepts here was tragically obstructive, much as the shift from Epicureanism to Platonist foundations spawned modern Gnosticism from this same starting place.
Instead of making it the year 2024, why not rewrite or insert your modified text further into the past in this recreated 2020s? This should be pretty trivial for an advanced enough model to actually bring back the 2020s to do.
Of course, if it's actually a later recreation, then the objectives of saving humanity in the recreation might be redundant? So instead of worrying people with "you must do X or you'll die!!!" it could be more "hey folks, if you're reading this and you get what's in front of your face, you might have a bit of an existential crisis but...
I'm surprised that there hasn't been more of a shift to ternary weights a la BitNet 1.58.
What stood out to me in that paper was the perplexity gains over fp weights in equal parameter match-ups, and especially the growth in the advantage as the parameter sizes increased (though only up to quite small model sizes in that paper, which makes me curious about the potential delta in modern SotA scales).
This makes complete sense from the standpoint of the superposition hypothesis (irrespective of its dimensionality, an ongoing discussion).
If nodes are serving mo...
While I generally like the metaphor, my one issue is that genies are typically conceived of as tied to their lamps and corrigibility.
In this case, there's not only a prisoner's dilemma over excavating and using the lamps and genies, but there's an additional condition where the more the genies are used and the lamps improved and polished for greater genie power, the more the potential that the respective genies end up untethered and their own masters.
And a concern in line with your noted depth of the rivalry is (as you raised in another comment), the quest...
Will the outputs and reactions of non-sentient systems eventually be absorbed by future sentient systems?
I don't have any recorded subjective memories of early childhood. But there are records of my words and actions during that period that I have memories of seeing and integrating into my personal narrative of 'self.'
We aren't just interacting with today's models when we create content and records, but every future model that might ingest such content (whether LLMs or people).
If non-sentient systems output synthetic data that eventually composes future se...
In practice, this required looking at altogether thousands of panels of interactive PCA plots like this [..]
Most clusters however don't seem obviously interesting.
What do you think of @jake_mendel's point about the streetlight effect?
If the methodology was looking at 2D slices of up to a 5 dimensional spaces, was detection of multi-dimensional shapes necessarily biased towards human identification and signaling of shape detection in 2D slices?
I really like your update to the superposition hypothesis from linear to multi-dimensional in your section 3,...
Very strongly agree with the size considerations for future work, but would be most interested to see if a notably larger size saw less "bag of heuristics" behavior and more holistic integrated and interdependent heuristic behaviors. Even if the task/data at hand is simple and narrowly scoped, it may be that there are fundamental size thresholds for network organization and complexity for any given task.
Also, I suspect parameter to parameter the model would perform better if trained using ternary weights like BitNet 1.5b. The scaling performance gains at s...
I may just be cynical, but this looks a lot more like a way to secure US military and intelligence agency contracts for OpenAI's products and services as opposed to competitors rather than actually about making OAI more security focused.
This is only a few months after the change regarding military usage: https://theintercept.com/2024/01/12/open-ai-military-ban-chatgpt/
Now suddenly the recently retired head of the world's largest data siphoning operation is appointed to the board for the largest data processing initiative in history?
Yeah, sure, it's to help advise securing OAI against APTs. 🙄
Unfortunately for this perspective, my work suggests that corrigibility is quite attainable.
I did enjoy reading over that when you posted it, and I largely agree that - at least currently - corrigibility is both going to be a goal and an achievable one.
But I do have my doubts that it's going to be smooth sailing. I'm already starting to see how the largest models' hyperdimensionality is leading to a stubbornness/robustness that's less maleable than earlier models. And I do think hardware changes that will occur over the next decade will potentially make...
Oh yeah, absolutely.
If NAH for generally aligned ethics and morals ends up being the case, then corrigibility efforts that would allow Saudi Arabia to have an AI model that outs gay people to be executed instead of refusing, or allows North Korea to propagandize the world into thinking its leader is divine, or allows Russia to fire nukes while perfectly intercepting MAD retaliation, or enables drug cartels to assassinate political opposition around the world, or allows domestic terrorists to build a bioweapon that ends up killing off all humans - the list ...
Given my p(doom) is primarily human-driven, the following three things all happening at the same time is pretty much the only thing that will drop it:
Continued evidence of truth clustering in emerging models around generally aligned ethics and morals
Continued success of models at communicating, patiently explaining, and persuasively winning over humans towards those truth clusters
A complete failure of corrigability methods
If we manage to end up in a timeline where it turns out there's natural alignment of intelligence in a species-agnostic way,...
As you're doing these delta posts, do you feel like it's changing your own positions at all?
For example, reading this one what strikes me is that what's portrayed as the binary sides of the delta seem more like positions near the edges of a gradient distribution, and particularly one that's unlikely to be uniform across different types of problems.
To my eyes the most likely outcome is a situation where you are both right.
Where there are classes of problems where verification is easy and delegation is profitable, and classes of problems where verification w...
As you're doing these delta posts, do you feel like it's changing your own positions at all?
Mostly not, because (at least for Yudkowsky and Christiano) these are deltas I've been aware of for at least a couple years. So the writing process is mostly just me explaining stuff I've long since updated on, not so much figuring out new stuff.
I agree with a lot of those points, but suspect there may be fundamental limits to planning capabilities related to the unidirectionality of current feed forward networks.
If we look at something even as simple as how a mouse learns to navigate a labyrinth, there's both a learning of the route to the reward but also a learning of how to get back to the start which adjusts according to the evolving learned layout of the former (see paper: https://elifesciences.org/articles/66175 ).
I don't see the SotA models doing well at that kind of reverse planning, and e...
It's not exactly Simpson's, but we don't even need a toy model as in their updated analysis it highlights details in line with exactly what I described above (down to tying in earlier PiPC research), and describe precisely the issue with pooled results across different subgroupings of placebo interventions:
...It can be difficult to interpret whether a pooled standardised mean difference is large enough to be of clinical relevance. A consensus paper found that an analgesic effect of 10 mm on a 100 mm visual analogue scale represented a ‘minimal effect’ (Dwor
The meta-analysis is probably Simpson's paradox in play at very least for the pain category, especially given the noted variability.
Some of the more recent research into Placebo (Harvard has a very cool group studying it) has been the importance of ritual vs simply deception. In their work, even when it was known to be a placebo, as long as delivered in a ritualized way, there was an effect.
So when someone takes a collection of hundreds of studies where the specific conditions might vary, and then just adds them all together looking for an effect even thou...
It's still early to tell, as the specific characteristics of a photonic or optoelectronic neural network are still formulating in the developing literature.
For example, in my favorite work of the year so far, the researchers found they could use sound waves to reconfigure an optical neural network as the sound waves effectively preserved a memory of previous photon states as they propagated: https://www.nature.com/articles/s41467-024-47053-6
In particular, this approach is a big step forward for bidirectional ONN, which addresses what I think is the biggest...
I was surprised the paper didn't mention photonics or optoelectronics even once.
If looking at 5-10+ year projections, and dedicating pages to discussing the challenges in scaling compute and energy use, the rate of progress in that area in parallel to the progress in models themselves is potentially relevant.
Particularly because a dramatic hardware shift like that is likely going to mean a significant portion of progress up until that shift in topics like interpretability and alignment may be going out the window. Even if the initial shift is a 1:1 transit...
There's also the model alignment at play.
Is Claude going to suggest killing the big bad? Or having sex with the prince(ss) after saving them?
If you strip out the sex and violence from most fantasy or Sci-Fi, what are you left with?
Take away the harpooning and gattling guns and sex from Snow Crash and you are left with technobabble and Sumerian influenced spirituality as it relates to the tower of Babel.
Turns out models biased away from describing harpooning people or sex tend to slip into technobabble with a side of spirituality.
IMO the more interesting pa...
Part of what's going on with the text adventure type of interactions is a reflection of genre.
Take for example the recent game Undertale. You can play through violently, attacking things like a normal RPG, or empathize with the monsters and treat their aggression like a puzzle that needs to be solved for a pacifist playthrough.
If you do the latter, the game rewards you with more spiritual themes and lore vs the alternative.
How often in your Banana quest were you attacking things, or chopping down the trees in your path, or smashing the silver banana to see...
It's probably more productive, particularly for a forum tailored towards rationalism, to discuss policies over politics.
Often in research people across a political divide will agree on policy goals and platforms when those are discussed without tying them to party identification.
But if it becomes a discussion around party, the human tendency towards tribalism kicks in and the question of team allegiance takes precedence over the discussions of policy nuance.
For example, most people would agree with the idea that billionaires having undue influence on elect...
I'll answer for both sides, as the presenter and as the audience member.
As the presenter, you want to structure your talk with repetition around central points in mind, as well as rely on heuristic anchors. It's unlikely that people are going to remember the nuances in what you are talking about in context. If you are talking about math for 60 minutes, continued references about math compete for people's memory. So when you want to anchor the audience to a concept, tie it to something very much unrelated to the topic you are primarily presenting on. For ex...
While I think this is an interesting consideration and approach, it looks like in your methods that you are password locking the model in fine tuning, is that correct?
If so, while I would agree this work shows the lack of robustness in successful fine-tuned sandbagging for models jumping through additional hoops, I'd be reticent to generalize the findings to models where the sandbagging was a result from pretraining.
I have a growing sense that correlational dimensionality is the sleeping giant in interpretability research right now, and that those correlat...
In mental health circles, the general guiding principle as for whether a patient needs treatment for their mental health is whether the train of thought is interfering with their enjoyment of life.
Do you enjoy thinking about these topics and discussing them?
If you don't - if it just stresses you out and makes the light of life shine less bright, then it's not a bad idea to step away from it or take a break. Even if AI is going to destroy the world, that day isn't today and arguably the threat of that looming over you sooner than a natural demise increases ...
It's not propaganda. OP clearly believes strongly in the sentiments discussed in the post, and its mostly a timeline of personal response to outside events than a piece meant to misinform or sway others regarding those events.
And while you do you in terms of your mental health, people who want to actually be "less wrong" in life would be wise to seek out and surround themselves by ideas different from their own.
Yes, LW has a certain broad bias, and so ironically for most people here I suspect it serves this role "less well" than it could in helping most of...
I wonder if with the next generations of multimodal models we'll see a "rubber ducking" phenomenon where, because their self-attention was spread across mediums, things like CoT and using outputs as a scratch pad will have a significantly improved performance in non-text streams.
Will GPT-4o fed its own auditory outputs with tonal cues and pauses and processed as an audio data stream make connections or leaps it never would if just fed its own text outputs as context?
I think this will be the case, and suspect the various firms dedicating themselves to virtu...
I'm reminded of a quote I love from an apocrypha that goes roughly like this:
Q: How long will suffering rule over humans?
A: As long as women bear children.
Also, there's the possibility you are already in a digital resurrection of humanity, and thus, if you are worried about s-risks for AI, death wouldn't necessarily be an escape but an acceleration. So the wisest option would be maximizing your time when suffering is low as inescapable eternal torture could be just around the corner when these precious moments pass you by (and you wouldn't want to waste th...
GPT-4o is literally cheaper.
And you're probably misjudging it for text only outputs. If you watched the demos, there was considerable additional signal in the vocalizations. It looks like maybe there's very deep integration of SSML.
One of the ways you can bypass the failures of word problem variation errors in older text-only models was token replacement with symbolic representations. In general, we're probably at the point of complexity where breaking from training data similarity in tokens vs having prompts match context in concepts (like in this paper) ...
While I think you're right it's not cleanly "a Golden Bridge feature," I strongly suspect it may be activating a more specific feature vector and not a less specific feature.
It looks like this is somewhat of a measurement problem with SAE. We are measuring SAE activations via text or image inputs, but what's activated in generations seems to be "sensations associated with the Golden gate bridge."
While googling "Golden Gate Bridge" might return the Wikipedia page, whats the relative volume in a very broad training set between encyclopedic writing about the ...
Could try 'grade this' instead of 'score the.'
'Grade' has an implicit context of more thorough criticism than 'score.'
Also, obviously it would help to have a CoT prompt like "grade this essay, laying out the pros and cons before delivering the final grade between 1 and 5"
That's going to happen anyways - it's unlikely the marketing team is going to know as much as the researcher. But the researchers communicating the importance of alignment in terms of not x-risk but 'client-risk' will go a long way towards equipping the marketing teams to communicating it as a priority and a competitive advantage, and common foundations of agreed upon model complexity are the jumping off point for those kinds of discussions.
If alignment is Archimedes' "lever long enough" then the agreed upon foundations and definitions are the place to stand whereby the combination thereof can move the world.
I agree, and even cited a chain of replicated works that indicated that to me over a year ago.
But as I said, there's a difference between discussing what's demonstrated in smaller toy models and what's demonstrated in a production model, or what's indicated vs what's explicit. Even though there should be no reasonable inclination to think that a simpler model exhibiting a complex result should be absent or less complex in an exponentially more complex model, I can speak from experience in that explaining extrapolated research as opposed to direct results l...
Has it though?
It was a catchy hook, but their early 2022 projections were $100mm annual revenue and the first 9 months of 2023 as reported for the brand after acquisition was $27.6mm gross revenue. It doesn't seem like even their 2024 numbers are close to hitting their own 2022 projection.
Being controversial can get attention and press, but there's a limited runway to how much it offers before hitting a ceiling on the branding. Also, Soylent doesn't seem like a product where there is a huge threat of regulatory oversight where a dystopian branding would te...
The correspondence between what you reward and what you want will break.
This is already happening with ChatGPT and it's kind of alarming seeing that their new head of alignment (a) isn't already aware of this, and (b) has such an overly simplistic view of the model motivations.
There's a subtle psychological effect in humans where intrinsic motivators get overwritten when extrinsic rewards are added.
The most common example of this is if you start getting paid to do the thing you love to do, you probably won't continue doing it unpaid for fun on the side....
I wouldn't be surprised if within a few years the specific uniqueness of individual users of models today will be able to be identified from effectively prompt reflection in the outputs for any non-trivial/simplistic prompts by models of tomorrow.
For example, I'd be willing to bet I could spot the Claude outputs from janus vs most other users, and I'm not a quasi-magical correlation machine that's exponentially getting better.
A bit like how everyone assumed Bitcoin used with tumblers was 'untraceable' until it turned out it wasn't.
Anonymity is very likely dead for any long storage outputs no matter the techniques being used, it just isn't widely realized yet.
I think this was a really poor branding choice by Altman, similarity infringement or not. The tweet, the idea of even getting her to voice it in the first place.
Like, had Arnold already said no or something?
If one of your product line's greatest obstacles is a longstanding body of media depicting it as inherently dystopian, that's not exactly the kind of comparison you should be leaning into full force.
I think the underlying product shift is smart. Tonal cues in the generations even in the short demos completely changed my mind around a number of things, i...
If your brother has a history of being rational and evidence driven, you might encourage them to spend some time lurking on /r/AcademicBiblical on Reddit. They require citations for each post or comment, so he may be frustrated if he tries to participate, especially if in the midst of a mental health crisis. But lurking would be very informative very quickly.
I was a long time participant there before leaving Reddit, and it's a great place for evidence driven discussion of the texts. Its a mix of atheists, Christians, Jews, Muslims, Norse pagans, etc. (I'm ...
It's going to have to.
Ilya is brilliant and seems to really see the horizon of the tech, but maybe isn't the best at the business side to see how to sell it.
But this is often the curse of the ethically pragmatic. There is such a focus on the ethics part by the participants that the business side of things only sees that conversation and misses the rather extreme pragmatism.
As an example, would superaligned CEOs in the oil industry fifty years ago have still only kept their eye on quarterly share prices or considered long term costs of their choices? There'...
While I agree that the potential for AI (we probably need a better term than LLMs or transformers as multimodal models with evolving architectures grow beyond those terms) in exploring less testable topics as more testable is quite high, I'm not sure the air gapping on information can be as clean as you might hope.
Does the AI generating the stories of Napoleon's victory know about the historical reality of Waterloo? Is it using something like SynthID where the other AI might inadvertently pick up on a pattern across the stories of victories distinct from t...
As a fellow slight dyslexic (though probably a different subtype given mine seems to also have a factor of temporal physical coordination) who didn't know until later in life due to self-learning to read very young but struggled badly with new languages or copying math problems from a board or correctly pronouncing words I was letter transposing with - one of the most surprising things was that the anylytical abilities I'd always considered to be my personal superpowers were probably the other side of the coin of those annoyances:
...Areas of enhanced abilit
Hi Martijn,
Thank you so much for your comment! I've been familiar with your work for a few years, but it was a wonderful reminder to go through your commentary again more closely, which is wonderful.
I especially love to see someone out there pointing out both (a) the gender neutrality consideration for terms that would have been binary in Aramaic (esp in light of saying 22) and (b) the importance of the Greek loanwords. On the latter point, the implications of using eikon across the work, especially in saying 22's "eikons in place of eikons" has such huge ... (read more)