The poor quality reflects that it is responding to demand for poor quality fakes, rather than to demand for high quality fakes
You’ve made the supply/demand analogy a few times on this subject, I’m not sure that is the best lens. This analysis makes it sound like there is a homogenous product “fakes” with a single dimension “quality”. But I think even on its own terms the market micro-dynamics are way more complex than that.
I think of it more in terms of memetic evolution and epidemiology. SIR as a first analogy - some people have weak immune systems, some ...
...Let’s walk through how shutdown would work in the context of the AutoGPT-style system. First, the user decides to shutdown the model in order to adjust its goals. Presumably the user’s first step is not to ask the model whether this is ok; presumably they just hit a “reset” button or Ctrl-C in the terminal or some such. And even if the user’s first step was to ask the model whether it was ok to shut down, the model’s natural-language response to the user would not be centrally relevant to corrigibility/incorrigibility; the relevant question is what actions
Confabulation is a dealbreaker for some use-cases (e.g. customer support), and potentially tolerable for others (e.g. generating code when tests / ground-truth is available). I think it's essentially down to whether you care about best-case performance (discarding bad responses) or worst-case performance.
But agreed, a lot of value is dependent on solving that problem.
While of course this is easy to rationalize post hoc, I don’t think falling user count of ChatGPT is a particularly useful signal. There is a possible world where it is useful; something like “all of the value from LLMs will come from people entering text into ChatGPT”. In that world, users giving up shows that there isn’t much value.
In this world, I believe most of the value is (currently) gated behind non-trivial amounts of software scaffolding, which will take man-years of development time to build. Things like UI paradigms for coding assistants, experi...
Amusingly, the US seems to have already taken this approach to censor books: https://www.wired.com/story/chatgpt-ban-books-iowa-schools-sf-496/
The result, then, is districts like Mason City asking ChatGPT, “Does [insert book here] contain a description or depiction of a sex act?” If the answer was yes, the book was removed from the district’s libraries and stored.
Regarding China or other regimes using LLMs for censorship, I'm actually concerned that it might rapidly go the opposite direction as speculated here:
...It has widely been reported that the PRC may b
There doesn't need to be a deception module or a deception neuron that can be detected
I agree with this. Perhaps I’m missing some context; is it common to advocate for the existence of a “deception module”? I’m aware of some interpretability work that looks for a “truthiness” neuron but that doesn’t seem like the same concept.
...We would need an interpretability tool that can say "this agent has an inaccurate world model and also these inaccuracies systematically cause it to be deceptive" without having to simulate the agent interacting with the world. I
I find the distinction between an agent’s behavior and the agent confusing; I would say the agent’s weights (and ephemeral internal state) determine its behavior in response to a given world state. Perhaps you can clarify what you mean there.
Cicero doesn’t seem particularly relevant here, since it is optimized for a game that requires backstabbing to win, and therefore it backstabs. If anything it is anti-aligned by training. It happens to have learned a “non-deceptive” strategy, I don’t think that strat is unique in Diplomacy?
But if you want to apply the ...
Interpretability. If we somehow solve that, and keep it as systems become more powerful, then we don’t have to solve the alignment problem in one shot; we can iterate safely knowing that if an agent starts showing signs of object-level deceptiveness, malice, misunderstanding, etc, we will be able to detect it. (I’m assuming we can grow new AIs by gradually increasing their capabilities, as we currently do with GPT parameter counts, plus gradually increasing their strength by ramping up the compute budget.)
Of course, many big challenges here. Could an agent...
I think getting to “good enough” on this question should pretty much come for free when the hard problems are solved. For example any common sense statement like “Maximize flourishing as depicted in the UN convention on human rights” is IMO likely to get us to a good place, if the agent is honest, remains aligned to those values, and interprets them reasonably intelligently. (With each of those three pre-requisites being way harder than picking a non-harmful value function.)
If our AGIs, after delivering utopia, tell us we need to start restricting childbea...
Seconding the Airmega, but here’s a DIY option too if availability becomes an issue: https://dynomight.net/better-DIY-air-purifier.html
The problem with ‘show your work’ and grading on steps is that at best you can’t do anything your teacher doesn’t understand
Being told to ‘show your work’ and graded on the steps helps you learn the steps and by default murders your creativity, execution style
I can see how this could in some cases end up impacting creativity, but I think this concern is at best overstated. I think the analogy to school is subtly incorrect, the rating policy is not actually the same, even though both are named “show your working”.
In the paper OpenAI have a “neutral” r...
Is “adversarial-example-wanters” referring to an existing topic, or something you can expand on here?
This is a great experiment! This illustrates exactly the tendency I observed when I dug into this question with an earlier mode, LaMDA, except this example is even clearer.
As an AI language model, I have access to a variety of monitoring tools and system resources that allow me to gather information about my current state
Based on my knowledge of how these systems are wired together (software engineer, not an ML practitioner), I’m confident this is bullshit. ChatGPT does not have access to operational metrics about the computational fabric it is running...
I buy this. I think a solid sense of self might be the key missing ingredient (though it’s potentially a path away from Oracles toward Agents).
A strong sense of self would require life experience, which implies memory. Probably also the ability to ruminate and generate counterfactuals.
And of course, as you say, the memories and “growing up” would need to be about experiences of the real world, or at least recordings of such experiences, or of a “real-world-like simulation”. I picture an agent growing in complexity and compute over time, while retaining a memory of its earlier stages.
Perhaps this is a different learning paradigm from gradient descent, relegating it to science fiction for now.
I think they quite clearly have no (or barely any) memory, as they can be prompt-hijacked to drop one persona and adopt another. Also, mechanistically, the prompt is the only thing you could call memory and that starts basically empty and the window is small. They also have a fuzzy-at-best self-symbol. No “Markov blanket”, if you want to use the Friston terminology. No rumination on counterfactual futures and pasts.
I do agree there is some element of a self-symbol—at least a theory of mind—in LaMDA, for example I found it’s explanation for why it lied to b...
why they thought the system was at all ready for release
My best guess is it’s fully explained by Nadella’s quote “I hope that, with our innovation, [Google] will definitely want to come out and show that they can dance. And I want people to know that we made them dance.”
https://finance.yahoo.com/news/microsoft-ceo-satya-nadella-says-172753549.html
Seems kind of vapid but this appears to be the level that many execs operate at.
Is there any evidence at all that markets are good at predicting paradigm shifts? Not my field but I would not be surprised by the “no” answer.
Markets as often-efficient in-sample predictors, and poor out-of-sample predictors, would be my base intuition.
Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic
Agreed. My update is coming purely from increasing my estimation for how much press and therefore funding AI risk is going to get long before to that point. 12 months ago it seemed to me that capabilities had increased dramatically, and yet there was no proportional increase in the general public's level of fear of catastrophe. Now it seems to me that there's a more plausible path to widespread appreciation of (and therefore work on) AI risk. To ...
I posted something similar over on Zvi’s Substack, so I agree strongly here.
One point I think is interesting to explore - this release actually updates me slightly towards lowered risk of AI catastrophe. I think there is growing media attention towards a skeptical view of AI, the media is already seeing harms and we are seeing crowdsourced attempts to break, and more thinking about threat models. But the actual “worst harm” is still very low.
I think the main risk is a very discontinuous jump in capabilities. If we increase by relatively small deltas, then ...
I think we will probably pass through a point where an alignment failure could be catastrophic but not existentially catastrophic.
Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic (both deceptive alignment and irreversible reward hacking are noticeably harder to fix once an AI coup can succeed). I expect it will be possible to create toy models of alignment failures, and that you'll get at least some kind of warning shot, but that you may not actually see any giant warning shots.
I think AI used...
I realized the reference "thin layer" is ambiguous in my post, just wanted to confirm if you were referring to the general case ""thin model, fat services", or the specific safety question at the bottom "is it possible to have a thin mapping layer on top of your Physics simulator that somehow subverts or obfuscates it"? My child reply assumed the former, but on consideration/re-reading I suspect the latter might be more likely?
With "thin model / fat service" I'm attempting to contrast with the typical end-to-end model architecture, where there is no separate "physics simulator", and instead the model just learns its own physics model, embedded with all the other relations that it has learned. So under that dichotomy, I think there is no "thin layer in front of the physics simulation" in an end-to-end model, as any part of the physics simulator can connect to or be connected from any other part of the model.
In such an end-to-end model, it's really hard to figure out where the "ph...
I suppose the follow-up question is: how effectively can a model learn to re-implement a physics simulator, if given access to it during training -- instead of being explicitly trained to generate XML config files to run the simulator during inference?
If it's substantially more efficient to use this paper's approach and train your model to use a general purpose (and transparent) physics simulator, I think this bodes well for interpretability in general. In the ELK formulation, this would enable Ontology Identification.
On this point, the paper says:
...Mind’s E
- That a sufficiently integrated CAIS is indistinguishable from a single general agent to us is what tells us CAIS isn't safe either.
Fleshing this point out, I think one can probably make conditional statistical arguments about safety here, to define what I think you are getting at with "sufficiently integrated".
If your model is N parameters and integrates a bunch of Services, and we've established that a SOTA physics model requires N*100 parameters (the OP paper suggests that OOM difference), then it is likely safe to say that the model has not "re-le...
I think it’s plausible to say this generation of headset will be better than a group video conference. At a stretch possibly better than a 1:1 video call. But better than in-person seems extremely unlikely to me.
Perhaps you are intending something broad like “overall higher utility for business and employees” rather than strictly better such that people will prefer to leave offices they were happy in to do VR instead? Taking into account the flexibility to hire people remotely, avoid paying tech hub wages, etc.?
Personally I think 1:1 video is much better t...
I know an AI wouldn’t think like a human
This assertion is probably my biggest question mark in this discourse. It seems quite deeply baked into a lot of the MIRI arguments. I’m not sure it’s as certain as you think.
I can see how it is obviously possible we’d create an alien AI, and I think it’s impossible to prove we won’t. However given that we are training our current AI on imprints of human thought (eg text artifacts), and it seems likely we will push hard for AI to be trained to obey laws/morality as they increase in power (eg Google’s AI safety tea...
One factor I think is worth noting, and I don't see mentioned here, is that the current state of big-tech self-censorship is clearly at least partly due to a bunch of embarassing PR problems over the last few years, combined with strident criticism of AI bias from the NYT et. al.
Currently, companies like Google are terrified of publishing a model that says something off-color, because they (correctly) predict that they will be raked over the coals for any offensive material. Meanwhile, they are busy commercializing these models to deliver value to their us...
Thanks, this is what I was looking for: Mind Crime. As you suggested, S-Risks links to some similar discussions too.
I guess that most wouldn't feel terribly conflicted about removing Hitler's right of privacy or even life to prevent Holocaust.
I'd bite that bullet, with the information we have ex post. But I struggle to see many people getting on board with that ex ante, which is the position we'd actually be in.
Is it ethical to turn off an AGI? Wouldn’t this be murder? If we create intelligent self-aware agents, aren’t we morally bound to treat them with at least the rights of personhood that a human has? Presumably there is a self-defense justification if Skynet starts murderbot-ing, or melting down things for paperclips. But a lot of discussions seem to assume we could proactively turn off an AI merely because we dislike its actions, or are worried about them, which doesn’t sound like it would fly if courts grant them personhood.
If alignment requires us to insp...
If we view the discovery of particular structures such as induction heads as chancing upon a hard-to-locate region in the parameter space (or perhaps a high activation energy to cross), and if we see these structures being repeatedly discovered ("parallel evolution"), is it possible to reduce the training time by initializing the network's parameters "close" to that location?
Speaking more mechanistically, is it possible to initialize a subset of the network prior to training to have a known functional structure, such as initializing (a guess at) the right ...
unlikely to be competitive
Would you care to flesh this assertion out a bit more?
To be clear I’m not suggesting that this is optimal now. Merely speculating that there might be a point between now and AGI where the work to train these sub components becomes so substantial that it becomes economical to modularize.
whether a design is aligned or not isn't the type of question one can answer by analyzing the agent's visual cortex
As I mentioned earlier in my post, I was alluding to the ELK paper with that reference, specifically Ontology Identification. O...
One other thought after considering this a bit more - we could test this now using software submodules. It’s unlikely to perform better (since no hardware speedup) but it could shed light on the tradeoffs with the general approach. And as these submodules got more complex, it may eventually be beneficial to use this approach even in a pure-software (no hardware) paradigm, if it lets you skip retraining a bunch of common functionality.
I.e. if you train a sub-network for one task, then incorporate that in two distinct top-layer networks trained on different ...
I've been thinking along similar lines recently. A possible path to AI safety that I've been thinking about extends upon this:
A promising concrete endgame story along these lines is Ought’s plan to avoid the dangerous attractor state of AI systems that are optimized end-to-end
One possible tech-tree path is that we start building custom silicon to implement certain subsystems in an AI agent. These components would be analogous to functional neural regions of the human brain such as the motor cortex, visual sy...
I’m a vegetarian and I consider my policy of not frequently recalculating the cost/benefit of eating meat to be an application of a rule in two-level utilitarianism, not a deontological rule. (I do pressure test the calculation periodically.)
Also I will note you are making some pretty strong generalizations here. I know vegans who cheat, vegans who are flexible, vegans who are strict.