RogerDearnaley

I'm an staff artificial intelligence engineer in Silicon Valley currently working with LLMs, and have been interested in AI alignment, safety and interpretability for the last 15 years. I'm now actively looking for employment working in this area.

Sequences

AI, Alignment, and Ethics

Comments

Sorted by

Hove you tried and compared You.com? If so, how do they compare?

I am using Artificial General Intelligence (AGI) to mean a AI that is, broadly, at least as good at most intellectual tasks as the typical person who makes a living from performing that intellectual task. If that applies across most economically-important intellectual tasks at a cost that is lower-than a human, then this is also presumably going to be Transformative Artificial Intelligence (TAI). So the latter means that it would be competitive at most white-collar jobs.

For reasons I've outlined in Requirements for a Basin of Attraction to Alignment and Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis, I personally think value alignment is easy, convergent, and "an obvious target", such that if you built a AGi or ASI that is sufficiently close to it, it will see the necessity/logic of value alignment and actively work to converge to it (or something close to it: I'm not sure the process is necessarily convergent to a single precisely-defined limit, just to a compact region: a question I discussed more in The Mutable Values Problem in Value Learning and CEV).

However, I agree that order-following alignment is obviously going to be appealing to people building AI, and to their shareholders/investors (especially if they're not a public-benefit corporation), and I also don't think that value alignment is so convergent that order-following aligned AI is impossible to build. So we're going to need to a make, and successfully enforce, a social/political decision across multiple countries about which of these we want over the next few years. The in-the-Overton-Window terminology for this decision is slightly different: value-aligned Ai is called "AI that resists malicious use", while order-following AI is "AI that enables malicious use". The closed-source frontier labs are publicly in favor of the former, and are shipping primitive versions of it: the latter is being championed by the open-source community, Meta, and A16z. Once "enabling malicious use" includes serious cybercrime, not just naughty stories, I don't expect this political discussion to last very long: politically, it's a pretty basic "do you want every-person-for-themself anarchy, or the collective good?" question. However, depending on takeoff speeds, the timeline from "serious cybercrime enabled" to the sort of scenarios Seth is discussing above might be quite short, possible only of the order of a year or two.

One element that needs to be remembered here is that each major participant in this situation will have superhuman advice. Even if these are "do what I mean and check" order-following AI, if they can forsee that an order will lead to disaster they will presumably be programmed to say so (not doing so is possible, but is a clearly a flawed design). So if it is reasonably obvious to anything superintelligent that both:


a) treating this as a zero-sum winner-take all game is likely to lead to a disaster, and

b) there is a cooperative non-zero-sum game approach whose outcome is likely to be better, for the median participant

then we can reasonably expect that all the humans involved will be getting that advice from their AIs, unless-and-until they order them to shut up.

This of course does not prove that both a) and b) are true, merely that is that were the case, we can be optimistic of an outcome better than the usual results of human short-sightedness.


The potential benefits of cheap superintelligence certainly provide some opportunity for this to be a non-zero-sum game; what's less clear is that having multiple groups of humans controlling multiple order-following AIs cooperating clearly improves that. The usual answer is that in research and the economy a diversity of approaches/competition increases the chances of success and the opportunities for cross-pollenization: whether that necessarily applies in this situation is less clear 

Answer by RogerDearnaley63

If:

a) the AI scaling curves hold up, and

b) we continue to improve at agentic scaffolding and other forms of "unhobbling", and

c) algorithmic efficiency improvement continue at about the same pace, and

d) the willingness of investors to invest exponentially more money in training AI each year continues to scale up at about the same rate, and

e) we don't hit any new limit like meaningfully running out of training data or power for training clusters, then:

capabilities will look something a lot like or close to Artificial General Intelligence (AGI)/Transformative Artificial Intelligence (TAI). Probably a patchy AGI, with some capabilities well into superhuman, most around expert-human levels (some still perhaps exceeded by rare very-talented-and-skilled individuals), and a few not yet at expert-human levels: depending on which abilities those are, it may be more of less TAI (currently long-term planning/plan execution is an really important major weakness: if that didn't get mostly-fixed by some combination of scaling, unhobbling, and new training data then it would be a critical lack).

Individually each of those listed preconditions seem pretty likely, but obviously there are five of them. If any of them fail, then we'll be close to but not quite at AGI, and making slower progress towards it, but we won't be stalled unless basically all of them fail, which seems like a really unlikely coincidence.

Almost certainly this will not yet by broadly applied across the economy, but given the potential for order-of-magnitude-or-more cost savings, people will be scrambling to apply it rapidly (during which fortunes will be made and lost), and there will be a huge amount of "who moved my cheese?" social upheaval as a result. As AI becomes increasingly AGI-like, the difficulty of applying it effectively to a given economic use case will reduce to somewhere around the difficulty of integrating and bringing-up-to-speed single human new-hire. So a massive and rapid economic upheaval will be going on. As an inevitable result, Luddite views and policies will skyrocket, and AI will become extremely unpopular with a great many people. A significant question here is whether this disruption will, in 2028, be limited to purely-intellectual work, or if advances in robotics will have yet started to have the same effect on jobs that also have a manual work element. I'm not enough of an expert on robotics to have an informed opinion here: my best guess is that robotics will lag, but not by much, since robotics research is mostly intellectual work.

This is of course about the level where the rubber really starts to hit the road on AI safety: we're no longer talking about naughty stories, cheap phishing, or how-to-guides on making drugs at home, and are looking at systems capable of autonomously or under human direction committing serious criminal or offensive activities at a labor cost at least an order-of-magnitude below current, and an escaped self-replicating malicious agent is feasible and might be able to evade law enforcement and computer security professionals unless they had equivalent AI assistance. If we get any major "warning shots" on AI safety, this is when they'll happen (personally I expect them to come thick-and fast). It's teetering on the edge of the existential risk level of Artificial Super-Intelligence (ASI).

Somewhere around that point, we start to hit two conflicting influences: 1) an intelligence feedback explosion from AGI accelerating AI research, vs. 2) to train a super-intelligence you need to synthesize very large amounts of training data displaying superintelligent behavior, rather than just using prexisting data from humans. So we either get a fast takeoff, or a slowdown, or some combination of the two. That's hard to predict: we're starting to get close to the singularity, where the usual fact that predictions are hard (especially about the future) is compounded by it being functionally almost impossible to predict the capabilities of something much smarter than us, especially when we've never previously seen anything smarter than us.

the jabberwacky entity is back

There was an early (pre-generative-AI, based on an early ad-hoc form of machine learning and imitation) chatbot called Jabberwacky. Possibly some output from it got into the training set. If so, presumably this entity isn't very helpful or intelligent.

Yes, the profit motive also involves attempting to avoid risks of bad press, a bad reputation, and getting sued/fined. In my experience large tech companies vary in whether they're focused primarily on avoiding the bad press/bad reputation side or the "don't get sued/fined" side (I assume depending mostly on how much they have previously lost to being sued/fined).

But I believe the gatekeepers of AI models will manipulate the algorithms to edit out anything they disagree with whilst promoting their agenda

This reads like a conspiracy theory to me, complete with assumption-laden words and unsupported accusations like "gatekeepers", "manipulate", and "promoting their agenda".

Having worked at more than one of these companies, what actually happens is that some part of the team picks a "user engagement metric" to care about, like "total time spent on the website" or "total value of products purchased through the website", then everyone in the team puts a lot of time and effort into writing and testing out changes to the algorithms with the aim of finding ways to make that metric go up, even by 0.1% for a project that took several people a month. Then, after a few years, people ask "why has our website turned into a toxic cesspool?", until someone points out that myopically pushing that one metric as hard as possible turns out to have unfortunate unanticipated side effects. For example, maybe the most effective way to get people to spend more time on the website turned out to be measures that had the net side-effect of promoting conspiracy theory nutcase interactions (in many flavors, thus appealing to many different subsets of users) over reasonable discussions.

So the 'agenda' here is just "make a profit", not "spread conspiracy theories" or "nefariously promote Liberal opinions", and the methods used don't even always do a good job at making a profit. The message here is that large organizations with a lot of power are stupider, more bureaucratic and more shortsighted than you appear to think they are.

Yes, most software engineers are university educated, and thus, like other well-educated people, tend, in the modern political environment, to be more liberal than an overall population that also includes people who are not university educated. However, we're strongly encouraged to remember that our users are more diverse than we are, don't all think or believe or act like us, to "think like the users", and not to encode personal political opinions into the algorithms.

That's not necessarily required. The Scientific Method works even if the true "Unified Field Theory" isn't yet under consideration, merely some theories that are closer to it and others further away from it: it's possible to make iterative progress.

In practice,  considered as search processes, the Scientific Method, Bayesianism, and stochastic gradient descent all tend to find similar answers: yet unlike Bayesianism gradient descent doesn't explicitly consider every point in the space including the true optimum, it just searches for nearby better points. It can of course get trapped in local minima: Singular Learning Theory highilights why that's less of a problem in practice than it sounds in theory.


The important question here is how good an approximation the search algorithm in use is to Bayesianism. As long as the AI understands that what it's doing is (like the scientific method and stochastic gradient descent) a computationally efficient approximation to the computationally intractable ideal of Bayesianism, then it won't resist the process of coming up with new possibly-better hypotheses, it will instead regard that as a necessary part of the process (like hypothesis creation in the scientific method, the mutational/crossing steps in an evolutionary algorithm, or the stochastic batch noise in stochastic gradient descent).

Cool! That makes a lot of sense. So does it in fact split into three before it splits into 7, as I predicted based on dimensionality? I see a green dot, three red dots, and seven blue ones… On the other hand, the triangle formed by the three red  dots is a lot smaller than the heptagram, which I wasn't expecting…

I notice it's also an oddly shaped heptagram.

Load More