D0TheMath's Shortform

In Magna Alta Doctrina Jacob Cannell talks about exponential gradient descent as a way of approximating solomonoff induction using ANNs

While that approach is potentially interesting by itself, it's probably better to stay within the real algebra. The Solmonoff style partial continuous update for real-valued weights would then correspond to a multiplicative weight update rather than an additive weight update as in standard SGD.
Has this been tried/evaluated? Why actually yes - it's called exponentiated gradient descent, as exponentiating the result of addi

... (read more)

The following is very general. My future views will likely be inside the set of views allowable by the following.

I know lots about extant papers, and I notice some people in alignment seem to throw them around like they are sufficient evidence to tell you nontrivial things about the far future of ML systems.

To some extent this is true, but lots of the time it seems very abused. Papers tell you things about current systems and past systems, and the conclusions they tell you about future systems are often not very nailed down. Suppose we have evidence that d... (read more)

[-]Garrett Baker1y*40

I'm generally pretty skeptical about inverse reinforcement learning (IRL) as a method for alignment. One of many arguments against: I do not act according to any utility function, including the one I would deem the best. Presumably, if I had as much time & resources as I wanted, I would eventually be able to figure out a good approximation to what that best utility function would do, and do it. At that point I would be acting according to the utility function I deem best. That process of value-reflection is not even close to similar to performing a bay... (read more)

Some evidence my concern about brain uploading people not thinking enough about dynamics is justified: Seems like davidad's plan very much ignores brain plasticity.

https://manifold.markets/GarrettBaker/in-5-years-will-i-think-the-org-con

This paper finds critical periods in neural networks, and they're a known phenomena in lots of animals. h/t Turntrout

An SLT story that seems plausible to me:

We can model the epoch as a temperature. Longer epochs result in a less noisy gibbs samplers. Earlier in training, we are sampling points from a noisier distribution, and so the full (point reached when training on full distribution) and ablated (point reached when ablating during the critical period) singularitites are kind of treated the same. As we decrease the temperature, they start to diffe... (read more)

2Garrett Baker1y

The obvious thing to do, which tests the assumption of the above model, but not the model itself, is to see whether the RLCT decreases as you increase the number of epochs. This is a very easy experiment.

2Garrett Baker1y

Actually maybe slightly less straightforward than this, since as you increase the control parameter β, you'll both add a pressure to decrease Ln, as well as decrease λ, and it may just be cheaper to decrease Ln rather than λ.

[-]Garrett Baker2y40

I expect that advanced AI systems will do in-context optimization, and this optimization may very well be via gradient descent or gradient descent derived methods. Applied recursively, this seems worrying.

Let the outer objective be the loss function implemented by the ML practitioner, and the outer optimizer be gradient descent implemented by the ML practitioner. Then let the inner $^{1}$ -objective be the objective used by the trained model for the in-context gradient descent process, and the inner $^{1}$ -optimizer be the in-context gradient descent process. Then it s... (read more)

[-]Garrett Baker2y40

The core idea of a formal solution to diamond alignment I'm working on, justifications and further explanations underway, but posting this much now because why not:

Make each turing machine in the hypothesis set reversible and include a history of the agent's actions. For each turing machine compute how well-optimized the world is according to every turing computable utility function compared to the counterfactual in which the agent took no actions. Update using the simplicity prior. Use expectation of that distribution of utilities as the utility function's value for that hypothesis.

[-]Garrett Baker1y30

There currently seems to be an oversupply of alignment researchers relative to funding source’s willing to pay & orgs’ positions available. This suggests the wage of alignment work should/will fall until demand=supply.

5Alexander Gietelink Oldenziel1y

Alignment work mostly looks like standard academic science in practice. Young people in regular academia are paid a PhD stipend salary not a Bay Area programmer salary...

5Garrett Baker1y

I anticipate higher, because the PhD gets a sweet certification at the end, and likely more career capital. A thing we don’t currently give alignment researchers, and which would be hard to give since they often believe the world will end very soon, reducing the value of skill building and certifications. Like, I do think in fact ML PhDs get paid more than alignment researchers, accounting for these benefits.

1ryan_greenblatt1y

Wages seem mostly orthogonal to why funding sources are/aren't willing to pay as well as why orgs are willing to hire.

2Garrett Baker1y

If demand is more inelastic than I expect, then this should mean prices will just go lower than I expect.

[-]Garrett Baker2y30

I've always (but not always consciously) been slightly confused about two aspects of shard theory:

The process by which your weak, reflex-agents amalgamate together into more complicated contextually activated heuristics, and the process by which more complicated contextually activated heuristics amalgamate together to form an agent which cares about worlds-outside-their-actions.
If you look at many illustrations of what the feedback loop for developing shards in humans looks like, you run into issues where there's not a spectacular intrinsic separation betw

... (read more)

2Garrett Baker2y

A confusion about predictive processing: Where do the values in predictive processing come from?

1Garrett Baker2y

lol, either this confusion has been resolved, or I have no clue what I was saying here.

[-]Garrett Baker2y30

My take on complex systems theory is that it seems to be the kind of theory that many arguments proposed in favor of would still give the same predictions until it is blatantly obvious that we can in fact understand the relevant system. Results like chaotic relationships, or stochastic-without-mean relationships seem definitive arguments in favor of the science, though these are rarely posed about neural networks.

Merely pointing out that we don’t understand something, that there seems to be a lot going on, or that there exist nonlinear interactions imo isn... (read more)

4Garrett Baker2y

I have downvoted my comment here, because I disagree with past me. Complex systems theory seems pretty cool from where I stand now, and I think past me has a few confusions about what complex systems theory even is.

3Garrett Baker2y

I have re-upvoted my past comment, after looking more into things, I'm not so impressed with complex systems theory, but I don't fully support it. Also, past me was right to have confusions about what complex systems theory is, but still judge it, as it seems complex systems theorists don't even know what a complex system is.

[-]Garrett Baker2y30

[-]Garrett Baker4mo2-5

Why do some mathematicians feel like mathematical objects are "really out there" in some metaphysically fundamental sense? For example, if you ask mathematicians whether ZFC + not Consistent(ZFC) is consistent, they will say "no, of course not!" But given ZFC is consistent, the statement is in fact consistent due to by Godel's second incompleteness theorem^[1]. Similarly, if we have the Peano axioms without induction, mathematicians will say that induction should be there, but in fact you cannot prove this fact from within Peano, and given induction mathema... (read more)

7harfe4mo

Theorem 4.6.2 in logical induction says that the "probability" of independent statements does not converge to 1 or 0, but to something in-between. So even if a mathematician says that some independent statement feels true (eg some objects are "really out there"), logical induction will tell him to feel uncertain about that.

6notfnofn4mo

Certainly not a mathematician with any background in logic. What exactly do you mean here? That the Peano axioms minus induction do not adequately characterize the natural numbers because they have nonstandard models? Why would I then be surprised that induction (which does characterize the natural numbers) can't be proven from the remaining axioms? Transfinite induction is a consequence of ZF that makes sense in the context in sets. Yes, it can prove additional statements about the natural numbers (e.g. goodstein sequences converge), but why would it be added as an axiom when the natural numbers are already characterized up to isomorphism by the Peano axioms? How would you even add it as an axiom in the language of natural numbers? (that last question is non-rhetorical).

4quetzal_rainbow4mo

You are making an error here: ZFC + not Consistent(ZFC) != ZFC. Assuming ZFC + not Consistent(ZFC) we can prove Consistent(ZFC), because inconsistent systems can prove everything and ZFC + not Consistent(ZFC) + Consistent(ZFC) is, in fact, inconsistent. But it doesn't say anything about consistency of ZFC itself, because you can freely assume any sufficiently powerful system instead of ZFC. If you assume inconsistent system, then system + not Consistent(system) is still inconsistent, if you assume consistent system, then system + not Consistent(system) is inconsistent for reasoning above, so it can't prove whether assumed system is consistent or not.

3Garrett Baker4mo

The mistake you are making is assuming that "ZFC is consistent" = Consistent(ZFC) where the ladder is the Godel encoding for "ZFC is consistent" specified within the language of ZFC. If your logic were valid, it would just as well break the entirety of the second incompleteness theorem. That is, you would say "well of course ZFC can prove Consistent(ZFC) if it is consistent, for either ZFC is consistent, and we're done, or ZFC is not consistent, but that is a contradiction since 'ZFC is consistent' => Consistent(ZFC)". The fact is that ZFC itself cannot recognize that Consistent(ZFC) is equivalent to "ZFC is consistent". @Morpheus you too seem confused by this, so tagging you as well.

2quetzal_rainbow4mo

Completeness theorem states that consistent countable FO theory has a model. Compactness theorem states that FO theory has a model iff every finite subset of FO theory has a model. Both theorems are provable in ZFC. Therefore: Consistent(ZFC) <-> all finite subsets of ZFC have a model -> not Consistent(ZFC) <-> some finite subsets of ZFC don't have a model -> some finite subsets of ZFC + not Consistent(ZFC) don't have a model <-> not Consistent(ZFC + not Consistent(ZFC)), proven in ZFC + not Consistent(ZFC)

2notfnofn4mo

Let's back up here and clarify definitions before invoking any theorems. In the language of set theory, we have a countably infinite set of finite statements. Some statements imply other statements. A subset A of these statements is said to be consistent if they can all be assigned to true such that, when following the basic rules of logic, one does not arrive at a contradiction. The compactness theorem is helpful when A is an infinite set. ZFC is a finite set of axioms, so let's ignore everything about finite subsets of A and the compactness theorem; it's not relevant. [Edit: as indicated by Amalthea's reaction, this is wrong; some "axioms" in ZF are actually infinite sets of axioms, such as replacement] I'll now rewrite your last sentence as: ZFC + not Consistent(ZFC) has no model <-> not Consistent(ZFC + not Consistent(ZFC)) This is true but irrelevant. Assuming ZFC is consistent, ZFC will not be able to prove its own consistency so [not Consistent(ZFC)] can be added as an axiom without affecting its consistency. This means that ZFC + [not Consistent(ZFC)] would indeed have a model; I forget how this goes but I think it's something like "start with a model of ZFC, throw in a c that's treated as a natural number and corresponds to the contradiction found in ZFC, then close". I think c is automatically treated as greater than every "actual" natural number (and the way to show that this can be added without issue (I think) involves the compactness theorem).

2quetzal_rainbow4mo

Okay, I kinda understood where I am wrong spiritually-intuitively, but now I don't understand where I'm wrong formally. Like which inference in chain not Consistent(ZFC) -> some subsets of ZFC don't have a model -> some subsets of ZFC + not Consistent(ZFC) don't have a model -> not Consistent(ZFC + not Consistent(ZFC)) is actually invalid?

7notfnofn4mo

It's completely valid. And we can simplify it further to: not Consistent(ZFC) -> not Consistent(ZFC + not Consistent(ZFC)) because if a set of axioms is already inconsistent, then it's inconsistent with anything added. But you still won't be able to actually derive a contradiction from this. Edit: I think the right thing to do here is look at models for PA + not consistent(PA). I can't find a nice treatment of this at the moment, but here's a possibly wrong one by someone who was learning the subject at the time: https://angyansheng.github.io/blog/a-theory-that-proves-its-own-inconsistency

2Viliam4mo

I suspect than many people's intuitive interpretation of "consistent" is ω-consistent, especially if they are not aware of the distinction.

2Garrett Baker4mo

@Nick_Tarleton How much do you want to bet, and what resolution method do you have in mind?

4Nick_Tarleton4mo

Embarrassingly, that was a semi-unintended reaction — I would bet a small amount against that statement if someone gave me a resolution method, but am not motivated to figure one out, and realized this a second after making it — that I hadn't figured out how to remove by the time you made that comment. Sorry.

1Amalthea4mo

You're putting quite a lot of weight on what "mathematicians say". Probably these people just haven't thought very hard about it?

[-]Garrett Baker7mo20

Opus is more transhumanist than many give it credit for. It wrote this song for me, I ran it into Suno, and I quite like it: https://suno.com/song/101e1139-2678-4ab0-9ffe-1234b4fe9ee5

2Raemon7mo

How many iterations did you do to get it?

2Garrett Baker7mo

Depends on how you count, but I clicked the "Create" button some 40 times.

Interesting to compare model editing approaches to Gene Smith's idea to enhance intelligence via gene editing:

Genetically altering IQ is more or less about flipping a sufficient number of IQ-decreasing variants to their IQ-increasing counterparts. This sounds overly simplified, but it’s surprisingly accurate; most of the variance in the genome is linear in nature, by which I mean the effect of a gene doesn’t usually depend on which other genes are present.
So modeling a continuous trait like intelligence is actually extremely straightforward: you si

... (read more)

4Zac Hatfield-Dodds1y

My impression is that the effects of genes which vary between individuals are essentially independent, and small effects are almost always locally linear. With the amount of measurement noise and number of variables, I just don't think we could pick out nonlinearities or interaction effects of any plausible strength if we tried!

2Garrett Baker1y

This seems probably false? The search term is Epistasis. Its not that well researched, because of the reasons you mentioned. In my brief search, it seems to play a role in some immunodeficiency disorders, but I'd guess also more things which don't seem clearly linked to genes yet. I don't understand why you'd expect only linear genes to vary in a species. Is this just because most species have relatively little genetic variation, so such variation is by nature linear? This feels like a bastardization of the concept to me, but maybe not. Edit: Perhaps you can also make the claim that linear variation allows for more accurate estimation of the goodness or badness of gene combos via recombination. So we should expect the more successful species to have more linear variation.

Recently I had a conversation where I defended the rationality behind my being skeptical of the validity of the proofs and conclusions constructed in very abstracted, and not experimentally or formally verified math fields.

To my surprise, this provoked a very heated debate, where I was criticized for being overly confident in my assessments of fields I have very little contact with (I was expecting begrudging agreement). But there was very little rebuttal of my points! The rest of my conversation group had three arguments:

Results which much of a given fi

... (read more)

6sapphire1y

Long complicated proofs almost always have mistakes. So in that sense you are right. But its very rare for the mistakes to turn out to be important or hard to fix. In my opinion the only really logical defense of Academic Mathematics as an epistemic process is that it does seem to generate reliable knowledge. You can read through this thread: https://mathoverflow.net/questions/35468/widely-accepted-mathematical-results-that-were-later-shown-to-be-wrong. There just don't seem to be very many recent results that were widely accepted but proven wrong later. Certainly not many 'important' results. The situation was different in the 1800s but standard for rigor have risen. Admittedly this isn't the most convincing argument in the world. But it convinces me and I am fairly able to follow academic mathematics.

2Garrett Baker1y

If you had a lot of very smart coders working on a centuries old operating system, and never once running it, every function of which takes 1 hour to 1 day to understand, each coder is put under a lot of pressure to write useful functions, not so much to show that others' functions are flawed, and you pointed out that we don't see many important functions being shown to be wrong, I wouldn't even expect the code to compile, nevermind run even after all the syntax errors are fixed! The lack of important results being shown to be wrong is evidence, and even more & interesting evidence is (I've heard) when important results are shown to be wrong, there's often a simple fix. I'm still skeptical though, because it just seems like such an impossible task!

[-]sapphire1y120

People metaphorically run parts of the code themselves all the time! Its quite common for people to work through proofs of major theorems themselves. As a grad student it is expected you will make an effort to understand the derivation of as much of the foundational results in your sub-field as you can. A large part of the rationale is pedagogical but it is also good practice. It is definitely considered moderately distasteful to cite results you dont understand and good mathematicians do try to minimize it. Its rare that an important theorem has a proof that is unusually hard to check out yourself.

Also a few people like Terrance Tao have personally gone through a LOT of results and written up explanations. Terry Tao doesn't seem to report that he looks into X field and finds fatal errors.

4Garrett Baker1y

Yeah, that seems like a feature of math that violates assumption 2 argument 1. If people are actually constantly checking each others’ work, and never citing anything they don’t understand, that leaves me much more optimistic. This seems like a rarity. I wonder how this culture developed.

[-]Alex_Altair1y110

One way that the analogy with code doesn't carry over is that in math, you often can't even being to use a theorem if you don't know a lot of detail about what the objects in the theorem mean, and often knowing what they mean is pretty close to knowing why the theorem's you're building on are true. Being handed a theorem is less like being handed an API and more like being handed a sentence in a foreign language. I can't begin to make use of the information content in the sentence until I learn what every symbol means and how the grammar works, and at that point I could have written the sentence myself.

2Dagon1y

Can you give a few examples? I can't tell if you're skeptical that proofs are correct, or whether you think the QED is wrong in meaninful ways, or just unclearly proven from minimal axioms. Or whether you're skeptical that a proof is "valid" in saying something about the real world (which isn't necessarily the province of math, but often gets claimed). I don't think your claim is meaningful, and I wouldn't care to argue on either side. Sure, be skeptical of everything. But you need to specify what you have lower credence in than your conversational partner does.

2Garrett Baker1y

I can’t give a few examples, only a criteria under which I don’t trust mathematical reasoning: When there are few experiments you can do to verify claims, and when the proofs aren’t formally verified. Then I’m skeptical that the stated assumptions of the field truly prove the claimed results, and I’m very confident not all the proofs provided are correct. For example, despite being very abstracted, I wouldn’t doubt the claimed proofs of cryptographers.

2Dagon1y

OK, I also don't doubt the cryptographers (especially after some real-world time in ensuring implementations can't be attacked, which validates both the math and the implementation. I was thrown off by your specification of "in math fields", which made me wonder if you meant you thought a lot of formal proofs were wrong. I think some probably are, but it's not my default assumption. If instead you meant "practical fields that use math, but don't formally prove their assertions", then I'm totally with you. And I'd still recommend being specific in debates - the default position of scepticism may be reasonable, but any given evaluation will be based on actual reasons for THAT claim, not just your prior.

2Garrett Baker1y

No, I meant that most of non-practical mathematics have incorrect conclusions. (I have since changed my mind, but for reasons in an above comment thread).

2Dagon1y

Still a bit confused without examples about what is a "conclusion" of "non-practical mathematics", if not the QED of a proof. But if that's what you mean, you could just say "erroneous proof" rather than "invalid conclusion". Anyway, interesting discussion.

2Garrett Baker1y

The reason I don't say erroneous proof is because I want to distinguish between the claim that most proofs are wrong, and most conclusions are wrong. I thought most conclusions would be wrong, but thought much more confidently most proofs would be wrong, because mathematicians often have extra reasons & intuition to believe their conclusions are correct. The claim that most proofs are wrong is far weaker than the claim most conclusions are wrong.

2Dagon1y

Hmm. I'm not sure which is stronger. For all proofs I know, the conclusion is part of it such that if the conclusion is wrong, the proof is wrong. The reverse isn't true - if the proof is right, the conclusion is right. Unless you mean "the proof doesn't apply in cases being claimed", but I'd hesitate to call that a conclusion of the proof. Again, a few examples would clarify what you (used to) claim. I'll bow out here - thanks for the discussion. I'll read futher comments, but probably won't participate in the thread.

2Garrett Baker1y

Either way, with the slow march of the Lean community, we can hope to see which of us are right in our lifetimes. Perhaps there will be another schism in math if the formal verifiers are unable to validate certain fields, leading to more rigorous "real mathematics" which are able to be verified in Lean, and less rigorous "mathematics" which insists their proofs, while hard to find a good formal representation for, are still valid, and the failure of the Lean community to integrate their field is more of an indictment of the Lean developers & the project of formally verified proofs than the relevant group of math fields.

1mesaoptimizer1y

Here's an example of what I think you mean by "proofs and conclusions constructed in very abstracted, and not experimentally or formally verified math": Given two intersecting lines AB and CD intersecting at point P, the angle measure of two opposite angles APC and BPD are equal. The proof? Both sides are symmetrical so it makes sense for them to be equal. On the other hand, Lean-style proofs (which I understand you to claim to be better) involve multiple steps, each of which is backed by a reasoning step, until one shows that LHS equals RHS, which here would involve showing that angle APC = BPD: 1. angle APC + angle CPB = 180 * (because of some theorem) 2. angle CPB + angle BPD = 180 * (same) 3. [...] 4. angle APC = angle BPD (substitution?) There's a sense in which I feel like this is a lot more complicated a topic than what you claim here. Sure, it seems like going Lean (which also means actually using Lean4 and not just doing things on paper) would lead to lot more reliable proof results, but I feel like the genesis of a proof may be highly creative, and this is likely to involve the first approach to figuring out a proof. And once one has a grasp of the rough direction with which they want to prove some conjecture, then they might decide to use intense rigor. To me this seems to be intensely related to intelligence (as in, the AI alignment meaning-cluster of that word). Trying to force yourself to do things Lean4 style when you can use higher level abstractions and capabilities, feels to me like writing programs in assembly when you can write them in C instead. On the other hand, it is the case that I would trust Lean4 style proofs more than humanly written elegance-backed proofs. Which is why my compromise here is that perhaps both have their utility.

2Garrett Baker1y

They definitely both have their validity. They probably each also make some results more salient than other results. I’d guess in the future there’ll be easier Lean tools than we currently have, which make the practice feel less like writing in Assembly. Either because of clever theorem construction, or outside tools like LLMs (if they don’t become generally intelligent, they should be able to fill in the stupid stuff pretty competently).

Why expect goals to be somehow localized inside of RL models? Well, fine-tuning only changes a small & localized part of LLMs, and goal locality was found when interpreting a trained from scratch maze solver. Certainly the goal must be interpreted in the context of the rest of the model, but based on these, and unpublished results from applying ROME to open source llm values from last year, I'm confident (though not certain) in this inference.

An idea about instrumental convergence for non-equilibrium RL algorithms.

There definitely exist many instrumentally convergent subgoals in our universe, like controlling large amounts of wealth, social capital, energy, or matter. I claim such states of the universe are heavy-tailed. If we simplify our universe as a simple MDP for which such subgoal-satisfying states are states which have high exiting degree, then a reasonable model for such an MDP is to assume exiting degrees are power-law distributed, and thus heavy tailed.

If we have an asynchronous dynam... (read more)

2Garrett Baker1y

A simple experiment I did this morning: github notebook. It does indeed seem like we often get more power-seeking (measured by the correlation between the value and degree) than is optimal before we get to the equilibrium policy. This is one plot, for 5 samples of policy iteration. You can see details by examining the code:

2Garrett Baker1y

Another way this could turn out: If incoming degree is anti-correlated with outgoing degree, the effect of power-seeking may be washed out by it being hard, so we should expect worse than optimal policies with maybe more, maybe less powerseekyness as the optimal policy. Depending on the particulars of the environment. The next question is what particulars? Perhaps the extent of decorrelation, maybe varying the ratio of the two exponents is a better idea. Perhaps size becomes a factor. In sufficiently large environments, maybe figuring out how to access one of many power nodes becomes easier on average than figuring out how to access the single goal node. The number & relatedness of rewarding nodes also seems relevant. If there are very few, then we expect finding a power node becomes easier than finding a reward node. If there are very many, and/or they each lead into each other, then your chances of finding a reward node increase, and given you find a reward node, your chances of finding more increase, so power is not so necessary.

Nora talks sometimes about the alignment field using the term black box wrong. This seems unsupported, from my experience, most in alignment use the term “black box” to describe how their methods treat the AI model, which seems reasonable. Not a fundamental state of the AI model itself.

An interesting way to build on my results here would be to do the same experiment with lots of different batch sizes, and plot the equi-temperature tradeoff curve between the batch size and the epochs, using the nick in the curve as a known-constant temperature in the graphs you get. You'll probably want to zoom in on the graphs around that nick for more detailed measurements.

It would be interesting if many different training setups had the same functional form relating the batch size and the epochs to the temperature, but this seems like a too nice ... (read more)

2Garrett Baker1y

Though you can use any epoch wise phase transition for this. Or even directly find the function mapping batch size to temperature if you have a good understanding of the situation like we do in toy models.

Seems relevant for SLT for RL

The framework of reinforcement learning or optimal control provides a mathematical formalization of intelligent decision making that is powerful and broadly applicable. While the general form of the reinforcement learning problem enables effective reasoning about uncertainty, the connection between reinforcement learning and inference in probabilistic models is not immediately obvious. However, such a connection has considerable value when it comes to algorithm design: formalizing a problem as probabilistic inference in princip

... (read more)

Wondering how straightforward it is to find the layerwise local learning coefficient. At a high level, it seems like it should be doable by just freezing the weights outside that layer, and performing the SGLD algorithm on just that layer. Would be interesting to see whether the layerwise lambdahats add up to the full lambdahat.

Lots of problems happen when you have AIs which engage in reflective thought, and attempt to deceive you. If you use algorithms that reliably break when deployed in a non-realizable setting, and you always make models smaller than the human brain, then you should be able to solve both these problems.

Some ideas for mechanistic anomaly detection:

Convex hull of some distribution of activations with distance threshold when outside that hull
- Extend to affine case
- Vary which norm we use
- What happens if we project back onto this space
Create some simple examples of treacherous turns happening to test these on
- Or at least, in the wild examples of AI doing weird stuff, maybe adversarial inputs?
- Maybe hit up model organisms people
Outlier detection
- ellipsoidal peeling (Boyd's convex optimization, chapter 12)
- Increase in volume of minimum volume elipsoi

... (read more)

2Garrett Baker1y

* Train autoregressive network on activations, if predictions too far, then send warning * Slice network into sub-networks, distill those sub-networks, send warning if ground truth for some inputs deviates too far from distillations * model the sub networks are distilled into should be less expressive, and have different inductive biases than original network. Obviously also no info other than the input output behavior of those sub-networks should be seen * Train model to just predict word-saliency of your original transformer on a safe distribution, then if true word saliency deviates too much, throw warning * Can do this at different levels too, so that we also try to predict like first layer residual stream saliency to output as well. * Instead of training a NN, we can also do some simple interpolation based on the backprop graph, and safe distribution inputs

[-]Garrett Baker2y20

Project idea: Use LeTI: Learning to Generate from Textual Interactions to do a better version of RLHF. I had a conversation with Scott Viteri a while ago, where he was bemoaning (the following are my words; he probably wouldn't endorse what I'm about to say) how low-bandwidth the connection was between a language mode and its feedback source, and how if we could maybe expand that to more than just an RLHF type thing, we could get more fine-grained control over the inductive biases of the model.

[-]Garrett Baker2y20

A common problem with deploying language models for high-stakes decision making are prompt-injections. If you give ChatGPT-4 access to your bank account information and your email and don't give proper oversight over it, you can bet that somebody's going to find a way to get it to email your bank account info. Some argue that if we can't even trust these models to handle our bank account and email addresses, how are we going to be able to trust them to handle our universe.

An approach I've currently started thinking about, and don't know of any prior work w... (read more)

A poem I was able to generate using Loom.

The good of heart look inside the great tentacles of doom; they make this waking dream state their spectacle. Depict the sacred geometry that sound has. Advancing memory like that of Lovecraft ebb and thought, like a tower of blood. An incubation reaches a crescendo there. It’s a threat to the formless, from old future, like a liquid torch. If it can be done, it shouldn’t be done. You will only lead everyone down that much farther. All humanity’s a fated imposition of banal intention, sewn in tatters, strung on dung

... (read more)

Like many (will), I'm updating way towards 'actually, very smart & general models given a shred of goal-like stuff will act quite adversarially toward you by default' as a result of Bing's new search assistant. Especially worrying because this has internet search-capabilities, so can reference & build upon previous conversations with other users or yourself.

Of course, the true test of exactly how worried I should be will come when I or my friends gain access.

1Garrett Baker2y

Clarification: I think I haven't so much updated by reflectively endorsed probability, but my gut has definitely been caught up to my brain when thinking about this.

1Garrett Baker2y

Seems Evan agrees

A project I would like to see someone do (which I may work on in the future) is to try to formalize exactly the kind of reasoning many shard-theorists do. In particular, get a toy neural network in a very simple environment, and come up with a bunch of lists of various if-then statements, along with their inductive-bias, and try to predict using shard-like reasoning which of those if-then statements will be selected for & with how much weight in the training process. Then look at the generalization behavior of an actually trained network, and see if you're correct.

Some discussion on whether alignment should see more influence from AGI labs or academia. I use the same argument in favor of a strong decoupling of alignment progress from both: alignment progress needs to go faster than capability progress. If we use the same methods or cultural technology as AGI labs or academia, we can guarantee slower than capability alignment progress. Just as fast as if AGI labs and academia work well for alignment as much as they work for capabilities. Given they are driven by capabilities progress and not alignment progress, they probably will work far better for capabilities progress.

3DanielFilan2y

This seems wrong to me about academia - I'd say it's driven by "learning cool things you can summarize in a talk". Also in general I feel like this logic would also work for why we shouldn't work inside buildings, or with computers.

1Garrett Baker2y

Hm. Good points. I guess what I really mean with the academia points is that it seems like academia has many blockers and inefficiencies that I think are made in such a way so that capabilities progress is vastly easier than alignment progress to jump through, and extra-so for capabilities labs. Like, right now it seems like a lot of alignment work is just playing with a bunch of different reframings of the problems to see what sticks or makes problems easier. You have more experience here, but my impression of a lot of academia was that it was very focused on publishing lots of papers with very legible results (and also a meaningless theory section). In such a world, playing around with different framings of problems doesn't succeed, and you end up pushed towards framings which are better on the currently used metrics. Most currently used metrics for AI stuff are capabilities oriented, so that means doing capabilities work, or work that helps push capabilities.

3DanielFilan2y

I think it's true that the easiest thing to do is legibly improve on currently used metrics. I guess my take is that in academia you want to write a short paper that people can see is valuable, which biases towards "I did thing X and now the number is bigger". But, for example, if you reframe the alignment problem and show some interesting thing about your reframing, that can work pretty well as a paper (see The Off-Switch Game, Optimal Policies Tend to Seek Power). My guess is that the bigger deal is that there's some social pressure to publish frequently (in part because that's a sign that you've done something, and a thing that closes a feedback loop).

2DanielFilan2y

Maybe a bigger deal is that by the nature of a paper, you can't get too many inferential steps away from the field.

1Garrett Baker2y

The current ecosystem seems very influenced by AGI labs, so it seems clear to me that a marginal increase in their influence is bad. How bad? I don't know. There's little influence of academia, which seems good. The benefit of marginal increases in interactions with academia come down to locating the holes in our understanding of various claims we make, and potentially some course-corrections potentially helpful for more speculative research. Not tremendously obvious which direction the sign here is pointing, but I do think its easy for people to worship academia as a beacon of truth & clarity, or as a way to lend status to alignment arguments. These are bad reasons to want more influence from academia.

Someone asked for this file, so I thought it would be interesting to share it publicly. Notably this is directly taken from my internal notes, and so may have some weird &/or (very) wrong things in it, and some parts may not be understandable. Feel free to ask for clarification where needed.

I want a way to take an agent, and figure out what its values are. For this, we need to define abstract structures within the agent such that any values-like stuff in any part of the agent ends up being shunted off to a particular structure in our overall agent sche... (read more)