Definitely a parallel sort of move! I would have already said that I was pretty rationality anti-realist, but I seem to have become even more so.
I think if I had to describe how I've changed my mind briefly, it would be something about before, I thought that to learn an AI stand-in for human preferences you should look at the effect on the real world. Now, I take much more seriously the idea that human preferences "live" in a model that is itself a useful fiction.
Six months ago, I thought I was on to something interesting involving value learning and philosophy of reference. Then I had a series of breakthroughs - or do you still call it a breakthrough when it reveals that you were on the wrong track? Reverse breakthrough? How about "repair-around." Anyhow, this is my attempt to turn the unpublished drafts into a story about their own failure.
One post in this vein did get published, Can few-shot learning teach AI right from wrong? It was about the appeal and shortcomings of what Abram Demski titled model-utility learning:
This post had its good points. But under the surface, there was a false assumption that the model of value learning and the skill of reference were separable - that the problem was understanding reference so you could bolt it onto a model-utility learner that lacked it.
That's what the unpublished second post, Reference as a game, was based on. It's not like I was blind to all problems - it's that I didn't see anything better. And so I pushed forward despite feeling stuck.
The idea examined was whether you could train an AI to play a game where it's presented with some objects and has to find the human-obvious classification boundary pointed out by those objects. I quite like this passage on romans and code length:
The problem remained, though - no matter how good your prior, you're going to have a bad time if a human-undesirable program reliably fits the data better.
This sounded really compelling to me. I thought I was going to end this post with a nice result. So I started thinking more about the issue, and you can actually see the paragraph in my notes where I had the repair-around.
So I learned about this general obstacle to CIRL and started to slowly realize that I'd been pursuing not quite the right thing.
The AI had to start by trying to model humans, "reference" wasn't a skill we needed to learn independent of a deep understanding of humans. And so I went back and added more to the third post I'd been working on, Passing the buck to human intuition. Only problem is, I was just going to end up with more repair-arounds. Though I guess I shouldn't knock repair-arounds - having one means you are less wrong than you were before, after all.
The most charitable thing I can say about this post was that I was trying to work on how an AI could model humans, not as idealized agents + defects, but on the human's own terms. The biggest problem I met was that this is just really, really hard. I don't even want to quote too heavily from this one. It's less interesting to watch me run headfirst into a wall, philosophically speaking. Like:
This isn't crazy - it wasn't a waste of time for me to think about how humans think of preferences in terms of associations, heuristics for judging, and verbal reasoning about the future. But this quote is telling as to how much I was still trying to force everything into the model-utility learner paradigm (word of the day: procrustean), and if I recall correctly, I was taking much too seriously the idea that there's some "best" model of human desires and internal maps that I had some insights into.
Eventually I was forced to confront the problem that the human models I was dreaming up were too fine-grained to have a basic object that was "the values."
I totally agree, past self - those sure are a bunch of problems. Since I stopped writing that post, I've gotten a better handle on how to understand humans (hint: it's the intentional stance), but I don't have a much better understanding of turning those human-models into highly optimized plans for the AI. I do have some ideas involving predictive models, though, so maybe it's time to start writing some more posts, and make some repair-arounds.