It's correct that, so far, Ought has been running small-scale experiments with people who know the research background. (What is amplification? How does it work? What problem is it intended to solve?)
Over time, we also think it's necessary to run larger-scale experiments. We're planning to start by running longer and more experiments with contractors instead of volunteers, probably over the next month or two. Longer-term, it's plausible that we'll build a platform similar to what this post describes. (See here for related thoughts.)
The reason we've focused on small-scale experiments with a select audience is that it's easy to do busywork that doesn't tell you anything about the question of interest. The purpose of our experiments so far has been to get high-quality feedback on the setup, not to gather object-level data. As a consequence, the experiments have been changing a lot from week to week. The biggest recent change is the switch from task decomposition (analogous to amplification with imitation learning as distillation step) to decomposition of evaluation (analogous to amplification with RL as distillation step). Based on these changes, I think that if we had stopped at any point so far and focused on scaling up instead of refining the setup, it would have been a mistake.
My immediate reaction is: why do you think the real and not the toy problems you are trying to solve are factorizable?
To take an example from your link, "What does a field theory look like in which supersymmetry is spontaneously broken?" does not appear to be an easily factorizable question. One needs to have 6+ years of intensive math and theoretical physics education to even understand properly what the question means and why it is worth answering. (Hint: it may not be worth answering, given that there are no experimentally detected super partners and there is no indication that any might exist below Planck scale.)
Provided you have reached the required level of understanding the problem, why do you think that the task of partitioning the question is any easier than actually solving the question? Currently the approach in academia is hiring a small number of relatively well supervised graduate students, maybe an occasional upper undergrad, to assist in solving a subproblem. I have seen just one case with a large number of grad students, and that was when the problem had already been well partitioned and what was needed was warm bodies to explore the parameter spaces and add small tweaks to a known solution.
I do not know how much research has been done on factorizability, but that seems like a natural place to start, so that you avoid going down the paths where your chosen approach is unlikely to succeed.
My immediate reaction is: why do you think the real and not the toy problems you are trying to solve are factorizable?
My immediate reaction is: why do you ask this question here? Wouldn't it be better placed under an authoritative article like this rather than my clumsy little exploration?
why do you think that the task of partitioning the question is any easier than actually solving the question? Currently the approach in academia is hiring a small number of relatively well supervised graduate students, maybe an occasional upper undergrad, to assist in solving a subproblem.
To me this looks like you're answering your own question. What am I not understanding? If I saw the above Physics questions and knew something about the topic, I would probably come up with a list of questions or approaches. Someone else could then work on each of those. The biggest issue that I see is that so much information is lost and friction introduced when unraveling a big question into short sub-questions. It might not be possible to recover from that.
I do not know how much research has been done on factorizability
This is part of what Ought is doing, as far as I understand. From the Progress Update Winter 2018: ‘Feasibility of factored cognition: I'm hesitant to draw object-level conclusions from the experiments so far, but if I had to say something, I'd say that factored cognition seems neither surprisingly easy nor surprisingly hard. I feel confident that our participants could learn to reliably solve the SAT reading comprehension questions with a bit more iteration and more total time per question, but it has taken iteration on this specific problem to get there, and it's likely that these experiments haven't gotten at the hard core of factored cognition yet.’
Abstract
Factored cognition is a possible basis for building aligned AI. Currently Ought runs small-scale experiments with it. In this article I sketch some benefits of building a system for doing large-scale experiments and generating large amounts of data for ML training. Then I estimate roughly how long it would take to build such a system. I'm not confident of this exploration being useful at all. But at least I wrote it down.
Benefits
If you want to know what factored cognition is, see here.
Ought does small-scale experiments with factored cognition (cf. Ought's Progress Update Winter 2018). I thought: wouldn't it be nice to do these experiments at much larger scale? With enough users that one root question could be answered within three hours any time and day of the week.
Benefits:
The feedback loop would be much tighter than with the weekly or bi-weekly experiments that Ought runs now. A tight feedback loop is great in many ways. For example, it would allow a researcher to test more hypotheses more often, more quickly and more cheaply. This in turn helps her to generate more hypotheses overall.
Note that I might be misunderstanding the goals and constraints of Ought's experiments. In that case this benefit might be irrelevant.
It would generate a lot of data. These could be used as training data when we want to train an ML system to do factored cognition.
Quantifying these benefits is possible, but would take some weeks of modelling and talking with people. So far I'm not confident enough of the whole idea to make the effort.
Feasibility
We would need three things for a large-scale factored cognition system to work: the system itself, enough users and useful behaviour of these users. I'll use Stack Overflow as a basis for my estimates and call large-scale factored cognition ‘Fact Overflow’.
Building Stack Overflow took five months from start of development ㊮ to public beta ㊮. Then they spent a lot of time tweaking the system to make it more attractive and maintain quality. So I'd say building Fact Overflow would take five to fifteen months with a team of two to five people.
For calculating how many users would be required, I used the following estimates (90 % confidence interval, uniformly distributed):
(I had to insert dashes to make the table look neat.)
xc is the share of workspaces in a tree that one user can work on without being contaminated, ie. without getting clues about the context of some workspaces.
The estimates are sloppy and probably overconfident. If people show interest in this topic, I will make them tighter and better calibrated.
Now if we want a tree of workspaces to be finished within tf, we need n∗u users, where: n∗u=nw⋅naxc⋅xa⋅fa⋅tf
A Guesstimate model based on this formula tells me that for tf=3h we need between 600 and 36 k users. Note that Guesstimate runs only 5000 samples, so the numbers jump around with each page reload. Note also that the actual time to finish a tree might be longer, depending on how long users take for each action and how many sub-questions have to be worked on in sequence.
How long would it take to accumulate these numbers of users? For this I use the number of sign-ups to Stack Exchange (of which Stack Overflow is the largest part). Let me assume that between 75 % and 98 % of people who sign up actually become users. That means between 700 and 42 k sign-ups are required. This is also in Guesstimate. What I can't include in the Guesstimate simulation is the difference between the growth rates of Stack Overflow and Fact Overflow. Assume that it takes Fact Overflow twice as long as Stack Overflow to reach a certain number of sign-ups. Then it would take one month to reach 700 sign-ups and twenty-two months to reach 42 k sign-ups.
Of course, the system would have to be useful and fun enough to retain that many users. As with Stack Overflow, the software and the community have to encourage and ensure that the users behave in a way that makes factored cognition work.
Conclusion
It would be useful to be able to experiment with factored cognition at a large scale. I can't quantify the usefulness quickly, but I did quantify very roughly what it would take: five to fifteen months of development effort with a small team plus one to twenty-two months of accumulating users.
Comment prompts