A few places where this argument breaks down...
First and most important: we can make a weaker model out of a stronger model if we know in advance that we want to do so, and actually try to, and make sure the stronger system does not have a chance to stop us (e.g. we don't run it). If there's an agentic superhuman AGI already undergoing takeoff, then "make it weaker" is not really an option. Even if there's an only-humanish-level agentic AGI already running, if that AGI can easily spin up a new instance of itself without us noticing before we turn it off, or arrange for someone else to spin up a new instance, then "make it weaker" isn't really an option. Plausibly even a less-than-human-level agent could pull that off; infosec does usually have an attacker's advantage.
(Subproblem 1: on some-but-not-all threat models, a superhuman AGI is already a threat when it's in training. So plausibly "don't run the strong model" wouldn't even be enough, we'd have to not even train the strong model.
Subproblem 2 (orthogonal to subproblem 1): looking at a strong model and figuring out how aligned/corrigible/etc it is, in a way robust enough to generalize well to even moderately strong capabilities, is itself one of the hardest open problems in alignment. So in order for a plan involving "build strong model and make it weaker" to help, the plan would have to weaken the strong model unconditionally, not check whether the strong model has problems and then weaken it. At which point... why use a stronger model in the first place? There are still some reasons, but a lot fewer.
Put subproblems 1 & 2 together, and we're basically back to "don't use a strong model in the first place" - i.e. unconditionally do not train a strong model.)
Second: one would need to know the relevant way in which to weaken the model. "Corrupting n% of its inputs/outputs" just doesn't matter that much on most threat models I can think of - for instance, it doesn't really matter at all for deception.
Third: in order for this argument to go through, one does need to actually use the mechanism from the argument, i.e. weaken the stronger model. Without necessarily accusing you specifically of anything, when I hear this argument, my gut expectation is that the arguer's next step will be to say "great, so let's assume that alignment gets easier as models get stronger" and then completely forget about the part where their plan is supposed to involve weakening the model somehow. For instance, I could imagine someone next arguing "well, today's systems are already reasonably aligned, and it only gets easier as models get stronger, so we should be fine!" without realizing/considering that this argument only works insofar as they actually expect all AI labs to intentionally weaken their own models (or do something strictly better for alignment than that, despite subproblem 2 above). So if someone made this argument to me in the context of a broader plan, I'd be on the lookout for that.
(Meta-note: I'm not saying I endorse the premises of all these counterarguments. These are just some counterarguments I see, under some different models.)
Here is the list of counter-arguments I prepared beforehand
1) Digital cliff, it may not be possible to weaken a stronger model
2) Competition, the existence of a stronger model implies we live in a more dangerous world
3) Deceptive alignment, the stronger model may be more likely to decieve you into thinking it's aligned
4) Wireheading, the user may be unable to resist using the stronger model even knowing it is more dangerous
5) Passive Saftey, the weaker model may be passively safe while the stronger model is not
6) Malicious actors, the stronger model may be more likely to be used by malicious actors
7) inverse scaling, the stronger model may be weaker in some safety-critical dimensions
8) Domain of alignment, the stronger model may be more likely to be used in a safety-critical context
I think the strongest counter arguments are:
It would love to hear a stronger argument for what @johnswentworth describes as "subproblem 1": that the model might become dangerous during training. All of the versions of this argument that I aware of involve some "magic" step where the AI unboxes itself by (e.g. side-channel or talking its way out of the box ) that seem like the either require huge leaps in intelligence or can be easily mitigated (air-gapped network, two person control).