A few places where this argument breaks down...
First and most important: we can make a weaker model out of a stronger model if we know in advance that we want to do so, and actually try to, and make sure the stronger system does not have a chance to stop us (e.g. we don't run it). If there's an agentic superhuman AGI already undergoing takeoff, then "make it weaker" is not really an option. Even if there's an only-humanish-level agentic AGI already running, if that AGI can easily spin up a new instance of itself without us noticing before we turn it off, or arrange for someone else to spin up a new instance, then "make it weaker" isn't really an option. Plausibly even a less-than-human-level agent could pull that off; infosec does usually have an attacker's advantage.
(Subproblem 1: on some-but-not-all threat models, a superhuman AGI is already a threat when it's in training. So plausibly "don't run the strong model" wouldn't even be enough, we'd have to not even train the strong model.
Subproblem 2 (orthogonal to subproblem 1): looking at a strong model and figuring out how aligned/corrigible/etc it is, in a way robust enough to generalize well to even moderately strong capabilities, is itself one of the hardest open problems in alignment. So in order for a plan involving "build strong model and make it weaker" to help, the plan would have to weaken the strong model unconditionally, not check whether the strong model has problems and then weaken it. At which point... why use a stronger model in the first place? There are still some reasons, but a lot fewer.
Put subproblems 1 & 2 together, and we're basically back to "don't use a strong model in the first place" - i.e. unconditionally do not train a strong model.)
Second: one would need to know the relevant way in which to weaken the model. "Corrupting n% of its inputs/outputs" just doesn't matter that much on most threat models I can think of - for instance, it doesn't really matter at all for deception.
Third: in order for this argument to go through, one does need to actually use the mechanism from the argument, i.e. weaken the stronger model. Without necessarily accusing you specifically of anything, when I hear this argument, my gut expectation is that the arguer's next step will be to say "great, so let's assume that alignment gets easier as models get stronger" and then completely forget about the part where their plan is supposed to involve weakening the model somehow. For instance, I could imagine someone next arguing "well, today's systems are already reasonably aligned, and it only gets easier as models get stronger, so we should be fine!" without realizing/considering that this argument only works insofar as they actually expect all AI labs to intentionally weaken their own models (or do something strictly better for alignment than that, despite subproblem 2 above). So if someone made this argument to me in the context of a broader plan, I'd be on the lookout for that.
(Meta-note: I'm not saying I endorse the premises of all these counterarguments. These are just some counterarguments I see, under some different models.)
I don't fully get your argument, but I'll bite anyway. I think all of this is highly dependent on what you mean by "stronger model" and "alignment".
I think it is easier to align a stronger model, if it's stronger in the sense of understanding the goals or values you're aligning it to better, and if you're able to use its understanding to control its decisions. In that case, its models of your goals/values becomes its goals/values. In this scenario, it's as aligned as the quality of its models, so a stronger model will be more aligned.
But that's assuming you don't lose control of the model before you do that alignment. I think this is a real, valid question, but it can be addressed by performing alignment early and often as the model is trained/gets smarter.
I recently wrote about this in The (partial) fallacy of dumb superintelligence. I think this is an intuition among risk-doubters about how it should be easy to align a smart AGI, and how that intuition is only partly wrong.