The traditional answer is that it is proper to reward doing "good" things socially, but they should not be enforced legally. One will be celebrated as a hero for saving people from a burning house, but one will not be charged with murder for not saving people from a burning house.
You're conflating two different questions here:
What interval of quantified goodness (utility) should the Law actively promote, by distributing punishments or rewards to agents? What are the least good good deeds the Law should care about, and what are the most good good deeds?
Restricting our attention to deeds the Law actively promotes or discourages, how ungood does an act have to be before the Law should discourage it via positive punishment, as opposed to just discouraging it by withholding a reward or by rewarding a somewhat-less-bad alternative action?
You start off speaking as though you're answering the first question -- when should the state be indifferent to supererogation? -- but then you only list punishment (and extremely harsh punishment, at that!) as the mechanism by which Laws can incentivize behavior. This is confusing. Whether the Law should encourage people (e.g., with economic inventives) to save their neighbors from burning houses is quite a different question from whether the Law should punish people who don't save their neighbors, and that in turn is quite a different question from whether such a punishment should be as harsh as that for, say, manslaughter! A $100 fine is also a punishment. (And a $100 reward is also an incentive.)
If two groups disagree about the baseline, they have moral disagreement even if they use the same utility function. They disagree about whether choosing worse B instead of better A should be punished.
I don't agree with this. If two rational and informed people disagree about whether enacting a certain punishment is a good idea, then they don't have the same utility function -- assuming they have utility functions at all.
I think the core problem is that you're conceiving the Law as a utilometer. You input the goodness or badness of an act's consequences. (Or its act-type's foreseeable consequences.) The Law, programmed with a certain baseline, calculates how far those consequences fall below the baseline, and assigns a punishment proportional to the distance below. (If it is at or above the baseline, the punishment is 0.) The Law acts as a sort of karmic justice system, mirroring the world's distribution of utility. (We could have a similar system that rewards things for going above the baseline, but never mind that.)
In contrast, I think just about any consistent consequentialist will want to think of the Law as a non-map tool. The Law isn't a way of measuring an act's badness and outputting a proportional punishment; it's a lever for getting people to behave better and thereby making the world a more fun place to live in. Questions 1 and 2 above are wrong questions, because the ideal set of Laws almost certainly won't consistently respond to acts in proportion to the acts' foreseeable harm. Rather, the ideal set of Laws will respond to acts in whichever way leads to the best outcome. If act A is worse than act B, but people end up overall much better off if we use a harsher punishment against B than against A, then we should use the harsher punishment against B. (Assuming we have to punish both acts at all.)
So no Schelling point is needed. The facts of our psychology should determine how useful it is to rely on punishment vs. reward in different scenarios. It should also determine how useful it is to rely on material rewards vs. social or internal ones in different contexts. Laws are (ideally) a way of making the right thing happen more often, not a way of keeping tabs on exactly how right or wrong individual actions are.
The current issue of the Oxford Left Review has a debate between socialist Pete Mills and two 80,000 hours people, Ben Todd and Sebastian Farquhar: The Ethical Careers Debate, p4-9. I'm interested in it because I want to understand why people object to the ideas of 80,000 hours. A paraphrasing:
As a socialist, Mills really doesn't like the argument that the best way to help the world's poor is probably to work in heavily capitalist industries. He seems to be avoiding engaging with Todd and Farquhar's arguments, especially replaceability. He also really doesn't like looking at things in terms of numbers, I think because numbers suggest certainty. When I calculate that in 50 years of giving away $40K a year you save 1000 lives at $2K each, that's not saying the number is exactly 1000. It's saying 1000 is my best guess, and unless I can come up with a better guess it's the estimate I should use when choosing between this career path and other ones. He also doesn't seem to understand prediction and probability: "every revolution is impossible, until it is inevitable" may be how it feels for those living under an oppressive regime but it's not our best probability estimate. [1]
In a previous discussion a friend also was mislead calculations. When I said "one can avert infant deaths for about $500 each" their response was "What do they do with the 500 dollars? That doesn't seem to make sense. Do they give the infant a $500 anti-death pill? How do you know it really takes a constant stream of $500 for each infant?". Have other people run into this? Bad calculations also tend to be distributed widely, with people saying things like "one pint of blood can save up to three lives" when the expected marginal lives saved is actually tiny. Maybe we should focus less on estimates of effectiveness in smart-giving advocacy? Is there a way to show the huge difference in effect between the best charities and most charities without using these?
Maybe I should have way more of these discussions, enough that I can collect statistics on what arguments and examples work and which don't.
(I also posted this on my blog)
[1] Which is not to say you can't have big jumps in probability estimates. I could put the chance of revolution at 5% somewhere based on historical data but then hear some new information about how one has just started and sounds really promising which bumps my estimate up to 70%. But expected value calculations for jobs can work with numbers like these, it's just "impossible" and "inevitable" that break estimates.