Give them a motivation that is higher than the drive to game the test. I'm an immortalist. I don't want to die. I could deceive myself and others in many ways about my skills, purposes, beliefs, but in the end I can't do that at the expense of my chances of not dying. Finding a similarly important purpose, something that might even be gamed, but for which gaming means you loose. Some real life test.
Maybe, measuring someone's capability to win. I have often wondered if being rational correlates with being succesful in society. I can't be sure, though it seems to be it should, if it doesn't then I suppose it either means there's a problem with a rationality that would leave you worse off, or more likely, that you aren't being rational enough, or do not have enough mental ressources to use that rationality to make a difference. Bounded rationality, always an issue.
Capability to win could be measured in many ways, economical success for instance, or any other existing societal position of power or prestige. Of course any single of those may be gamed, but it's ok to cheat, if cheating brings you closer to what you want, then it is rational to game. However, the goal that you hold, and for which you are vying, may not be very interesting. Empty fame, etc.
It would be best to have a personal goal set, and known, and measure how a person fares as to that goal; a goal difficult enough to require the proper use rationality to win in society, that would require to apply rationality to a very large and diverse bunch of situations, a goal that you'd want to preserve.
Can't help much to determine what that would be, I have my own thing to protect, as I said, not sure what it might be for other people. It doesn't work all the time either. Sometimes short time goals are vying for dominance over my actions, and I'll give in to them, even if it means getting farther from my own personal long term goal. That's a lack of willpower, not a lack of rationality at work there I think.
I strongly suspect that there is a possible art of rationality (attaining the map that reflects the territory, choosing so as to direct reality into regions high in your preference ordering) which goes beyond the skills that are standard, and beyond what any single practitioner singly knows. I have a sense that more is possible.
The degree to which a group of people can do anything useful about this, will depend overwhelmingly on what methods we can devise to verify our many amazing good ideas.
I suggest stratifying verification methods into 3 levels of usefulness:
If your martial arts master occasionally fights realistic duels (ideally, real duels) against the masters of other schools, and wins or at least doesn't lose too often, then you know that the master's reputation is grounded in reality; you know that your master is not a complete poseur. The same would go if your school regularly competed against other schools. You'd be keepin' it real.
Some martial arts fail to compete realistically enough, and their students go down in seconds against real streetfighters. Other martial arts schools fail to compete at all—except based on charisma and good stories—and their masters decide they have chi powers. In this latter class we can also place the splintered schools of psychoanalysis.
So even just the basic step of trying to ground reputations in some realistic trial other than charisma and good stories, has tremendous positive effects on a whole field of endeavor.
But that doesn't yet get you a science. A science requires that you be able to test 100 applications of method A against 100 applications of method B and run statistics on the results. Experiments have to be replicable and replicated. This requires standard measurements that can be run on students who've been taught using randomly-assigned alternative methods, not just realistic duels fought between masters using all of their accumulated techniques and strength.
The field of happiness studies was created, more or less, by realizing that asking people "On a scale of 1 to 10, how good do you feel right now?" was a measure that statistically validated well against other ideas for measuring happiness. And this, despite all skepticism, looks like it's actually a pretty useful measure of some things, if you ask 100 people and average the results.
But suppose you wanted to put happier people in positions of power—pay happy people to train other people to be happier, or employ the happiest at a hedge fund? Then you're going to need some test that's harder to game than just asking someone "How happy are you?"
This question of verification methods good enough to build organizations, is a huge problem at all levels of modern human society. If you're going to use the SAT to control admissions to elite colleges, then can the SAT be defeated by studying just for the SAT in a way that ends up not correlating to other scholastic potential? If you give colleges the power to grant degrees, then do they have an incentive not to fail people? (I consider it drop-dead obvious that the task of verifying acquired skills and hence the power to grant degrees should be separated from the institutions that do the teaching, but let's not go into that.) If a hedge fund posts 20% returns, are they really that much better than the indices, or are they selling puts that will blow up in a down market?
If you have a verification method that can be gamed, the whole field adapts to game it, and loses its purpose. Colleges turn into tests of whether you can endure the classes. High schools do nothing but teach to statewide tests. Hedge funds sell puts to boost their returns.
On the other hand—we still manage to teach engineers, even though our organizational verification methods aren't perfect. So what perfect or imperfect methods could you use for verifying rationality skills, that would be at least a little resistant to gaming?
(Added: Measurements with high noise can still be used experimentally, if you randomly assign enough subjects to have an expectation of washing out the variance. But for the organizational purpose of verifying particular individuals, you need low-noise measurements.)
So I now put to you the question—how do you verify rationality skills? At any of the three levels? Brainstorm, I beg you; even a difficult and expensive measurement can become a gold standard to verify other metrics. Feel free to email me at sentience@pobox.com to suggest any measurements that are better off not being publicly known (though this is of course a major disadvantage of that method). Stupid ideas can suggest good ideas, so if you can't come up with a good idea, come up with a stupid one.
Reputational, experimental, organizational:
Finding good solutions at each level determines what a whole field of study can be useful for—how much it can hope to accomplish. This is one of the Big Important Foundational Questions, so—
Think!
(PS: And ponder on your own before you look at the other comments; we need breadth of coverage here.)