Rationality is about pattern recognition, not reasoning

25 JonahSinick 26 May 2015 07:23PM

Short version (courtesy of Nanashi)

Our brains' pattern recognition capabilities are far stronger than our ability to reason explicitly. Most people can recognize cats across contexts with little mental exertion. By way of contrast, explicitly constructing a formal algorithm that can consistently cats across contexts requires great scientific ability and cognitive exertion.

Very high level epistemic rationality is about retraining one's brain to be able to see patterns in the evidence in the same way that we can see patterns when we observe the world with our eyes. Reasoning plays a role, but a relatively small one. Sufficiently high quality mathematicians don't make their discoveries through reasoning. The mathematical proof is the very last step: you do it to check that your eyes weren't deceiving you, but you know ahead of time that your eyes probably weren't deceiving you.

I have a lot of evidence that this way of thinking is how the most effective people think about the world. I would like to share what I learned. I think that what I've learned is something that lots of people are capable of learning, and that learning it would greatly improve people's effectiveness. But communicating the information is very difficult.

It took me 10,000+ hours to learn how to "see" patterns in evidence in the way that I can now. Right now, I don't know how to communicate how to do it succinctly. In order to succeed, I need collaborators who are open to spend a lot of time thinking carefully about the material, to get to the point of being able to teach others. I'd welcome any suggestions for how to find collaborators.

continue reading »

How my social skills went from horrible to mediocre

29 JonahSinick 19 May 2015 11:29PM

Over the past few months, I've become aware that my understanding of social reality had been distorted to an extreme degree. It took 29 years for me to figure out what was going on, but I finally now understand.

The situation is very simple: The amount of time that I put into interacting within typical social contexts was very small, so I didn't get enough feedback to realize that I had a major blindspot as I otherwise would have.

Now that I've identified the blindspot, I can work on it, and my social awareness has been increasing at very rapid clip. I had no idea that I had so much potential for social awareness. I had been in a fixed mindset as rather than a growth mindset, I had thought "social skills will never be my strong point, so I shouldn't spend time trying to improve them, instead I should focus on what I'm best at." I'm astonished by how much my relationships have improved over a span of mere weeks.

I give details below.

continue reading »

Is Scott Alexander bad at math?

31 JonahSinick 04 May 2015 05:11AM

This post is a third installment to the sequence that I started with The Truth About Mathematical Ability and Innate Mathematical Ability. I begin to discuss the role of aesthetics in math. 

There was strong interest in the first two posts in my sequence, and I apologize for the long delay. The reason for it is that I've accumulated hundreds of pages of relevant material in draft form, and have struggled with how to organize such a large body of material. I still don't know what's best, but since people have been asking, I decided to continue posting on the subject, even if I don't have my thoughts as organized as I'd like. I'd greatly welcome and appreciate any comments, but I won't have time to respond to them individually, because I already have my hands full with putting my hundreds of pages of writing in public form.

continue reading »

Innate Mathematical Ability

40 JonahSinick 18 February 2015 11:11AM

In my present sequence of posts, I'm writing about the nature of mathematical ability. My main reason for doing so is to provide information that can help improve mathematical ability.

Along the way, I'm going to discuss how people can't improve their mathematical ability. This may seem antithetical to my goal. Focus on innate ability can lead to a sort of self-fulfilling prophesy, where people think that their abilities are fixed and can't be improved, which results in them not improving their abilities because they think that doing so is pointless.

Carol Dweck has become well known for her growth mindset / fixed mindset framework. She writes:

"In a fixed mindset students believe their basic abilities, their intelligence, their talents, are just fixed traits. They have a certain amount and that's that, and then their goal becomes to look smart all the time and never look dumb. In a growth mindset, students understand that their talents and abilities can be developed through effort, good teaching and persistence. They don't necessarily think everyone's the same or anyone can be Einstein, but they believe everyone can get smarter if they work at it." [...] This is important because individuals with a "growth" theory are more likely to continue working hard despite setbacks...

As I'll describe in my next post, I'm broadly sympathetic with Dweck's perspective. But it's not an either-or situation. Some abilities are innate and can't be developed, and other abilities can be.

One could argue that this idea is too nuanced for most people to appreciate, so that it's better to just not talk about innate ability. This seems to me paternalistic and patronizing. People need to know which abilities are fixed and which can be developed, so that they can focus on developing abilities that can in fact be developed rather than wasting time and effort on developing those that can't be.

continue reading »

The Truth About Mathematical Ability

61 JonahSinick 12 February 2015 01:29AM

There's widespread confusion about the nature of mathematical ability, for a variety of reasons:

  • Most people don't know what math is.
  • Most people don't know enough statistics to analyze the question properly.
  • Most mathematicians are not very metacognitive.
  • Very few people have more than a casual interest in the subject.

If the nature of mathematical ability were exclusively an object of intellectual interest, this would be relatively inconsequential. For example, many people are confused about Einstein’s theory of relativity, but this doesn’t have much of an impact on their lives. But in practice, people’s misconceptions about the nature of mathematical ability seriously interfere with their own ability to learn and do math, something that hurts them both professionally and emotionally.

I have a long standing interest in the subject, and I’ve found myself in the unusual position of being an expert. My experiences include:

  • Completing a PhD in pure math at University of Illinois.
  • Four years of teaching math at the high school and college levels (precalculus, calculus, multivariable calculus and linear algebra)
  • Personal encounters with some of the best mathematicians in the world, and a study of great mathematicians’ biographies.
  • A long history of working with mathematically gifted children: as a counselor at MathPath for three summers, through one-on-one tutoring, and as an instructor at Art of Problem Solving.
  • Studying the literature on IQ and papers from the Study of Exceptional Talent as a part of my work for Cognito Mentoring.
  • Training as a full-stack web developer at App Academy.
  • Doing a large scale data science project where I applied statistics and machine learning to make new discoveries in social psychology.

I’ve thought about writing about the nature of mathematical ability for a long time, but there was a missing element: I myself had never done genuinely original and high quality mathematical research. After completing much of my data science project, I realized that this had changed. The experience sharpened my understanding of the issues.

This is a the first of a sequence of posts where I try to clarify the situation. My main point in this post is:

There are several different dimensions to mathematical ability. Common measures rarely assess all of these dimensions, and can paint a very incomplete picture of what somebody is capable of.

continue reading »

The Role of Attractiveness in Mate Selection: Individual Variation

19 JonahSinick 23 January 2015 11:21AM

This post reports on a portion of my analysis of Fisman and Iyengar's speed dating dataset which bears on the question of how people select romantic partners. 


Note:
 I made very substantial edits to the second to last section of this post having posted it, addressing questions of generalizabilityI've also cross-posted to my blog.

Summary

  • Participants rated one another on several dimensions. The majority of variation in the ratings is captured by the average of the different rating types: some people were regarded as good overall, and others were regarded as not good overall.
  • The second most important source of variation in the ratings given to participants is that some were regarded as more attractive and fun than they were intelligent/sincere, and for others, the situation was reversed. 
  • Broadly, when people had to chose between partners who were seen as attractive and fun and partners who were seen as intelligent and sincere, they had a moderately strong preference for partners who were seen as attractive and fun.
  • Individuals varied substantially in how they responded to the tradeoff, with some showing very strong preference for people who were seen as attractive and fun people, and others showed virtually no such preference.

 

The speed dating context may be unusual in that people make a decision on whether or not to see somebody again after only 4 minutes of interaction. On the other hand, some people do meet their partners in contexts such as bars and speed dating events where decisions are made based on brief interactions. To this extent, the empirical phenomena in data from the study are relevant to understanding mate selection in general.

continue reading »

How subjective is attractiveness?

23 JonahSinick 13 January 2015 12:04AM

Consider the two statements:

  • There is a universal standard for beauty.
  • Beauty is in the eye of the beholder.

Most people would agree that there's some truth to each of these statements. At Thing of Things Ozy wrote:

As for the beauty thing… well, yeah, everyone’s beautiful in the sense that everyone is sexually attractive to someone, and that human bodies in general are pretty cool-looking. But conventional attractiveness is still a thing. While I’m fairly conventionally attractive (thin, white, clear skin, symmetrical features), I doubt hairy legs, bound chests, and haircuts that make one look like a teenage boy are going to be all the rage at Cosmo any time soon.

This post explores the question of the extent to which each of the two statements is true, using data from a study of speed dating events conducted by Raymond Fisman and Sheena Iyengar. 

The basic facts  that I describe here are:

  • Attractiveness as defined by group consensus can be modeled well using a normal distribution.
  • The group consensus on somebody's attractiveness accounted for roughly 60% of the variance in people's perceptions of the person's relative attractiveness.
  • The distribution of people's perceptions of the relative attractiveness of a fixed person can be modeled well using a normal distribution. Moreover, the standard deviations of these distributions tend to be quite close to one another (across different people), so that it's often possible to approximate the entire distribution of perceptions of somebody's relative attractiveness using only the mean of the distribution, which is just the group consensus on the person's attractiveness. 

There's much more to say about how to interpret the group consensus and its implications, which I'll go into in a later post.

continue reading »

Methodology for predicting speed dating participants' decisions

7 JonahSinick 25 December 2014 08:43PM

In my last post I described phenomena that I used to predict speed dating participants' decisions by estimating the participants' general selectivity and perceived desirability. I was planning on following up with a discussion of the phenomena that I used to refine the model by taking into account differences between individuals. But since comments focused on methodology rather than the empirical phenomena, I decided to write about methodology first, so that readers wouldn't have to disbelief while reading my next post. 

This post is more dense and technical than my last one. I wrote it for readers who want to check the details of the work, or who have strong interest in statistics and/or machine learning. If you don't fall into either category but are interest in the series, you can skip it without loss of continuity. 

Here I'll address three points:

  • The situation that I attempted to simulate and how faithful one should expect the simulation to be.
  • The exact definitions of rating averages that I referenced in my last post
  • My criteria for including a feature in the model.
Most of the code that I used is here.

The simulation

The underlying question that I attempted to address is "Suppose that a speed dating company wanted to organize events with more matches. How could machine learning help?"

As Ryan Carey pointed out, the model that I developed uses data about other speed dates that participants had been on to predict decisions on a given speed date. It is possible to make nontrivial predictions exclusively using information that was available before the participants attended the events, but I haven't systematically explored how well one could do. So the model that I developed is potentially useful only in the special case where participants had attended similar past events.

In fact, the participants in the dataset attended only a single speed dating event, not multiple events, so it's not possible to directly check whether the model would in fact predict behavior at future events based on past events. I instead simulated a situation where participants had attended similar events in the past, by imagining that for a given date, all other dates that the pair of people had been on had occurred in a past event.

It's very likely that the simulation overstates the predictive power that the model would give in practice, if for no other reason than regression to the mean. One example of this is that the most popular participants are more likely than usual to have been at their best on the day of the event than the other participants are, so that confidence that one can have that someone who was chosen by most of their dates at an event will be chosen by partners at a different event is lower than the confidence that one can have that the person will be chosen by partners at the same event.

If one were to apply the model in a real world setting, one would collect data that allowed one to quantify the expected regression to the mean, and also to improve the model. 

Average ratings

Conceptually, the foundation of the model is the idea that you can infer a participant's traits from:

  • Averages of the ratings that members of the opposite sex gave the participant (one average for each type of rating). 
  • Averages of the ratings that the participant gave members of the opposite sex.

For the sake of limiting unnecessary verbiage, it's useful to think of the decision that a participant makes on a partner as a "rating," where a 'no' decision corresponds to a rating of 0 and a 'yes' decision corresponds to a rating of 1. 

The first point to make is that given a rater / ratee pair, we need to exclude from consideration both the ratings that the rater gave the ratee from consideration, and the ratings that the ratee gave the rater. This is because we're trying to predict whether two people who have never been on a speed date would be interested in seeing each other again if they were to go on a speed date.

Excluding these ratings wouldn't be crucial if the speed dating events involved each person going on thousands of speed dates: in that case, the ratings that the two people had given each other would correspond to slight perturbations of the averages. But when an event involves only ~15 people, the impact of a single rating on somebody's average can be large enough so that failing to exclude the individuals' ratings of one another would substantially overstate the predictive power of the model while simultaneously obscuring what was going on.

Given a rating type R, and two participants A and B whose decisions we're trying to predict, let R(A,B) be the rating that A gave B, and let R(B) be the sum of the ratings that were given to B. Let N be the number of people who rated B. One might think that the right features to look at are

[R(B) - R(A,B)]/(N - 1)  (**)

But these features are still contaminated with the decisions we're trying to predict. To see this, consider the case of a dataset including only a single ratee B. In this case, R(B) is constant, so when the rating type R is 'decision,' the feature's value depends only on R(A,B), so that one can solve for R(A,B) in terms of the feature. 

Even though we have many more than one rater, the contamination is still an issue. Some machine learning algorithms are capable of learning the identities of individual raters, and if they do so, they can learn how to solve (**) for each individual rater.

Rather than using (**), we imagine that at the event, B had been on a date with someone other than A, who we call a "surrogate" of A. We model the surrogate of A using another participant A' that B dated. Conceptually, A' is a randomly selected participant amongst the participants who B dated, but literally picking one at random would break the symmetry of the data in a way that could dilute the statistical power of the data, so I instead made a uniform choice to replace A by the participant who B would have dated that round if the speed dating schedule had been slightly different.

[R(B) - R(A,B) + R(A', B)]/N

In the special case where the rating type is "decision," the averages correspond to frequencies, and for easy of comparison with other features these are most naturally replaced by their log odds ratios, so I did this.

I normalized these averages by subtracting off the average of all ratings that participants of B's gender would have received at the event had the surrogates of A and B attended the event in lieu of A and B. This washes out heterogeneity in raters' rating scales from event to event. 

Distinguishing noise from signal: my criteria for including a feature

In order to avoid overfitting the dataset in a way that reduces the generalizability of the findings,  I imposed a high threshold for features to meet to be included in the model. From the point of view of discovery, this was very helpful insofar as it helped me discover the core phenomena that I used.  

One could argue that the filters are collectively too strict, but I've chosen to use them for several reasons:

  • The tendency to see signal in noise is so strong that it seems that it's nearly always the case that when people make effort to avoid it, they're not doing enough, so it seems better to err on the conservative side.
  • I wanted to make an unambiguous case for the features that I did include adding incremental predictive power. I'm fairly confident that to the extent that the factors that influenced the participants at the event reflect general human behavioral tendencies, the predictive power of the features that I identified also generalizes. My main source of uncertainty is that nobody's checked my work in detail. 
  • From an expository point of view, the effect sizes of the features that I excluded are arguably too small for them to warrant comment.

    If I were strictly focused on optimizing for predictive power, I would have included features that improve predictive power by a tiny margin with 60% confidence, but I had no reason to do so: even in aggregate, the resulting difference in predictive power wouldn't have been striking, it's unlikely that anyone will actually use the model, and if even if someone does, there will be opportunities to collect more data and make a better model.

    What's interesting is not so much exactly how predictive the model is, but what the main driving factors are and how they interact.

I've enumerated the criteria below. In practice, there’s a fair amount of redundancy between them: if a feature didn’t pass through one of them, it usually failed to pass through at least one other. But this fact only emerged gradually, and I used each individually at different times.

I tried to keep the number of features that I used small

The dataset that I’ve been working is derived from 9 speed dating events involving ~160 people of each gender, for a total of ~3000 dates. The size is sufficiently large so that we can hope to get a broad sense for what’s going on, but not sufficiently large so that we can determine the influence of individual idiosyncracies in great detail. If we hope for too much, we’re apt to base our model on patterns that don’t generalize, regardless of how much cross checking we do.

My final model uses only 5 features to predict men's decisions and only 3 features to predict women's decisions.

I only included a feature when the fact that it increased the model's performance was in consonance with my intuitions

For example, I found that empirically, people who expressed a preference for people who share their interests were considered to be undesirable, but given the small size of the dataset and the absence of evidence for the phenomenon coming from other sources, using this to make predictions seemed ill-advised.

I restricted myself to using features that were derived from a relatively large number of examples, both of speed dates and of people.

The female engineering graduate students in the sample showed a very strong preference for male engineering graduate students over other men. They were also far more receptive to dating the male engineering students than other women were. The engineering/engineering cross feature passed through all other filters that I used aside from this one, but though there were 40 dates between engineering graduate students, they involved only 6 women, so I dropped the feature.

I used cross validation

Suppose there were 20 people who have some trait X, and that most of them were considered about as desirable as usual, but 2 of them were rejected by everyone. In this case it might so that it might look like people with trait X are a little less likely to be chosen. We don't want to base our model on participants' responses to only two people. 

If we split the dataset into two subsets, train our model on one, and test it on the other, then if one of the unpopular people is in the train set and one is in the test set, including the feature could increase the model's performance on the test set. With a dataset of this size, the boost in performance could be large enough so that one would be inclined to include the feature based on the increase in performance.

The standard method used to avoid this problem is cross-validation: instead of using a single train/test split, use many train/test splits. If including a feature in the model improves performance for a large fraction of train/test splits of sufficiently low redundancy between them, that can provide much stronger evidence that that the predictive power of the feature will generalize.

For each event, I split the data into a test set consisting the event, and a train set consisting of all other events. With this setup:

  • When both of unpopular people are in the train set, including trait X as a feature makes the model's predictions for the test set worse.
  • In the instances where one of the people is in the train set and the other is in the test set, including the feature may improve performance. But there are at most 2 such instances out of 9 train/test splits.
  • Should it happen that both people were at the same event, including the feature won't improve performance for any of the events, because when the two people are in the test set, there's no pattern in the train set for the model to pick up on.

    The fact that the model never does better in this case case is helpful, because flukish occurrences are more likely to be concentrated in a single event than they are to be split up over a different events: for example, maybe the two unusual people are friends who have a lot in common and signed up for the same event together. 

I required that when predictions are generated in this way (with one train/test split for each event), every feature that I include improve performance

  1. When we average all predictions made across the whole dataset.
  2. For a majority of events when we look at the data by event. 
  3. For a majority of raters when we look at the data by rater.
  4. For a majority of ratees when we look at the dataset by ratee.

Having spent a long time with the dataset, it was more or less clear to me that that the train/test splits that I used were enough, but I realized this may not be a priori clear, so I did a final check in which before forming the train/test splits, I removed each individual from the dataset in turn, and each wave from the dataset in turn.  This is in the spirit of  leave-one-out cross validation. It turns out to be overkill: (1)-(4) are never violated for any feature that I used, except for one that occasionally fell short of meeting criterion (4) by a single ratee.

I measured performance using "log loss," which is a technical measure of the quality of probabilistic predictions. I omit a description of it because I figure that readers either already know it or don't have the time/energy to absorb an explanation, but I can write about it if someone would like.  

The tables below show how much predictive power increases when we include a given feature, starting from a base consisting of all other features that we used. Here the columns correspond to criteria (1)-(4), and the numbers in the "Avg boost" column are drops in log loss. Since I haven't defined the features, I've left them unlabeled, but I'll label them once I've written my next post.

Women’s decisions:

 


Feature

Avg boost

% events

% of raters

% ratees

1

0.0874

100%

63%

83%

2

0.0645

100%

75%

69%

3

0.0035

100%

58%

55%


Men’s decisions:


Feature

Avg boost

% events

% of raters

% ratees

1

0.1162

100%

64%

90%

2

0.0874

100%

84%

72%

3

0.0030

78%

56%

53%

4

0.0024

89%

55%

55%

5

0.0017

67%

64%

67%

 

The tables and the criteria that I described don't tell the whole story as far as overfitting goes: the features depend on numerical parameters, which are themselves overfit to the model, in the sense that to some extent I picked them with a view toward maximizing the numbers in the table.

But this sort of overfitting corresponds to optimizing the expected performance of the model on hypothetical future datasets, which is the opposite of picking features that are likely to be predictive only in the context of the dataset. It overstates the predictive power of the model in more general contexts, but it's simultaneously the case that not doing it would produce a model that performs worse in general settings.

The choices that I made seem fairly natural, and to the extent that they overstate the model's predictive power, the effect seems likely to be minor. If one had more data, one could obtain improved estimates for the numerical parameters. The more serious distortion in potential predictive power comes from the absence of data on participants across multiple events. 

Thanks to Brian Tomasik for catching an error in an earlier version of this post.

Using machine learning to predict romantic compatibility: empirical results

24 JonahSinick 17 December 2014 02:54AM

Overview

For many people, having a satisfying romantic relationship is one of the most important aspects of life. Over the past 10 years, online dating websites have gained traction, and dating websites have access to large amounts of data that could be used to build predictive models to achieve this goal. Such data is seldom public, but Columbia business school professors Ray Fisman and Sheena Iyengar compiled a rich and relevant data set for their paper Gender Differences in Mate Selection: Evidence From a Speed Dating Experiment. Their main results were:

Women put greater weight on the intelligence and the race of partner, while men respond more to physical attractiveness. Moreover, men do not value women’s intelligence or ambition when it exceeds their own. Also, we find that women exhibit a preference for men who grew up in affluent neighborhoods. Finally, male selectivity is invariant to group size, while female selectivity is strongly increasing in group size.

I found the study through Andrew Gelman’s blog, where he wrote:

What I really want to do with these data is what I suggested to Ray and Sheena several years ago when they first told me about the study: a multilevel model that allows preferences to vary by person, not just by sex. Multilevel modeling would definitely be useful here, since you have something like 10 binary observations and 6 parameters to estimate for each person.

Several months ago I decided to pursue a career in data science, and with a view toward building my skills, I worked to build a model to predict when an individual participant will express interest in seeing a given partner again. Along with the goal of learning, I had the dual intent of contributing knowledge that had the potential, however slight, to help people find satisfying romantic relationships.

It’s unlikely that what I did will have practical applications (as basic research seldom does), but I did learn a great deal about many things, most having to do with data science methodology in general, but also some about human behavior.

This is the first of a series of posts where I report on my findings. A linear narrative would degenerate to a sprawling blog post that would be of little interest to anybody but me. In this post, I’ll restrict focus to the question: how much predictive power can we get by estimating the generic selectivity and desirability of the people involved, without using information about the interactions between their traits?

I’ll ultimately go into the details of the methodology that I used, including discussion of statistical significance, the rationale for making the decisions that I did the, and links to relevant code, but here I’ll suppress technical detail in favor of relegating it to separate blog posts that might be of interest to a more specialized audience. In several places I speculate as to the meaning of the results. I’ve made efforts to subject my reasoning to cross checks, but haven’t gotten almost any external feedback yet, and I’d welcome counter considerations, alternative hypotheses, etc. I’m aware that there are places where claims that I make don’t logically follow from what precedes them, and I’m not so much looking for examples of this in general as much as I am instances where there’s a sizable probability that I’ve missed something that alters the bottom line conclusions.

continue reading »

Cognito Mentoring and zero-sum competition

5 JonahSinick 29 April 2014 11:06PM

I've recently made some posts ([1], [2], [3], [4]) geared toward helping high school and college students increase their expected future earnings, as a part of my work for Cognito Mentoring. A LWer recently asked how closely aligned this is with our goal of producing social value. Specifically, he noted that in some cases the effect of our advice will be to help people win zero-sum competitions. For example, to the extent that majoring in economics has signaling benefits, if somebody majors in economics based on our information, the signaling benefits to him or her will be counterbalanced by reduced signaling benefits for others. At first glance, this appears to lack social value, raising the question of why we're working to provide this information. We recognize the tension, and this post responds to this question.

continue reading »

View more: Prev | Next