This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.

It had all seemed so promising at first. Colonizing a newly-discovered planet with two extra space dimensions would have allowed the development of novel arts and sciences, the founding of unprecedentedly networked and productive cities, and – most importantly – the construction of entirely new kinds of monuments to the Galactic Empress’ glory.

And it still might! But your efforts to expand her Empire by settling the SuperHyperSphere have hit a major snag. Your Zero-Point Power Generators – installation of which are the first step in any colonization effort – have reacted to these anomalous conditions with anomalously poor performance, to the point where your superiors want to declare this project a lost cause.

They’ve told you to halt all construction immediately and return home. They think it’s impossible to figure out which locations will be viable, and which will have substantial fractions of their output leeched by hyperdimensional anomalies. You think otherwise.

You have a list of active ZPPGs set up so far, and their (typically, disastrous) levels of performance. You have a list of pre-cleared ZPPG sites[1]. You have exactly enough time and resources to build twelve more generators before a ship arrives to collect you; if you pick twelve sites where the power generated matches or exceeds 100% of Standard Output[2], you can prove your point, prove your worth, save your colony, and save your career!

Or . . . you could just not. That’s also an option. The Empire is lenient towards failure (the Empress having long since given up holding others to the standards she sets herself), but merciless in punishing disobedience (at least, when said disobedience doesn’t bear fruit). If you install those ZPPGs in defiance of direct orders, yet fail to gather sufficient evidence . . . things might not end well for you.

What, if anything, will you do?


I’ll post an interactive you can use to test your choices, along with an explanation of how I generated the dataset, sometime on Monday the 22nd. I’m giving you nine days, but the task shouldn’t take more than an evening or two; use Excel, R, Python, the Rat Prophet, or whatever other tools you think are appropriate. Let me know in the comments if you have any questions about the scenario.

If you want to investigate collaboratively and/or call your decisions in advance, feel free to do so in the comments; however, please use spoiler tags or rot13 when sharing inferences/strategies/decisions, so people intending to fly solo can look for clarifications without being spoiled.


ETA: When exploring this dataset, you may notice a suspicious dearth of sites near the Equator(s). While I can justify it in-universe as the Empire having a weird coordinate system and/or the SuperHyperSphere being non-Euclidean, the Doylist explanation for this is just “the GM screwed up”. Please don't read too much into it!

  1. ^

    . . . which is all you're getting for now, as the site-clearing tools have already been recalled.

  2. ^

    Ideally, each of the twelve sites would have >100%, but twelve sites with a >100% average between them would also suffice to get your point across.

New Comment
23 comments, sorted by Click to highlight new comments since:

Hi! 
I'm new here.  

Fun puzzle. I'll take a crack at this.  

So far I've reformatted the data a bit. Links to my reformatted data (and nothing else!) below.  In a spoiler tag in case that's a spoiler somehow? 
 

This puzzle is great, thanks again for posting! 

Here's my first pass at a solution. This is my first time coding a regression model, so I probably have a bug somewhere. None of these are close to 100%, maybe the puzzle is fiendish and the correct answer is "none" , or more likely I'm doing something wrong.  

If the ship were arriving right now, I would follow orders and NOT build anything. But... this is fun so I'm going to keep hacking and see what I can figure out.

 ZPPG_id  ZPPG_pred
   4314   0.647873
   6123   0.645127
  48703   0.643784
  10709   0.635666
 104703   0.628708
   1273   0.626511
  53987   0.625413
  41545   0.621872
  13181   0.621323
  58945   0.619797
  99524   0.616257
  66023   0.614731

Wild idea: I wonder if the solution can't be deduced from each row of data one at a time. Maybe the highest values are like, "Due west of Eerie Silence"  I'm pretty sure my regression model doesn't account for that at all... 
 

Tweaked my model, but I'm still not very confident. Anyway.. here's my current best guess. Hail to the Empress!
6123,  61818, 14132,  99524,  58945,  84398,  101554,  18257,  26400,  48703,  103862, 99682

Thanks for giving us this puzzle, abstractapplic.

My answer (possibly to be refined later, but I'll check other's responses and aphyer's posts after posting this):

id's: 96286,9344,107278,68204,905,23565,8415,83512,62718,42742,16423,94304

observations and approach used:

After some initial exploration I considered only a single combination of qualitative traits (No/Mint/Adequate/['Eerie Silence'], though I think it wouldn't have mattered if I chose something else) in order to study the quantitative variables without distractions. 

Since Murphy's constant had the biggest effect, I first chose an approximation for the effect of Murphy's Constant (initially a parabola), then divided the ZPPG data by my prediction for Murphy's constant to get the effects of another variable (in this case, the local value of pi) to show up better. And so on, going back to refine my previously guessed functions as the noise from other variables cleared up.

As it turned out, this approach was unreasonably effective as the large majority of the variation (at least for the traits I ended up studying  - see below) seems to be accounted for by multiplicative factors, each factor only taking into account one of the traits or variables. 

Murphy's constant:

Cubic (I tried to get it to fit some kind of exponential, or even logistic function, because I had a headcanon explanation of something like that a higher value causes problems at a higher rate and the individual problems would multiply together before subtracting from nominal. (Or something.) But cubic fits better.) It visually looks like it's inflecting near the extreme values of the data (not checked quantitatively) so maybe it's a (cubic) spline.

Local Value of Pi:

Piecewise linear, peaking around 3.15,  same slope on either side I think. I tried to fit a sine to it first, similar reasons as with Murphy and exponentials. 

Latitude:

Piecewise constant, lower value if between -36 and 36.

Longitude:

This one seems to be a sine, though not literally sin(x) - displaced vertically and horizontally. I briefly experimented to see if I could get a better fit substituting the local value of pi for our boring old conventional value, didn't seem to work, but maybe I implemented that wrong.

Shortitude:

Another piecewise constant. Lower value if greater than 45. Unlike latitude, this one is not symmetrical - it only penalizes in the positive direction.

Deltitude:

I found no effect.

Traits:

I only considered traits that seemed relatively promising from my initial exploration (really just what their max value was and how many tries they needed to get it): No or EXTREMELY, Mint, Burning or Copper, (any Feng Shui) and ['Eerie Silence'] or ['Otherworldly Skittering'].

All traits tested seemed to me to have a constant multiplier. 

Values in my current predictor (may not have been tested on all the relevant data, and significant digits shown are not justified):

Extremely (relative to No): 0.94301

Burning, Copper (relative to Mint): 1.0429, 0.9224

Exceptional, Disharmonious (relative to Adequate): 1.0508,0.8403 - edit: I think these may actually be 1.05, 0.84 exactly.

Skittering (relative to Silience): 0.960248

Residual errors typically within 1%, relatively rarely above 1.5%. There could be other things I missed (e.g. non-multiplicative interactions) to account for the rest, or afaik it could be random. Since I haven't studied other traits than the ones listed, clues could also be lurking in those traits.

Using my overall predictor, my expected values for the 12 sites listed above are about:

96286: 112.3, 9344: 110.0, 107278: 109.3, 68204: 109.2, 905: 109.0, 23565: 108.1, 8415: 106.5, 83512: 106.0, 62718: 105.9 ,42742: 105.7, 16423: 105.4, 94304: 105.2

Given my error bars in the (part that I actually used of the) data set I'm pretty comfortable with this selection (in terms of building instead of folding, not necessarily that these are the best choices), though I should maybe check to see if any is right next to one of those cutoffs (latitude/shortitude) and I should also maybe be wary of extrapolating to very low values of Murphy's Constant. (e.g. 94304, 23565, 96286)

edited to add: aphyer's third post (which preceded this comment) has the same sort of conclusion and some similar approximations (though mine seem to be more precise), and unnamed also mentioned that it appears to be a bunch of things multiplied together. All of aphyer's posts have a lot of interesting general findings as well.

edited to also add: the second derivative of a cubic is a linear function. The cubic having zero second derivative at two different points is thus impossible unless the linear function is zero, which happens only when the first two coefficients of the cubic are zero (so the cubic is linear). So my mumbling about inflection points at both ends is complete nonsense... however, it does have close to zero second derivative near 0, so maybe it is a spline where we are seeing one end of it where the second derivative is set to 0 at that end. todo: see what happens if I actually set that to 0

edited again: see below comment - can actually set both linear and quadratic terms to 0

update:

 on Murphy:

I think that the overall multiplication factor from Murphy's constant is 1-0.004*(Murphy's constant)^3 - this appears close enough, I don't think I need linear or quadratic terms.

On Pi: 

I think the multiplication factor is probably 1-10*abs((local Value of Pi)-3.15) - again, appears close enough, and I don't think I need a quadratic term.

Regarding aphyer saying cubic doesn't fit Murphy's, and both unnamed and aphyer saying Pi needs a quadratic term, I am beginning to suspect that maybe they are modeling these multipliers in a somewhat different way, perhaps 1/x from the way I am modeling it? (I am modeling each function as a multiplicative factor that multiplies together with the others to get the end result).

edited to add: aphyer's formulas predict the log; my formulas predict the output, then I take the log after if I want to (e.g. to set a scaling factor). I think this is likely the source of the discrepancy. If predicting the log, put each of these formulas in a log (e.g. log(1-10*abs((local Value of Pi)-3.15))).

Do we know how the planet rotates/orients with respect to its sun (or any other local astronomical features)?

The same way it does everything: in a weird, non-Euclidean manner which defies human intuition.

Oooooookay.  Do we know what time period the existing performance data was derived over, and how long that is compared to the time we have until the ship picks us up? 

I'm asking because

I see an effect of Longitude on performance that resembles what you'd see on Earth if sunlight was good for performance.  However, I'm nervous that this effect might be present in the existing data but change by the time our superiors evaluate our performance: if we choose locations on the day side of the planet, and then the planet rotates, then our superiors will come by and the planet will be pointed a different way.

If the existing data was gathered over months and our superiors are here tomorrow, I'd be willing to assume 'the planet doesn't meaningfully rotate' and put sites at Longitudes that worked well in the existing data.  But if the existing data is the performance of all those sites this morning, I'd need to find solutions that worked without expecting to benefit from Longitude effects.

There are no time effects in the data; past trends can in generality be assumed to exist in the present.

(Good question!)

It looks to me like the (spoilers for coordinates)

strange frequency distributions seen in non-longitude coordinates is a lot like what you get from a normal distribution minus another normal distribution, with lower standard deviation, scaled down so that its max is equal to the first's max. I feel like I've seen this ... vibe, I guess, from curves, when I have said "this looks like a mixture of a normal distribution and something else" and then tried to subtract out the normal part.

I want to say that I don't play these, but I love reading them and reading other people play them.

My current choices (in order of preference) are

96286, 23565, 68204, 905, 93762, 94408, 105880, 8415, 94304, 42742, 92778, 62718

Did a little robustness check, and I'm going to swap out 3 of these to make it:

96286, 23565, 68204, 905, 93762, 94408, 105880, 9344, 8415, 62718, 80395, 65607

To share some more:

I came across this puzzle via aphyer's post, and got inspired to give it a try.

Here is the fit I was able to get on the existing sites (Performance vs. Predicted Performance). Some notes on it:

Seems good enough to run with. None of the highest predicted existing sites had a large negative residual, and the highest predicted new sites give some buffer.

Three observations I made along the way. 

First (which is mostly redundant with what aphyer wound up sharing in his second post):

Almost every variable is predictive of Performance on its own, but none of the continuous variables have a straightforward linear relationship with Performance.

Second:

Modeling the effect of location could be tricky. e.g., Imagine on Earth if Australia and Mexico were especially good places for Performance, or on a checkerboard if Performance was higher on the black squares.

Third:

The ZPPG Performance variable has a skewed distribution which does not look like what you'd get if you were adding a bunch of variables, but does look like something you might get if you were multiplying several variables. And multiplication seems plausible for this scenario, e.g. perhaps such-and-such a disturbance halves Performance and this other factor cuts performance by a quarter.

My updated list after some more work yesterday is

96286, 9344, 107278, 68204, 905, 23565, 8415, 62718, 83512, 16423, 42742, 94304

which I see is the same as simon's list, with very slight differences in the order

More on my process:

I initially modeled location just by a k nearest neighbors calculation, assuming that a site's location value equals the average residual of its k nearest neighbors (with location transformed to Cartesian coordinates). That, along with linear regression predicting log(Performance), got me my first list of answers. I figured that list was probably good enough to pass the challenge: the sites' predicted performance had a decent buffer over the required cutoff, the known sites with large predicted values did mostly have negative residuals but they were only about 1/3 the size of the buffer, there were some sites with large negative residuals but none among the sites with high predicted values and I probably even had a big enough buffer to withstand 1 of them sneaking in, and the nearest neighbors approach was likely to mainly err by giving overly middling values to sites near a sharp border (averaging across neighbors on both sides of the border) which would cause me to miss some good sites but not to include any bad sites. So it seemed fine to stop my work there.

Yesterday I went back and looked at the residuals and added some more handcrafted variables to my model to account for any visible patterns. The biggest was the sharp cutoff at Latitude +-36. I also changed my rescaling of Murphy's Constant (because my previous attempt had negative residuals for low Murphy values), added a quadratic term to my rescaling of Local Value of Pi (because the dropoff from 3.15 isn't linear), added a Shortitude cutoff at 45, and added a cos(Longitude-50) variable. Still kept the nearest neighbors calculation to account for any other location relevance (there is a little but much less now). That left me with 4 nines of correlation between predicted & actual performance, residuals near zero for the highest predicted sites in the training set, and this new list of sites. My previous lists of sites still seem good enough, but this one looks better.

 > Still kept the nearest neighbors calculation to account for any other location relevance (there is a little but much less now). That left me with 4 nines of correlation between predicted & actual performance,

Interesting, that definitely suggests some additional influences that we haven't explicitly taken account of, rather than random variation.

> added a quadratic term to my rescaling of Local Value of Pi (because the dropoff from 3.15 isn't linear)

As did aphyer, but I didn't see any such effect, which is really confusing me. I'm pretty sure I would have noticed it if it were anywhere near as large as aphyer shows in his post.

edit: on the pi issue see my reply to my own comment. Did you account for these factors as divisors dividing from a baseline, or multipliers multiplying a baseline (I did the latter)? edit: a converation with aphyer clarified this. I see you are predicting log performance, as with aphyer, so a linear effect on the multiplier would then have a log taken of it which makes it nonlinear.

I started putting together my analysis of this here, I'll try to update as I make more progress.

Error message: "Sorry, you don't have access to this draft"

Fixed link, thanks!

My submission:

96286
9344
68204
107278
905
23565
8415
62718
42742
83512
16423
94304

(and it seems that several other people have given the same exact answer haha}

Thank you for posting this. My findings are as follows:

Only 2 existing locations have > 100 performance. Both of these have:
- No strange smell
- Mint air
- Adequate Feng Shui

Most other high performers (But sub 100) have the same properties. Addittionally the weird sounds of the high performers are either:
- Eerie Silence
- Otherworldly Skittering

This suggests it would be sensible to restrict ourselve to locations with these properties. This alone increases the average performance from 23.12 to 46.92

High values of murphys constant are bad, though the affect seems to become small at around 3.5. There is some evidence to suggest that amongst the high performing section (But not the others) too low a value of murphys constant would be counterproductive, though it is a small affect. It may be a statistical fluctuation. Restricting to locations with a value < 3.5 would leave an average of 62.77

A value of pi below the normal value also looks harmful, though one that is too high also looks counterproductive. It looks like a relatively modest effect though and I don't wan to exclude too many possible locations, so I won't exclude any locations based on the value of pi.

There are few high performing ones between latitude -38 and +38.
There are few high performing ones around shortitude 0, and  +-90
There are few high performing ones around deltitude 0, and  +-90
But this might be caused by the small number of bases in these areas.

I then fitted a simple linear trigonometric model to the records that had the other properties that were identified with high performance. This gave the following model:

90.40498684665667+1.0177516945528096*sin(deltitude)+11.356095534597717*cos(deltitude)-0.17160136718146096*sin(shortitude)+14.734915658705445*cos(shortitude)-2.41427525418688*sin(latitude)-62.034325081735766*cos(latitude)+5.158741059290979*sin(longitude)+8.287780468342245*cos(longitude)
Standard deviation of error is 13.19479724158273 The no of records was 180

I found that the standard deviation error was reduced when degrees  were converted to radians using the local value of pi, so that is what was used.

This predicts that the following 12 possible locations have the best performance:

38730 VALUE:110.47214318466737
103420 VALUE:110.67976641439135
91328 VALUE:111.69109272372066
26403 VALUE:112.35848311090837
7724 VALUE:112.40205758474453
21409 VALUE:113.21091851443907
89352 VALUE:113.88444725998731
3090 VALUE:113.89351821821175
65317 VALUE:121.11038526877846
57364 VALUE:123.12690147352956
26627 VALUE:132.5469987591373
91627 VALUE:134.17450784571542


In practice I think it is unlikely that all of them will be greater than 100, but it looks like it will probably be good enough to please the empress.

Looking at this further, by far the strongest effect is the latitude, and that looks more like a rectangular effect than a trigonometric one. Replacing  the trigonometric fit with one that modelled a rectangulat latitude effect and no other yielded a model that explained most of the variation. By itself this looks better than the previous model.  

The next biggest effect looks like it is due to variation in murphys constant. This looked vaguely quadratic. 

The next biggest effect looked like it was due to variations in the value of Pi. It looked vaguely triangular, with the point ust below 3.15. 

The next biggest effect looked like a vaguely sinusoidal variation due to the longitude.

Including all of these in a model yielded one with a standard deviation of 4.9, and predicted that the following 12 locations were the best:

76804 VALUE:87.95301643603202
16965 VALUE:88.18566645580597
104815 VALUE:88.34280034001172
8415 VALUE:88.39346893009704
18123 VALUE:88.50303192064138
107929 VALUE:88.5221749787355
99595 VALUE:88.59004262250107
80395 VALUE:88.59313676878352
42742 VALUE:88.72736581213306
40639 VALUE:88.80584599223495
65607 VALUE:90.36919375244607
94304 VALUE:90.63981001558145

This is currently my best estimate. As the predicted values are all < 100 I will have to file a report on this with the empires colonisation department in case there is every any interest in making another attempt, but I won't risk the empresses' rath by attempting to colonise any of them.

My inferences, in descending order of confidence:

(source: it was revealed to me by a neural net)

84559, 79685, 87081, 99819, 37309, 44746, 88815, 58152, 55500, 50377, 69067, 53130.