One sort-of counterexample would be The Unreasonable Effectiveness of Mathematics in the Natural Sciences, where a lot of Math has been surprisingly accurate even when the assumptions where violated.
The Mathematical Theory of Communication by Shannon and Weaver. It's an extended version of Shannon's original paper that established Information Theory, with some extra explanations and background. 144 pages.
Atiyah & McDonald's Introduction to Commutative Algebra fits. It's 125 pages long, and it's possible to do all the exercises in 2-3 weeks – I did them over winter break in preparation for a course.
Lang's Algebra and Eisenbud's Commutative Algebra are both supersets of Atiyah & McDonald, I've studied each of those as well and thought A&M was significantly better.
Unfortunately, I think it isn't very compatible with the way management works at most companies. Normally there's pressure to get your tickets done quickly, which leaves less time for "refactor as you go".
I've heard this a lot, but I've worked at 8 companies so far, and none of them have had this kind of time pressure. Is there a specific industry or location where this is more common?
A big piece is that companies are extremely siloed by default. It's pretty easy for a team to improve things in their silo, it's significantly harder to improve something that requires two teams, it's nearly impossible to reach beyond that.
Uber is particularly siloed, they have a huge number of microservices with small teams, at least according to their engineering talks on youtube. Address validation is probably a separate service from anything related to maps, which in turn is separate from contacts.
Because of silos, companies have to make an extra...
Cooking:
Data Science:
Programming:
I've used Optim.jl
for similar problems with good results, here's an example: https://julianlsolvers.github.io/Optim.jl/stable/#user/minimization/
The general lesson is that "magic" interfaces which try to 'do what I mean' are nice to work with at the top-level, but it's a lot easier to reason about composing primitives if they're all super-strict.
100% agree. In general I usually aim to have a thin boundary layer that does validation and converts everything to nice types/data structures, and then a much stricter core of inner functionality. Part of the reason I chose to write about this example is because it's very different from what I normally do.
...Important caveat for the pass-through approach
This is a perfect example of the AWS Batch API 'leaking' into your code. The whole point of a compute resource pool is that you don't have to think about how many jobs you create.
This is true. We're using AWS Batch because it's the best tool we could find for other jobs that actually do need hundreds/thousands of spot instances, and this particular job goes in the middle of those. If most of our jobs looked like this one, using Batch wouldn't make sense.
...You get language-level validation either way. The
assert
statements are superfluous in that sense.
The reason to be explicit is to be able to handle control flow.
The datasets aren't dependent on each other, though some of them use the same input parameters.
If your jobs are independent, then they should be scheduled as such. This allows jobs to run in parallel.
Sure, there's some benefit to breaking down jobs even further. There's also overhead to spinning up workers. Each of these functions takes ~30s to run, so it ends up being more efficient to put them in one job instead of multiple.
...Your errors would come out just as fast if you ran
check_dataset_para
This is sometimes true in functional programming, but only if you're careful.
I think this overstates the difficulty, referential transparency is the norm in functional programming, not something unusual.
For example, suppose the expression is a function call, and you change the function's definition and restart your program. When that happens, you need to delete the out-of-date entries from the cache or your program will read an out-of-date answer.
As I understand, this system is mostly useful if you're using it for almost every function. In that case...
If you're using non-modal editing, in that example you could press Alt+rightarrow three times, use cmd+f, the end key (and back one word), or cmd+righarrow (and back one word). That's not even counting shortcuts specific to another IDE or editor. Why, in your mental model, does the non-modal version feel like fewer choices? I suspect it's just familiarity – you've settled on some options you use the most, rather than trying to calculate the optimum fewest keystrokes each time.
Have you ever seen an experienced vim user? 3-5 seconds latency is completel...
So, one of the arguments you've made at several points is that we should expect Vim to be slower because it has more choices. This seems incorrect to me, even a simple editor like Sublime Text has about a thousand keyboard shortcuts, which are mostly ad-hoc and need to be memorized separately. In contrast Vim has a small, (mostly) composable language. I just counted lsusr's post, and it has fewer than 30 distinct components – most of the text is showing different ways to combine them.
The other thing to consider is that most programmers will use at lea...
I did :Tutor on neovim and only did commands that actually involved editing text, it took 5:46.
Now trying in Sublime Text. Edit: 8:38 in Sublime, without vim mode – a big difference! It felt like it was mostly uniform, but one area where I was significantly slower was search and replace, because I couldn't figure out how to go backwards easily.
I would expect using VIM to increase latency. While you are going to press fewer keys you are likely going to take slightly longer to press the keys as using any key is more complex.
This really isn't my experience. Once you've practiced something enough that it becomes a habit, the latency is significantly lower. Anecdotally, I've pretty consistently seen people who're used to vim accomplish text editing tasks much faster than people who aren't, unless the latter is an expert in keyboard shortcuts of another editor such as emacs.
...There's the paradox o
As far as I know there's almost no measurement of productivity of developer tools. Without data, I think there are two main categories in which editor features, including keyboard shortcuts, can make you more productive:
An example of the first would be automatically syncing your code to a remote development instance. An example of the first would be adding a comma to the end of several lines at once using a macro. IDEs tend to focus on 1, text editors tend to focus on 2.
In general, I...
Very cool, thanks for writing this up. Hard-to-predict access in loops is an interesting case, and it makes sense that AoS would beat SoA there.
Yeah, SIMD is a significant point I forgot to mention.
It's a fair amount of work to switch between SoA and AoS in most cases, which makes benchmarking hard! StructArrays.jl
makes this pretty doable in Julia, and Jonathan Blow talks about making it simple to switch between SoA and AoS in his programming language Jai. I would definitely like to see more languages making it easy to just try one and benchmark the results.
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%." – Donald Knuth
Yup, these are all reasons to prefer column orientation over row orientation for analytics workloads. In my opinion data locality trumps everything but compression and fast transmission is definitely very nice.
Until recently, numpy and pandas were row oriented, and this was a major bottleneck. A lot of pandas's strange API is apparently due to working around row orientation. See e.g. this article by Wes McKinney, creator of pandas: https://wesmckinney.com/blog/apache-arrow-pandas-internals/#:~:text=Arrow's%20C%2B%2B%20implementation%20provides%20essential,...
I see where that intuition comes from, and at first I thought that would be the case. But the machine is very good at iterating through pairs of arrays. Continuing the previous example:
function my_double_sum(data)
sum_heights_and_weights = 0
for row in data
sum_heights_and_weights += row.weight + row.height
end
return sum_heights_and_weights
end
@btime(my_double_sum(heights_and_weights))
> 50.342 ms (1 allocation: 16 bytes)
function my_double_sum2(heights, weights)
sum_heights_and_weights = 0
for (height, weight) in zip
... There's also a transcript: https://www.cs.virginia.edu/~robins/YouAndYourResearch.html
I've re-read it at least 5 times, highly recommend it.
Of course it's easy! You just compare how much you've made, and how long you've stayed solvent, against the top 1% of traders. If you've already done just as well as the others, you'd in the top 1%. Otherwise, you aren't.
This object-level example is actually harder than it appears, performance of a fund or trader in one time period generally has very low correlation to the next, e.g. see this paper: https://www.researchgate.net/profile/David-Smith-256/publication/317605916_Evaluating_Hedge_Fund_Performance/links/5942df6faca2722db499cbce/Evaluating-Hedge-Fu...
An incomplete list of caveats to Sharpe off the top of my head:
This is very, very cool. Having come from the functional programming world, I frequently miss these features when doing machine learning in Python, and haven't been able to easily replicate them. I think there's a lot of easy optimization that could happen in day-to-day exploratory machine learning code that bog standard pandas/scikit-learn doesn't do.
I don't really understand what you mean by "from first principles" here. Do you mean in a way that's intuitive to you? Or in a way that includes all the proofs?
Any field of Math is typically more general than any one intuition allows, so it's a little dangerous to think in terms of what it's "really" doing. I find the way most people learn best is by starting with a small number of concrete intuitions – e.g., groups of symmetries for group theory, or posets for category theory – and gradually expanding.
In the case of Complex Analysis, I find the intuition of the Riemann Sphere to be particularly useful, though I don't have a good book recommendation.
One major confounder is that caffeine is also a painkiller, many people have mild chronic pain, and I think there's a very plausible mechanism by which painkillers improve productivity, i.e. just allowing someone to focus better.
Anecdotally, I've noticed that "resetting" caffeine tolerance is very quick compared to most drugs, taking something like 2-3 days without caffeine for several people I know, including myself.
The studies I could find on caffeine are highly contradictory, e.g. from Wikipedia, "Caffeine has been shown to have...
One key dimension is decomposition – I would say any gears model provides decomposition, but models can have it without gears.
For example, the error in any machine learning model can be broken down into bias + variance, which provides a useful model for debugging. But these don't feel like gears in any meaningful sense, whereas, say, bootstrapping + weak learners feel like gears in understanding Random Forests.
I think it is true that gears-level models are systematically undervalued, and that part of the reason is because of the longer payoff curve.
A simple example is debugging code: a gears-level approach is to try and understand what the code is doing and why it doesn't do what you want, a black-box approach is to try changing things somewhat randomly. Most programmers I know will agree that the gears-level approach is almost always better, but that they at least sometimes end up doing the black-box approach when tired/frustrated/stuck.
And in companies t...
Black-box approaches often fail to generalize within the domain, but generalize well across domains. Neural Nets may teach you less about medicine than a PGM, but they'll also get you good results in image recognition, transcription, etc.
This can lead to interesting principal-agent problems: an employee benefits more from learning something generalizable across businesses and industries, while employers will generally prefer the best domain-specific solution.
Nit: giving IQ tests is not super cheap, because it puts companies at a nebulous risk of being sued for disparate impact (see e.g. https://en.wikipedia.org/wiki/Griggs_v._Duke_Power_Co.).
I agree with all the major conclusions though.
My favorite book, by far, is Functional Programming in Scala. This book has you derive most of the concepts from scratch, to the point where even complex abstractions feel like obvious consequences of things you've already built.
If you want something more Haskell-focused, a good choice is Programming in Haskell.
I didn't downvote, but I agree that this is a suboptimal meme – though the prevailing mindset of "almost nobody can learn Calculus" is much worse.
As a datapoint, it took me about two weeks of obsessive, 15 hour/day study to learn Calculus to a point where I tested out of the first two courses when I was 16. And I think it's fair to say I was unusually talented and unusually motivated. I would not expect the vast majority of people to be able to grok Calculus within a week, though obviously people on this site are not a representative sample.
A good exposition of the related theorems is in Chapter 6 of Understanding Machine Learning (https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ref=sr_1_1?crid=2MXVW7VOQH6FT&keywords=understanding+machine+learning+from+theory+to+algorithms&qid=1562085244&s=gateway&sprefix=understanding+machine+%2Caps%2C196&sr=8-1)
There are several related theorems. Roughly:
1. The error on real data will be similar to the error on the training set + epsilon, where epsilon is roughly proportional to (datapoints / VC dime...
Yes, roughly speaking, if you multiply the VC dimension by n, then you need n times as much training data to achieve the same performance. (More precise statement here: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension#Uses) There are also a few other bounds you can get based on VC dimension. In practice these bounds are way too large to be useful, but an algorithm with much higher VC dimension will generally overfit more.
A different view is to look at the search process for the models, rather than the model itself. If model A is found from a process that evaluates 10 models, and model B is found from a process that evaluates 10,000, and they otherwise have similar results, then A is much more likely to generalize to new data points than B.
The formalization of this concept is called VC dimension and is a big part of Machine Learning Theory (although arguably it hasn't been very helpful in practice): https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension
It's really useful to ask the simple question "what tests could have caught the most costly bugs we've had?"
At one job, our code had a lot of math, and the worst bugs were when our data pipelines ran without crashing but gave the wrong numbers, sometimes due to weird stuff like "a bug in our vendor's code caused them to send us numbers denominated in pounds instead of dollars". This is pretty hard to catch with unit tests, but we ended up applying a layer of statistical checks that ran every hour or so and raised an alert if something was anomalous, and those alerts probably saved us more money than all other tests combined.