All of SatvikBeri's Comments + Replies

It's really useful to ask the simple question "what tests could have caught the most costly bugs we've had?"

At one job, our code had a lot of math, and the worst bugs were when our data pipelines ran without crashing but gave the wrong numbers, sometimes due to weird stuff like "a bug in our vendor's code caused them to send us numbers denominated in pounds instead of dollars". This is pretty hard to catch with unit tests, but we ended up applying a layer of statistical checks that ran every hour or so and raised an alert if something was anomalous, and those alerts probably saved us more money than all other tests combined.

There was a serious bug in this post that invalidated the results, so I took it down for a while. The bug has now been fixed and the posted results should be correct.

One sort-of counterexample would be The Unreasonable Effectiveness of Mathematics in the Natural Sciences, where a lot of Math has been surprisingly accurate even when the assumptions where violated.

The Mathematical Theory of Communication by Shannon and Weaver. It's an extended version of Shannon's original paper that established Information Theory, with some extra explanations and background. 144 pages.

Atiyah & McDonald's Introduction to Commutative Algebra fits. It's 125 pages long, and it's possible to do all the exercises in 2-3 weeks – I did them over winter break in preparation for a course.

Lang's Algebra and Eisenbud's Commutative Algebra are both supersets of Atiyah & McDonald, I've studied each of those as well and thought A&M was significantly better.

Unfortunately, I think it isn't very compatible with the way management works at most companies. Normally there's pressure to get your tickets done quickly, which leaves less time for "refactor as you go".

I've heard this a lot, but I've worked at 8 companies so far, and none of them have had this kind of time pressure. Is there a specific industry or location where this is more common?

4Adam Zerner
Interesting. My impression is that it's pretty widespread across industries and locations. It's been the case for me in all four companies I've worked at. Two of which were startups, two mid-sized, and each was in a different state.

A big piece is that companies are extremely siloed by default. It's pretty easy for a team to improve things in their silo, it's significantly harder to improve something that requires two teams, it's nearly impossible to reach beyond that.

Uber is particularly siloed, they have a huge number of microservices with small teams, at least according to their engineering talks on youtube. Address validation is probably a separate service from anything related to maps, which in turn is separate from contacts. 

Because of silos, companies have to make an extra... (read more)

2ChristianKl
It might very well be that there's no team responsible for IBAN validation at Amazon as the team that's responsible for payment information might be an Amazon.com team that doesn't care about IBAN's which are only important for Amazon in Europe.  

Cooking: 

  • Smelling ingredients & food is a good way to develop intuition about how things will taste when combined
  • Salt early is generally much better than salt late

Data Science:

  • Interactive environments like Jupyter notebooks are a huge productivity win, even with their disadvantages
  • Automatic code reloading makes Jupyter much more productive (e.g. autoreload for Python, or Revise for Julia)
  • Bootstrapping gives you fast, accurate statistics in a lot of areas without needing to be too precise about theory

Programming:

  • Do everything in a virtual environme
... (read more)
1MarcelloV
These are nice, for the friends recommendation one just be cautious of offering unsolicited advice and other-optimizing
  • using vector syntax is much faster than loops in Python

To generalize this slightly, using Python to call C/C++ is generally much faster than pure Python. For example, built-in operations in Pandas tend to be pretty fast, while using .apply() is usually pretty slow.

1Rudi C
Just use Julia ;)

I didn't know about that, thanks!

I found Loop Hero much better with higher speed, which you can fix by modifying a variables.ini file: https://www.pcinvasion.com/loop-hero-speed-mod/

2ChristianKl
The example looks to me like it gives the library one equation that has to be minimized. I on the other hand have a bunch of equations. 

The general lesson is that "magic" interfaces which try to 'do what I mean' are nice to work with at the top-level, but it's a lot easier to reason about composing primitives if they're all super-strict.

100% agree. In general I usually aim to have a thin boundary layer that does validation and converts everything to nice types/data structures, and then a much stricter core of inner functionality. Part of the reason I chose to write about this example is because it's very different from what I normally do. 

Important caveat for the pass-through approach

... (read more)

This is a perfect example of the AWS Batch API 'leaking' into your code. The whole point of a compute resource pool is that you don't have to think about how many jobs you create.
 

This is true. We're using AWS Batch because it's the best tool we could find for other jobs that actually do need hundreds/thousands of spot instances, and this particular job goes in the middle of those. If most of our jobs looked like this one, using Batch wouldn't make sense.

You get language-level validation either way. The assert statements are superfluous in that sense.

... (read more)

The reason to be explicit is to be able to handle control flow.

The datasets aren't dependent on each other, though some of them use the same input parameters.

If your jobs are independent, then they should be scheduled as such. This allows jobs to run in parallel.

Sure, there's some benefit to breaking down jobs even further. There's also overhead to spinning up workers. Each of these functions takes ~30s to run, so it ends up being more efficient to put them in one job instead of multiple.

Your errors would come out just as fast if you ran check_dataset_para

... (read more)
4philh
This has its own problems, but you could use inspect.signature, I think?
3Zolmeister
This is a perfect example of the AWS Batch API 'leaking' into your code. The whole point of a compute resource pool is that you don't have to think about how many jobs you create. It sounds like you're using the wrong tool for the job (or a misconfiguration - e.g. limit the batch template to 1 vcpu). You get language-level validation either way. The assert statements are superfluous in that sense. What they do add is in effect check_dataset_params(), whose logic probably doesn't belong in this file. No, I meant a developer introducing a runtime bug.

"refine definite theories"

Where does this quote come from – is it in the book?

6[anonymous]

Is there a reason you recommend Hy instead of Clojure? I would suggest Clojure to most people interested in Lisp these days, due to the overwhelmingly larger community, ecosystem, & existence of Clojurescript. 

6lsusr
I recommend Hy because it's what I personally use and I can therefore vouch for it. I have heard nothing but good things about Clojure. I even attend a Clojure user group. The Clojure programmers I meet tend to be smart which is a good sign.

Ah, that's a great example, thanks for spelling it out.

This is sometimes true in functional programming, but only if you're careful.

I think this overstates the difficulty, referential transparency is the norm in functional programming, not something unusual.

For example, suppose the expression is a function call, and you change the function's definition and restart your program. When that happens, you need to delete the out-of-date entries from the cache or your program will read an out-of-date answer.

As I understand, this system is mostly useful if you're using it for almost every function. In that case... (read more)

6justinpombrio
It really depends on what your domain you're working in. If you're memoizing functions, you're not allowed to use the following things (or rather, you can only use them in functions that are not transitively called by memoized functions): * Global mutable state (to no-one's surprise) * A database, which is global mutable state * IO, including reading user input, fetching something non-static from the web, or logging * Networking with another service that has state * Getting the current date Ask a programmer to obey this list of restrictions, and -- depending on the domain they're working in -- they'll either say "ok" or "wait what that's most of what my code does". That's very clever! I don't think it's sufficient, though. For example, say you have this code: (defnp add1 [x] (+ x 10)) ; oops typo (defnp add2 [x] (add1 (add1 x))) (add2 100) You run it once and get this cache: (add1 100) = 110 (add1 (add1 100)) = 120 (add2 100) = 120 You fix the first function: (defnp add1 [x] (+ x 1)) ; fixed (defnp add2 [x] (add1 (add1 x))) (add2 100) You run it again, which invokes (add2 100), which is found in the cache to be 120. The add2 cache entry is not invalidated because the add2 function has not changed, nor has its inputs. The add1 cache entries would be invalidated if anything ever invoked add1, but nothing does. (This is what I meant by "You also have to look at the functions it calls (and the functions those call, etc.)" in my other comment.)

This is very cool. The focus on caching a code block instead of just the inputs to the function makes it significantly more stable, since your cache will be automatically invalidated if you change the code in any way.

3justinpombrio
More stable, but not significantly so. You cannot tell what an expression does just by looking at the expression. You also have to look at the functions it calls (and the functions those call, etc.). If any of those change, then the expression may change as well. You also need to look at local variables, as skybrain points out. For example, this function: (defn myfunc [x] (value-of (place-of [EXPR INVOLVING x]))) will behave badly: the first time you call it it will compute the answer for the value of x you give it. The second time you call it, it will compute the same answer, regardless of what x you give it.

If you're using non-modal editing, in that example you could press Alt+rightarrow three times, use cmd+f, the end key (and back one word), or cmd+righarrow (and back one word). That's not even counting shortcuts specific to another IDE or editor. Why, in your mental model, does the non-modal version feel like fewer choices? I suspect it's just familiarity – you've settled on some options you use the most, rather than trying to calculate the optimum fewest keystrokes each time.

Have you ever seen an experienced vim user? 3-5 seconds latency is completel... (read more)

I ended up using cmd+shift+i which opens the find/replace panel with the default set to backwards.

So, one of the arguments you've made at several points is that we should expect Vim to be slower because it has more choices. This seems incorrect to me, even a simple editor like Sublime Text has about a thousand keyboard shortcuts, which are mostly ad-hoc and need to be memorized separately. In contrast Vim has a small, (mostly) composable language. I just counted lsusr's post, and it has fewer than 30 distinct components – most of the text is showing different ways to combine them.

The other thing to consider is that most programmers will use at lea... (read more)

2ChristianKl
Let's think about an example. I want to move my cursor.  I might be in a situation when 3W, lllllllllllllllllllllllllllllllll, / with something else $b are all valid moves to get at my target location for the cursor.  This has probably something like 3-5 seconds latency because I not only have to think about where my cursor should go about also about the way to get there.  On the other hand without VIM, having a proper keyboard that makes arrow keys easy to reach I might have a latency of maybe 700 milliseconds.  VIM frequently costs mental processing capacity because I have to model my code in my head in concepts like words (for w and b) that I wouldn't otherwise. 
3ChristianKl
The issue is not just more choices but more choices to achieve the same result. In programming languages Python achieved a large user-base through being easy to use with it's core principles like "there should be one obvious way to do things".  The problem is that it's not dependable when you can use the Vim shortcuts within user editors. If I use IdeaVim in IntelliJ I can use "*y to copy a lot of things to the clipboard but not for example the text in hover popups for which I actually need Crtl+c and where I lose the ability to copy the text when I let Vim overwrite the existing shortcut. 

I did :Tutor on neovim and only did commands that actually involved editing text, it took 5:46.

Now trying in Sublime Text. Edit: 8:38 in Sublime, without vim mode – a big difference! It felt like it was mostly uniform, but one area where I was significantly slower was search and replace, because I couldn't figure out how to go backwards easily.

2John_Maxwell
Interesting, thanks for sharing. Command-shift-g right?

This is a great experiment, I'll try it out too. I also have pretty decent habits for non-vim editing so it'll be interesting to see.

6SatvikBeri
I did :Tutor on neovim and only did commands that actually involved editing text, it took 5:46. Now trying in Sublime Text. Edit: 8:38 in Sublime, without vim mode – a big difference! It felt like it was mostly uniform, but one area where I was significantly slower was search and replace, because I couldn't figure out how to go backwards easily.

Some IDEs are just very accommodating about this, e.g. PyCharm. So that's great.

Some of them aren't, like VS Code. For those, I just manually reconfigure the clashing key bindings. It's annoying, but it only takes ~15 minutes total.

5paragonal
Thanks for your answer. Part of the problem might have been that I wasn't that proficient with vim. When I reconfigured the clashing key bindings of the IDE I sometimes unknowingly overwrote a vim command which turned out to be useful later on. So I had to reconfigure numerous times which annoyed me so much that I abandoned the approach at the time.

I would expect using VIM to increase latency. While you are going to press fewer keys you are likely going to take slightly longer to press the keys as using any key is more complex.

This really isn't my experience. Once you've practiced something enough that it becomes a habit, the latency is significantly lower. Anecdotally, I've pretty consistently seen people who're used to vim accomplish text editing tasks much faster than people who aren't, unless the latter is an expert in keyboard shortcuts of another editor such as emacs.

There's the paradox o

... (read more)
2ChristianKl
How much experience do you have with measuring the latency of things to know what takes 400ms and what takes 700ms? Even if total time for the task is reduced the latency for starting the task might still be higher.

As far as I know there's almost no measurement of productivity of developer tools. Without data, I think there are two main categories in which editor features, including keyboard shortcuts, can make you more productive:

  1. By making difficult tasks medium to easy
  2. By making ~10s tasks take ~1s

An example of the first would be automatically syncing your code to a remote development instance. An example of the first would be adding a comma to the end of several lines at once using a macro. IDEs tend to focus on 1, text editors tend to focus on 2.

In general, I... (read more)

2ChristianKl
I would expect using VIM to increase latency. While you are going to press fewer keys you are likely going to take slightly longer to press the keys as using any key is more complex.  There's the paradox of choice and having more choices to accomplish a task costs mental resources. Vim forces me to spent cognitive resources to chose between different alternatives of how to accomplish a task.  All the professional UX people seem to advocate making interfaces as simple as possible. 

Very cool, thanks for writing this up. Hard-to-predict access in loops is an interesting case, and it makes sense that AoS would beat SoA there.

Yeah, SIMD is a significant point I forgot to mention.

It's a fair amount of work to switch between SoA and AoS in most cases, which makes benchmarking hard! StructArrays.jl makes this pretty doable in Julia, and Jonathan Blow talks about making it simple to switch between SoA and AoS in his programming language Jai. I would definitely like to see more languages making it easy to just try one and benchmark the results.

"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.  Yet we should not pass up our opportunities in that critical 3%." – Donald Knuth

3Gunnar_Zarncke
That would be my preferred quote too.

Yup, these are all reasons to prefer column orientation over row orientation for analytics workloads. In my opinion data locality trumps everything but compression and fast transmission is definitely very nice.

Until recently, numpy and pandas were row oriented, and this was a major bottleneck. A lot of pandas's strange API is apparently due to working around row orientation. See e.g. this article by Wes McKinney, creator of pandas: https://wesmckinney.com/blog/apache-arrow-pandas-internals/#:~:text=Arrow's%20C%2B%2B%20implementation%20provides%20essential,... (read more)

I see where that intuition comes from, and at first I thought that would be the case. But the machine is very good at iterating through pairs of arrays. Continuing the previous example:

function my_double_sum(data)
    sum_heights_and_weights = 0
    for row in data
        sum_heights_and_weights += row.weight + row.height
    end
    return sum_heights_and_weights
end
@btime(my_double_sum(heights_and_weights))
>   50.342 ms (1 allocation: 16 bytes)
function my_double_sum2(heights, weights)
    sum_heights_and_weights = 0
    for (height, weight) in zip
... (read more)
5gjm
Looks like it is marginally quicker in the first of those cases. Note that you're iterating over the objects linearly, which means that the processor's memory-access prediction hardware will have a nice easy job; and you're iterating over the whole thing without repeating any, which means that all the cache is buying you is efficient access to prefetched bits. After defining function mss(data::Vector{HeightAndWeight}) s = 0 for i=1:40000000 j=((i*i)%40000000)+1 @inbounds s += data[j].weight * data[j].height end return s end function mss2(heights::Vector{Int},weights::Vector{Int}) s = 0 for i=1:40000000 j=((i*i)%40000000)+1 @inbounds s += weights[j] * heights[j] end return s end (mss for "my scattered sum"; the explicit type declarations, literal size and @inbounds made it run faster on my machine, and getting rid of overhead seems like a good idea for such comparisons; the squaring is just a simple way to get something not too predictable) I got the following timings: julia> @btime(mss(heights_and_weights)) 814.056 ms (0 allocations: 0 bytes) 400185517392 julia> @btime(mss2(just_heights,just_weights)) 1.253 s (0 allocations: 0 bytes) 400185517392 so the array-of-structs turns out to work quite a lot better in this case. (Because we're doing half the number of hard-to-predict memory accesses.) Note how large those timings are, by the way. If I just process every row in order, I get 47.9ms for array-of-structs and 42.6ms for struct-of-arrays. 40.3ms if I use zip as you did instead of writing array accesses explicitly, which is interesting; I'm not that surprised the compiler can eliminate the superficial inefficiencies of the zip-loop, but I'm surprised it ends up strictly better rather than exactly the same. Anyway, this is the other way around from your example: struct-of-arrays is faster for me in that situation. But when we process things in "random" order, it's 20-30x slower because we no

Of course it's easy! You just compare how much you've made, and how long you've stayed solvent, against the top 1% of traders. If you've already done just as well as the others, you'd in the top 1%. Otherwise, you aren't.

This object-level example is actually harder than it appears, performance of a fund or trader in one time period generally has very low correlation to the next, e.g. see this paper: https://www.researchgate.net/profile/David-Smith-256/publication/317605916_Evaluating_Hedge_Fund_Performance/links/5942df6faca2722db499cbce/Evaluating-Hedge-Fu... (read more)

That's why I said "how long you've stayed solvent." Thanks for bringing in a more nuanced statement of the argument + source.

An incomplete list of caveats to Sharpe off the top of my head:

  • We can never measure the true Sharpe of a strategy (how it would theoretically perform on average over all time), only the observed Sharpe ratio, which can be radically different, especially for strategies with significant tail risk. There are a wide variety of strategies that might have a very high observed sharpe over a few years, but much lower true Sharpe
  • Sharpe typically doesn't measure costs like infrastructure or salaries, just losses to the direct fund. So e.g. you could view workin
... (read more)
2Liron
Nice ones. The first is probably the one that most accounts for funds like Titan marketing themselves misleadingly (IMO), but the others are still important caveats of the definition and good to know.

This is very, very cool. Having come from the functional programming world, I frequently miss these features when doing machine learning in Python, and haven't been able to easily replicate them. I think there's a lot of easy optimization that could happen in day-to-day exploratory machine learning code that bog standard pandas/scikit-learn doesn't do.

3lsusr
This is encouraging to hear. When I talk about this stuff to ML engineers, some instantly get it, especially when they come from a functional programming background. Others don't and it feels like there's a wall between me and them. I think I can replicate a lot of this in Python, even if it's a little clunky. It's just easier to start in Hy and then write a wrapper to port it to Python.

If N95 masks work, O95-100 and P95-100 masks should also work, and potentially be more effective - the stuff they filter is a superset of what N95 filters. They're normally more expensive, but in the current state I've actually found P100s cheaper than N95s.

I don't really understand what you mean by "from first principles" here. Do you mean in a way that's intuitive to you? Or in a way that includes all the proofs?

Any field of Math is typically more general than any one intuition allows, so it's a little dangerous to think in terms of what it's "really" doing. I find the way most people learn best is by starting with a small number of concrete intuitions – e.g., groups of symmetries for group theory, or posets for category theory – and gradually expanding.

In the case of Complex Analysis, I find the intuition of the Riemann Sphere to be particularly useful, though I don't have a good book recommendation.

One major confounder is that caffeine is also a painkiller, many people have mild chronic pain, and I think there's a very plausible mechanism by which painkillers improve productivity, i.e. just allowing someone to focus better.

Anecdotally, I've noticed that "resetting" caffeine tolerance is very quick compared to most drugs, taking something like 2-3 days without caffeine for several people I know, including myself.

The studies I could find on caffeine are highly contradictory, e.g. from Wikipedia, "Caffeine has been shown to have... (read more)

One key dimension is decomposition – I would say any gears model provides decomposition, but models can have it without gears.

For example, the error in any machine learning model can be broken down into bias + variance, which provides a useful model for debugging. But these don't feel like gears in any meaningful sense, whereas, say, bootstrapping + weak learners feel like gears in understanding Random Forests.

I think it is true that gears-level models are systematically undervalued, and that part of the reason is because of the longer payoff curve.

A simple example is debugging code: a gears-level approach is to try and understand what the code is doing and why it doesn't do what you want, a black-box approach is to try changing things somewhat randomly. Most programmers I know will agree that the gears-level approach is almost always better, but that they at least sometimes end up doing the black-box approach when tired/frustrated/stuck.

And in companies t... (read more)

4Panashe Fundira
To drill in further, a great way to build a model of why a defect arises is using the scientific method. You generate some hypothesis about the behavior of your program (if X is true, then Y) and then test your hypothesis. If the results of your test invalidate the hypothesis, you've learned something about your code and where not to look. If your hypothesis is confirmed, you may be able to resolve your issue, or at least refine your hypothesis in the right direction.

Black-box approaches often fail to generalize within the domain, but generalize well across domains. Neural Nets may teach you less about medicine than a PGM, but they'll also get you good results in image recognition, transcription, etc.

This can lead to interesting principal-agent problems: an employee benefits more from learning something generalizable across businesses and industries, while employers will generally prefer the best domain-specific solution.

Nit: giving IQ tests is not super cheap, because it puts companies at a nebulous risk of being sued for disparate impact (see e.g. https://en.wikipedia.org/wiki/Griggs_v._Duke_Power_Co.).

I agree with all the major conclusions though.

6Vaniver
For a long time, this was my impression as well, but Caplan claims the evidence doesn't bear this out. And many organizations do use IQ testing successfully; the military is a prime example.

For the orthogonal decomposition, don't you need two scalars? E.g. . For example, in , let Then , and there's no way to write as

2Rafael Harth
Ow. Yes, you do. This wasn't a typo either, I remembered the result incorrectly. Thanks for pointing it out, and props for being attentive enough to catch it. Or to be more precise, you only need one scalar, but the scalar is for y not z, because z isn't given. The theorem says that, given x and y, there is a scalar a and a vector z such that x=ay+z and y is orthogonal to z.

My favorite book, by far, is Functional Programming in Scala. This book has you derive most of the concepts from scratch, to the point where even complex abstractions feel like obvious consequences of things you've already built.

If you want something more Haskell-focused, a good choice is Programming in Haskell.

I didn't downvote, but I agree that this is a suboptimal meme – though the prevailing mindset of "almost nobody can learn Calculus" is much worse.

As a datapoint, it took me about two weeks of obsessive, 15 hour/day study to learn Calculus to a point where I tested out of the first two courses when I was 16. And I think it's fair to say I was unusually talented and unusually motivated. I would not expect the vast majority of people to be able to grok Calculus within a week, though obviously people on this site are not a representative sample.

Quite fair. I had read Zvi as speaking to typical LessWrong readership. Also, the standard you seem to be describing here is much higher than the standard Zvi was describing.

A good exposition of the related theorems is in Chapter 6 of Understanding Machine Learning (https://www.amazon.com/Understanding-Machine-Learning-Theory-Algorithms/dp/1107057132/ref=sr_1_1?crid=2MXVW7VOQH6FT&keywords=understanding+machine+learning+from+theory+to+algorithms&qid=1562085244&s=gateway&sprefix=understanding+machine+%2Caps%2C196&sr=8-1)

There are several related theorems. Roughly:

1. The error on real data will be similar to the error on the training set + epsilon, where epsilon is roughly proportional to (datapoints / VC dime... (read more)

Yes, roughly speaking, if you multiply the VC dimension by n, then you need n times as much training data to achieve the same performance. (More precise statement here: https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension#Uses) There are also a few other bounds you can get based on VC dimension. In practice these bounds are way too large to be useful, but an algorithm with much higher VC dimension will generally overfit more.

2John_Maxwell
Oh yeah, thanks, I think I remember seeing that. Do you happen to know the assumptions that proof makes? I'm having a hard time finding it in Vapnik's textbook. I vaguely remember it assuming that the correct hypothesis is in our hypothesis class, which made it seem kind of uninteresting. To be clear, I agree it's easier to overfit if you try lots of models, but my explanation of this would be more Bayesian than Vapnik's. (Maybe something like: If a researcher restricts themselves to only 10 models, they will choose 10 models they feel relatively optimistic about/assign a high prior probability to. A high prior with a high likelihood gives us better generalization than a low prior with a slightly higher likelihood; the posterior probability of the first model is greater.)

A different view is to look at the search process for the models, rather than the model itself. If model A is found from a process that evaluates 10 models, and model B is found from a process that evaluates 10,000, and they otherwise have similar results, then A is much more likely to generalize to new data points than B.

The formalization of this concept is called VC dimension and is a big part of Machine Learning Theory (although arguably it hasn't been very helpful in practice): https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension

9John_Maxwell
Is there a theoretical justification for using VC dimension as a basis for generalization, or is it treated as an axiomatic desideratum?

It's a combination. The point is to throw out algorithms/parameters that do well on backtests when the assumptions are violated, because those are much more likely to be overfit.

Load More