LESSWRONG
LW

Comment Permalink

This method of caching assumes that an expression always evaluates to the same value. This is sometimes true in functional programming, but only if you're careful. For example, suppose the expression is a function call, and you change the function's definition and restart your program. When that happens, you need to delete the out-of-date entries from the cache or your program will read an out-of-date answer.

Also, since you're using the text of an expression for the cache key, you should only use expressions that don't refer to any local variables. For example, caching an expression that's within a function and refers to a function parameter will result in bugs when the function is called more than once with different parameters.

So this might be okay in simple cases when you are working alone and know what you're doing, but it likely would result in confusion when working on a team.

It's also essentially the same kind of caching that's commonly done by build systems. It's common for makefiles to be subtly broken so that incremental builds are unreliable and you need to do a "clean build" (with an empty cache) when it really matters that a build is correct. (The make command will compare file dates, but that's often not enough due to missing dependencies.)

But it still might be better to switch to a build system that's designed for this sort of thing, because then at least people will expect to need to do a clean build whenever the results seem to be wrong.

(Bazel is a build system that tries very hard to make sure that incremental builds are always correct and you never need to do a "clean build," but it's hard enough to use that I don't really recommend it.)

Showing 3 of 5 replies (Click to show all)

6justinpombrio4y

It really depends on what your domain you're working in. If you're memoizing functions, you're not allowed to use the following things (or rather, you can only use them in functions that are not transitively called by memoized functions): * Global mutable state (to no-one's surprise) * A database, which is global mutable state * IO, including reading user input, fetching something non-static from the web, or logging * Networking with another service that has state * Getting the current date Ask a programmer to obey this list of restrictions, and -- depending on the domain they're working in -- they'll either say "ok" or "wait what that's most of what my code does". That's very clever! I don't think it's sufficient, though. For example, say you have this code: (defnp add1 [x] (+ x 10)) ; oops typo (defnp add2 [x] (add1 (add1 x))) (add2 100) You run it once and get this cache: (add1 100) = 110 (add1 (add1 100)) = 120 (add2 100) = 120 You fix the first function: (defnp add1 [x] (+ x 1)) ; fixed (defnp add2 [x] (add1 (add1 x))) (add2 100) You run it again, which invokes (add2 100), which is found in the cache to be 120. The add2 cache entry is not invalidated because the add2 function has not changed, nor has its inputs. The add1 cache entries would be invalidated if anything ever invoked add1, but nothing does. (This is what I meant by "You also have to look at the functions it calls (and the functions those call, etc.)" in my other comment.)

Ah, that's a great example, thanks for spelling it out.

3justinpombrio4y

I think this is fixable. An invocation (f expr1 expr2) will produce the same result as the last time you invoked it if: * The body of f is the same as last time. * Every function it calls, including transitively, has the same source code as the last time you called f. Also every macro and type definition that is used transitively. Basically any code that it depends on in any way. * Every function involved is pure (no state, no IO). * Every function involved is top-level. I'm not sure this will play well with higher-order functions. * The invocations expr1 and expr2 also obey this checklist. I'm not sure this list is exhaustive, but it should be do-able in principle. If I look at a function invocation and all the code it transitively depends on (say it's 50% of the codebase), and I know that that 50% of the codebase hasn't changed since last time you ran the program, and I see that that 50% of the codebase is pure, and I trust you that the other 50% of the codebase doesn't muck with it (as it very well could with e.g. macros), then that function invocation should produce the same result as last time. This is tricky enough that it might need language level support to be practical. I'm glad that Isusr is thinking of it as "writing a compiler".

32

Place-Based Programming - Part 1 - Places

14th Apr 2021

2 min read

32

When I write machine learning software, I tend to use the Place-Based Programming (PBP) paradigm. PBP caches your computations so you rarely have to perform the same computation twice.

The fundamental unit of data is a place, which refers to a location on disk. Consider the hard-coded string "I am Satoshi Nakamoto.". You can complete the place of a string by hashing it.

;; This code is written in Hy.
;; See https://docs.hylang.org/en/stable/ for documentation of the language.
(import [hashlib [md5]] os)
(setv +place-dir+ ".places/")

(defn place-of [expression]
  "Returns the place of an expression"
  (os.path.join
    +place-dir+
    "str/"
    (+ (.hexdigest (md5 (.encode (str expression))))
       ".pickle")))

;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of '"I am Satoshi Nakamoto."))

In Lisp, the ' tag means "do not evaluate the following expression". Note how we did not compute the place of the string's value directly. We evaluated the place of the source code which defines the string. We can replace our function with a macro so the user does not have to quote his or her code.

(import [hashlib [md5]] os)
(setv +place-dir+ ".places/")

(defmacro place-of [expression]
  "Returns the place of an expression"
  `(os.path.join
    +place-dir+
    (str (type '~data))
    (+ (.hexdigest (md5 (.encode (str '~expression))))
       ".pickle")))

;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of "I am Satoshi Nakamoto."))

Whenever a function returns a place, it implicitly guarantees that the place is populated. The place-of macro is not allowed to just compute where a place would be if it existed. The macro must also save our data to the place if the place is not already populated.

(defmacro/g! place-of [expression]
  "Returns the place of an expression"
  `(do
     (setv ~g!place
           (os.path.join
             +place-dir+
             (str (type '~code))
             (+ (.hexdigest (md5 (.encode (str '~expression))))
                ".pickle")))
     (if-not (os.path.exists ~g!place)
             (with [f (open ~g!place "wb")]
               (pickle.dump (eval '~expression) f)))
     ~g!place))

;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of "I am Satoshi Nakamoto."))

Reading from a place is easier.

(defn value-of [place]
  (with [f (open place "rb")]
    (pickle.load f)))

;; prints  "I am Satoshi Nakamoto."
(print (value-of (place-of "I am Satoshi Nakamoto.")))

This constitutes a persistent memoization system where code is evaluated no more than once.

(import [time [sleep]])
(print (value-of (place-of (do (sleep 5) "This computation takes 5 seconds"))))

The first time you call the above code it will take 5 seconds to execute. On all subsequent runs the code will return instantly.

Machine Learning (ML)Programming

32

New Comment

19 comments, sorted by

Click to highlight new comments since: Today at 10:32 AM

[-]skybrian4y80

This method of caching assumes that an expression always evaluates to the same value. This is sometimes true in functional programming, but only if you're careful. For example, suppose the expression is a function call, and you change the function's definition and restart your program. When that happens, you need to delete the out-of-date entries from the cache or your program will read an out-of-date answer.

Also, since you're using the text of an expression for the cache key, you should only use expressions that don't refer to any local variables. For example, caching an expression that's within a function and refers to a function parameter will result in bugs when the function is called more than once with different parameters.

So this might be okay in simple cases when you are working alone and know what you're doing, but it likely would result in confusion when working on a team.

It's also essentially the same kind of caching that's commonly done by build systems. It's common for makefiles to be subtly broken so that incremental builds are unreliable and you need to do a "clean build" (with an empty cache) when it really matters that a build is correct. (The make command will compare file dates, but that's often not enough due to missing dependencies.)

But it still might be better to switch to a build system that's designed for this sort of thing, because then at least people will expect to need to do a clean build whenever the results seem to be wrong.

(Bazel is a build system that tries very hard to make sure that incremental builds are always correct and you never need to do a "clean build," but it's hard enough to use that I don't really recommend it.)

[-]SatvikBeri4y6-1

This is sometimes true in functional programming, but only if you're careful.

I think this overstates the difficulty, referential transparency is the norm in functional programming, not something unusual.

For example, suppose the expression is a function call, and you change the function's definition and restart your program. When that happens, you need to delete the out-of-date entries from the cache or your program will read an out-of-date answer.

As I understand, this system is mostly useful if you're using it for almost every function. In that case, your inputs are hashes which contain the source code of the function that generated them, and therefore your caches will invalidate if an upstream function's source code changed.

Also, since you're using the text of an expression for the cache key, you should only use expressions that don't refer to any local variables.

Agreed.

So this might be okay in simple cases when you are working alone and know what you're doing, but it likely would result in confusion when working on a team.

I agree that it's essentially a framework, and you'd need buy-in from a team in order to consistently use it in a repository. But I've seen teams buy into heavier frameworks pretty regularly; this version seems unusual but not particularly hard to use/understand. It's worth noting that bad caching systems are pretty common in data science, so something like this is potentially a big improvement there.

[-]justinpombrio4y60

I think this overstates the difficulty, referential transparency is the norm in functional programming, not something unusual.

It really depends on what your domain you're working in. If you're memoizing functions, you're not allowed to use the following things (or rather, you can only use them in functions that are not transitively called by memoized functions):

Global mutable state (to no-one's surprise)
A database, which is global mutable state
IO, including reading user input, fetching something non-static from the web, or logging
Networking with another service that has state
Getting the current date

Ask a programmer to obey this list of restrictions, and -- depending on the domain they're working in -- they'll either say "ok" or "wait what that's most of what my code does".

As I understand, this system is mostly useful if you’re using it for almost every function. In that case, your inputs are hashes which contain the source code of the function that generated them, and therefore your caches will invalidate if an upstream function’s source code changed.

That's very clever! I don't think it's sufficient, though.

For example, say you have this code:

(defnp add1 [x] (+ x 10)) ; oops typo
(defnp add2 [x] (add1 (add1 x)))
(add2 100)

You run it once and get this cache:

(add1 100) = 110
(add1 (add1 100)) = 120
(add2 100) = 120

You fix the first function:

(defnp add1 [x] (+ x 1)) ; fixed
(defnp add2 [x] (add1 (add1 x)))
(add2 100)

You run it again, which invokes (add2 100), which is found in the cache to be 120. The add2 cache entry is not invalidated because the add2 function has not changed, nor has its inputs. The add1 cache entries would be invalidated if anything ever invoked add1, but nothing does.

(This is what I meant by "You also have to look at the functions it calls (and the functions those call, etc.)" in my other comment.)

[-]SatvikBeri4y60

Ah, that's a great example, thanks for spelling it out.

[-]justinpombrio4y30

I think this is fixable. An invocation (f expr1 expr2) will produce the same result as the last time you invoked it if:

The body of f is the same as last time.
Every function it calls, including transitively, has the same source code as the last time you called f. Also every macro and type definition that is used transitively. Basically any code that it depends on in any way.
Every function involved is pure (no state, no IO).
Every function involved is top-level. I'm not sure this will play well with higher-order functions.
The invocations expr1 and expr2 also obey this checklist.

I'm not sure this list is exhaustive, but it should be do-able in principle. If I look at a function invocation and all the code it transitively depends on (say it's 50% of the codebase), and I know that that 50% of the codebase hasn't changed since last time you ran the program, and I see that that 50% of the codebase is pure, and I trust you that the other 50% of the codebase doesn't muck with it (as it very well could with e.g. macros), then that function invocation should produce the same result as last time.

This is tricky enough that it might need language level support to be practical. I'm glad that Isusr is thinking of it as "writing a compiler".

Note to readers: skybrian's parent comment was written when this post was titled "Place-Based Programming", before I changed the title to "Place-Based Programming - Part 1".

Your are correct that the code here in Part 1 breaks when you use variables with nonlocal scope. I begin to solve this problem in Part 2.

It's also essentially the same kind of caching that's commonly done by build systems.

Yes. I often think about this project as "writing a compiler". Some of the techniques I use come from Makefiles.

[-]uncomputable4y80

generally referred to as persistent memoization.

maybe persistent futures in this case, since you've got these intermediate "place" values.

Thanks. I have changed "simple caching system" to "persistent memoization system".

[-]Nicholas / Heather Kross4y60

My normie-paradigm mindset translates this to something like "the variable is a pointer to a result, and the result needs to be computed once (same as a source file may need to be computed once to use)". Is this accurate, or am I missing more of it?

This is accurate. Places are analogous to pointers.

[-]SatvikBeri4y40

This is very cool. The focus on caching a code block instead of just the inputs to the function makes it significantly more stable, since your cache will be automatically invalidated if you change the code in any way.

[-]justinpombrio4y30

More stable, but not significantly so.

You cannot tell what an expression does just by looking at the expression. You also have to look at the functions it calls (and the functions those call, etc.). If any of those change, then the expression may change as well.

You also need to look at local variables, as skybrain points out. For example, this function:

(defn myfunc [x] (value-of (place-of [EXPR INVOLVING x])))

will behave badly: the first time you call it it will compute the answer for the value of x you give it. The second time you call it, it will compute the same answer, regardless of what x you give it.

Note to readers: justinpombrio's parent comment was written when this post was titled "Place-Based Programming", before I changed the title to "Place-Based Programming - Part 1".

I think the disagreement here comes from me communicating different things to different people. I showed SatvikBeri a more complete system of which this post is just a tiny part. If all you see is this post (without the "Part 1") then justinpombrio's comment makes sense. If you see the entire project then SatvikBeri's comment makes sense.

justinpombrio's particular example can be solved with the defnp macro in Part 2.

(defnp myfunc [x] x)

(assert (= 3 value-of (myfunc 3))) ;; works

Not related to the main idea, but the point of os.path.join is to combine path elements using whichever delimiter the OS requires ("/" on Unix, "\" on Windows, etc., even though Windows in particular can also handle "/"). If you don't care about that portability, you might as well use normal string concatenation. Or if you're using os.path.join, you might as well omit the "/" delimiters in your string literals to get extra portability.

The place-of macro is not allowed to just compute where a place would be if it existed. The macro must also save our data to the place [if] the place is not already populated.

Fixed. Thanks.

Satvik mentioned you had a way to go from hash to the source code?

You can go from hash to source code by saving the source code too in addition to saving the value. You can go from place to source code by treating source code as a value. Otherwise, hashing is a trapdoor function.

[+][comment deleted]4y20

Deleted by SatvikBeri, 04/14/2021

More from lsusr

Curated and popular this week