When I write machine learning software, I tend to use the Place-Based Programming (PBP) paradigm. PBP caches your computations so you rarely have to perform the same computation twice.
The fundamental unit of data is a place, which refers to a location on disk. Consider the hard-coded string "I am Satoshi Nakamoto."
. You can complete the place of a string by hashing it.
;; This code is written in Hy.
;; See https://docs.hylang.org/en/stable/ for documentation of the language.
(import [hashlib [md5]] os)
(setv +place-dir+ ".places/")
(defn place-of [expression]
"Returns the place of an expression"
(os.path.join
+place-dir+
"str/"
(+ (.hexdigest (md5 (.encode (str expression))))
".pickle")))
;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of '"I am Satoshi Nakamoto."))
In Lisp, the '
tag means "do not evaluate the following expression". Note how we did not compute the place of the string's value directly. We evaluated the place of the source code which defines the string. We can replace our function with a macro so the user does not have to quote his or her code.
(import [hashlib [md5]] os)
(setv +place-dir+ ".places/")
(defmacro place-of [expression]
"Returns the place of an expression"
`(os.path.join
+place-dir+
(str (type '~data))
(+ (.hexdigest (md5 (.encode (str '~expression))))
".pickle")))
;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of "I am Satoshi Nakamoto."))
Whenever a function returns a place, it implicitly guarantees that the place is populated. The place-of
macro is not allowed to just compute where a place would be if it existed. The macro must also save our data to the place if the place is not already populated.
(defmacro/g! place-of [expression]
"Returns the place of an expression"
`(do
(setv ~g!place
(os.path.join
+place-dir+
(str (type '~code))
(+ (.hexdigest (md5 (.encode (str '~expression))))
".pickle")))
(if-not (os.path.exists ~g!place)
(with [f (open ~g!place "wb")]
(pickle.dump (eval '~expression) f)))
~g!place))
;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of "I am Satoshi Nakamoto."))
Reading from a place is easier.
(defn value-of [place]
(with [f (open place "rb")]
(pickle.load f)))
;; prints "I am Satoshi Nakamoto."
(print (value-of (place-of "I am Satoshi Nakamoto.")))
This constitutes a persistent memoization system where code is evaluated no more than once.
(import [time [sleep]])
(print (value-of (place-of (do (sleep 5) "This computation takes 5 seconds"))))
The first time you call the above code it will take 5 seconds to execute. On all subsequent runs the code will return instantly.
This method of caching assumes that an expression always evaluates to the same value. This is sometimes true in functional programming, but only if you're careful. For example, suppose the expression is a function call, and you change the function's definition and restart your program. When that happens, you need to delete the out-of-date entries from the cache or your program will read an out-of-date answer.
Also, since you're using the text of an expression for the cache key, you should only use expressions that don't refer to any local variables. For example, caching an expression that's within a function and refers to a function parameter will result in bugs when the function is called more than once with different parameters.
So this might be okay in simple cases when you are working alone and know what you're doing, but it likely would result in confusion when working on a team.
It's also essentially the same kind of caching that's commonly done by build systems. It's common for makefiles to be subtly broken so that incremental builds are unreliable and you need to do a "clean build" (with an empty cache) when it really matters that a build is correct. (The make command will compare file dates, but that's often not enough due to missing dependencies.)
But it still might be better to switch to a build system that's designed for this sort of thing, because then at least people will expect to need to do a clean build whenever the results seem to be wrong.
(Bazel is a build system that tries very hard to make sure that incremental builds are always correct and you never need to do a "clean build," but it's hard enough to use that I don't really recommend it.)
Ah, that's a great example, thanks for spelling it out.