When I write machine learning software, I tend to use the Place-Based Programming (PBP) paradigm. PBP caches your computations so you rarely have to perform the same computation twice.
The fundamental unit of data is a place, which refers to a location on disk. Consider the hard-coded string "I am Satoshi Nakamoto."
. You can complete the place of a string by hashing it.
;; This code is written in Hy.
;; See https://docs.hylang.org/en/stable/ for documentation of the language.
(import [hashlib [md5]] os)
(setv +place-dir+ ".places/")
(defn place-of [expression]
"Returns the place of an expression"
(os.path.join
+place-dir+
"str/"
(+ (.hexdigest (md5 (.encode (str expression))))
".pickle")))
;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of '"I am Satoshi Nakamoto."))
In Lisp, the '
tag means "do not evaluate the following expression". Note how we did not compute the place of the string's value directly. We evaluated the place of the source code which defines the string. We can replace our function with a macro so the user does not have to quote his or her code.
(import [hashlib [md5]] os)
(setv +place-dir+ ".places/")
(defmacro place-of [expression]
"Returns the place of an expression"
`(os.path.join
+place-dir+
(str (type '~data))
(+ (.hexdigest (md5 (.encode (str '~expression))))
".pickle")))
;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of "I am Satoshi Nakamoto."))
Whenever a function returns a place, it implicitly guarantees that the place is populated. The place-of
macro is not allowed to just compute where a place would be if it existed. The macro must also save our data to the place if the place is not already populated.
(defmacro/g! place-of [expression]
"Returns the place of an expression"
`(do
(setv ~g!place
(os.path.join
+place-dir+
(str (type '~code))
(+ (.hexdigest (md5 (.encode (str '~expression))))
".pickle")))
(if-not (os.path.exists ~g!place)
(with [f (open ~g!place "wb")]
(pickle.dump (eval '~expression) f)))
~g!place))
;; prints ".places/<class 'hy.models.HyString'>/17f36dc3403a328572adcea3fd631f55.pickle"
(print (place-of "I am Satoshi Nakamoto."))
Reading from a place is easier.
(defn value-of [place]
(with [f (open place "rb")]
(pickle.load f)))
;; prints "I am Satoshi Nakamoto."
(print (value-of (place-of "I am Satoshi Nakamoto.")))
This constitutes a persistent memoization system where code is evaluated no more than once.
(import [time [sleep]])
(print (value-of (place-of (do (sleep 5) "This computation takes 5 seconds"))))
The first time you call the above code it will take 5 seconds to execute. On all subsequent runs the code will return instantly.
I think this overstates the difficulty, referential transparency is the norm in functional programming, not something unusual.
As I understand, this system is mostly useful if you're using it for almost every function. In that case, your inputs are hashes which contain the source code of the function that generated them, and therefore your caches will invalidate if an upstream function's source code changed.
Agreed.
I agree that it's essentially a framework, and you'd need buy-in from a team in order to consistently use it in a repository. But I've seen teams buy into heavier frameworks pretty regularly; this version seems unusual but not particularly hard to use/understand. It's worth noting that bad caching systems are pretty common in data science, so something like this is potentially a big improvement there.