I finished the third deep learning course on coursera and the last two aren’t available, so I went back to trying out some keras code to see how far I could get running things on my own. It was easy enough to find a tutorial that set up MNIST and then find something that quickly hit 98%+ accuracy on it.
It was also much easier than I had worried it would be to set up a network that takes the same inputs through multiple parallel layers—in particular I wanted to try training a network with a dense layer that operates separately from the convolutional layer, then merges the results for the final output. This didn’t actually improve my performance at all but it was a nice exercise to learn how to do that, and I think from here I should be able to build networks with basically arbitrary graph structure so that’s exciting.
But since it’s a bit hard to pick at the last teeny bits of error in MNIST (I hit 98.6% accuracy on the test set on my first model) I wanted to pick up a bigger, badder dataset to play with. Since I wanted something that wasn’t MNIST my first thought was obviously notMNIST! So I went ahead and downloaded the dataset, unzipped it, ran a script to rewrite it as a matlab matrix, cleared my memory, reloaded only the numpy arrays, and tried to split them into train and test sets before getting into anything.
And then my computer ran out of memory.
This virtual environment has 8GB of ram, and this dataset consists of about 500k images at 28x28 resolution. Stored in a text file, the text file is 3.3 GB. Ironically, this is over ten times as big in file size as the scipy data. Maybe because it’s saving every pixel value as a float instead of an int? (Note: forcing the dataset to be ints did not help)
It’s not clear to me why I’m running out of memory now, when I haven’t even loaded my neural network into memory, but it is clear to me why I’d run out of memory when I got there.
What I actually want to do here is to push the train and test sets into different files, and just pull one minibatch at a time into memory during training, then drop that from memory and pull a new one. Googling reveals this to be a common problem, but it seems that the solutions are somewhat ad hoc.
In particular, the standard solution is to use a batch generator… which you write yourself.
This seems odd to me, since there are a few somewhat standard ways to store numerical data—I would expect that there would be a standard batch generator for accessing at least one such format, and the job of sticking my data in that format is much less annoying to me than the job of writing my own batch generator.
That said, it looks like this may just be a question of time. For example, Keras does have this function
which collapses the coding problem from building a loop in which to use the generator to only writing the generator. I wouldn’t be surprised to see an extension of this come up soon in which keras (or some other supplemental library) can pull from a standard format like scipy or numpy saved dictionaries and build its own generator with default parameters.
--
On an unrelated note, one thing I’ve considered idly from my armchair is the question of how much precision is actually useful in deep learning.
In a number of ways, reducing precision can be advantageous. For example, ReLUs work for training because they are an approximation of a simple combination of sigmoid units, but they are popular because by dropping precision in that approximation they become MUCH faster to compute. Similarly, data augmentation by adding noise, especially in image processing, is a well established and effective way of improving performance. In a sense, this is trying to force your model to learn not to care too much about specific, precise details.
It seems to me that another way of pushing these kinds of advantages would be to use a less precise data type for the weights of your network to begin with. If each step of your gradient descent process has some threshold for what size of step is meaningful, versus what size of step is so small as to be not worth updating on because it is likely to be noise, then I can imagine getting some advantages similar to L1 normalization, data augmentation, and simpler architecture by spending less time on the computations in backpropagation and less memory per parameter.
That said, I can think of a few reasons this might not be practical, even if I don’t know which of those reasons will or will not apply in practice.
It could be that linear algebra libraries have already been optimized to work on 64-bit floats, and optimizing on smaller data types like 32 or 16 bit floats wouldn’t actually speed up calculations (this is my number one guess for why this isn’t done)
It could be that removing small gradient descent updates damages momentum and makes networks more likely to get caught in long stretches of close-to-flat loss differences, since it can’t take as many small steps that might lead it in the right direction.
It could be that the behavior of networks depends, through some chaotic process, a lot on very small changes in results coming through the middle layers of the network. I would be surprised about this, since it’s clear that small differences on the earliest layers (for example the input) shouldn’t create big differences, and neither should small differences on the latest layers.
I finished the third deep learning course on coursera and the last two aren’t available, so I went back to trying out some keras code to see how far I could get running things on my own. It was easy enough to find a tutorial that set up MNIST and then find something that quickly hit 98%+ accuracy on it.
It was also much easier than I had worried it would be to set up a network that takes the same inputs through multiple parallel layers—in particular I wanted to try training a network with a dense layer that operates separately from the convolutional layer, then merges the results for the final output. This didn’t actually improve my performance at all but it was a nice exercise to learn how to do that, and I think from here I should be able to build networks with basically arbitrary graph structure so that’s exciting.
But since it’s a bit hard to pick at the last teeny bits of error in MNIST (I hit 98.6% accuracy on the test set on my first model) I wanted to pick up a bigger, badder dataset to play with. Since I wanted something that wasn’t MNIST my first thought was obviously notMNIST! So I went ahead and downloaded the dataset, unzipped it, ran a script to rewrite it as a matlab matrix, cleared my memory, reloaded only the numpy arrays, and tried to split them into train and test sets before getting into anything.
And then my computer ran out of memory.
This virtual environment has 8GB of ram, and this dataset consists of about 500k images at 28x28 resolution. Stored in a text file, the text file is 3.3 GB. Ironically, this is over ten times as big in file size as the scipy data. Maybe because it’s saving every pixel value as a float instead of an int? (Note: forcing the dataset to be ints did not help)
It’s not clear to me why I’m running out of memory now, when I haven’t even loaded my neural network into memory, but it is clear to me why I’d run out of memory when I got there.
What I actually want to do here is to push the train and test sets into different files, and just pull one minibatch at a time into memory during training, then drop that from memory and pull a new one. Googling reveals this to be a common problem, but it seems that the solutions are somewhat ad hoc.
In particular, the standard solution is to use a batch generator… which you write yourself.
This seems odd to me, since there are a few somewhat standard ways to store numerical data—I would expect that there would be a standard batch generator for accessing at least one such format, and the job of sticking my data in that format is much less annoying to me than the job of writing my own batch generator.
That said, it looks like this may just be a question of time. For example, Keras does have this function which collapses the coding problem from building a loop in which to use the generator to only writing the generator. I wouldn’t be surprised to see an extension of this come up soon in which keras (or some other supplemental library) can pull from a standard format like scipy or numpy saved dictionaries and build its own generator with default parameters.
--
On an unrelated note, one thing I’ve considered idly from my armchair is the question of how much precision is actually useful in deep learning.
In a number of ways, reducing precision can be advantageous. For example, ReLUs work for training because they are an approximation of a simple combination of sigmoid units, but they are popular because by dropping precision in that approximation they become MUCH faster to compute. Similarly, data augmentation by adding noise, especially in image processing, is a well established and effective way of improving performance. In a sense, this is trying to force your model to learn not to care too much about specific, precise details.
It seems to me that another way of pushing these kinds of advantages would be to use a less precise data type for the weights of your network to begin with. If each step of your gradient descent process has some threshold for what size of step is meaningful, versus what size of step is so small as to be not worth updating on because it is likely to be noise, then I can imagine getting some advantages similar to L1 normalization, data augmentation, and simpler architecture by spending less time on the computations in backpropagation and less memory per parameter.
That said, I can think of a few reasons this might not be practical, even if I don’t know which of those reasons will or will not apply in practice.
It could be that linear algebra libraries have already been optimized to work on 64-bit floats, and optimizing on smaller data types like 32 or 16 bit floats wouldn’t actually speed up calculations (this is my number one guess for why this isn’t done)
It could be that removing small gradient descent updates damages momentum and makes networks more likely to get caught in long stretches of close-to-flat loss differences, since it can’t take as many small steps that might lead it in the right direction.
It could be that the behavior of networks depends, through some chaotic process, a lot on very small changes in results coming through the middle layers of the network. I would be surprised about this, since it’s clear that small differences on the earliest layers (for example the input) shouldn’t create big differences, and neither should small differences on the latest layers.