“The subsequent values, all comma separated, are the pixel values of the handwritten digit. The size of the pixel array is 28 by 28, so there are 784 values after the label. Count them if you really want!
So that first record represents the number “5” as shown by the first value, and the rest of the text on that line are the pixel values for someone’s handwritten number 5. The second record represents a handwritten “0”, the third represents “4”, the fourth record is “1” and the fifth represents “9”. You can pick any line from the MNIST data files and the first number will tell you the label for the following image data.
But it is hard to see how that long list of 784 values makes up a picture of someone’s handwritten number 5. We should plot those numbers as an image to confirm that they really are the colour values of handwritten number.
Before we dive in and do that we should download a smaller subset of the MNIST data set. The MNIST data data files are pretty big and working with a smaller subset is helpful because it means we can experiment, trial and develop our code without being slowed down by a large data set slowing our computers down. Once we’ve settled on an algorithm and code we’re happy with, we can use the full data set.
The following are the links to a smaller subsets of the MNIST dataset, also in CSV format:
10 records from the MNIST test data set – https://raw.githubusercontent.com/makeyourownneuralnetwork/makeyourownneuralnetwork/master/mnist_dataset/mnist_test_10.csv
100 records from the MNIST training data set – https://raw.githubusercontent.com/makeyourownneuralnetwork/makeyourownneuralnetwork/master/mnist_dataset/mnist_train_100.csv
If your browser shows the data instead of downloading it automatically, you can manually save the file using the “File -> Save As …”, or equivalent action in your browser.
Save the data files to a location that works well for you. I keep my data files in a folder called “mnist_dataset” next to my IPython notebooks as shown in the following screenshot. It gets messy if IPython notebooks and data files are scattered all over the place.
Before we can do anything with the data, like plotting it or training a neural network with it, we need to find a way to get at it from our Python code.
Opening a file and getting its content is really easy in Python. It’s best to just show it and explain it. Have a look at the following code:
data_file = open("mnist_dataset/mnist_train_100.csv", ‘r’)
data_list = data_file .readlines()
There are only three lines of code here. Let’s talk through each one.
The first line uses a function open() to open a file. You can see first parameter passed to the function is the name of the file. Actually, it is more than just the filename “mnist_train_100.csv”, it is the whole path which includes the directory the file is in. The second parameter is optional, and tells Python how we want to treat the file. The ‘r’ tells Python that we want to open the file for reading only, and not for writing. That way we avoid any accidents changing or deleting the data. If we tried to write to that file and change it, Python would stop us and raise an error. What’s that variable data_file ? The open() function creates a file handle, a reference, to that file and we’ve assigned it to a variable named data_file . Now that we’ve opened the file, any further actions like reading from it, are done through that handle.
The next line is simple. We use the readlines() function associated with the file handle data_file , to read all of the lines in the file into the variable data_list. The variable contains a list, where each item of the list is a string representing a line in the file. That’s much more useful because we can jump to specific lines just like we would jump to specific entries in a list. So data_list  is the first record, and data_list  is the tenth record, and so on.”