Rebuilding AlexNet
Rebuilding AlexNet #
My role at GM (aka ex-Cruise) has started pulling me and more and more into the machine learning space, in particular models that are critical to Autonomous vehicles such as perception, planning, world models, etc. For context, I have traditionally worked more on the system performance / vehicle software side, not the models directly.
As such, I’ve started to try to get a better grasp of ML fundamentals. One of my first forays into getting hand-on experience is rebuilding AlexNet, one of the most influential models to modern machine learning. It was the first model to get a top-5 error rate of 15.3% in the ImageNet Large Scale Recognition Challenge and using convolutional neural networks in the process.
This post serves to document some of my learnings as I attempted to recreate it.
My General Approach #
I didn’t start by recreating AlexNet immediately: I instead started with a simple neuralnet + RELU based model, to try to understand the individual impact of various model architectures.
I also tried to use vanilla PyTorch, although I did build a bazel-based build workflow since I prefer the hermicity and relatively simple dependency management of multiple languages.
To keep the training fast (since I was focused more experimentation rather than production use cases), I used the much smaller imagenette dataset rather than ImageNet which AlexNet was trained and tested against. This has only 10 classifications, as well as less data.
The Learnings #
Terminology #
If I had one big takeaway, it was understanding the terminology a lot more deeply! Knowing the terms logit, epoch, stochastic gradient descent, maxpool, generalization, and overfitting in a more intuitive and visceral helps me understand ideas discussed within our AV organization much more quickly.
Always randomize datasets #
result: 9.99% -> 40%
My initial iteration had a 9.99% pass rate, after adding the debugger and looking at the predicted output of the model by printing the final logit layer, I noticed that the output was always a specific classification, no matter what the input was.
This was caused by me not randomizing the dataset - effectively every epoch, the model would get exclusively trained against one classification for the last 10% of the time, biasing it toward that one classification.
After I fixed that, my simple 2-layer neural network got up to 40% accuracy!
Misc things that helped the neural net #
The highest I could get the neural network was 44%. Here’s a table of things I tried:
| Accuracy | Epochs | Batch Size | Learning Rate | Shuffled | Momentum |
|---|---|---|---|---|---|
| 44.6% | 10 | 32 | 0.01 | yes | 0 |
| 42.9% | 20 | 16 | 0.01 | yes | 0 |
| 41.9% | 10 | 16 | 0.001 | yes | 0 |
| 40.9% | 10 | 32 | 0.001 | yes | 0 |
| 38.0% | 5 | 16 | 0.001 | yes | 0 |
| 11% | 10 | 16 | 0.01 | no | 0.9 |
| 9.9% | 10 | 16 | 0.001 | no | 0 |
I could tweak batch size and learning rate to help it converge more quickly.
Single-layer convolution #
Switching to even a single-layer of convolution immediately caused the model to jump to 55%: as explained in the AlexNet paper, I think this can primarily be attributed to better generalization of the learnings in one particular segment of the image: without convolution and some form of feature consolidation, it’s really impossible to divine general relationships that shift across the matrix like the position of a dog in the image.
60+ epochs is the good range. #
For most of the convolution models, many had some continued gain up to about 45 to 60 epochs. I believe this will be highly dependent on how much training data is available, but it’s interesting to thing that running the model against the same data 40+ times is required to hit some sort of ceiling.
dropout added 10% #
The original AlexNet paper suggests that introducing dropout in the final connected layer (2 layers of linear + ReLU combinations) results in a significant raining improvement. They theorize this is due to the individual “neurons” in the matrices start to embed more generalization into themselves, not requiring the constant presence of the upstream neurons.
In my experience, this increased the accuracy by roughly 10% from the non-dropout version. So clearly, the neurons embedding more information is a critical part to better models.
flipping image data added 10% accuracy #
There isn’t much more I could do with model architecture to increase the accuracy, but I got some pointers to go look at data.
The biggest contributor was just flipping the image data half the time: I think this effectively doubles the dataset to some extent. Which implies that training data might be a bottleneck now rather than just learning capacity.
Some other things I tried, which has less of an effect:
- Scaling the data up and cropping didn’t seem to improve much. Maybe 1-2% at most.
- Changing exposure, brightness also had maybe a 1-2% difference, although training still ended up in the same range.
my current score: 78% #
For the imagenette, I was able to get a 78% score after 60 epochs. This is pretty good, as the original AlexNet paper, with a much larger dataset and longer training time, reached 85%.
I did not see any overfitting directly (which I’d expect to see via loss curve that has an upward trend after multiple epochs). But I did see the loss plateau after about 45 epochs or so.
Some quick notes:
- Adding RMSNorm in the first 2 layers added roughly 1-2%.
- Originally I had only convolutional layers with stride=2 rather than maxpool. With the maxpool layers instead I had a 4-5% bump.