Lecture 9 CNN Architectures
Dl.cs231n.lecture| 24 Oct 2018
Tags:
DeepLearning
CS231n
- AlexNet(Hinton, Montreal): First large-scale CNN structure, 2012 ImageNET classification test winner
- Use CONV, POOL, FCNN layers
- In 2014, GoogleNet(Google) and VGGNet(Oxford) has made an huge improvements by using deeper networks(22 layers\19 layers respectively)
- VGGNet: Use smaller filters than AlexNet, and connects 16~19 layers
- Using stacks of small filters is more efficient than using big filters! It represents linearity well, and has fewer params
- too many memory\parameters costs: Most memory is in early CONV layers, Most params are in late FCNN layers
- GoogleNet: Deeper networks, interpreted classification as the problem of computational efficiency
- Stacked ‘Inception modules’ which includes CONV, POOL, and filter concaternation layers
- Concept of inception module: Designing a ‘good local network topology’, and stack those NiN(Network within a network) modules
- In each inception module, we apply several filter operations then concatenate all outputs
- To deal with computational complexity problem(that is, # of operations and # of filters increases dramatically while stack-ing inception module), we use ‘Bottleneck layer’, that is 1 * 1 CONV layer(with less # of filters than input layer), to combine the features of filters in input layer
- No FCNN » much less params(1/12 less) then other CNN
- Use Auxiliary classification to inject additional gradient at lower layers(so that the gradient would not shrink)
- Stacked ‘Inception modules’ which includes CONV, POOL, and filter concaternation layers
- ResNet: Extremely deeper networks(152 layers), use residual connection
- Generally, stack extremely deeper CNN layers is not efficient; It also lacks accuracy than shallow CNN
- Hypothesis of ResNet Developer: the problem is an optimization problem, but deeper models are harder to optimize
- Solution: copying the learned layers from shallow model, then set additional layers to identity mapping
- That is, CONV layers are fitting residual, whereas other model just stack all layers to fit the data directly.
- Also used bottleneck layer to prevent it from dramatic increase in # of operations
- Batch Normalization, Xavier/2 initialization, SGD+Momentum, decaying learning rate, weight decay, no dropout
- (Computational, Time, Electricity) Costs of CNN Architectures
- Other CNN Architectures (for inspirations)
- Network in Network(NiN): Instead of using CONV layer, to compute more abstract features for local patches, use MLP-CONV layer(‘micronetwork’)
- Uses 1 * 1 CONV layer, named multilayer perceptron, which precedes ‘bottleneck layer’ idea of GoogleNet and ResNet
- Philosophical inspiration of ‘good local network topology’ for GoogleNet
- Improvements of ResNet
- Identity Mappings in Deep Residual Networks: Improved ResNet by changing the structure of Residual fitting layer(BN » ReLU » CONV), moves activation to residual mapping pathway
- Wide Residual Networks: Philosophy of ‘residual is important factor, not depth’, then use k times more filters than basic residual block, efficiency in computation because of parallel computation
- ResNeXt(Aggregated Residual Transformations for DNN): Not just increases width of residual block, create structure by multiple parallel pathways(so-called cardinality), use the philosophy of inception module
- ResNet with Stochastic Depth: Randomly drop a residual block each training pass, then bypass the data with identity function(Motivation: reduce vanishing gradients problem and training time cost)
- Comparable Network Architecture(compared by ResNet)
- FractalNet: Argues that Key factor is transitioning effectively from shallow to deep net; not residual representation - Uses Fractal Expansion Rule Architecture and deep paths, training with dropout of sub-paths.
- DenseNet: Creates the ‘Dense Block’ which each layer is connected to every other layer(in feedforward fashion); That alleviates vanishing gradient, strengthens feature propagation and encourages feature reuse
- SqueezeNet: Uses ‘Fire Module’ which has squeeze layer(three 1 * 1 filters) and expand layer(1 * 1 and 3 * 3 filters), achieve AlexNet level acc with 50x fewer params
- Network in Network(NiN): Instead of using CONV layer, to compute more abstract features for local patches, use MLP-CONV layer(‘micronetwork’)