Lecture 9 CNN Architectures

LeNet-5 Review ImageNet Winners

AlexNet(Hinton, Montreal): First large-scale CNN structure, 2012 ImageNET classification test winner
- Use CONV, POOL, FCNN layers

AlexNet Structure AlexNet CONV1 Shape AlexNet POOL1 Shape AlexNet Structure 2

In 2014, GoogleNet(Google) and VGGNet(Oxford) has made an huge improvements by using deeper networks(22 layers\19 layers respectively)
VGGNet: Use smaller filters than AlexNet, and connects 16~19 layers
- Using stacks of small filters is more efficient than using big filters! It represents linearity well, and has fewer params
- too many memory\parameters costs: Most memory is in early CONV layers, Most params are in late FCNN layers

VGGNet Structure VGGNet Using Smaller filters for performance VGGNet Structure 2 VGGNet Details

GoogleNet: Deeper networks, interpreted classification as the problem of computational efficiency
- Stacked ‘Inception modules’ which includes CONV, POOL, and filter concaternation layers
  - Concept of inception module: Designing a ‘good local network topology’, and stack those NiN(Network within a network) modules
  - In each inception module, we apply several filter operations then concatenate all outputs
  - To deal with computational complexity problem(that is, # of operations and # of filters increases dramatically while stack-ing inception module), we use ‘Bottleneck layer’, that is 1 * 1 CONV layer(with less # of filters than input layer), to combine the features of filters in input layer
- No FCNN » much less params(1/12 less) then other CNN
- Use Auxiliary classification to inject additional gradient at lower layers(so that the gradient would not shrink)

GoogleNet Structure GoogleNet Inception Module Structure GoogleNet Computational Complexity Problem GoogleNet Bottleneck layer GoogleNet Bottleneck layer 2 Googlenet Structure 2

ResNet: Extremely deeper networks(152 layers), use residual connection
- Generally, stack extremely deeper CNN layers is not efficient; It also lacks accuracy than shallow CNN
- Hypothesis of ResNet Developer: the problem is an optimization problem, but deeper models are harder to optimize
- Solution: copying the learned layers from shallow model, then set additional layers to identity mapping
  - That is, CONV layers are fitting residual, whereas other model just stack all layers to fit the data directly.
- Also used bottleneck layer to prevent it from dramatic increase in # of operations
- Batch Normalization, Xavier/2 initialization, SGD+Momentum, decaying learning rate, weight decay, no dropout

ResNet Overview ResNet Residual Fitting ResNet Structure ResNet Bottleneck layer ResNet Details

(Computational, Time, Electricity) Costs of CNN Architectures

Complexity of CNN Networks Complexity of CNN Networks 2

Other CNN Architectures (for inspirations)
- Network in Network(NiN): Instead of using CONV layer, to compute more abstract features for local patches, use MLP-CONV layer(‘micronetwork’)
  - Uses 1 * 1 CONV layer, named multilayer perceptron, which precedes ‘bottleneck layer’ idea of GoogleNet and ResNet
  - Philosophical inspiration of ‘good local network topology’ for GoogleNet
- Improvements of ResNet
  - Identity Mappings in Deep Residual Networks: Improved ResNet by changing the structure of Residual fitting layer(BN » ReLU » CONV), moves activation to residual mapping pathway
  - Wide Residual Networks: Philosophy of ‘residual is important factor, not depth’, then use k times more filters than basic residual block, efficiency in computation because of parallel computation
  - ResNeXt(Aggregated Residual Transformations for DNN): Not just increases width of residual block, create structure by multiple parallel pathways(so-called cardinality), use the philosophy of inception module
  - ResNet with Stochastic Depth: Randomly drop a residual block each training pass, then bypass the data with identity function(Motivation: reduce vanishing gradients problem and training time cost)
- Comparable Network Architecture(compared by ResNet)
  - FractalNet: Argues that Key factor is transitioning effectively from shallow to deep net; not residual representation - Uses Fractal Expansion Rule Architecture and deep paths, training with dropout of sub-paths.
  - DenseNet: Creates the ‘Dense Block’ which each layer is connected to every other layer(in feedforward fashion); That alleviates vanishing gradient, strengthens feature propagation and encourages feature reuse
  - SqueezeNet: Uses ‘Fire Module’ which has squeeze layer(three 1 * 1 filters) and expand layer(1 * 1 and 3 * 3 filters), achieve AlexNet level acc with 50x fewer params

Network in Network Identity Mappings in Deep Residual Networks Wide Residual Networks ResNeXt ResNet with Stochastic Depth FractalNet DenseNet SqueezeNet Summary of CNN Architectures

BuildOurOwnRepublic blog rpblic