READING NOTES: Practitioners’ Guide to CNN on sentence classification (Zhang and Wallac)

1 minute read

Reading notes of A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural

Main Idea:

How sentence classification performance of one-layer CNNs was effected on 9 datasets regarding to the following architecture components:

  • input word vector representations
  • filter region size
  • number of feature maps
  • activation functions
  • pooling strategy
  • regularization

CNN Architecture

  • use slide windows on sentence matrix: sentence matrix constitute of word vectors -> three filter sizes each with 2 filters = 6 feature maps ->
  • max-pooling: max-pooling on each feature map ->
  • concatenation: feature vector = concatenation of the 6 max features ->
  • softmax: softmax layer for classification

Experiments and Results

  • 9 datasets on sentence classification.(7 same in Kim (2014))
  • 10-fold cross validation on all datasets
  • Effects of factor:
    • input word vectors:
      • one-hot performs worse than embeddings on sentence classification (was originally used for doc classification);
      • embeddings trained from scratch may be best.
    • filter sizes
      • coarse grid search could work
      • multiple different filters with size (close to optimal) may help
      • number of each filter (depend on datasets, 100-600, note might overfitting if too large)
    • activation function
      • Best: ReLU, tahn, Iden (no activation)
      • if multiple hidden layers, consider ReLU and tahn
    • pooling
      • global max pooling better than local, k-max pooling, and average pooling
    • regularization: not helping much
      • small dropout (0-0.5): depends on datasets
      • large \(l2\) norm
      • increse number of feature maps to see if help


Well, hyperparameter tuning is always time-consuming and sometimes even frustrating. Though it’s task specific, the baseline configuration and suggestions included in the paper could be useful for practitioners on this task:

  • input word vectors: Glove
  • filter region size: (3,4,5)
  • feature maps: 100
  • activation function: ReLU
  • pooling: 1-max pooling
  • dropout rate: 0.5
  • l2 norm constraint: 3

Leave a Comment