Ensemble (요약)

1. Ensemble Learning
  • Set of models is integrated in some way to obtain the final prediction
  • Homogeneous : It uses only one induction algorithm
    • SVM1 + SVM2 + SVM3 ...
  • Heterogeneous : It uses different induction algorithms
    • SVM + DT + DNN + Bayesian ...
  • Adds complexity
  • Violation of Occam's Razor
    • Decision boundary may become simpler
  • Data manipulation : Changes the training set in order to obtain different models
    • Manipulating the input features
    • Sub-sampling from the training set
  • Modeling process manipulation : Changes the induction algorithm
    • Manipulating the parameter sets
    • Manipulating the induction algorithm

  • Combine Models
    • Algebraic method : Average, Weighted Average, etc.
    • Voting method : Majority Voting, Weighted Majority Voting, etc.

  • Base Models : The base classifiers should be as accurate as possible and having diverse errors, while each classifier provides some positive evidences


2. Bagging (Bootstrap AGGregatING)
  • Averaging the prediction over a collection of predictors generated from bootstrap samples
  • For noisy data: not considerably worse, more robust
  • High Variance : need unstable classifier types
  • Decision trees are a typical unstable classifier $\rightarrow$ Random Forest


3. Boosting (AdaBoost : Adaptive Boosting)
  • Weighted vote with a collection of classifiers that were trained sequentially from training sets given priority to instances wrongly classified
  • Focus on difficult examples which are not correctly classified in the previous steps
  • Using Different Data Distribution
    • Start with uniform weighting
    • During each step of learning
      • Not correctly learned by the weak learner $\rightarrow$ Increase weights
      • Correctly learned by the weak learner $\rightarrow$ Decrease weights
      • (Weight는 상대적인 값임으로 두 방법 중 하나만 써도 됨)
  • Risks overfitting the model to misclassified data $\rightarrow$ Use weighted sum/vote


4. Random Forest
  • A variation of the bagging algorithm
  • Classification : each tree votes and the most popular class is returned
  • Regression : the result is the averaged prediction of all generated trees
  • Construct Random Forest
    • Forest-RI (random input selection) : randomly select attributes as candidates for the split at the node.
    • Forest-RC (random linear combinations): new attributes that are a linear combination of the existing attributes (reduces the correlation between individual classifiers)
  • Faster than bagging or boosting


5. Statistical Validation
  • Mixture of Experts : Combine votes or scores
  • Stacking : Combiner $f()$ is another learner
  • Cascading : Use next level of classifier if the previous decision is not confident enough










댓글

이 블로그의 인기 게시물

One-Class SVM & SVDD

Support Vector Regression (SVR)

Self-Training & Co-Training