# American Institute of Mathematical Sciences

January  2017, 2(1): 69-75. doi: 10.3934/bdia.2017009

## Multiple-instance learning for text categorization based on semantic representation

 National Key Laboratory for Novel Software Technology, Nanjing University, China

Published  September 2017

Text categorization is the fundamental bricks of other related researches in NLP. Up to now, researchers have proposed many effective text categorization methods and gained well performance. However, these methods are generally based on the raw features or low level features, e.g., tf or tfidf, while neglecting the semantic structures between words. Complex semantic information can influence the precision of text categorization. In this paper, we propose a new method to handle the semantic correlations between different words and text features from the representations and the learning schemes. We represent the document as multiple instances based on word2vec. Experiments validate the effectiveness of proposed method compared with those state-of-the-art text categorization methods.

Citation: Jian-Bing Zhang, Yi-Xin Sun, De-Chuan Zhan. Multiple-instance learning for text categorization based on semantic representation. Big Data & Information Analytics, 2017, 2 (1) : 69-75. doi: 10.3934/bdia.2017009
The structure of Bag-of-Words and Skip-Gram
Pseudo-code for mi-SVM
Results of experiments on sougouC
 Model car finance IT health sport SVM + TF-IDF 0.8473 0.8420 0.8363 0.8326 0.8737 SVM + Word2vec 0.9303 0.8571 0.8755 0.9163 0.9828 mi-SVM + Word2vec 0.9599 0.8904 0.8943 0.9325 0.9842
Results of experiments on 20newsgroup
 Model SVM+tf-idf SVM+Word2vec mi-SVM+Word2vec Average 0.8508 0.8421 0.8619
