Improved Text Classification using Long Short-Term Memory and Word Embedding Technique

Text classification is an important problem for spam filtering, sentiment analysis, news filtering, document organizations, document retrieval and many more. The complexity of text classification increases with a number of classes and training samples. The main objective of this paper is to improve the accuracy of text classification with long short-term memory with word embedding. Experiments conducted on seven benchmark datasets namely IMDB, Amazon review full score, Amazon review polarity, Yelp review polarity, AG news topic classification, Yahoo! Answers topic classification, DBpedia ontology classification with different number of classes and training samples. Different experiments are conducted to evaluate the effect of each parameter on LSTM. Results show that 100 batch size, 50 epochs, Adagrad optimizer, 5 hidden nodes, 100-word vector length, 2 LSTM layers, 0.001 L2 regularization, 0.001 learning rate give the higher accuracy. The results of LSTM are compared with literature. For IMDB, Amazon review full score, Yahoo! Answers topic classification dataset the results obtained are better than literature. Results of LSTM for Amazon review polarity, Yelp review polarity, AG news topic classification are close to best-known results. For DBpedia ontology classification dataset the accuracy is more than 91% but less than best known.


Introduction
Text classification is to classify the text into the appropriate categories based on the textual content. Due to the large and growing amount of text data, automatic text classification methods are receiving more and more attention from the research community. Although many efforts have been made in this regard, it remains an open question [1]. The World Wide Web needs efficient and effective classification algorithms to help people navigate and browse online documents quickly [2]. Text classification is used in a various area like email classification and spam filtering [3], sentiment analysis [4], opinion and topic detection [5], author identification [6] and language identification [7], news filtering and organizations [8], document organization and retrieval [9] etc.
Various techniques are designed for text classification. Some key methods, usually used for text classification such as support vector machine (SVM) [10], decision trees [3], pattern (Rule)-based, neural network (NN) [11], bayesian (Generative) [12], k. nearest neighbour (KNN) [13] etc. Compared with other supervised machine learning algorithms, SVM classifier is one of the most effective text classification methods.
Neural networks are machine learning (ML) models inspired by the human brain. It includes many neurons that form a huge network. Neural networks have a flexible architecture with a distinct number of nodes per layer with a different number of weights and hidden layers. Multiple hidden layers of the neural network are called deep learning [11]. There are many types of deep learning that can be used for the classification such as recurrent neural network (RNN), convolutional neural networks (CNN), multilayer Perceptron etc. The advantage of RNN is that the previous state is used to calculate the current state. However, simple RNNs have problems in delivering information in long sequences. LSTM is the solution to this problem. RNN with extra long-term memory that was proposed in 1977. LSTM is one of the most successful and developed for controlling robots, natural language text compression, automatic speech recognition, time series prediction, handwriting recognition, document classification and many more. They can equally prove good in the process of document classification [14].
In literature, different text classifiers are investigated for text classification such as Naï ve Bayes, SVM, logistic regression, stochastic gradient descent, NN, SVM and hybrid models of these. The main objective of this paper is to improve text classification using long short-term memory. Paper presents results of LSTM with Word2Vec for text classification on seven benchmark datasets. Investigations are conducted on the effect of different parameters of LSTM.
The rest of this paper is organized as follows. In Section 2, is about brief review and some related works. In Section 3, we discuss the proposed methodology. Experimental details, results and discussion are demonstrated in Section 4. Finally, in Section 5 we present our conclusions.

Literature Review
This section covers a review of different research carried out for text classification using different machine learning (ML) based approaches and LSTM.
Classification algorithm KNN is given in [13]. The KNN is frequently used text classification technique. This method works well even when using multi-category documents to process classification tasks. The limitation of KNN is that it needs more time to classify objects when given many training examples. RA has high computational efficiency, fast learning speed [15].
The NB classifier based on the Bayesian theorem with strong independence assumptions. The algorithm calculates the posterior probabilities that the documents belong to different classes and assign the document to the class with the highest a posteriori probability [12]. It handles numerical and textual data extremely well. The disadvantage of the NB classification method is that the classification performance is relatively low compared with other ML algorithms [3].
The DT text classifier built using a "divide and conquer" strategy. The DT checks if all training examples have the identical label, and if not, selects the term partition from the merged class document with the duplicate term value and places each such item in a separate subtree [16].
SVM is a supervised classification algorithm which deals with a lot of functionalities. SVM is one of the most effective text classification methods as a comparison to other ML algorithms [3]. SVM was originally applied to Joachim's text classification [17]. Joachim verifies the classification performance of SVM in text categorization by comparing with SVM and KNN. Drucker uses the SVM to implement a spam filtering system and compares it to NB to implement the system. They show that SVM is a better spam filtering method than NB. Since the analysis, SVM has more parameters than the logistic regression and DT classifiers, the SVM has the highest classification accuracy most of the time, whereas the SVM requires more computation time due to more parameters and is very time-consuming. Logistic regression is computationally efficient compared to SVM.
On comparing DT and NNs, their strengths and weaknesses are almost complementary. For example, people can easily understand the representation of DT, which is not the case of NNs. Decision tree's encounter difficulties in handling noise in training data, as well as NNs, DTs learn very quickly. NNs learn relatively slowly. DT learning is used for qualitative analysis, and NN is used for subsequent quantitative analysis.
An NN initialized with a DT is a hybrid approach that can be applied to text classification problems and tested for performance relative to many other text classification algorithms. The method shows that the hybrid decision tree and neural network method improve the accuracy of the text classification task, and the performance of the random text classifier is equivalent to the previous result than the single DT or NN.
The probabilistic neural network is a combination of SVM, KNN, and slightly modified versions of DTs and proposed to better handle multi-label classification problems [18]. BFC is a hybrid method of NB vectorizer. Compared with the simple Bayesian classification method, the SVM classifier improves the classification accuracy. In [19], a hybrid algorithm is proposed, which is based on the variable precision rough set, combined with the strength of KNN and RA techniques to improve the accuracy of text classification and overcome the drawbacks of RA.
Long Short-Term Memory units, also called LSTM, are a variation of Recurrent Neural Networks that are capable of learning long-term dependencies. They were proposed by the German researchers Hochreiter and Schmidhuber as a solution to the error backflow problems [20]. The challenging task of sentiment analysis is a need of required labeled dataset and to solve this issue Qurat Tul Ain et al. [21] combined deep learning technique and sentiment analysis. Deep learning techniques gave an effective performance. Peerapon Vateekul et al. [22] proposed those deep learning techniques for sentiment analysis of Twitter data. LSTM and Dynamic CNN performed well than traditional methods like naï ve Bayes and support vector machine. In deep learning techniques they used word2vec and in traditional approach bag of the word, the approach was used which occur difficulty in the training process. LSTM and DCNN gave the better accuracy. Dan Li et al. [23] trained the emotional model to find out the which sentence belong to which emotion model using LSTM which was used for better analyzing the long sentences. 10 Unfold layer in backpropagation gave better accuracy rate and recall rate for LSTM than RNN [24]. After analysis of the performance of three RNN methods which are vanilla RNN, LSTM and GRU, one layer with GRU gave better accuracy [25]. Abdalraouf Hassan et al. used a continuous bag of word approach which gave the better performance than a traditional bag of word approach with single LSTM layer performed well [26]. Extended versions of RNN which was LSTM and Gated Recurrent Unit was proposed by Yong Zhang et al. LSTM and GRU methods achieve better performance compared with RNN [4]. Piotr Semberecki et al. proposed LSTM for classifying English Wikipedia article. Encoding of a word using a dictionary and Google news pre-trained word vector used for word embedding. Pre-trained word vector achieved better performance [27]. Different ML techniques are used for document classification. Yash R. Ghorpade et al. [14] includes text pre-processing, FS, feature extraction and class prediction. LSTM achieves 93% accuracy at 25 epochs.
[ Table 1] summarized the contributions for text classification using LSTM with dataset, parameter details and results.

Methodology
This section presents algorithmic and implementation details of LSTM and word embedding technique (Word2Vec) for text classification.

Word embedding technique -Word2Vec
LSTM needs the input in numeric format, not text. Different methods are used to convert text to numeric format such as a bag of the word, term frequency-inverse document term frequency, Word2Vec etc. Bag of word and TFIDF method for obtaining such vectors is based on very simple lexical coding. It works for binary classifications, but as the number of classes (subject categories) increases, its accuracy decreased. Paper [28] suggested two methods for vector representations, CBOW (continuous bag of word) and the skip-gram. The Word2Vec method is more stable and flexible. Word2Vec is a method of representing words in multidimensional vector space [28]. Word2Vec gave better performance on different neural network techniques. Word2Vec is shallow NN that used to process text and create the vector of words from the vocabulary.
This paper used skip-gram model for input of LSTM. It converts the text into a numeric form that the deep network can understand. The Skip-gram model finds the target corpus y from the given word. There are input, output, and hidden layers. Activation function used for output layer is SoftMax. For configuration of Word2Vec used different parameters that are: 100 batch size -It is the number of words that you process at a time. 20 minimum word frequency -It is the number of word frequency in the corpus. In case of large corpus need to increase the value of word frequency.
100 layer size -It tells the input vector size. Learning Rate -It is the step size for each update of the coefficient. Tokenizer -It is used to tokenize the sentences.

Long Short-Term memory for text classification
LSTM contains one memory cell and three gates, which are forgotten gate, input gate and the output gate and the size of the input vector defined by the word embedding size. Input gate is used to control the flow of input, output gate is used to control the flow of output and forget gate decides the what information going to store in the memory cell and what information going to throw away from the cell. [ Figure 1] shows the pseudo-code of LSTM for text classification.
Farhoodi and Yari [29] have listed the performance measures for text classification. Evaluation of the LSTM classifier is tested using two measures namely accuracy and loss. Accuracy is calculated using equation (1).
The smaller MSE value, the higher the accuracy and vice versa [30]. Loss is calculated using mean square error shown in equation (2) [31].

Experimental details, results and discussion
This section presents details of the dataset used, parameters on which LSTM experimentation are carried out, results and discussion. LSTM performance is evaluated with Amol C. Adamuthe six experiments conducted on batch size, epochs, optimizers, hidden nodes, word vector size, LSTM layers, L2 regularization and learning rate.  Table 2]. The deeplearning4j library is used for the implementation of the methodology. It is a javabased toolkit for building, training and deploying deep neural networks, regressions and KNN [33].

Experiment 1: Identifying suitable batch size, epochs and optimizer
Experiments are carried out using different combinations of batch size and number of epochs. Results in [ Table 3] show that 100 batch size at 50 epochs gives the best accuracy for all the seven datasets. The accuracy range is from 91% to 95.3%. We have tested the performance of seven optimizers namely Adagrad, Adadelta, SGD, Adam, RMSprop, Adamax, Nadam. Figure 2 shows that the performance of LSTM differs with optimizers. For all seven datasets Adagrad, Adadelta and SGD are top three performers respectively. Results show that loss decreases with increasing epochs. The goal of this experiment is to find the optimal output size of the LSTM, which can be thought of as the size of hidden nodes in the network. After experimentation, found that the best accuracy comes from using 5 hidden nodes as shown in [Table 4].

Experiment 3: Impact of word vector length on accuracy of LSTM
The LSTM used in this experiment receives a word vector sequence and creates an output of length 100 which gives a better result which is shown in [ Table 5]. With this word vector, accuracy increases as per vector length increase at the beginning but do not increase after a certain period of time. All the remaining experiments are conducted with best fitted vector length.

Experiment 4: Impact of layers on accuracy of LSTM
Performance of algorithms is always associated with the architecture. [ Table 6] shows the results of LSTM with different layers. LSTM with two layers shows the best accuracy for all datasets. Performance is not directly proportional to the number of LSTM layers.

Experiment 5: Impact of L2 regularization on accuracy of LSTM
L2 regularization used to avoid the overfitting and minimize the error. [ Table 7] shows that 0.001 value for L2 regularization given the best performance.

Experiment 6: Impact of learning rate on performance of LSTM
[ Table 8] shows that the 0.001 learning rate gives the best performance. If the learning rate is low, then training is more reliable and gives better accuracy, but optimization will take a lot of time because steps towards the minimum of the loss function are tiny. [ Table 9] shows the best results obtained with suitable LSTM parameters. [ Table 10] shows result comparison for IMDB dataset with 12 papers from the literature. All these papers experimented with machine learning algorithms. The proposed methodology provides better results with 95.78% accuracy. [ Table 12] shows a comparison of results for dataset 3 to 7. To best of our knowledge results for dataset 2 (Amazon review full score with 2 classes) is not available. The accuracy obtained for dataset 2 is 94.93%. The accuracy obtained for dataset 3, 4 and 5 are 94.87%, 94.88 and 91.79% respectively, which is close to the best results in the literature. The results are better than many other techniques listed in [Table 12]. Results obtained for dataset 6 are significantly better than literature. The improvement in accuracy is more than 17%. For dataset 7, accuracy is approximately 6% less than best mentioned in [ Table 12].

Conclusions
Paper presents the effect of the different parameters on the performance of LSTM and Word2Vec for text classification. Text classification accuracy obtained by proposed methodology for dataset 1, 2, 3, 4, 5 and 6 are 95.78%, 94.93%, 94.87%, 94.88%, 91.79%, 93.04% and 91.98% respectively. Six different experimentation shows that 100 batch size, 50 epochs, Adagrad optimizer, 5 hidden nodes, 100-word vector length, 2 LSTM layers, 0.001 L2 regularizations, 0.001 learning rate give better accuracy. Empirical results on IMDB, Amazon review full score, and Yahoo! Answers topic classification dataset demonstrate that the proposed architecture effectively improves the classification performance compared with previously published results on the same dataset. Accuracy of LSTM for dataset Amazon review polarity, Yelp review polarity and AG news topic classification is close to the best results in the literature. For DBpedia ontology classification dataset, the accuracy is above 91% but 6% less than the best results in the literature. Future work: There is scope to improve the accuracy of LSTM with hybrid architecture.