An Efficient Topic Modeling Approach for Text Mining and Information Retrieval through K-means Clustering

Topic modeling is an effective text mining and information retrieval approach to organizing knowledge with various contents under a specific topic. Text documents in form of news articles are increasing very fast on the web. Analysis of these documents is very important in the fields of text mining and information retrieval. Meaningful information extraction from these documents is a challenging task. One approach for discovering the theme from text documents is topic modeling but this approach still needs a new perspective to improve its performance. In topic modeling, documents have topics and topics are the collection of words. In this paper, we propose a new k-means topic modeling (KTM) approach by using the k-means clustering algorithm. KTM discovers better semantic topics from a collection of documents. Experiments on two real-world Reuters 21578 and BBC News datasets show that KTM performance is better than state-of-the-art topic models like LDA (Latent Dirichlet Allocation) and LSA (Latent Semantic Analysis). The KTM is also applicable for classification and clustering tasks in text mining and achieves higher performance with a comparison of its competitors LDA and LSA.

Topic modeling is a probabilistic technique that is widely accepted in text mining and information retrieval fields [1,2]. The basic goal of topic modeling is shown in Fig. 1.

An Efficient Topic Modeling Approach for Text Mining and Information Retrieval through K-means Clustering
Documents are about various topics at the same time and topics are associated with different words. The goal of topic modeling is to extract the thematical structure from the text corpus. The output of topic modeling is set of multi-distribution of topics. In topic modeling words are extracted from a collection of documents and words are belong to some topic.
The process of examining text documents to create new documents and transform unstructured text data to structure text data for more processing is called text mining. Several statistical and probabilistic topic-modeling approaches are used for text mining tasks such as temporal text mining [3], contextual text mining [4] and comparative text mining [5].
The obtaining of an information resource which is relevant to the information need form text corpus is called information retrieval. Topic models incorporate the framework for language model and achieve effective information retrieval results [2,[6][7][8].
The cross collection mixture model [5] for text mining is proposed that is based on probabilistic latent semantic analysis [9] discover the themes from documents collection.
The LDA [10] model generate topics from documents in term of the probability distribution of words for every topic. LDA technique is also used as a dimension reduction technique but does not achieve high classification accuracy [11].
The LSA based documents are usable for many applications of information retrieval [12,13] for discovering lexical semantics and simple word co-occurrence approach perform better [14,15].
Topics extraction from documents is performed which clusters text documents in groups with the similarity of semantic terms [16,17]. The models are good for classification use but the limitation of these models is that every document is associated with one cluster only.
The PLSA (Probabilistic Latent Semantic Analysis) [9] topic model is an extension of LSA to fix some issues of LSA. PLSA improves the LSA in a probabilistic sense and use the generative model.
The aspect model is based on the statistical model as an extension of PLSA [18]. Aspect model is also called latent variable model for data co-occurrence and associate unobserved calls with every observation [19]. The correlated topic model uses normal logistic distribution to generate a relationship with topics and allow words to occurrences in every topic [20]. Limitation of this method is that it needs numerous calculations. Common semantic topic model [21] is used for filtering of noise in short text topic modeling. Word co-occurrence network-based topic model [22] extracts topics from short news and apply to cluster on these topics.
In this research, we propose an efficient K-means topic modeling approach that discovers the semantic hidden topics from text documents. KTM is used for text classification and clustering tasks in text mining.
Experimental results on real-world datasets show that KTM performance is better than LDA and LSA which are state-of-the-art topic models.

Proposed k-means Topic Modeling Approach
The proposed K-means topic modeling approach consists of the following steps.
Step 1 Preprocessing of text documents is performed in this step.
Different preprocessing steps are performed as shown in Step 2 After the preprocessing step Bag of Words (BOW) model is used on preprocessing text documents. BOW model [23] is used for words occurrences in text documents. In natural language processing, a document is usually represented by a BOW that is a word-document matrix.
BOW example is shown in Table 1. There are six documents (d1, d2, d3, d4, d5, d6) and four words (bank, account, customer, manager). The word bank occurs three times in document one and four times in document three.
The different words occurrence in documents is different as shown in Table 1.
Step 3 Local term weighting has been calculated in this step.
Term frequency [24] method has been used for finding the local term weighting. This method estimates that how much a term is appearing in documents collection.
Step 4 Global term weighting has been calculated in this step by using the Entropy method. The global term weighting through entropy is calculated by finding ( ) The GTW (Global Term Weighting) with entropy is calculated by using equation 4. The N is amount documents and n i documents amount in which i term appear. The entropy assigns a higher weight to a term that is of lower frequency in documents [25].
The output of this step is the entropy term matrix.
Step 5 To avoid high dimensionality negative impact on global term weighting of entropy in step 4 principal component analysis (PCA) [26] is used. The PCA objective is to reduce the large set of variables to a small set of variables that even holds information the information in a large set.
In PCA to make a process fast, we select two dimensions which are minimum dimensions.
Step 6 In this step, K-means clustering [27] algorithm is used that clusters the documents, which represents in global term weighting method of entropy. K-means clustering finds that how many k clusters exist in data. The algorithm iteratively moves k-centers and data points are selected which are the closest centroid. K-means clustering algorithm minimizes the objective function as shown in Where chosen measure distance is shown in equation 6 and measure between the ( ) j i x data points and j c is cluster centers that is distance indicator for n data point for the particular cluster center.
Step 7 Documents term matrix are used with GTW method in Here i represents a number of documents. P(D j ) is calculated by equation 7.
Step 8 The probability of documents j in topics k to find ( | ) j k P D T is found by using the ( )

An Efficient Topic Modeling Approach for Text Mining and Information Retrieval through K-means Clustering
Step 9 The probability of words i in documents j and ( | ) Step 10 The probability of words i in topics k is calculated using and through equation 10.

Datasets
We used two datasets including Reuters 21578 news dataset and BBC news articles dataset. The Reuters 21578 datasets contain many classes, in this research two big classes including acq and earn have been selected for experiments. Reuters 21578 dataset is used for the classification and BBC news dataset is used for clustering. Table 2 describes the basic statistics of these two datasets.

KTM time execution of classification on Reuters 21578
dataset is compared with LDA and LSA. In this experiment famous method Gibbs sampling has been used that need multiple numbers of iterations and increases the computational cost. KTM performance is stable with the increasing numbers of topics and better than state-ofthe-art topic models LDA and LSA as shown in Fig. 3.

Clustering Results
The      The Fig. 8 shows that KTM CH-index is higher than the LDA and LSA topic models with 125 numbers of topics.
So, KTM clustering results are better than LDA and LSA on 125 numbers of topics.

Time Execution of Clustering
KTM time execution of clustering on BBC news dataset is compared with LDA and LSA. KTM performance is stable with the increasing number of topics and better than state-of-the-art topic models LDA and LSA as shown in Fig. 12.

EXAMPLE OF TOPIC MODELING FOR BBC NEWS DATASET
An example of a sports topic from BBC news datasets through LDA, LSA, and KTM topic models are shown in

CONCLUSION
In this paper, we proposed a K-means topic modeling approach for news text documents. Topic modeling for news text is a challenging task due to the increase of news content on the web. Here we proposed a new Kmeans topic modeling approach for news text documents.
KTM discovered more precise topics from documents and topics are multi-distribution of words.
The experimental results on two real-world datasets indicate that KTM can learn more coherent topics and it has been competitive against state-of-the-art topic models Overall performance of KTM for classification and clustering is better than LDA and LSA topic models. KTM can be utilized in text mining and information retrieval field.