Semantic Based Cluster Content Discovery in Description First Clustering Algorithm

In the field of data analytics grouping of like documents in textual data is a serious problem. A lot of work has been done in this field and many algorithms have purposed. One of them is a category of algorithms which firstly group the documents on the basis of similarity and then assign the meaningful labels to those groups. Description first clustering algorithm belong to the category in which the meaningful description is deduced first and then relevant documents are assigned to that description. LINGO (Label Induction Grouping Algorithm) is the algorithm of description first clustering category which is used for the automatic grouping of documents obtained from search results. It uses LSI (Latent Semantic Indexing); an IR (Information Retrieval) technique for induction of meaningful labels for clusters and VSM (Vector Space Model) for cluster content discovery. In this paper we present the LINGO while it is using LSI during cluster label induction and cluster content discovery phase. Finally, we compare results obtained from the said algorithm while it uses VSM and Latent semantic analysis during cluster content discovery phase.

indications. It is a fact that more than 80% of the available data is in the text form. Many search engines have been introduced to mine data from this library which uses different algorithms to mine and analyze the text data on Description first clustering algorithms are used to increase the quality of cluster labels and readability of thematic groups. Including lexical terms it also considers the phrases as well for the candidates of cluster labels.
LINGO is the algorithm of type description first. This algorithm in its existing form does use a novel IR technique LSI for the purpose of cluster label induction and VSM for cluster content discovery. In this paper we use LSI during both phases; cluster content discovery and cluster label induction.

Vector Space Model
VSM is an IR technique in which a text document is represented as a multidimensional vector. In VSM we compare algebraic vectors instead of text documents because once we are able to represent a text document into an algebraic vector then algebraic operations can be used to find out the similarities among the vectors. In VSM each vector v represents a document j in multidimensional space. Each element v ij represents a specific term of document j and value of this term in the vector represents the strength of relationship of term i to document j. The matrix of the vectors we construct is called term x documents matrix (A). In matrix A rows represent the number of terms of the documents and columns represent the number of the documents d. Element a ij of the matrix A represents the relationship between term i and document j. Number of term weighting schemes including "Binary weighting", "term frequency", "term frequency inverse document frequency" can be used to measure this degree of relationship according to the requirements [8]. After the construction of matrix A various methods are available to measure the distance between vectors representing document a and b; mostly cosine similarity calculation is used. Formula is given below: Detailed review is available in [10]. It depends upon the user that up to what extent he wants to eliminate the extraneous terms. The larger the value of k the larger is the proximity to get the original matrix. So k is chosen in such a way that at least 80% of information content in the original matrix should be retained.

OVERVIEW OF THE ALGORITHM
While designing clustering algorithms it is necessary to pay special attention on the presentation of cluster labels and contents of those clusters. It should be emphasized that Labels and contents of the clusters should have a meaning to users. Most of the algorithms of this field follow the strategy to find the contents of the clusters first and then on the basis of these contents assign the appropriate labels to them. Without considering any similarity measure among the labels and contents it might be possible that labels might not be the proper representative of those groups. To avoid such type of problems LINGO attempts to find meaningful cluster labels first and then assign the appropriate documents to the labels. It considers the frequent phrases and lexical terms from input documents as label candidates and chooses the appropriate labels after pruning. After ensuring the description quality of labels it assigns the documents to labels.
Algorithm-1 is the pseudo-code when it uses LSI in cluster content discovery. Particular steps are also given in the following sections.

Preprocessing and Frequent Phrases Extraction
Preprocessing is the preliminary step in any IR technique.
In this step we remove the stop words and other unnecessary tags from the available dataset to reduce the effect of these terms on our results. Because these terms might affect our results negatively [11].
Frequent phrases are defined as the ordered terms which occur in the input documents in a repeated manner. We consider these terms also as cluster label candidates with the lexical terms because it is a fact that good writers use the idea of synonymy to express his views and to get the attention of his readers. So by considering the idea of synonymy a sentence can be represented in variety of ways by avoiding the repetition. SVD got the potential to identify abstract concepts behind the document [12].
To be a candidate for the label of the cluster a phrase or term must occur more than a specific threshold; the term or phrase must not start or end with a stop word or tag and it must be a complete phrase or term. A complete phrase or term is defined as it should not be possible to get another term or phrase by adding or removing a substring. It cannot be extended by adding or removing an element into it. These assumptions are discussed in [2,4].

Cluster Label Induction
Once we have extracted frequent phrases from the input data; they are considered as candidates of clusters labels [13][14][15]. There are three steps involved; construction of txd matrix, discovering the abstract concepts and trimming of labels.
The txd matrix is constructed from the input dataset by representing each sentence as a vector of the multidimensional matrix that contains the terms of the sentence exceeding from the defined frequency threshold. A term weighting scheme tfidf (term frequency inverse document frequency) is used to measure the weight of each term in the vector [16][17][18][19]. To discover the abstract concepts of term-document matrix SVD is applied on the matrix to find the orthogonal basis. As we know that SVD decompose the matrix A into sub matrices U, S and V T . Matrix U represents the abstract concepts of the matrix A. For future calculation we use the reduced U matrix upto k terms. The value of k is selected by calculating the Frobenius norms of the matrix A and matrix A k ; which is a reduced matrix upto k dimensions. Let us define a threshold q which represents a percentage value that upto what extent the original information of matrix A should be retained. So larger the value of k larger will be the information restored in the A k matrix. The condition should be satisfied which is the Frobenius norm of matrix X.
In the step of phrase matching and label pruning the phrases and abstract concepts are represented as column vectors of the same vector space of matrix A. By doing so we would be able to use cosine similarity to calculate how close is a term and a phrase to abstract concept. Let we refer it with P a matrix of size tx(p+t). Where t is the number of terms and p is the number of phrases we have calculated from original dataset. Many tools are available to calculate phrases from the dataset. In this work we have used Maui Indexer and Kia for this purpose. Having t and p we can easily construct the matrix P by using one of the weighting schemes. Here we have used t f i df and the frequency of the term is the frequency in the original dataset. Once we have matrix P and i th column vector of matrix U we calculate the cosine distance between abstract concept and phrase by the formula m i =U T i *P. This process can be extended upto entire matrix U and matrix M; a matrix of cosines between P and U is constructed by M = U T k *P formula. The component of matrix M with maximum value is considered as candidate of cluster labels. At the end of cluster label induction phase candidate cluster labels are pruned to induce the cluster labels. For this purpose we construct another matrix Z in which candidates for cluster labels are represented as a documents and we calculate the Z*Z T which produces a matrix of similarities among the cluster labels. From each row we pick the column which exceeds the defined threshold and leave all but the candidate with maximum score.

Cluster Content Discovery
In this phase documents in the corpus are allocated to the cluster labels inducted in the former phase. For this purpose LSI technique is used. In this process of assigning the documents to cluster labels we construct the matrix Q in which induced labels are represented as column vectors and multiply it with the matrix A k . Matrix A k is a matrix which we have reconstructed by the reduced dimensions of matrix U, S and V T upto k dimensions; Let C = Q T *A k . Now in the matrix C the element c ij will give the strength of relationship between document j in cluster i.
We will assign the document j to cluster i where the value of c ij increase from specific threshold. The remaining documents which do not fall into any cluster end up with an artificial group called topic of others.

Final Cluster Formation
For the purpose of presentation, clusters are presented in sorted orders based on their score. The score of the cluster is calculated by a simple formula given below: Cluster score = score of label x ||C|| This score function favors large clusters over smaller.
We may say that it gives the quality of clusters.

RESULTS AND EVALUATION
The study is evaluated by performing experiments on several datasets. In each experiment it gives us better results as compared to existing methodology; while it is used to find cluster contents. It assigns the documents to appropriate labels and reduces the group of unassigned documents named "Others". It can be observed clearly that our proposed methodology in which we have found the contents of clusters by using LSI has reduced the topic of others remarkably. It has also been observed in the results that our technique has grouped the documents in the most relevant cluster. In Fig. 2, a significant change in group of others comparison among three datasets D1, D2, D3 is shown. The quality of cluster is determined by its score. Fig. 3 shows a graph that clarifies comparison between proposed and existing methodology. A significant change in the cluster quality on D1 can be observed from the graph.  The third version again identifies a major difference in the quality of clusters in new methodology.

CONCLUSION
In