Bringing Shape to Textual Data – A Feasible Demonstration

The Internet has revolutionized the communication paradigm. This has led towards immense amount of unstructured data (i.e. textual data), which is a major source to get useful knowledge about people in several application domains. TM (Text Mining) extracts high quality information to discover knowledge by drawing patterns and relationships in textual data. This field has taken great attention of the research community. As a result, several attempts have been made to propose, introduce and refine techniques applied for uncovering knowledge from text data. This study aims at: (1) presenting existing TM techniques in the scientific literature, (2) reporting challenges/issues and gaps that still need attention, and (3) proposing a framework to bring shape to textual data. A prototype has been developed to demonstrate the effectiveness and potential worth of proposed approach to display how unstructured data (i.e. news articles in this study) has been brought to a shape representing interesting knowledge. The proposed framework implements basic NLP (Natural Language Processing) functions in combination of AYLIEN API (Application Programming Interface) functions. The results reveal the fact that how events, celebrities and popular news-items have been covered in the electronic media, and it also represents subjectivity of topical news events. The news coverage trends highlight the significance of daily news events, which may assist in getting insight about the media groups.


INTRODUCTION
all become data generators that are pumping out enormous amounts of data to the web each day.
Undoubtedly, there are lots of insights within this data that cannot be ignored. It is important to bring structure to this text, which has lot of possibilities and insights to look for relationships, hierarchies, patterns and trends to discover knowledge. Manually exploring and analyzing such a large collection of data in order to find new insights for prediction and forecasting is unrealistic. Also, this data is beyond the capabilities of traditional applications for exploratory analysis. High volumes of data need to be analyzed for the discovery of trends and meaningful patterns, which are vital for effective decision-making.
There are several tools and techniques for mining the text data effectively and extracting the new outcomes and the rich insights it can bring. The TM techniques that aim at providing such insights include text categorization, clustering, summarization, concept extraction, topic detection, information retrieval and prediction.

LIFECYCLE OF TEXT MINING
Text Mining: TM or TDM (Text Data Mining) is a process of discovering interesting and actionable information to look for emerging trends and patterns, which are valid, useful, unexpected and understandable [5]. This means through TM novel knowledge is captured automatically, which provides new and exciting information about the world [6]. TM is considered to be a specialist technology requiring a multitude of skills such as linguistic background, statistics, computational skills, and psychology [6]. It enquires tools and techniques that vary in usability, accessibility and configurability as these enable a deeper analysis of the information.
Besides, it helps in understanding and identifying business insights in the content and also highlights relationships between texts in a document or corpus, which would otherwise be undiscovered and ignored.
The TM in the business domain is referred to as text analytics [7].
TM has potential to enrich information and knowledge management processes. It can explore large amounts of text containing extensive and detailed coverage of innumerable observations. To take advantage from the text, it is necessary to bring some level of structure into textual data format such that most of the available clues are identified and necessary actions may be taken on timely basis. Although, computer program is linguistically challenged but the current processing speed [8] has adequate power and potential to make and benefit from efficient tools, techniques and algorithms.
Lifecycle of Text Mining: It contains eight steps, which are described in Fig. 1.
Step-1: Problem Definition and Defining Specific Goals: This step identifies the fundamental problem that needs to be solved. Thus, it enables to determine the right content needed for mining and the right TDM approach.
Understanding of the problem definition is considered the key to carry out research in TM. A good knowledge of the problem and accurately defining it builds a path to reaching towards a solution or a recommendation. TDM analyzes the contents of data as well as it also explores the outcomes so that new connections or patterns are detected that are manually next to impossible.
Step-2: Use/Design and Build/Outsource TDM Approach: A ready-made TDM approach is chosen, only if the problem investigation person has good technological skills. In case, investigator possesses good programming skills, then it is better to build a TDM approach. There are possibilities that the approach may be outsourced as solving a TM problem that may require good NLP knowledge.
Step Each of these models helps in further processing.
Step-6: Feature Extraction/Build Concept and Category

Model:
The text features could be identified on three levels, which are words, sentences and documents [9]. The features are selected from, for instance, representation model present in the considered data collection [10].
Feature selection is an important step, since it reduces the complexity by selecting essential features and ignoring irrelevant ones. Feature Extraction also helps in reducing the dimensionality.
Step There are different algorithms, which can be chosen and applied on the concepts for detailed analysis. The bestscored concepts could be merged with other data to predict future behavior.
Step-8: Reach an Insight/Outcome/Recommendation: The selection, interpretation, evaluation and visualization of knowledge are performed at this stage [11]. There are a lot of visualization tools available and graphs or charts could be drawn for in depth understanding and better analysis to find new outcomes and exciting discovery.

ROOTS OF TEXT MINING
TM is a subfield of DM, which itself has grown from its parent disciplines -ML (Machine Learning), databases, data warehousing and knowledge discovery [6]. TM, being an interdisciplinary field, employs many computational technologies such as ML, NLP, AI (Artificial Intelligence), IT (Information Technology), CL (Computer Linguistics), Biostatistics, Pattern Recognition and Psychology. TM is also closely related to IR (Information Retrieval) and IE (Information Extraction).
Thus, the roots of TM are scattered over several overlapping fields that are described in the following.

Data Mining
DM serves two goals. Firstly, it identifies emerging patterns and trends termed as insight, which may help in taking actions on timely basis. This provides tremendous economic value, which is often imperative to businesses looking for competitive advantage. Secondly, it helps in prediction by building a model predicting based on certain given data [10]. This helps organizations to plan, predict and forecast and take appropriate measures and provide recommendations effectively and on time. Precisely, patterns are extracted from large computerized databases, which are previously unknown and are particularly important; whilst in TM, patterns are extracted from NLP text where input is free unstructured text.

Machine Learning
ML builds algorithms taking input data and then uses statistical analysis to predict outcomes within an acceptable range. Supervised and unsupervised are often the two main approaches in ML [12]. In supervised learning, model is trained based on training data; the predictions are computed on unseen new data. In unsupervised learning, no training is provided to model.
The methods categories data sets based on similarity among the data points in a given dataset.

Natural Language Processing
NLP is a component of TM that helps the machine read and understand text by performing linguistic or grammatical analysis [13]. It needs a consistent knowledge base, such as detailed thesaurus, a lexicon of words, a data set for linguistic and grammatical rules, ontology and up-to-date entities.

Information Retrieval
An IR (Information Retrieval) system finds relevant texts from a large collection of documents and presents them to the user like most search engines. The main job of IR system is providing the right information to its users at

Information Extraction
IE (Information Extraction) is a technique that extracts meaningful structured information from unstructured and/ or semi-structured data format [13]. The process also needs NLP in order to read the machine-readable documents. The performance of text classifiers is evaluated with accuracy or precision and recall.

Text Clustering
Text clustering is an unsupervised learning approach.

Building Ontology
Ontology construction is a growing research topic [15]. Interesting associations among concepts that co-occur is extracted. Directed and transitive associations are represented.
The approach proposed significantly reduces the number of associations.
O nly a s s o c ia t io n r ule support and confidence are considered.
Dinh and Tamine [20] Vector space model, BOW + word positions Semantic indexing and retrieval through domain concepts on biomedical documents. Concept reference scoring is done.
The proposed methods facilitates and improves biomedical information indexing and retrieval.
Extracted concepts are limited to MeSH Thesaurus.
Chin et. al. [21] Machine learning Approach and Semantic Approach An abstract discourse-level word sense disambiguation on word clusters is performed.
The proposed approach provides a possibility of incremental learning capability for NLP based systems Only limited to Wordnet.
Abebe et. al. [22] Natural Language Processing (NLP) Domain concepts and relations are extracted from program identifiers.
Supports concept location executed in the context of bug fixing.
Program element names must be chosen carefully by the programmer.
Ajgalk et. al. [23] Classification, Page-rank algorithm Representation of documents is shifted from keyword based to key concepts, as keywords are sometimes ambiguous but concepts are unambiguous. The presented approach is evaluated and compared to TF-IDF keyword model.
The key concepts extracted are satisfyingly accurate and are quite convenient to be used on web as it allows online key concept extraction.
A lightwe ight o nto lo gy WordNet is only utilized.
Szwed [24] Rule based approach Concepts are extracted based on correct morphological forms in Polish text. Annotations prepared by the user are compiled into transformation rules.
The proposed approach is quite general and can be applied to texts in other languages.
The approach does not work for 3-gram translation patterns.
Yong-Bin et. al. [25] Heuristic approach: NLP+statistics+domain specific knowledge+ inner structural patter of terms Best key concepts are selected based on candidates by leveraging the inner structure of terms . (CFinder) Based on some metrics, best key concept candidates are selected.
Not efficient when phraselength is large.

Khin and Lynn[27] Statistical and Linguistic rules
The proposed method extracts ontology concepts from multiple text of the same type using mutual information and domain frequency. The corpus comprises of financial and economic reports in Chinese Language.
Extraction of relevant ontological concepts from multiple sources.
The extracted concepts are limited to two-word phrase.
Wang et. al. [28] Word frequency analysis, Clustering User experience is evaluated by user feedback using text mining. It also using grounded theory.

Results show the various factors which influences user's emotions.
External Validity i.e., further analysis of original data and visual networks is required.
Prameswari et. al. [29] Sentiment Analysis, Summarization Online hotel reviews are mined to build the hospitality sector as an integral part of Indonesia's tourism industry.
The two text mining techniq ues namely, summarization and sentiment analysis are combined and exciting outcomes are observed.
Sentiment graphs of positive and negative reviews are s h o w n o n l y i n f i v e categories.
Jiang et. al. [30] Multi-label text classification An embedded model for multi-label text classification is proposed based on ELM (Extreme Learning Machine+ L21-norm minimization of the hidden layer output weights matrix.
Inherits the merits of ELM, facilitates g r o u p s c a r c it y, a n d r e d u c e s complexity of the learning method. Proposed algorithm obtains superior performance.
The presented approach t a k e s mo r e t ime t h a n original ELM.
Forman and Kirshenbaum [31] Text Feature Extraction for Classification and Indexing A fast method for text feature extraction is proposed that folds together Unicode conversion, forced lowercasing, word boundary detection and string hash computation. The method yields word and phrase features represented as hash integers rather than strings.
It requires less computation and less memory.
T h e r e m a y h a p p e n a collision between important, predictive feature and more frequent word in the Hash function.

Chintan et. al. [32]
Supervised learning -classification Text Mining is used to identify and predict cases of child abuse in a public health institution in Netherlands in medical texts.
Both structured and unstructured data are taken into account for prediction. Re a l d a ta se t ha s b e e n use d fo r experiments.

Sentiment Analysis
The

APPLICATIONS OF TEXT MINING
TM has remained in the use of government organizations, law enforcement agencies, news agencies, business intelligence and customer relationship management as well as researchers. It has helped to track, trace and understand contacts between individuals, among organizations and different ideologies. The task of identifying whether news has the same story as once told by someone a year back, for example, may not be assisted manually due to its nature, which requires error-free processing and rapid response.
However, a computer can never tire or lose interest and can do such tasks in the blink of an eye. TM processes dense text to seek information that lies beneath the considered data for better decision-making, to gain indispensible business insights and to mitigate operational risks.
The prominent application domains, where TM has played key role in getting in-depth knowledge are, for instance,

Ambiguity in Text:
Ambiguity is text is a major challenge.
For example, one word can have several different meanings and multiple words give one general meaning.

Bringing Shape to Textual Data -A Feasible Demonstration
Mostly Uses Supervised Learning: Many TM techniques use supervised learning, which is useful when amount of training data is available. It is quite expensive to create training data for textual data.

THE PROPOSED FRAMEWORK
This section reports the proposed framework to bring shape to text data. The phases involved in the framework are described in the following subsection.

The Building Blocks of Proposed Framework
The proposed framework comprised of four blocks (i.e. components) in order to shape the textual data into understandable visual information as shown in Fig. 3. Therefore, filtering is applied to prune such words from entire set of documents.  [39]. Dictionary and the computational scores are carried out using the similar method as mentioned in [16]. The proposed framework allows injecting any other method and/or technique for the knowledge discovery. The similar approach has been adopted in [37] for web navigational pattern extraction.

Block-1: Web
On the contrary, the scientific research literature also reports several proposed methods to detect meaningful information from variety of data, such as text from video [40], clickstream data [37] relational database of criminal activities [38].

Block-4: Visual Representation:
The extracted trends or patterns are presented in terms of effective visual graphs, which help in getting in-depth information about considered set of documents. The visual representation of information has potential to get in-depth understanding of considered dataset. For instance, study [38] proposed a framework to visually assess the trends of criminal activities, though the study applied such approach for structural data format.
The experimental results are discussed in the following subsection.

Results Discussion
This section reports the experimental results carried out at several events and topical models from news. The

Bringing Shape to Textual Data -A Feasible Demonstration
The extracted information represented in effective manner helps in getting easy and potentially valuable knowledge about the patterns, trends and in-sights hidden in unstructured data formats. The above-mentioned results validate the effectiveness of the proposed framework that relies on basic text processing functions, thus, it requires less computational sources to bring shape to unstructured (i.e. textual) data.