Role of FCBF Feature Selection in Educational Data Mining

The Educational Data Mining (EDM) is a very vigorous area of Data Mining (DM), and it is helpful in predicting the performance of students. Student performance prediction is not only important for the student but also helpful for academic organization to detect the causes of success and failures of students. Furthermore, the features selected through the students’ performance prediction models helps in developing action plans for academic welfare. Feature selection can increase the prediction accuracy of the prediction model. In student performance prediction model, where every feature is very important, as a neglection of any important feature can cause the wrong development of academic action plans. Moreover, the feature selection is a very important step in the development of student performance prediction models. There are different types of feature selection algorithms. In this paper, Fast Correlation-Based Filter (FCBF) is selected as a feature selection algorithm. This paper is a step on the way to identifying the factors affecting the academic performance of the students. In this paper performance of FCBF is being evaluated on three different student’s datasets. The performance of FCBF is detected well on a student dataset with greater no of features.


INTRODUCTION
tudent performance prediction models have received a significant amount of contemplation from both the research community and the educational sector. Student performance prediction model tackles the problem of student's grades [1], Grade Point Average (GPA) [2], Cumulative Grade Point Average (CGPA) [3] and Pass/Fail Course [4]. Thus the only goal of students' performance prediction 1  Email: mobashar@utar.edu.my models in EDM is not to achieve the high accuracy prediction model but also to help the educational stakeholders in predicting the performance of students, in order to make proactive decisions, and develop the strategies to enhance the quality of education for the improvement of students' academic performance. As the students are the main assets of any community, and the main aim of any academic organization is to provide quality education to its students. Moreover, quality education supports in building the skillful and featureful students. This gives attention to analyze the student's data in such a way to figure out the features affecting the performance of students. A lot of research is being conducted on the development of students' performance prediction models. But the study of students prediction models is still inadequate in predicting the performance of students [5]. This leads us to work on student prediction models to trace a suitable method for the development of student performance prediction model, to make proactive decision for the betterment of student's performance.
There are different techniques available for student performance prediction model. There are two main approaches to predict academic success whereas, one is supervised, and another is the unsupervised method. According to [6] around 71.4% of research articles on students' performance prediction models are using the classification method. It is the top method for performance prediction models [7]. In the classification method, the target variable is clearly defined as that it is predicted as grades, GPA, CGPA, or students PASS/FAIL. This leads us to build the students' performance prediction model with the help of the classification method as to figure out the dominant features affecting the student's final results.
Feature selection can play a prominent role in enhancing the accuracy of a prediction model. In the student's prediction model, where the selected features play not only an important role in increasing the prediction accuracy but also the base for the strategic plans for the educational environment. [8] deduced that information gain attribute evaluator is the best feature selection technique to improve the effectiveness of student prediction model. Whereas, [9] claims CFS subset evaluator as the best feature selection method for predicting the final semester examination performance of students. According to [10] there is no common feature selection method which can be accurate for all datasets even for a common domain. So that there is a need to figure out the important feature selection methods for predicting the performance of students. The importance of feature selection methods in predicting students' performance, motivated us to check the performance f feature selection for students' performance prediction.
There are mainly two types of feature selection methods, filter, and wrapper feature selection. Filter feature selection is being used and recommended by different studies in EDM. Filter feature selection is divided further into different types. In this paper, we focus one of the most important filter feature selection algorithm that is FCBF.
The contribution of this paper is that it checks the performance of FCBF on three different student's datasets. To give ease to the new researchers to know the performance of FCBF on datasets with different number of instances and different number of features. According to best of Knowledge, this is the first article in EDM that performs the evaluation of a filter feature selection FCBF on three different student datasets.
The outline of the paper is as follows. Section 2 describes the methods used in this research, section 3 discusses and describes results of filter feature selection algorithm on three datasets, and the conclusion of the paper is presented in section 4.

METHOD
In this study, the performance of the FCBF filter feature selection is evaluated on three different datasets of students. FCBF is applied to three datasets. The Support Vector Machine (SVM) classification algorithm is applied to the chosen datasets. The SVM classification algorithm is used to find the predictions. At the end, findings are evaluated. Prediction accuracy, F-measure, Precision and Recall are taken as the performance evaluation measures. Fig.1 describes the flow of main methodology of the proposed research presented in the paper. Three benchmark datasets DS33, DS16. And DS2 of students' academic records are selected to check the performance of FCBF. These datasets contain different number of instances, features and also belong to different educational domains. These three datasets are given as an input to FCBF feature selection algorithm one by one to select the features from the dataset. The dataset with selected features is then trained through SVM classification algorithm, and at the end tested and evaluated through performance evaluation measures (precision, recall, f-measure, and accuracy).

Description of Datasets
Three different student datasets are taken from different sources. The description of the three datasets is given below.

DS33:
The DS33 datasets is a Portages secondary students school dataset. The dataset is has been used in different EDM studies [10][11][12]. This is dataset of 395 students taking Mathematics subject. The dataset includes 33 features having demographic, academic information and personal information of students.

DS16:
The second dataset consists of 500 student's records. There are 16 features in the dataset including demographic details, academic details and behavioral features of the students. The dataset was previously used by Elaf [13].

DS21:
The third dataset DS21 is collected from different colleges in India. The dataset consists of 300 students records and 21 features. The dataset was used by the study in [14].

Filter Feature Selection Algorithm
Feature selection is a significant pre-processing technique that is applied in machine learning methods. Feature selection is important in all other fields of research to make proper decision [15]. The filter feature selection is a type of feature selection that maximizes the evaluation function for getting the best feature subset through a search strategy [16]. There are three main stages of filters that are feature set generation, measurement and testing by a learning algorithm. The filter feature selection algorithms process quickly, and they calculate the information from the features so that their results will depend on measured information of the features [17]. The filter feature selection algorithms are chosen because they can accomplish better with any classification algorithm as they have a smaller amount of computational complexity [18].
FCBF was purposed by [19]. It is a multivariate feature selection method that attempts to discover the best feature subset based on goodness of features [19]. It starts with a full set of features and uses symmetrical uncertainty to calculate the dependence of features. Symmetrical Uncertainty (SU) is a normalized information theoretic measure which uses the values of entropy and conditional entropy to calculate the dependencies of features. FCBF is a correlation based feature subset selection method, which is faster than other subset selection methods [20]. In EDM, FCBF practiced ranking the features of graduate students in United States universities, to detect the factors of high dropout rate and low graduation rate of four-year college students [21]. Authors of reference [22] applied FCBF in pre-processing stage to predict the student interactions in the intelligent learning environment, furthermore, the study recommended that FCBF would be competent on selecting features from students dataset, as in this kind of datasets the correlation between features are very crucial.

Classification Algorithm
There are two main methods in data mining, one is supervised, and another is unsupervised. Classification is a type of supervised method. According to the existing literature on EDM, it is most frequently used in predicting the performance of students [23][24][25].
There are quite a lot of classification algorithms available that are being used in student performance prediction models such as Decision Tree, Neural Network, Naïve Bayes, Random Forest, Ada Boost and SVM. In this research work SVM as the classification algorithm is used for students' performance prediction.
SVM: SVM is a type of classification algorithm. It has been applied in a number of research works including face recognition, 3D (Three-Dimensional) object recognition, text and image classification and in EDM. It has an inimitable benefit of solving small-sample, on-linear, and high dimensional pattern recognition problems [26]. SVM practices a Gaussian function. As an assistance, the complex relationship between the given data points can be captured. SVM is appropriate for feature selection hitches [27]. In this research, we have used SVM linear Kernel. Equation (1) presents the linear kernel whereas xi is representing data points. The SVM linear kernel classification is very simple and training with the data with linear kernel of SVM is faster than any other kernel. Linear Kernel: K(xi,xj)= xi T xj (1)

RESULTS AND DISCUSSION
In this section, we present the results of FCBF filter feature selection on all the three datasets DS33, DS16 and DS21. The results are evaluated on different evaluation measures. First, we show the previous results of FCBF using one-fold cross-validation on three different student. The datasets 1 and 2 have almost two same categories of features that are Demographic (DF), and Academic (AF), whereas dataset1 has Lifestyle Information (LF1), and dataset2 has behavioral features (BEF) category. Dataset2 has also included the features regarding parent's participation in the learning process (PPL). Whereas, the features of the third dataset has features regarding demographic, academic and socio-economic information of students. FCBF shows the highest accuracy on dataset1. Whereas, FCBF shows lowest on dataset 3, that have the lowest number of instances among all 3. The results show that the academic background is a very important category of features for predicting the performance of students. Whereas, student behavior and socio-economic factors also influence the performance of the student. This motivated us to check the performance of FCBF on three datasets using 10-cross-validation and through different measures. Fig. 2 presents the comparison of prediction accuracy by using FCBF on three selected student's datasets. The results show that the FCBF shows better accuracy on the dataset DS33, whereas shows lowest performance on the dataset with less number of instances that is DS21 whereas accuracy is the ratio between all correct predictions.

The accuracy is defined as
Prediction Accuracy = (2) Precision is the fraction of the retrieved instances that belong to the target class. The precision formula is presented through Equation (4).

Recall on Three Datasets
The Fig. 5 shows a comparison of the recall measure of FCBF. The results show that FCBF performs out class on DS33, whereas better results are not observed on DS21. However, it shows 66% Recall value on DS16.
The recall formula is presented through Equation (5).

CONCLUSION
Student performance prediction is a very important area of research because this area is not only an interesting field for the researchers in EDM but also it is beneficial for all the educational stakeholders. Feature selection helps EDM in developing a high accuracy students prediction model. In this paper, we have evaluated the performance of FCBF. The performance of FCBF in terms of accuracy, f-measure, precision, and recall shows out class results on DS33. Whereas, perform not up to the mark on DS21. The results deduced that FCBF performs satisfactorily with a student dataset of large number of features. Moreover, FCBF does not give good results on a dataset with less number of instances. So, it is recommended to use FCBF feature selection on a dataset with large number of features. In future, we will evaluate different feature selection algorithms on student's dataset to evaluate their performance.