An Arabic Mispronunciation Detection System Based on the Frequency of Mistakes for Asian Speakers

Over the last few decades, the field of artificial intelligence and machine learning has evolved. Due to the advancement in these fields, much work has been done to assist language learning with the help of computers called Computer-Assisted Language Learning (CALL). Mispronunciation detection is one of the significant tasks of the CALL system. An efficient mispronunciation detection model has a positive impact on the life of second language learners by providing phoneme level feedback. In this paper, we introduce the phone grouping technique for mispronunciation detection that is based on mistakes probability. We consider mispronunciation detection as a classification problem, traditionally for this purpose, a separate classifier is trained for each phoneme mistake that requires a lot of memory and time. Instead of training a separate classifier, we group the phoneme based on their mistakes probability that helps in reducing the number of the classifiers to be trained and also saves memory and time. We use the Support Vector Machine (SVM) classifier and test the results on the Arabic dataset (28 Phonemes). The performance of our proposed method is evaluated by using accuracy. The results of the model are evaluated using the confusion matrix and gives an accuracy of 88%. Our approach outperforms the existing systems developed for Arabic phonemes in terms of accuracy and is also time/memory efficient.


INTRODUCTION
ue to advancements in technology world has become a global village. People can easily communicate with one another living in different parts of the world, so there is an increasing demand for new language learning [1]. Arabic is the fifth prevalent language in terms of native speakers [2]. Speech technology has improved dramatically over the last decade, so by using speech technology and machine learning techniques, many intelligent CALL systems are developed which are more useful pronunciation scoring, an overall pronunciation score is calculated on the global level. These global scores are not very useful when used in pronunciation training because in pronunciation training people are more concerned with the nature of errors made in pronunciation rather than the overall scoring. Pronunciation scoring determines the speaker's proficiency in the language and used to test different pronunciation applications. Mispronunciation detection requires calculating the pronunciation scores on the local level which is usually phoneme level. So both pronunciation scoring and mispronunciation detection have different goals and different results.
On the other hand, mispronunciation detection can point out pronunciation mistakes and provide feedback at the phoneme level. There are many reasons for mispronunciation such as the speaker's native speaking style or speaker's unfamiliarity with words and so on. Pronunciation errors are classified into phonemic errors and prosodic errors [5,6]. Phonemic errors are related to phones, phonemes may be substituted with another similar phoneme, some phones may be added or deleted, and all these changes make a difference in sound. Prosodic errors on the other hand are difficult to categorize because they include errors based on stress, rhythm, and annotation [4].
The second language learner makes pronunciation mistakes frequently. Particularly when the non-native language contains a few phonemes that are not found in foreign native language, second language learners replace these phonemes with ones existing in their native language. Automatic detection of such errors is a fundamental and essential procedure in CALL frameworks [7].
In this paper, we propose a classifier-based approach for mispronunciation detection of Arabic phonemes. We consider mispronunciation detection as a classification problem. Traditionally to detect mispronunciation we train a separate classifier for each phone mistake, a separate classifier is trained, that takes a lot of memory and time for training. To cater to this problem, we categorize the data into groups so we train only one classifier for the whole group instead of training a separate classifier for each pronunciation mistake. This grouping technique enhances the performance of classifiers as well as it is more efficient in terms of space and time.
The remaining paper is organized as follows: In section 2, we present a detailed overview of related approaches for mispronunciation detection. In section 3 we describe our proposed methodology and details of each step are also provided. In section 4, we deliberate the experiments and results and also provide a comparison of our approach with state of the art approaches.

LITERATURE REVIEW
Mispronunciation detection systems can be categorized into three main groups: posterior probability-based methods, classifier-based methods, and Deep-learning-based methods.

Posterior probability-based Methods
The initial work in this field started in the 1990s and different scoring algorithms were proposed for error detection. Kim et al. [8] presented three Hidden Markov Model (HMM) based scores: 1) HMM-based log-likelihood scores 2) HMM-based posterior probability scores, which later on turned into an accepted standard, 3) segment duration based scores. Similarly, the Goodness of Pronunciation (GOP) score utilizes a log-probability-based score. In posterior probability-based methods, different methodologies have been used. Witt et al. [9] introduced the GOP strategy to check the quality of pronunciation and the combined standard GOP strategy with a few refinements that provide improvements in scoring performance. GOP score can be computed in equation (1) as where; s = the sequence of observation q = the labels d = time interval of the audio examination in the form of frames. Zhang et al. [10] proposed the Scaled Log-Posterior Probability (SLPP) and weighted phone SLPP method to improve the degree of pronunciation quality. Hindi et al. [11] calculated the GOP score to identify pronunciation mistakes in five Arabic phonemes that were frequently mispronounced by non-native Arabic speakers. In the same manner, Kawai et al. [12] utilized log-probability scores in constrained arrangement mode. Extended versions of probabilitybased scores were effectively utilized by Mak et al. [13]. Posterior probability-based methods can detect the pronunciation quality but these scoring algorithms are not capable to detect the type and exact location of error so for this purpose classifier-based techniques are used.

Classifier-based Methods
In classifier-based approaches, Truong et al. [14] used Linear Discriminant Analysis or a decision tree for mispronunciation detection of three sounds (A, Y, and X) that are frequently mistaken by L2-students (foreign/second language students) of Denmark. Ito et al. [15] proposed a decision-based clustering technique to enhance the accuracy of error detection. They developed the clusters of pronunciation rules and defined a threshold for each cluster. Amdal et al. [16] differentiated among short and long vowels of speech by using acoustic-phonetic features and consolidated them in a Linear Discriminant Analysis (LDA) classifier. Georgoulas et al. [17] used SVM to detect the speech articulation of sound and for the classification of speech sounds. Strik et al. [3] compared four different approaches (GOP, decision tree, LDA-APF (Acoustic phonetic feature) and LDA-MFCC (Mel Frequency Cepstral Coefficient)) for automatic pronunciation error detection. The comparative analysis showed that LDA-APF and LDA-MFCC both strategies yielded preferred outcomes over GOP scores and the decision tree. Wei et al. [4], presented the SVM framework, with pronunciation space models to enhance execution. Tongmu Zhao et al. [18] developed a system for error detection on eight confusing phonemes of Chinese using SVM classifier with structural features. Yoon et al. [19] presented the confidence scoring method and landmark-based SVMs method to detect mispronunciation. The combination of both methods did not provide significant improvement when data was not trained appropriately. Yang et al. [20] used six different classifiers (decision trees, random forest, gradient boosting, SVM with a linear kernel, SVM with radial basis function and Binomial logistic regression) for classification and among those support vector classifiers and logistic regression performed best. Maqsood et al. [21] developed acoustic-phonetic feature-based Computer Assisted Pronunciation Training (CAPT) system for most confusing Arabic phoneme pairs (/ ‫ﻁ‬ / vs ‫ﺕ‬ / /) and (/ ‫ﺡ‬ / vs / ‫ﺥ‬ / or / ‫ﻫـ‬ /). They applied four classifiers (Random forest, Naïve Bayes, Ada-boost, and K-NN) on a dataset of 200 speakers and compared the performance of the classifiers and the result showed that Random Forest classifier performed better as compared to other classifiers. Maqsood et al. [22] developed a system for mispronunciation detection of five phonemes of Arabic using the SVM classifier.

Deep-Learning based Methods
In Deep-learning based approaches, Lee et al. [23], used Deep Belief Network (DBN) posteriorgrams to detect the word level mispronunciation. DBNs have been effectively utilized for phone recognition with input coefficients that are MFCCs or filterbank [24,25]. Li et al. used Deep belief networks for lexical stress detection, and demonstrated that the DBN performed better than the Gaussian Mixture Model. Hu et al. [11] proposed the Deep Neural Network (DNN) based approach to acoustic modeling of a tonal language. Joshi et al. [26] proposed a method for vowel mispronunciation detection using DNN with cross-lingual training. Gao et al. [27], aimed at the robust detection of Pronunciation Erroneous Tendency (PET) and proposed the DNN-HMM framework for error detection and used three types of acoustic features namely MFCC, Perceptual Linear Predictive (PLP) and filter band. Hu et al. Hu et al. [29], extended the GOP algorithm from traditional GMM-HMM to DNN-HMM to detect phone-level mispronunciation and tone diagnosis of the L2 learner. Li et al. [30] focused on mispronunciation detection on the segmental and subsegmental levels. They used speech attributes (voicing and aspiration) and Deep neural network classifiers to address mispronunciation detection and diagnostic feedback. At the sub-segmental level, they used speech attribute scores to measure the pronunciation quality, and then they integrated scores using NN classifiers to produce segmental level pronunciation scores. Li et al. [31] proposed Acoustic Phonological Model that used multi-distribution DNN for mispronunciation detection and diagnostics.

Features used for mispronunciation detection
Apart from methods used to detect pronunciation mistakes, an important aspect of a mispronunciation detection technique is to extract discriminative features that efficiently represent pronunciation variations. Different types of features have been used by researchers including confidence measures and loglikelihood scores based features, acoustic-phonetic features, statistical features, structural features, and combination of many features. However, the most discriminative pronunciation features are still to be identified. Literature has highlighted that the performance can be achieved through the use of better classifiers [17,19,25] but that causes an increase in the computational cost. Therefore, there is a need for a system that can effectively and efficiently detect and classify mistakes in phonemes. We have used acoustic-phonetic features in our research to improve the efficiency of the system. Table 1 represents the details of features used by the researchers for pronunciation training.

PROPOSED METHODOLOGY
The flowchart of our proposed methodology is shown in Fig. 1. The first step is to extract the features from the audio signals labeled with phoneme class C= {c1, c2, c2….cn}, where n represents a total number of phoneme classes. In the next step, we use preprocessing to clean the data and remove sparsity, and then we apply dimensionality reduction for selecting the most discriminative features. After dimensionality reduction, we divide the Arabic dataset into two groups: frequently mistaken phonemes and less mistaken phonemes. Finally, we train the model using SVM, Naïve Bayes, and KNN on those discriminative features to get optimal results. We use the k-fold crossvalidation in which test data from each fold is passed to the trained classifier to find the phoneme label.
Algorithm 1 represents the sequence of steps. First of all, we extract the acoustic-phonetic features from an audio file present in dataset D (Line 2). If dataset D contains missing values then we apply pre-processing steps like data cleaning and impute missing values (Line3 and 4). After that, we apply a feature selection algorithm to choose the most discriminative features for mispronunciation detection (Line 6,7). After feature selection step we divide the dataset into two groups, frequently mistaken phonemes Pfm and less mistaken phonemes Plm (Line 9). We train two separate classifiers for each group using k-fold crossvalidation (Line 11, 12) to detect mispronunciation. We present the detail of each step in the subsection.

STEPS:
1. For each Audio signal in the dataset, D do 2. Extract Acoustic-Phonetic features(APF) 3.

Feature Extraction
To extract features from an audio file, first, we divide the speech signal into small frames of 20ms and 10ms, overlap and apply signal processing techniques to extract acoustic-phonetic features like pitch, MFCC, energy, and formats from these frames. These features are called low-level descriptors. We also extract global statistical features like mean, min, standard deviation, and slope by combining different frames and these features represent the global trend of a signal. Table 2 shows the acoustic-phonetic features used in this research work.
We use these feature for our research work as these are the main features used in the literature work as listed in Table 1 and we also combine the statistical features with other features to obtain good results. Each local level descriptor combines with each global statistical feature to form multiple features. For example pitch, a local level descriptor combines with all global descriptors mean, Period_frequency, slope, amplitude, standard deviation, period and entropy to form six features (Pitch_mean, Pitch_std, Pitch_slope, Pitch_preiod Frequency, Pitch_period Amplitude, Pitch_Entropy) and similarly all other features are combined to form 284 features. We provide details of each feature as follows.

Pitch
We define the pitch of the sound as a frequency of vibrations. When compelled air from the lungs passes through the choral folds sound is produced. The fundamental frequency or pitch of the sound is that frequency where vocal tracts vibrate. Automatic speech recognition applications widely use the pitch of the sound. It has also been proved useful for mispronunciation detection.

Roll-Off
The roll-off is defined as the frequency below which 95% of the power of the signal is determined. It is also a measure of spectral shape and produces higher values for high frequencies. Therefore, it can be assumed that a strong correlation exists between these features. The roll-off is computed in equation (2) as Here Xk represents the discrete Fourier transform of x (6). The left-hand side of the equation denotes the summation of the power underneath the frequency value f and the right side of condition shows the 95% of the aggregate vitality of the signal.

Entropy
The entropy feature has been used in speech recognition applications to detect the voiced and unvoiced region of a speech signal. Spectral entropy is a measure of signal complexity. It captures the formats or the peakiness of a distribution. Formants and their locations assume an imperative part in speech track. We compute the entropy of speech signals in equation Xi represents speech signal Power Spectral Density (PSD), and pi represent normalized PSD, Xi can be calculated in equation (5) as where Si shows the spectrum of the speech signal.

Cepstrum
Mel scale is a scale of pitches that are equal in distance from each other employed by MFCC's. The normal frequency f in hertz can be changed to the Mel scale range by equation (6) ss follow The Cepstrum is a measure used to gain information from a person's speech signal. We apply logarithm on the signal spectrum and then take inverse Fourier transform to obtain Cepstrum. Mathematically it is expressed in equation (7).
Backward Fourier Change (IFT) of the logarithm of the evaluated range of a flag is Cepstrum.
C n = DFT QB Rlog|DFTSX n T|U where DFT represents the Discrete Fourier transform and DFT QB is the Inverse DFT. The Cepstrum contains the information rate of change in spectrum bands. Spectrum is first transformed by the Mel scale to give MFCC's which are used for speech recognition. We retain the high coefficients if we are interested in excitation signals and on the off chance that we are occupied with the vocal tract, we keep the low coefficients. Cepstral coefficients are a compressed representation of the spectral envelope. It can be shown that cepstral coefficients are not correlated. This information is useful that is why speech recognition applications widely use cepstral coefficients.

Zero-Crossing Rate
Zero-Crossing Rate (ZCR) is the extent of how often a signal passes the zero axes or in other words, it counts the number of times in a given frame a signal amplitude changes sign from positive to negative and vice versa. ZCR is a time-domain feature and is a very robust and discriminative feature to differentiate sound signals. We compute ZCR for signal S with length T in equation (8) as follows: Zero cross value for the periodic sound is low, and its value for noisy sound is high. The zero-crossing rate is a time-domain feature that is determined by the signal frequency. Furthermore, to notice zero crossings of the input speech signal, the sampling rate should be very high. Another important aspect is to normalize the input signal before calculating the zerocrossing. The zero-crossing rate is an important parameter for mispronunciation detection techniques. Zero cross value for the periodic sound is low, and its value is high for noisy sounds.

Energy features
In a speech signal, the power of the signal at a given time is called energy. Energy can also be defined as the pressure exerted by the lungs and passed through the vocal track. The signal amplitude differs with time due to variation in pronunciation. The spoken section amplitude changes altogether when contrasted with an unspoken section of the speech signal. Correct pronounced phonemes have different amplitude variation as compared to the mispronounced phonemes. These amplitude variations are represented by short-time energy, so energy is considered as a potential feature to discriminate speech signals. The energy of the discrete-time signal S t is computed in equation (9) as where Sg t represents the time signal power and is computed in equation (10) Low short-term energy can be characterized as the number of speech frames whose short-time vitality esteem is not as much as the 0.5 times of the normal short-time vitality in one-moment. We compute energy in equation (11) as:* where T = total number of frames, (t) = short time vitality of the t Ya frame STE _` = average short time vitality in a one-second

Root Mean Square
Root Mean Square (RMS) value represents the average power of a signal and it is related to the amplitude of a signal. We compute RMS by squaring the signal amplitude, averaged over time-period and then the square root of the result is calculated in equation (12): RMS is proportional to the effective power of the signal and an important feature to discriminate correctly pronounced and mispronounced audio signals.

Spectral Features
The spectral features can be expressed as qualities of the speech signal in the frequency domain other than the fundamental frequency e0. Formants are the most generally utilized spectral features of a speech signal. The speech spectrum is gone through a bank of bandpass channels whose middle frequencies depend on human recognition scales and are exponential. To show these frequencies, analysts have proposed two unique techniques the Bark Mel and Scale. Bark Scale is characterized as in equation (13) Bark fr = 13 arctan 0.00076fr + 3.5arctan 5m noMM ?
Spectral features are then taken out from these signals [36,37].

Data Pre-Processing
Data is often incomplete and inconsistent, so it is essential to preprocess the data before applying any machine learning algorithm, so effective analysis is performed to achieve optimal results. Our dataset is sparse (contains missing values) so we apply a numerical cleaner filter [38] that detects and marks missing values in the dataset and then applies to replace missing value filter. This filter imputes missing values in data by a mean value of data distribution. Cleaned and completed data is then fed to feature selection process that enhances the performance of the training model.

Feature Selection:
Feature selection aims at picking those features that are discriminative to distinguish among classes. The dataset contains 284 features, but all features are not significant, a subset of discriminative features plays an important role in decision making for classification. In our proposed methodology we use Relief-F attribute evaluation technique for feature selection. Relief-F is an addition to Relief feature selection procedure that deals with only binary classification data, but Relief-F is optimized to deal with multiclass problems [39].
Relief-F attribute selection methods arbitrarily selects any instance Xi and afterward looks for n closest neighbors from the same class called closest hits H, and n closest neighbors each from an alternate class called closest misses M and refreshes the weights of all attributes. The weights are calculated by using equation (14): This process is repeated m times and the parameter n defined by the user, controls the number of nearest hits and misses. Relief-F attribute evaluation filter provides a ranked list of attributes using the ranker search method, and a threshold is also required. The ranker method is used in combination with the feature evaluation method and ranks features by their separate evaluations. We set n=10 and threshold value to 0.0181 by ranking the data on multiple thresholds and this threshold value provides the best features for further processing. We discard all the values below that threshold from the ranked list of attributes and retain values above that threshold. We retain 135 features on a defined threshold.

end
Algorithm 2 describes the steps to calculate the weighted attribute list. Initially, all the attributes are initialized with zero weight (Line1). Relief-F algorithm arbitrarily chooses an instance Xa (line3), and finds closest hits Hn, n of its closest neighbors from a similar class (line 4), and n closest misses Mn (Cl), nearest neighbors from various classes (lines 5 and 6). It updates the weight estimation. Weights [Attributes] for all instances rely upon their estimations of Xa, hits Hn, and misses Mn (Cl) (lines 6, 7, and 8). The formula for Relief (lines 7 and 8), utilizes a considerable number of hits and all of the misses. The commitment for each class of the misses is weighted with the earlier likelihood of that class Pr (Cl). As the class of hits is absent in the total, we need to separate every likelihood weight with factor 1− Pr (Cl (Xa)). The procedure repeats for 'k' times.

Grouping of Phonemes
The Arabic language consists of 28 phonemes. Nonnative Arabic speaker makes pronunciation mistakes due to the number of reasons. A Pakistani national while learning the Arabic language confuses some phonemes (replace one phone with other similar phones called confusing pairs). Fig 2 shows confusing pairs of Arabic mistaken by the Pakistani national. Sulaiman et al. [40] discovered Arabic phonemes mispronounced by Pakistani nationals and also found which phoneme sound is replaced or substituted by the other phonemes to provide confusing phoneme pairs. When we take mispronunciation detection as a classification task, we have to train a separate classifier for each confusing pair that needs a lot of memory and training time. To make efficient use of memory and time, we group the phonemes into two main groups. These groups are based on the pronunciation errors made by Pakistani nationals [40]. The phonemes with a high probability of mistakes are placed in Group1 and phonemes that have a low probability of mistakes placed in Group2. We set a threshold that all the phoneme pairs having mistaken probability greater and equal to 10% are placed in Group1 (frequently mistaken phonemes) and phoneme pairs having mistaken probability below that threshold are placed in Group2 (Less mistaken phonemes). Table 3 shows mispronounced phonemes along with their mistakes probability.

Classifiers
There are many classifiers used for mispronunciation detection. In this research work, we use convolutional neural network features from different layers to detect mispronunciation and for classification of deep features, we use SVM [43], Naïve Bayes, and KNN [44].

SVM Classifier
The support vector algorithm outputs an optimal hyperplane which categorizes the data by labeled classes. The SVM is best for binary classification, but it is optimized to deal with multiclass problems. In two-dimensional space, the SVM classifier makes a direct hyperplane that isolates the two classes. If the data is not linear so we have to tune the SVM using some parameters like kernel trick. Kernel function transforms the nonlinear data to linear in high dimensional space. We apply SVM on a multiclass dataset that contains correctly pronounced phonemes and mispronounced phonemes.
The earliest utilized approach for SVM multiclassification is one versus all strategy. In this strategy, k SVM models are developed where k is a number of classes. Another significant approach is one versus one. It was presented in [15]. This technique develops (K-1)/2) classifiers where everyone is prepared for information from two classes. One versus one approach is more productive when contrasted with one versus all approach in terms of various classifiers prepared and we utilize one against one strategy. Our dataset consists of Arabic phonemes and some phonemes resemble other phonemes, so it is really hard to classify the phonemes, and data is not linearly separable so for mispronunciation detection of phonemes, we utilize the kernel functions to change the information to higher dimensional space for better classification performance. We have utilized linear, polynomial, and Radial Basis Function (RBF) Kernel and they can be expressed numerically in equation (15) as K ν, υ : = υ X υ : + c Linear K ν, υ : = γυ X υ : + c • , γ > 0 Polynomial (15) K ν, υ : = exp −Ž‖• − ' ' ‖ ? , γ>0, Radial Basis where (ν) represent input vector and (νi) shows support vector; c is a constant term and d represents the degree of the polynomial and these parameters are adjustable.

Naïve Bayes
Naïve Bayes classifier is a basic classifier with a strong naïve assumption between features and based on Bayes theorem. Naïve Bayes classifier assumes that features are independent of each other. For the classification purpose, we assume that there are a fixed number of phoneme classes, C∈{c1, c2, c3,…, ck}, where k is the aggregate number of classes that represent a unique phoneme, each with a fixed set of features. Each sample is characterized by ndimensional vector Ph = {ph B, ph ?, ph n……. ph c }, where n is the quantity of features {A1, A2, A3,…,An}. Given a phone sample Ph, the classifier will anticipate that phone Ph has a place with an accurately articulated class or misspoke class relies upon the highest posterior probability of a class, molded on Ph.

K-Nearest Neighbor
KNN is a simple and important instance-based machine learning classification algorithm [41]. KNN is utilized for classification and regression. For classification, an instance is classified by majority votes of its K-Nearest neighbors. The nearest neighbor is selected by the linear search method, but other searching methods are also used. These search methods by default use the Euclidean distance as the selection parameter. K is a positive whole number and on the off chance we set k=1 then the instance is classified based on one closest neighbor that means the instance is allotted the same class as a neighbor. We apply KNN on our dataset to detect correctly pronounced phonemes and mispronounced phonemes. Firstly, we trained the classifier with labeled training data T. In KNN; a training data T is utilized to decide the label of anonymous sample A. KNN classifier finds its K closest neighbors of sample A based on Euclidian distance. If we have two samples a and b  (19) where n represents a number of features describing a and b. The label is assigned to sample A according to majority voting rule which states that that label is assigned to sample A that frequently occurs among nearest neighbors. This classification scheme improves performance by defining nonlinear decision boundaries. We choose the value of K after trying different values of K and find optimal results at K=10.

Dataset
There are numerous CALL frameworks available for various languages like English, Mandarin, Dutch, French, and Arabic. We used an Arabic dataset that was recorded from 400 speakers of Pakistani, learning Arabic as their second language. The dataset was recorded in an open office environment with the help of a microphone in stereo using a 44100 Hz sampling frequency. We used Audacity software to record the dataset and for manual segmentation. Data recorded in the office environment contains noise, so we used a fifth-order high pass Butterworth filter to remove lowfrequency noise. The reading material includes isolated Arabic consonants. Arabic language consists of 28 consonants and Table 5 shows the details of the phonemes used in this research work. The recording process was held in five different sessions, and each speaker recorded 28 phonemes three times. The repetition per speaker was used to find the bestrecorded consonant. The detail of the dataset used for this experiment is given in Table 4. The dataset was created by considering an equal number of male and female speakers as their ages ranged from 10-50 and having different mother tongues like Punjabi, Pushto, and Urdu. Some speakers were highly proficient, and some were at the beginning stage of learning the Arabic language.
The labeling of the dataset was carried out by five Arabic language experts. Each language expert labelled the data separately as correct and incorrect pronunciation classes. If three or more language experts assigned the same label to a certain phoneme then that class (data label) was assigned to that phoneme.

Evaluation Metrics
To evaluate CALL frameworks, distinctive evaluation matrices like accuracy, recall, precision, and Mean Absolute Error were utilized. In our research work, we use accuracy, Recall, Precision, and Receiver Operating Characteristic (ROC) curve as an evaluation parameter that is based on the confusion matrix. Accuracy can be computed in equation (20): where N | represents the number of mispronunciations detected correctly and t represents total number of mispronunciations detected. Recall and Precision can be defined in equation (21) and (22) as where T and T @ represent the number of mispronunciation detected correctly, whereas, F and F @ represent number of mispronunciations detected incorrectly. We also use a ROC curve (a graphical representation of sensitivity versus specificity) to evaluate the performance of our model. The area under the curve shows the performance of the model, the greater the area under the curve, the more accurate the model.

Results and Discussions
This section presents the results for both groups of phonemes, frequently mistaken phonemes (P 5x ), and less mistaken phonemes (P x ). P 5x contains ten 290 phoneme classes and P x contains eighteen phonemes classes. Each phoneme represents a unique class. Table 6 represents the list of phonemes included in frequently mistaken phonemes P 5x and less mistaken phonemes P x .
Three different classifiers, Naïve Bayes, K-Nearest Neighbor and SVM were tested for P 5x and P x . We used k-fold cross-validation (k=10) to divide the dataset for training and testing. We used almost equal number of samples for each phone, and we used all the classifiers with default settings.
The performance of all the three classifiers has been evaluated for frequently mistaken phonemes P 5x .
The performance of the same three classifiers Nearest Neighbor, and SVM has been evaluated for less mistaken phonemes P x . Average accuracies for for less mistaken phonemes P x are found to be 61.6%, 72.1%, 86.7% respectively. The results for each group are presented in Table 7. The results show that the classifier-based approach efficiently handles mispronunciation detection in both groups. It is also concluded from the results that the SVM classifier outperforms the Naïve Bayes and KNN classifiers. Naïve Bayes classifier shows worst results due to its simplicity and cannot cope up with a complex problem like mispronunciation detection while SVM classifier performs best due to its robustness and generalized ability as shown in Fig 3. To check the effectiveness of the feature reduction technique we executed our algorithm twice, once using the defined feature reduction technique on P 5x and P x and once without using the feature reduction technique on P 5x and P x . Table 8 represents the effectiveness of the feature reduction technique. Accuracies achieved by Naïve Bayes, KNN and SVM on P 5x group without feature reduction technique are 78%, 80%, and 88% respectively and after feature reduction, accuracies achieved by these classifiers are 78%, 81%, and 90% respectively. Accuracies achieved by Naïve Bayes, KNN and SVM on P x group without feature reduction technique are 57%, 73%, and 85.4% respectively and after feature reduction accuracies achieved by these classifiers are 61.6%, 76.1%, and 86.9% respectively. The comparative analysis of the results shows that the feature reduction technique enhances the accuracy of the algorithm by around 2%.   the y-axis represents the weights of attributes and features with a weight value greater than 0.0181 are selected for this research work. Relief-F filter provides a list of weighted attributes and all the attributes whose weights are greater than cut-off or threshold value are retained for mispronunciation detection process, all other features with weights less than the threshold are discarded.
As the comparison of different classifier inferred that SVM outperforms the Naïve Bayes and KNN, so we test the accuracy of our algorithm by applying different kernels (linear, polynomial and Gaussian) on both groups P 5x and P x . Table 9 represents the percentage of accuracies of the SVM classifier using different kernels. Results show that accuracies achieved by SVM using linear kernel are of 74.2% for P 5x , and 76.7% for P x .   The average accuracy obtained by SVM using linear, polynomial, and Gaussian kernels is 75%, 88%, and 87% respectively. Comparative analysis shows that SVM with polynomial kernel performs best as compare to linear and Gaussian kernels as shown in   Fig. 7 shows the performance of our method, with a polynomial kernel of degree 3, on both groups P 5x and P x in terms of the confusion matrix. It shows that our approach achieved reasonable performance on most of the phoneme classes. The confusion matrix of the P 5x group shows that misclassification occurs only on confusing phones. In P 5x group, we take three frequently mistaken phonemes pairs ‫ﻅ‬ ‫ﺽ‬ ‫ﺯ‬ ‫ﺫ‬ ‫,ﺩ‬ ‫ﺹ‬ ‫ﺱ‬ ‫,ﺙ‬ ‫ﻩ‬ ‫ﺡ‬ so misclassification occurs only within one confusing pair while the decrease in performance of * -, group is due to the presence of all remaining phones with low mistakes probability but still have confusing pairs, which can mislead classifiers. The results of our method for mispronunciation detection are also presented using the ROC curve on both groups P 5x and P x as shown in Fig 8. The curve plots the true positive rate against false positive rate. In this research work, a multiclass classification problem is addressed, so the ROC curve is drawn for P 5x group by taking an aggregate for the ten classes and P x group by taking an aggregate for the eighteen classes. Each point on the curve represents sensitivity and specificity pair value for a specific decision threshold. A perfect ROC curve for classification passes through the top left corner. The ROC curve for P 5x and P x group is closer to unity which shows that our approach demonstrates a reasonable classification performance.

Discussion
The results of our model show an accuracy rate of 88%, which is higher than the accuracies of other similar models [2,17,42] and less than Al Hindi et al. [11] work. Our method is more efficient as compared to Georgoulas et al. [17], Kun Li et al. [42], Abdou et al. [2], Kun Li et al. [31] in terms of accuracy due to phonemes grouping technique. We also compare our work with Muazzam et al. [45] work that uses the same dataset as we used in our proposed method. Our proposed method performed better as compared to their work. We group frequently mistaken confusing pairs in one group; one confusing pair is different from other confusing pairs. In our group of frequently mistaken phonemes, we take three confusing pairs with a high probability of mistakes. The first confusing pair consists of three phonemes so these three phonemes have matching sounds and confused with each other while the second confusing pair consists of five phonemes and confused with each other. The second confusing pair is not confused with the first confusing pair, because they have different sounds, so that is the reason that the proposed classifier achieves better accuracy as compared to the previous approaches. The accuracy rate of Al Hindi et al. work is 92.5% that is higher because they focused on only five Arabic phonemes and considered mispronunciation detection as binary classification while our proposed model is based on the multi-label classification of 28 Arabic consonants.

State of art comparison
We complete experimentation and results in the discussion section by comparing our approach with state of the art as shown in Table 10. Our proposed method outperforms the mentioned state of the art methods in terms of accuracy. It should be noted that the performance of our method is enhanced due to the grouping of phonemes.

CONCLUSION
In this paper, we proposed a novel approach to deal with pronunciation mistakes of Arabic made by Pakistani nationals. This proposed work demonstrated the development of an efficient mispronunciation detection framework for language learning systems. We considered mispronunciation detection as a classification problem. When we deal with mispronunciation detection as a classification problem the main drawback is that we have to train a separate classifier for each phoneme's mistake resulting in the increased use of memory and time. To handle this issue we grouped the dataset of 28 phonemes into two groups based on mistakes probability of phonemes. Group1 contained frequently mistaken phonemes and the second group contained less mistaken phonemes. We trained the SVM for both groups instead of training separate classifiers for each phoneme mistake. This grouping technique saves memory and helps in minimizing the number of classifiers to be trained for each phoneme mistake.
Moreover, most states of the art methods focused on one or two confusing pairs while the proposed model deals with all Arabic consonants. This grouping technique is not only efficient in terms of space and time but also enhances the performance of the classifier and achieves an accuracy of 88%. Our approach also outperforms the state of the art methods by around 6% in terms of accuracy. The system is implemented to detect the pronunciation mistakes of a second language learner and provides feedback to make language learning more efficient.