CNN-XGBoost fusion-based affective state recognition using EEG spectrogram image analysis | Scientific Reports – Nature.com

Figure 1, illustrates the proposed method, which is generally divided into two segments. On the left, we take a feature fusion-based approach, emphasizing signal processing on the acquired dataset by denoising it with a band pass filter and extracting alpha, beta, and theta bands for further processing. Numerous features have been extracted from the extracted bands. Feature extraction methods include the Fast Fourier Transform, Discrete Cosine Transform, Poincare, Power Spectral Density, Hjorth parameters, and some statistical features. The Chi-square and Recursive Feature Elimination procedures were used to choose the discriminative features among them. Finally, we utilized classification methods such as Support Vector Machine and Extreme Gradient Boosting to classify all the dimensions of emotion and obtain accuracy scores. On the other hand, we take a spectrogram image-based 2DCNN-XGBoost fusion approach, where we utilize a bandpass filter to denoise the data in the region of interest for different cognitive states. Following that, we performed the Short-Time Fourier Transform and obtained spectrogram images. To train the model on the retrieved images, we use a two-dimensional Convolutional Neural Network (CNN) and a dense layer of neural network to obtain the retrieved features from CNNs trained layer. After that, we utilized Extreme Gradient Boosting to classify all of the dimensions of emotion based on the retrieved features. Finally, we compared the outcomes from both approaches.

An overview of the proposed method.

In the proposed method (i.e., Fig. 1), we have used the DREAMER3 dataset. Audio and video stimuli were used to develop the emotional responses of the participants in this dataset. This dataset consists of 18 stimuli tested on participants, and Gabert-Quillen et al.16 selected and analyzed them to induce emotional sensation. The clips came from several films showing a wide variety of feelings. Two of each film centered on one emotion: amusement, excitement, happiness, calm, anger, disgust, fear, sad, and surprise. All of the clips are between 65 and 393 seconds long, giving users plenty of time to convey their feelings17,18. However, just the last 60 s of the video recordings were considered for the next steps of the study. The clips were shown to the participants on a 45-inch television monitor with an attached speaker so that they could hear the soundtrack and put them to the test. The EEG signals were captured with the EMOTIV EPOC, a 16-channel wireless headset. Data from sixteen distinct places were acquired using these channels. The wireless SHIMMER ECG sensor provided additional data. This study, however, focused solely on EEG signals from the DREAMER dataset.

Initially, the data collection was performed for 25 participants, but due to some technical problems, data collection from 2 of them was incomplete. As a result, the data from 23 participants were included in the final dataset. The dataset consists of signals from trail and pre-trail. Both were collected as a baseline for each stimuli test. The data dimension of EEG signals from the DREAMER dataset is shown in Table 2.

EEG signals usually have a lot of noise in them. As a result, the great majority of ocular artifacts occur below 4 Hz, muscular motions occur above 30 Hz, and power line noise occurs between 50 and 60 Hz3. For a better analysis, the noise must be decreased or eliminated. Additionally, to work on a specific area, we must concentrate on the frequency range that provides us with the stimuli-induced signals. The information linked to the emotion recognition task is included in a frequency band ranging from 4 to 30 Hz3. We utilized band pass filtering to acquire sample values ranging from 4-30 Hz to remove the noise from the signals and discover the band of interest.

The band-pass filter is a technique or procedure that accepts frequencies within a specified range of frequency bands while rejecting any frequencies above the frequency of interest. The bandpass filter is a technique that uses a combination of a low pass and a high pass filter to eliminate frequencies that arent required. The fundamental goal of such a filter is to limit the signals bandwidth, allowing us to acquire the signal we need from the frequency range we require while also reducing unwanted noise by blocking frequency regions we wont be using anyway. In both sections of our proposed method, we used a band-pass filter. On the feature fusion-based approach, we used this filtering technique to filter the frequency band between 4 and 30 Hz, which contains the crucial information we require. This helps in the elimination of unwanted noises. Weve decided to divide the signals of interest into three more bands: theta, alpha, and beta. These bands were chosen because they are the most commonly used bands for EEG signal analysis. The defining of band borders is somewhat subjective. The ranges that we use in our case are theta ranging between 4 and 8Hz, alpha ranging between 8 and 13 Hz, and beta ranging between 13 and 20 Hz. For the 2DCNN-XGBoost fusion-based approach, using this filter technique, we filtered the frequency range between 4 and 30 Hz, which contains relevant signals and generated spectrum images. Here the spectrograms from the signals were extracted using STFT and transformed into RGB pictures.

After pre-processing, we have used several feature extraction techniques for our feature fusion-based and the 2DCNN-XGBoost fusion-based approach that we discussed below:

Fast Fourier transform is among the most useful methods for processing various signals19,20,21,22,23. We used the FFT algorithm to calculate a sequence of Discrete Fourier Transform. The FFT stems are evaluated because they operate in the frequency domain, in the time or space, equally computer-feasible. The O(NlogN) result can also be determined by the FFT. Where N is the length of the vector. It functions by splitting a N time-domain signal into a N time domain with one single stage. The second stage is an estimate of the N frequency range for the N time-domain signals. Lastly, the N spectrum was synthesized into one frequency continuum to speed up the Convolutional Neural Network training phase.

The equations of FFT are shown below (1), (2):

$$begin{aligned} H(p)= & {} sum _{t=0}^{N-1} r(t) W_{N}^{p n}, end{aligned}$$

(1)

$$begin{aligned} r(t)= & {} frac{1}{N} sum _{p=0}^{N-1} H(p) W_{N}^{-p n}. end{aligned}$$

(2)

Here (H_p) represents the Fourier co efficients of r(t)

(a) A baseline EEG signal in time domain, (b) A baseline EEG signal in frequency domain using FFT, (c) A stimuli EEG signal in time domain, (d) A stimuli EEG signal in frequency domain using FFT .

We have implemented this FFT to get the coefficients shown in Fig. 2. The mean and maximum features for each band were then computed. Therefore, we get 6 features for each channel across 3 bands, for a total of 84 features distributed across 14 channels.

This method exhibits a finite set of data points for all cosine functions at varying frequencies which is used in research24,25,26,27,28. The Discrete Cosine Transformation (DCT) is usually applied to the coefficients of a periodically and symmetrically extended sequence in the Fourier Series. In signal processing, DCT is the most commonly used transformation method (TDAC).The imaginary part of the signal is zero in the time domain and in the frequency domain. The actual part of the spectrum is symmetrical, the imaginary part is unusual. With the following Eq. (3) , we can transform normal frequencies to the mel frequency:

$$begin{aligned} X_{P}=sum _{n=0}^{N-1} x_{n} cos {left[ frac{pi }{N}left( n+frac{1}{2}right) Pright] }, end{aligned}$$

(3)

where, N is the the list of real numbers and (X_p) is the set of N data values

(a) A baseline EEG signal in time domain, (b) A baseline EEG signal in frequency domain using DCT, (c) A stimuli EEG signal in time domain, (d) A stimuli EEG signal in frequency domain using DCT.

We have implemented DCT to get the coefficients shown in Fig. 3. The mean and maximum features for each band were then computed. Therefore, we get 6 features for each channel across 3 bands, for a total of 84 features distributed across 14 channels.

The Hjorth parameter is one of the ways in which a signals statistical property is indicated in the time domain and has three parameters which are Activity, Mobility, and Complexity. These parameters were calculated in many research29,30,31,32.

Activity: The parameter describes the power of the signal and, the variance of a time function. This can suggest the power spectrum surface within the frequency domain. The notation for activity is given below (4),

$$begin{aligned} var(y(t)). end{aligned}$$

(4)

Mobility: The parameter represents the average frequency or the share of the natural variation of the spectrum. This is defined as the square root of the variance of the first y(t) signal derivative, which is divided by the y(t). The notation for activity is given below (5),

$$begin{aligned} sqrt{frac{var(y'(t))}{var(y(t))}}. end{aligned}$$

(5)

Complexity: The parameter reflects the frequency shift. The parameter contrasts the signal resemblance with a pure sinusoidal wave, where the value converges to 1 if the signal is more identical. The notation for activity is given below (6),

$$begin{aligned} frac{mobility(y'(t))}{mobility(y(t))}. end{aligned}$$

(6)

For our analysis, we calculated Hjorths activity, mobility, and complexity parameters as features. Therefore, we get 9 features for each channel across 3 bands, for a total of 126 features distributed across 14 channels.

Statistics is the application of applied or scientific data processing using mathematics. We use statistical features to work on information-based data, focusing on the mathematical results of this information. We can learn and gain more and more detailed information on how statistics arrange our data in particular and how other data science methods can be optimally used to achieve more accurate and structural solutions. There is multiple research33,34,35 on emotion analysis where statistical features were used. The statistical features that we have extracted are median, mean, max, skewness and variance. As a result, we get 5 features for each channel, for a total of 70 features distributed across 14 channels.

The Poincare, which takes a series of intervals and plots each interval against the following interval, is an emerging analysis technique. In clinical settings, the geometry of this plot has been shown to differentiate between healthy and unhealthy subjects. It is also used in a time series for visualizing and quantifying the association between two consecutive data points. Since long-term correlation and memory are demonstrated in the dynamics of variations in physiological rhythms, this analysis was meant to expand the plot of Poincare by steps, instead of between two consecutive points, the association between sequential data points in a time sequence. We used two parameters in our paper which are:

SD1: Represent standard deviation from axis 1 of the distances of points and defines the width from the ellipse (short-term variability). Descriptors SD1 can be defined as (7):

$$begin{aligned} SD1 = frac{sqrt{2}}{2}SD(P_n - P_{n+1}). end{aligned}$$

(7)

SD2: The standard deviations from axis 2 and ellipse length (long-term variability) are equivalent to SD2.Descriptors SD2 can be defined as (8):

$$begin{aligned} SD2 = sqrt{2SD(P_n)^2 - frac{1}{2}SD(P_n - P_{n+1})^2}. end{aligned}$$

(8)

We have extracted 2 features which are SD1 and SD2 from each band (theta, alpha, beta). Therefore, we get 6 features for each channel across 3 bands, for a total of 84 features distributed across 14 channels.

The Welch method is a modified segmentation system and is used to assess the average periodogram, which is used in papers3,23,36. The Welch method is applied to a time series. For spectral density, it is concerned with decreasing the variance in the results. Power Spectral Density (PSD) informs us which differences in frequency ranges are high and could be very helpful for further study.The Welch method of the PSD can usually be described by the following equations: (9), (10) of the power spectra.

$$begin{aligned} P(f)= & {} frac{1}{M U}left| sum _{n=0}^{M-1} x_{i}(n) w(n) e^{-j 2 pi f}right| ^{2}, end{aligned}$$

(9)

$$begin{aligned} P_{text{ welch } }(f)= & {} frac{1}{L} sum _{i=0}^{L-1} P(f). end{aligned}$$

(10)

Here, the equation of density is defined first. Then, Welch Power Spectrum implies that for each interval, the average time is expressed. We have implemented this Welch method to get the PSD of the signal. From that, the mean power has been extracted from each band. As a result, we get 3 features for each channel across 3 bands, for a total of 42 features distributed across 14 channels.

A Convolutional Neural Network (CNN) is primarily used to process images since the time series is converted into a time-frequency diagram using a Short-Time Fourier Transform (STFT). It extracts required information from input images using multilayer convolution and pooling, and then classifies the image using fully connected layers. We have calculated the STFT using the filtered signal, which ranges between 4 and 30 Hz, and transformed them into RGB images. Some of the generated images are shown in Fig. 4.

EEG signal spectrograms using STFT with classification (a) high arousal, high valence, and low dominance, (b) low arousal, high valence, and high dominance, (c) high arousal, low valence, and low dominance.

To convert time series EEG signals into picture representations, Wavelet algorithms and Fourier Transforms are commonly utilized, which we have used in our secondary process. But in order to preserve the integrity of the original data, EEG conversion should be done solely in the time-frequency domain. As a result, STFT is the best method for preserving the EEG signals most complete anesthetic characteristics, which we have used in our second process. The spectrograms from the signal were extracted using STFT and the Eq. (11) is given below:

$$begin{aligned} Z_{n}^ {e^{(j hat{omega )}}}=e^{-jhat{omega }n}[(W(n)e^{jhat{omega }n}) times x(n)], end{aligned}$$

(11)

where, (e^{-jhat{omega }n}) is the complex bandpass filter output modulated by signal. From the above equation we have calculated the STFT from the filtered signals.

For our feature fusion-based approach, as we have pre-trail signals, we have used 4 s of pre-trail signals as baseline signals, resulting in 512 samples for each at a 128 Hz sampling rate. Then similar to the features extracted for stimuli, the features from baseline signals were also extracted. Then the stimuli features were divided by the baseline features, in order to get only the differences which can be noticed for the feature fusion-based approach by the stimuli test only, which is also done in the paper3.

After extracting all the features and calculating the ratio between stimuli features and baseline features, we have added the self-assessment ratings of arousal valence and dominance. Now the data set for the feature fusion-based approach has 414 data points with 630 features for each data point. We scaled the data using MinMax Scaling to remove the large variation in our data set. The estimator in MinMax Scaling scales and translates each value individually so that it is between 0 and 1, within the defined range.

The formula for MinMax scale is (12),

$$begin{aligned} X_{n e w}=frac{X_{i}-{text {Min}}(X)}{{text {Max}}(X)-{text {Min}}(X)}. end{aligned}$$

(12)

There are various feature selection techniques which are used by many researchers, to reduce the number of features which are not needed and only the important features which can play a big role in the prediction. So in our paper we used two feature elimination methods. One is Recursive Feature Elimination (i.e., Fig. 5) and another one is Chi-square test (i.e., Fig. 6) .

Procedure of recursive feature elimination (RFE).

Procedure of feature selection using Chi-square.

RFE (i.e., Fig. 5) is a wrapper type feature selection technique amongst the vast span of features. Here the term recursive is representative of the loop work of this method that traverses backward on loops to identify the best fitted feature giving each predictor an importance score and later eliminating the least importance scored predictor. Additionally cross-validation is used to find the optimal number of features to rank various feature subcategories and pick the best selection of features for scoring. In this method one attribute is taken and along with the target attribute and this procedure keeps forwarding combining attributes and merging with the target attribute to produce a new model. Thus different subsets of features of different combinations generate models through training. All these models are then strained out to get the maximum accuracy resulting model and its consecutive features. In short, we remove those features which result in the accuracy to be high or at least equal and return it back if the accuracy gets low after elimination . Here we have used step size of 1 to eliminate one feature at a time at each level which can help to remove the worst features early, keeping the best features in order to improve the already calculated accuracy of the overall model.

Chi-square (i.e., Fig. 6) test is a filter method that states the accuracy of a system comparing the predicted data with the observed data based on their importance. It is a test that figures out if there is any feature effective on nominal categorized data or not in order to compare between observed and expected data. In this method one predicted data set is considered as a base point and expected data is calculated from the observed value with respect to the base point.

The Chi-square value is computed by (13):

$$begin{aligned} chi ^{2}=sum _{i=1}^{m} sum _{j=1}^{k} frac{left( A_{i j}-frac{R_{i} cdot C_{j}}{N}right) ^{2}}{frac{R_{i} cdot C_{j}}{N}}, end{aligned}$$

(13)

where, m is the number of intervals, k is the amount of classes, (R_i) is the amount of patterns in the i range, (C_j) is the amount of patterns in the j range, (A_{ij}) is the amount of patterns in i and j range.

After applying RFE and Chi-square , from the achieved accuracy we have observed that, Chi-square does not incorporate a machine learning (ML) model, while RFE uses a machine learning model and trains it to decide whether it is relevant or not. Moreover, in our research, Chi-square methods failed to choose the best subset of features which can provide better results,but because of the extensive nature, RFE methods give the best subset of features mostly in our research. Therefore we consider RFE over Chi-square for feature elimination.

In research3, on this data set, they have calculated the mean and standard deviation for the self assessment. Then they have divided each dimension into two classes, high or low. The boundary between high and low was in the mid point of (0-*5) which is 2.5. But we have adjusted this boundary on our secondary process based on some of our observation. We have also calculated the mean and standard deviation of self assessment ratings, shown in Table 3, to separate each dimension of emotions into two separate classes, which will be high (1) and low (0) and will be representing two emotional category for each dimension.

Arousal: For our 2DCNN-XGBoost fusion based approach, (ratings (> 2.5)) is considered in the class of Excited/Alert and (ratings(< 2.5)) is considered as Uninterested/Bored (0). Here, from the 5796 data, 4200 was in the excited/alert class (1) and 1596 was in the uninterested/bored class. For the feature fusion-based approach, We have focused on the average ratings for excitement which co-responds to stimuli number 5 and 16, having 3.70 0.70 and 3.35 1.07 respectively. Additionally for, calmness, we can take stimuli 1 and 11 into consideration where the average ratings are, 2.26 0.75 and 1.96 0.82 respectively. Therefore, (ratings (> 2)) can be considered in the class of Excited/Alert and (ratings(< 2)) can be considered as Uninterested/Bored. Here, from the 414 data, 393 was in the excited/alert class and 21 was in the uninterested/bored class. We have also shown the parallel Coordinate plot for arousal in Fig. 8a to show the impact of different features on arousal level.

Valence: For our 2DCNN-XGBoost fusion based approach, (ratings (> 2.5)) is considered in the class of happy/elated and (ratings(< 2.5)) is considered as unpleasant/stressed. Here, from the 5796 data, 2254 was in the unpleasant/stressed class and 3542 was in happy/elated class. To store this values in the new data set, unpleasant/stressed is considered as 0 and happy/elated is considered as 1. For the feature fusion-based approach, firstly, we concentrated on the average happiness ratings, which correspond to stimuli 7 and 13, having 4.52 0.59 and 4.39 0.66 respectively. Additionally, stimuli (4, 15) and (6, 10) for fear and disgust were considered where the average ratings are, 2.04 1.02, 2.48 0.85, 2.70 1.55 and 2.17 1.15 respectively. Here, it is clear that, ratings (> 4) can be considered in the class of happy/elated and ratings(< 4) can be considered as unpleasant/stressed. Here, from the 414 data, 359 was in the unpleasant/stressed class and 55 was in happy/elated class. To store this values in the new data set, unpleasant/stressed is considered as 0 and happy/elated is considered as 1. We have also shown the parallel Coordinate plot for valence in Fig. 8b to show the impact of different features on valence level.

Dominance: For our 2DCNN-XGBoost fusion based approach, Same approach is followed here with low and high classes. Here, ratings(> 2.5) in the class of helpless/without control and ratings(< 2.5) can be considered for the class of empowered. Here, from the 5796 data, 1330 was in the helpless/without control class and 4466 was in empowered class. To store this values in the new data set, helpless/Without Control is considered as 0 and empowered is considered as 1. For the feature fusion-based approach, we have targeted stimuli number 4,6 and 8 which has targeted emotions of fear, disgust and anger, having mean rating of 4.13 0.87, 4.04 0.98 and 4.35 0.65 respectively. So, ratings(> 4) in the class of helpless/without control and rest for the class of empowered. Here, from the 414 data, 65 was in the helpless/without control class and 349 was in empowered class. To store this values in the new data set, helpless/Without Control is considered as 0 and empowered is considered as 1. We have also shown the parallel Coordinate plot for dominance in Fig. 8c to show the impact of different features on dominance level.

The overall class distribution for arousal, valence and dominance is shown in the Fig. 7.

Overall class distribution after conversion to a two-class rating score for arousal, valence and dominance.

Impact factor of features on (a) arousal, (b) valence and (c) dominance using parallel co-ordinate plot.

Convolutional Neural Network (CNN) is a type of deep neural network used to analyze visual imagery in deep learning. Figure 9, represents the overall two-dimensional Convolutional Neural Network model used in our proposed method (i.e., Fig. 1), which is also our 2DCNN-XGBoost fusion approach. We generated spectrum images before using this CNN architecture by filtering the frequency band containing significant signals between 4 and 30 Hz. Following that, we compute the Short-Time Fourier Transform of the EEG signals and convert them to spectrogram images before extracting features with a 2D Convolutional Neural Network. We train the model with 2D convolutional layers using the obtained spectrogram images, and then retrieve the trained features from the training layer with the help of another dense layer. We have implemented the test-bed to evaluate the performance of our proposed method. The proposed model is trained using the Convolutional Neural Network (CNN) described below,

The architecture of the implemented CNN model.

Basic features such as horizontal and diagonal edges are usually extracted by the first layer. This information is passed on to the next layer, which is responsible for detecting more complicated characteristics such as corners and combinational edges. As we progress deeper into the network, it becomes capable of recognizing ever more complex features such as objects, faces, and so on.The classification layer generates a series of confidence ratings (numbers between 0 and 1) on the final convolution layer, indicating how likely the image is to belong to a class. In our proposed method, we have used three layers of Conv2D and identified the classes.

The pooling layer is in charge of shrinking the convolved features spatial size. By lowering the size, the computer power required to process the data is reduced. Pooling can be divided into two types: average pooling and max pooling. We have used max pooling because it gives a better result than average pooling. We found the maximum value of a pixel from a region of the image covered by the kernel using max pooling. It removes all noisy activations and conducts de-noising as well as dimensionality reduction. In general, any pooling function can be represented by the following formula (14):

$$begin{aligned} q_{j}^{(l+1)} = Pool(q_{1}^{(l)}, ldots ,q_{i}^ {(l)},ldots ,q_{n}^{(l)}),q_{i}in R_{j}^{(l)}, end{aligned}$$

(14)

where, (R_{j}^{(l)}) is the jth pooled region at layer l and Pool() is pooling function over the pooled region

We added a dropout layer after the pooling layer to reduce overfitting. The accuracy will continuously improve as the dropout rate decreases, while the loss rate decreases. Some of the max pooling is randomly picked outputs and completely ignored. They arent transferred to the following layer.

After a set of 2D convolutions, its always necessary to perform a flatten operation.Flattening is the process of turning data into a one-dimensional array for further processing. To make a single lengthy feature vector, we flatten the output of the convolutional layers. Its also linked to the overall classification scheme.

Dense gives the neural network a completely linked layer. All of the preceding layers outputs are fed to all of its neurons, with each neuron delivering one output to the following layer.

In our proposed method, with this CNN architecture, diverse kernels are employed in the convolution layer to extract high-level features, resulting in different feature maps. At the end of the CNN model, there is a fully connected layer. The predicted class labels of emotions are generated by the output of the fully connected layer. According to our proposed method, we have added dense layer with 630 units after training layer to extracted this amount of features.

Extreme Gradient Boosting (XGBoost) is a machine learning algorithm that use a supervised learning strategy to accurately predict an objective variable by combining the predictions of several weaker models. It is a common data mining tool with good speed and performance. The XGBoost model computes 10 times faster than the Random Forest model.The XGBoost model was generated utilising the additive tree method, which involves adding a new tree to each step to complement the trees that have already been built.As additional trees are built, the accuracy generally improves. In our proposed model, we have used XGBoost after applying CNN. We extracted some amount of features from CNNs trained layer. . Then, based on the retrieved features, we used Extreme Gradient Boosting to classify all of the dimensions of emotion. The following Eqs. (15) and (16) are used in Extreme Gradient Boosting.

$$begin{aligned}{}&f(m) approx f(k)+f^{prime }(k)(m-a)+frac{1}{2} f^{n}(k)(m-k)^{2}, end{aligned}$$

(15)

$${ mathcal {L}^{(t)} simeq sum _{i=1}^{n}left[ lleft( q_{i}, q^{(t-1)}right) +r_{i} f_{t}left( m_{i}right) +frac{1}{2} s_{i} f_{t}^{2}left( m_{i}right) right] +Omega left( f_{t}right) +C },$$

(16)

where C is Constant, (r_i) and (s_i) are defined as,

$$begin{aligned} r_{i}= & partial hat{z}_{i}^{(b-1)}. int left( z_{i,} hat{z}_{i}^{(b-1)}right) , end{aligned}$$

(17)

$$begin{aligned} s_{i}= & {} partial hat{z}_{i}^{(b-1)} .int left( z_{i}, hat{z}_{i}^{(b-1)}right) . end{aligned}$$

(18)

After removing all the constants, the specific objective at step b becomes,

$$begin{aligned} sum _{i=1}^{n}left[ { r_{i}f_{t} }left( m_{i}right) +frac{1}{2}{s_{i} {f}_{t}^{2}(m_{i})}right] +Omega left( f_{t}right) , end{aligned}$$

(19)

Go here to see the original:

CNN-XGBoost fusion-based affective state recognition using EEG spectrogram image analysis | Scientific Reports - Nature.com

Related Posts

Comments are closed.