International Journal of Speech Technology
Công bố khoa học tiêu biểu
* Dữ liệu chỉ mang tính chất tham khảo
Sắp xếp:
Automatic genre classification of Indian Tamil and western music using fractional MFCC
International Journal of Speech Technology - Tập 19 - Trang 551-563 - 2016
This paper presents the automatic genre classification of Indian Tamil music and western music using timbral features and fractional Fourier transform (FrFT) based Mel frequency cepstral coefficient (MFCC) features. The classifier model for the proposed system has been built using K-nearest neighbours and support vector machine (SVM) classifiers. In this work, the performance of various features extracted from music excerpts have been analyzed, to identify the appropriate feature descriptors for the two major genres of Indian Tamil music, namely classical music (Carnatic based devotional hymn compositions) and folk music. The results have shown that the feature combination of spectral roll off, spectral flux, spectral skewness and spectral kurtosis, combined with fractional MFCC features, outperforms all other feature combinations, to yield a higher classification accuracy of 96.05 %, as compared to the accuracy of 84.21 % with conventional MFCC. It has also been observed, that the FrFT based MFCC, with timbral features and SVM, efficiently classifies the two western genres of rock and classical music, from the GTZAN dataset, with fewer features and a higher classification accuracy of 96.25 %, as compared to the classification accuracy of 80 % with conventional MFCC.
Low rank sparse decomposition model based speech enhancement using gammatone filterbank and Kullback–Leibler divergence
International Journal of Speech Technology - Tập 21 Số 2 - Trang 217-231 - 2018
A computationally efficient speech emotion recognition system employing machine learning classifiers and ensemble learning
International Journal of Speech Technology - - 2024
Speech Emotion Recognition (SER) is the process of recognizing and classifying emotions expressed through speech. SER greatly facilitates personalized and empathetic interactions, enhances user experiences, enables sentiment analysis, and finds applications in psychology, healthcare, entertainment, and gaming industries. However, accurately detecting and classifying emotions is a highly challenging task for machines due to the complexity and multifaceted nature of emotions. This work gives a comparative analysis of two approaches for emotion recognition based on original and augmented speech signals. The first approach involves extracting 39 Mel Frequency Cepstrum Coefficients (MFCC) features, while the second approach involves using MFCC spectrograms and extracting features using deep learning models such as MobileNet V2, VGG16, Inception V3, VGG19 and ResNet 50. These features are then tested on Machine learning classifiers such as SVM, Linear SVM, Naive Bayes, k-Nearest Neighbours, Logistic Regression and Random Forest. From the experiments, it is observed that the SVM classifier works best with all the feature extraction techniques Furthermore, to enhance the results, ensembling techniques involving CatBoost, and the Voting classifier along with SVM were utilized, resulting in improved test accuracies of 97.04% on the RAVDESS dataset, 93.24% on the SAVEE dataset, and 99.83% on the TESS dataset, respectively. It is worth noting that both approaches are computationally efficient as they required no training time.
A noise robust speech features extraction approach in multidimensional cortical representation using multilinear principal component analysis
International Journal of Speech Technology - Tập 18 - Trang 351-365 - 2015
In this paper, we propose a new type of noise robust feature extraction method based on multidimensional perceptual representation of speech in the auditory cortex (AI). Different coded features in different dimensions cause an increase in discrimination power of the system. On the other hand, this representation causes a great increase in the volume of information that produces the curse of dimensionality phenomenon. In this study, we propose a second level feature extraction stage to make the features suitable and noise robust for classification training. In the second level of feature extraction, we target two main concerns: dimensionality reduction and noise robustness using singular value decomposition (SVD) approach. A multilinear principal component analysis framework based on higher-order SVD is proposed to extract the final features in high-dimensional AI output space. The phoneme classification results on different subsets of the phonemes of additive noise contaminated TIMIT database confirmed that the proposed method not only increased the classification rate considerably, but also enhanced the robustness significantly comparing to conventional Mel-frequency cepstral coefficient and cepstral mean normalization features, which were used to train in the same classifier.
Application of glottal flow descriptors for pathological voice diagnosis
International Journal of Speech Technology - Tập 23 - Trang 205-222 - 2020
Acoustic analysis of speech signal enables automatic detection and classification of voice disorders along with its severity. This automatic assessment provides help to the clinician in initial diagnosis of pathological larynx in non-intrusive way. Voice pathologies damage the vocal cords and consequently alter the dynamics (fluctuation speed) of vocal cords. In this article, we have estimated glottal volume velocity waveform (GVVW) from the speech pressure waveforms of healthy and pathological subjects using quasi closed phase (QCP) glottal inverse filtering algorithm to capture altered dynamics of vocal cords. Closed-phase methods revealed notable stability in diverse voice qualities and sub-glottal pressures. The GVVW is the source of significant acoustical clues rooted in speech. The estimated GVVW is then parameterized by various time based, frequency based and Liljencrants–Fant (LF) model based glottal descriptors. Glottal descriptor’s vectors have been passed on to stochastic gradient descent (SGD) classifier for voice disorder evaluation. The normal pitch utterance of sustained vowel /a/ quarried from German, English, Arabic and Spanish voice databases is used. Information gain (IG) feature scoring technique is employed to select optimal descriptors and to rank them. Several intra and cross-database experiments were performed to explore the usefulness of glottal descriptors for voice disorder detection, severity detection and classification. Student’s t-tests were performed to validate the obtained results.
Training augmentation with TANDEM acoustic modelling in Punjabi adult speech recognition system
International Journal of Speech Technology - Tập 24 - Trang 473-481 - 2021
Processing of low resource pre and post acoustic signals always faced the challenge of data scarcity in its training module. It’s difficult to obtain high system accuracy with limited corpora in train set which results into extraction of large discriminative feature vector. These vectors information are distorted due to acoustic mismatch occurs because of real environment and inter speaker variations. In this paper, context independent information of an input speech signal is pre-processed using bottleneck features and later in modeling phase Tandem-NN model has been employ to enhance system accuracy. Later to fulfill the requirement of train data issues, in-domain training augmentation is perform using fusion of original clean and artificially created modified train noisy data and to further boost this training data, tempo modification of input speech signal is perform with maintenance of its spectral envelope and pitch in corresponding input audio signal. Experimental result shows that a relative improvement of 13.53% is achieved in clean and 32.43% in noisy conditions with Tandem-NN system in comparison to that of baseline system respectively.
Automatic age recognition, call-type classification, and speaker identification of Zebra Finches (Taeniopygia guttata) using hidden Markov models (HMMs)
International Journal of Speech Technology - Tập 26 - Trang 641-650 - 2023
Hidden Markov models (HMMs) were developed and implemented to discriminate between each of the 2 ages, 11 call-types, and 51 speakers of birds using cross-validation on the recordings in the 3314 database for chick (19–25 days of age) and adult (60 days–7 years of age) vocalizations of Zebra Finches (Taeniopygia guttata). By applying both temporal [delta (velocity) and delta-delta (acceleration) coefficients] and spectral [Mel-Frequency Cepstral Coefficients (MFCCs)] features, the HMMs produced excellent performance with accuracies on the three tasks: (1) 96.68% (age recognition); (2) 94.62% (chicks) and 79.30% (adults) (call-type classification); and (3) 55.32% (12 speakers, chicks) and 16.78% (33 speakers, adults) to 100.00% (2 speakers, chicks), and 100.00% (3 speakers adults) (speaker identification). Based on the performances, the HMMs could be extended to other animals for automatic recognition, classification, and identification tasks.
Detection of replay signals using excitation source and shifted CQCC features
International Journal of Speech Technology - Tập 24 Số 2 - Trang 497-507 - 2021
Correction to: Performance analysis of neural network, NMF and statistical approaches for speech enhancement
International Journal of Speech Technology - Tập 23 - Trang 939-939 - 2020
Two entries were missing in the reference list of the original publication. Hu, G. & Wang D. L. (2010). A tandem algorithm for pitch estimation and voiced speech segregation. IEEE Transactions on Audio, Speech, and Language Processing, 18, 2067–2079 Hu, G. (2004). 100 nonspeech environmental sounds, Available:
http://www.cse.ohio-state.edu/pnl/corpus/HuCorpus.html
he original article has been corrected.
HTK-based speech recognition and corpus-based English vocabulary online guiding system
International Journal of Speech Technology - Tập 25 - Trang 921-931 - 2022
With the popularization of computers and the development of modern educational technology, the connection between corpus and foreign language intelligent guiding is getting closer and closer. Corpus was first used in vocabulary guiding in foreign language guiding, and there are many research achievements in this field. However, in practical guiding, English vocabulary teaching is a big problem faced by teachers and students. This thesis mainly studies the English vocabulary online instruction system from the perspective of speech recognition. English vocabulary online guidance system has become an essential tool for English learners to learn vocabulary. Speech recognition technology is the technology that converts speech signals into text. Automatic speech recognition is also known as speech recognition or computer speech recognition, its goal is to let the computer can recognize the continuous speech that different people speak, to achieve the conversion of voice to text. Speech recognition is a comprehensive technology that integrates many subjects, including phonetics, linguistics, computer science and so on. Hence, this paper analyzes the HTK speech recognition technology and the construction of the corpus, and studies the English vocabulary online guidance system. The novel speech analysis technology is considered for the implementations of the novel guiding system. Through the comparison simulations compared with the other state-of-the-art systems, the designed outperformed.
Tổng số: 852
- 1
- 2
- 3
- 4
- 5
- 6
- 10