International Journal of Speech Technology

Công bố khoa học tiêu biểu

* Dữ liệu chỉ mang tính chất tham khảo

Sắp xếp:  
nameGist: a novel phonetic algorithm with bilingual support
International Journal of Speech Technology - Tập 22 - Trang 1135-1148 - 2019
Shahidul Islam Khan, Md. Mahmudul Hasan, Mohammad Imran Hossain, Abu Sayed Md. Latiful Hoque
Phonetic algorithm plays an essential role in many applications including name-matching, database record linkage, spelling correction, search recommendations, etc. Since 1918, many phonetic algorithms have been proposed by the researchers. Soundex, Match Rating Codex, NYSIIS, Metaphone, and Double Metaphone are among the frequently used phonetic algorithms. These algorithms were primarily developed for English phonetics, and they perform well for their intended purposes. Above algorithms do not support Bengali Language and show poor performance for Bengali phonetic representation in the English language. Some phonetic algorithms, e.g., NameSignifcance, Modified NameSignifcance, etc., have been proposed recently by researchers to deal with Bengali phonetic names but their performances are not up to the mark for English names. Besides, these algorithms do not support names written in the Bengali Language, i.e., Bengali Unicode. Bengali language, also known as Bangla among natives, is counted as the seventh most spoken language in the world. More than 250 million people, around the world, speak in Bengali. Use of Bengali Unicode is increasing in Bangladesh and around the globe with the increasing use of computers everywhere. For example, in different healthcare systems, a patient’s name can be stored both in English representation of Bengali or Bengali Unicode. Being unable to process Bengali Unicode leads to failure of linking information of the same patient from multiple databases. This creates a problem in record linkage or entity matching. In this paper, we proposed a novel phonetic algorithm—nameGist which can efficiently encode Bengali phonetic names in English representation, Bengali Unicode names and English phonetic names. We have tested nameGist in various datasets which contains Bengali Phonetic names, Bengali Unicode names, English Phonetic (American or British) names and a mixture of these types. In each case, our proposed algorithm, nameGist, performed better than other algorithms in terms of accuracy and F-measure. NameGist can be used to solve record linkage and entity resolution problems for Bengali, English, and mixed names effectively.
Usefulness, localizability, humanness, and language-benefit: additional evaluation criteria for natural language dialogue systems
International Journal of Speech Technology - Tập 19 - Trang 373-383 - 2016
Bayan AbuShawar, Eric Atwell
Human–computer dialogue systems interact with human users using natural language. We used the ALICE/AIML chatbot architecture as a platform to develop a range of chatbots covering different languages, genres, text-types, and user-groups, to illustrate qualitative aspects of natural language dialogue system evaluation. We present some of the different evaluation techniques used in natural language dialogue systems, including black box and glass box, comparative, quantitative, and qualitative evaluation. Four aspects of NLP dialogue system evaluation are often overlooked: “usefulness” in terms of a user’s qualitative needs, “localizability” to new genres and languages, “humanness” or “naturalness” compared to human–human dialogues, and “language benefit” compared to alternative interfaces. We illustrated these aspects with respect to our work on machine-learnt chatbot dialogue systems; we believe these aspects are worthwhile in impressing potential new users and customers.
Gender and age-evolution detection based on audio forensic analysis using light deep neural network
International Journal of Speech Technology - Tập 26 - Trang 1091-1098 - 2023
Noor D. AL-Shakarchy, Huda Rageb, Mais Saad Safoq
Forensic audio analysis is a foundation stone of many crime investigations. In forensic evidence; the audio file of the human voice is analyzed to extract much information in addition to the content of the speech, such as the speaker’s identity, emotions, gender, origin, etc. The accurate determination of individuals into groups based on their age development stage and their gender are often used as early investigations to differentiate them and determine the legal rights and responsibilities associated with them. This work introduces a light CNN model with a new architecture to detect the human age-evolution being’s stage (kids or adults) at the same time the gender of the adult one (male or female) based on the individual’s voice characteristics, which offers a balance between computational efficiency and model accuracy. The temporal information in the audio file is prepared by scaled and normalized. Then this information is exploited to extract and track the unique and salient audio features that make up the pattern of the feature map for each target class through some convolutional layers followed by maxpooling layers. Finally, The decision is made based on these feature maps by some fully connected layers. Successful and promising results are accomplished in terms of accuracy and loss functions which realize 0.99 and 0.017 respectively over the riched Voxceleb2 dataset. The proposed model underscores the importance of leveraging Light DNNs for gender and age-evolution detection, offering a robust and ethically sound solution for real-world applications in the field of audio forensics such as span speaker identification, victim profiling, deception detection, and more, contributing to the advancement of audio forensic analysis.
Voice assessments for detecting patients with Parkinson’s diseases using PCA and NPCA
International Journal of Speech Technology - Tập 19 - Trang 743-754 - 2016
Achraf Benba, Abdelilah Jilbab, Ahmed Hammouch
In this study, we wanted to discriminate between two groups of people. The database used in this study contains 20 patients with Parkinson’s disease and 20 healthy people. Three types of sustained vowels (/a/, /o/ and /u/) were recorded from each participant and then the analyses were done on these voice samples. Firstly, an initial feature vector extracted from time, frequency and cepstral domains. Then we used linear and nonlinear feature extraction techniques, principal component analysis (PCA), and nonlinear PCA. These techniques reduce the number of parameters and choose the most effective acoustic features used for classification. Support vector machine with its different kernel was used for classification. We obtained an accuracy up to 87.50 % for discrimination between PD patients and healthy people.
Combining evidences from excitation source and vocal tract system features for Indian language identification using deep neural networks
International Journal of Speech Technology - Tập 21 - Trang 501-508 - 2017
Mounika Kamsali Veera, Ravi Kumar Vuddagiri, Suryakanth V. Gangashetty, Anil Kumar Vuppala
In this paper, a combination of excitation source information and vocal tract system information is explored for the task of language identification (LID). The excitation source information is represented by features extracted from linear prediction (LP) residual signal called the residual cepstral coefficients (RCC). Vocal tract system information is represented by the mel frequency cepstral coefficients (MFCC). In order to incorporate additional temporal information, shifted delta cepstra (SDC) are computed. An LID system is built using SDC over both MFCC and RCC features individually and evaluated based on their equal error rate (EER). Experiments have been performed on a dataset consisting of 13 Indian languages with about 115 h for training and 30 h for testing using a deep neural network (DNN), DNN with attention (DNN-WA) and a state-of-the-art i-vector system. DNN-WA outperforms the baseline i-vector system. An EER of 9.93 and 6.25% are achieved using RCC and MFCC features respectively. By combining evidence from both features using a late fusion mechanism, an EER of 5.76% is obtained. This result indicates the complementary nature of the excitation source information to that of the widely used vocal tract system information for the task of LID.
Dual estimation based vocal tract shape computation
International Journal of Speech Technology - Tập 22 - Trang 575-584 - 2018
Subhasmita Sahoo, Aurobinda Routray
This paper presents a new method for direct estimation of vocal tract shape from the speech signal. The method computes cross-sectional areas of uniform-length cylindrical tubes comprising the vocal tract. Cross-sectional areas are calculated from reflection coefficients at tube junctions whose values depend on the areas of adjoining tubes. A new state space representation of the speech production system has been formulated in which reflection coefficients are parameters. The state space model has been constructed using state equations of the glottal flow signal and vocal tract formulated from Liljencrants–Fant model and concatenated tube model respectively. Dual extended Kalman filtering algorithm has been used for estimation of unknown parameters of the system. The estimated reflection coefficients are then used to compute cross-sectional areas of the vocal tract. The performance of proposed technique has been compared to an existing shape estimation method proposed by Wakita. For both synthesized and natural speech signals, the performance of proposed method has been found to be comparable to the existing one. Nevertheless, the Kalman filter algorithm used in proposed method has provisions to tune measurement noise covariance which can be adjusted based on the noise level in speech. Therefore, the performance of proposed method has been seen to be comparatively more robust to noise than the existing technique.
Clean speech/speech with background music classification using HNGD spectrum
International Journal of Speech Technology - Tập 20 Số 4 - Trang 1023-1036 - 2017
Banriskhem K. Khonglah, S. R. Mahadeva Prasanna
ILATalk: a new multilingual text-to-speech synthesizer with machine learning
International Journal of Speech Technology - Tập 19 - Trang 55-64 - 2015
Saleh M. Abu-Soud
In this paper, a new multilingual text-to-speech system based on inductive learning has been developed. This system is called ILATalk. It is composed of three phases: the analysis phase, learning phase, and synthesis phase. It can accept any language; all what is needed is to store the data set that contains the training examples that are generated from a representative and selected subset of words from the required language in addition to the associated phonemes of the language in data tables to be used as input to the system. The system has been thoroughly tested with many sets of experiments with various parameters and sizes, and compared with two known approaches: ID3 and NN Backpropagation. The results obtained showed that ILATalk produces correct phonemes with high accuracy and out-performs these algorithms in most cases.
A statistical framework for EEG channel selection and seizure prediction on mobile
International Journal of Speech Technology - - 2019
Fatma E. Ibrahim, Saly Abd-Elateif El-Gindy, Sami A. El-Dolil, Adel S. El‐Fishawy, El-Sayed M. El-Rabaie, M. I. Dessouky, Ibrahim M. El-Dokany, Turky N. Alotaiby, Saleh A. Alshebeili, Fathi E. Abd El‐Samie
Maximum entropy PLDA for robust speaker recognition under speech coding distortion
International Journal of Speech Technology - Tập 22 - Trang 1115-1122 - 2019
Ahmed Krobba, Mohamed Debyeche, Sid. Ahmed Selouani
The system combining i-vector and probabilistic linear discriminant analysis (PLDA) has been applied with great success in the speaker recognition task. The i-vector space gives a low-dimensional representation of a speech segment and training data of a PLDA model, which offers greater robustness under different conditions. In this paper, we propose a new framework based on i-vector/PLDA and Maximum Entropy (ME) to improve the performance of speaker identification system in the presence of speech coding distortion. The results are reported on TIMIT database and speech coding obtained by passing the speech test from TIMIT database through the AMR encoder/decoder. Our results show that the proposed methode achieves improved performance when compared with the i-vector/PLDA and MEGMM.
Tổng số: 851   
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 10