Speech emotion recognition using spectrograms and convolutional neural networks

Majid, Taiba

Please use this identifier to cite or link to this item: http://studentrepo.iium.edu.my/handle/123456789/10783

Full metadata record

DC Field	Value	Language
dc.contributor.author	Majid, Taiba	en_US
dc.date.accessioned	2022-01-11T00:54:34Z	-
dc.date.available	2022-01-11T00:54:34Z	-
dc.date.issued	2021	-
dc.identifier.uri	http://studentrepo.iium.edu.my/handle/123456789/10783	-
dc.description.abstract	Speech Emotion Recognition (SER) is the task of recognising the emotional aspects of speech irrespective of the semantic contents. Recognising these human speech emotions have gained much importance in recent years in order to improve both the naturalness and efficiency of Human-Machine Interactions (HCI). Deep Learning techniques have proved to be best suited for emotion recognition over traditional techniques because of their advantages like fast and scalable, all-purpose parameter fitting and infinitely flexible function. Nevertheless, there is no common consensus on how to measure or categorise emotions as they are subjective. The crucial aspect of SER system is selecting the speech emotion corpora (database), recognition of various features inherited in speech and a flexible model for the classification of those features. Therefore, this research proposes a different architecture of Deep Learning technique - Convolution Neural Networks (CNNs) known as Deep Stride Convolutional Neural Network (DSCNN) using the plain nets strategy to learn discriminative features and then classify them. The main objective is to formulate an optimum model by taking a smaller number of convolutional layers and eliminate the pooling-layers to increase computational stability. This elimination tends to increase the accuracy and decrease the computational time of speech emotion recognition (SER) system. Instead of pooling layers, notable strides have been used for the necessary dimension reduction. CNN and DSCNN are trained on three databases; a German database Berlin Emotional Database (Emo-DB), an English database Surrey Audio-Visual Expressed Emotion (SAVEE) and Indian Institute of Technology Kharagpur Simulated Emotion Hindi Speech Corpus (IITKGP-SEHSC), a Hindi database. The speech signals of three databases are converted to clean spectrograms by applying STFT on them after preprocessing. For the evaluation process, four emotions angry, happy, neutral, and sad have been considered. Besides, F1 scores have been calculated for all the considered emotions of all databases. Evaluation results show that the proposed architecture of both CNN and DSCNN outperform the-state-of-art models in terms of validation accuracy. The proposed architecture of CNN improves the accuracy of absolute 6.37%, 9.72% and 5.22% for EmoDB, SAVEE database and IITKGP-SEHSC database respectively. In comparison, as DSCNN architecture improves the performance by absolute 6.37%, 10.72% and 7.22% for EmoDB, SAVEE database and IITKGP-SEHSC database respectively compared to the best existing models. Furthermore, the proposed DSCNN architecture performs better for the three examining databases than proposed CNN architecture in terms of computational time. The computational time difference is found to be 60 seconds, 58 seconds and 56 seconds for EmoDB, SAVEE database and IITKGP-SEHSC respectively on 300 epochs. This study has set new benchmarks for all the three databases for upcoming work, which proves the effectiveness and significance of the proposed SER techniques. Future work is warranted to examine the capability of CNN and DSCNN for the voice-based identification of gender and image/video-based emotion recognition.	en_US
dc.language.iso	en	en_US
dc.publisher	Kuala Lumpur : Kulliyyah of Engineering, International Islamic University Malaysia, 2021	en_US
dc.subject.lcsh	Automatic speech recognition	en_US
dc.subject.lcsh	Speech processing systems	en_US
dc.title	Speech emotion recognition using spectrograms and convolutional neural networks	en_US
dc.type	Master Thesis	en_US
dc.description.identity	t11100392670TaibaMajid	en_US
dc.description.identifier	Thesis : Speech emotion recognition using spectrograms and convolutional neural networks /by Taiba Majid	en_US
dc.description.kulliyah	Kulliyyah of Engineering	en_US
dc.description.programme	Master of Science (Communication Engineering)	en_US
dc.description.abstractarabic	التعرف على المشاعر الكلام (SER) هي مهمة التعرف على الجوانب العاطفية للكلام بغض النظر عن المحتويات الدلالية. اكتسب التعرف على مشاعر الكلام البشرية أهمية كبيرة في السنوات الأخيرة ، وذلك من أجل تحسين طبيعية وكفاءة التفاعلات بين الإنسان والآلة (HCI). أثبتت تقنيات التعلم العميق أنها الأنسب للتعرف على المشاعر مقارنة بالتقنيات التقليدية بسبب سرعتها وقابليتها للتوسع ،إمكانية تركيب المعلمات لجميع الأغراض ودعم الوظائف المرنة بلا حدود. ومع ذلك ، لا يوجد إجماع مشترك حول كيفية قياس العواطف أو تصنيفها لأنها ذاتية. تتمثل الجوانب الحاسمة لنظام SER في اختيار هيئة عاطفة الكلام (قاعدة بيانات) ، والتعرف على الميزات المختلفة الموروثة في الكلام وتصنيف تلك الميزات من خلال نموذج مرن. لذلك ، يقترح هذا البحث بنية مختلفة لتقنية التعلم العميق - الشبكات العصبية الالتفافية (CNNs) المعروفة باسم الشبكة العصبية التلافيفية ذات الخطوة العميقة (DSCNN) وذلك باستخدام استراتيجية الشبكات البسيطة لتعلم الميزات التمييزية ثم تصنيفها. الهدف الرئيسي هو تصميم نموذج مناسب عن طريق أخذ عدد أقل من الطبقات التلافيفية وأيضًا عن طريق التخلص من طبقات التجميع لزيادة الاستقرار الحسابي. يميل هذا الحذف إلى زيادة الدقة وتقليل الوقت الحسابي لنظام التعرف على مشاعر الكلام (SER). بدلاً من طبقات التجميع ، تم استخدام خطوات خاصة لتقليل الأبعاد الضرورية. تم تدريب CNN و DSCNN على ثلاث قواعد بيانات ؛ يتم تدريب CNN و DSCNN على ثلاث قواعد بيانات ؛ قاعدة بيانات برلين العاطفية الألمانية (Emo-DB) ، قاعدة بيانات باللغة الإنجليزية باسم "Surrey للمشاعر المعبر عنها بالصوت والصورة (SAVEE)" وقاعدة بيانات هندية للمعهد الهندي للتكنولوجيا في خراجبور تسمى "مجموعة الكلام الهندية المحاكاة للعاطفة" (IITKGP-SEHSC). يتم تحويل إشارات الكلام لقواعد البيانات الثلاث إلى مخططات طيفية نظيفة عن طريق تطبيق STFT على الإشارات ، بعد المعالجة المسبقة. بالنسبة لعملية التقييم ، تم تبني أربعة مشاعر غاضبة وسعيدة ومحايدة وحزينة. بالإضافة إلى ذلك ، تم حساب درجات F1 لجميع المشاعر المدروسة لجميع قواعد البيانات. تظهر نتائج التقييم أن البنية المقترحة لكل من CNN و DSCNN تتفوق على أحدث النماذج من حيث دقة التحقق. تعمل البنية المقترحة لـ CNN على تحسين الدقة المطلقة 6.37٪ و 9.72٪ و 5.22٪ لقاعدة بيانات EmoDB و SAVEE وقاعدة بيانات IITKGP-SEHSC على التوالي. بينما تعمل بنية DSCNN على تحسين الأداء بنسبة 6.37٪ و 10.72٪ و 7.22٪ لقاعدة بيانات EmoDB وقاعدة بيانات SAVEE وقاعدة بيانات IITKGP-SEHSC على التوالي وذلك بالمقارنة مع أفضل النماذج الموجودة.علاوة على ذلك ، تعمل بنية DCNN المقترحة بشكل أفضل لقواعد بيانات الفحص الثلاث. مقارنة ، بنية CNN المقترحة من حيث الوقت الحسابي. تم العثور على فارق الوقت الحسابي ليكون 60 ثانية و 58 ثانية و 56 ثانية ل EmoDB وقاعدة بيانات SAVEE و IITKGP-SEHSC على التوالي في 300 فترة. وضعت هذه الدراسة معايير جديدة لجميع قواعد البيانات الثلاثة للأعمال القادمة ، مما يثبت فعالية وأهمية تقنيات SER المقترحة. العمل المستقبلي له ما يبرره لفحص قدرة CNN و DSCNN على التحديد الصوتي للجنس والتعرف على المشاعر القائمة على الصورة / الفيديو.	en_US
dc.description.callnumber	t TK 7882 S65 M233S 2021	en_US
dc.description.notes	Thesis (MSCE)--International Islamic University Malaysia, 2020.	en_US
dc.description.physicaldescription	xiv, 135 leaves : colour illustrations ; 30cm.	en_US
item.openairetype	Master Thesis	-
item.grantfulltext	open	-
item.fulltext	With Fulltext	-
item.languageiso639-1	en	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.cerifentitytype	Publications	-
Appears in Collections:	KOE Thesis

Files in This Item:

File	Description	Size	Format
t11100392670TaibaMajid_24.pdf	24 pages file	554.13 kB	Adobe PDF	View/Open
t11100392670TaibaMajid_SEC.pdf Restricted Access	Full text secured file	3.67 MB	Adobe PDF	View/Open Request a copy

Show simple item record

Google Scholar^TM

Check

Items in this repository are protected by copyright, with all rights reserved, unless otherwise indicated. Please give due acknowledgement and credits to the original authors and IIUM where applicable. No items shall be used for commercialization purposes except with written consent from the author.

Files in This Item:

Google ScholarTM

Google Scholar^TM