Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar

Arman Savran; Savran, Arman

Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar

Date

2024

Authors

Arman Savran

Savran, Arman

Open Access Color

GOLD

Green Open Access

No

Publicly Funded

No

Impulse

Average

Influence

Average

Popularity

Average

Abstract

Konuşma sesi tespiti (KST) insan bilgisayar arayüzleri için yaygın olarak kullanılan gerekli bir ön-işlemedir. Karmaşık akustik arka plan gürültülerinin varlığı büyük derin sinir ağlarının ağır hesaplama yükü pahasına kullanımlarını gerekli kılmaktadır. Görü yoluyla KST ise arka plan gürültüsü problemi olmadığından tercih edilebilen alternatif bir yaklaşımdır. Görü kanalı ses verisine erişimin mümkün olmadığı durumlarda ise zaten tek seçenektir. Ancak genelde uzun süreler aralıksız çalışması beklenen görsel KST video kamerası donanım ve video verisi işleme gereksinimlerinden dolayı önemli enerji sarfiyatına sebep olur. Bu çalışmada görü yoluyla KST için nöromorfik teknoloji sayesinde verimliliği geleneksel video kameradan oldukça yüksek olan olay kamerasının kullanımı incelenmiştir. Olay kamerasının yüksek zaman çözünürlüklerinde algılama yapması sayesinde uzamsal boyut tamamen indirgenerek sadece zaman boyutundaki örüntülerin öğrenilmesine dayanan son derece hafif fakat başarılı modeller tasarlanmıştır. Tasarımlar zamansal alıcı alan genişlikleri gözetilerek farklı evrişim genleştirme tiplerinin aşağı-örnekleme yöntemlerinin ve evrişim ayırma tekniklerinin bileşimleri ile yapılır. Deneylerde KST’nin çeşitli yüz eylemleri karşısındaki dayanıklıkları ölçülmüştür. Sonuçlar aşağı-örneklemenin yüksek başarım ve verimlilik için gerekli olduğunu ve bunun için maksimum-havuzlamanın adımlı evrişim yöntemiyle aşağı-örnekleme yapmaktan daha üstün başarım elde ettiğini göstermektedir. Bu şekilde üstün başarımlı standart tasarım 1.57 milyon kayan nokta işlemle (MFLOPS) çalışır. Evrişim genleştirmesinin sabit bir faktörle yapılıp aşağı-alt örnekleme ile birleştirilmesiyle de benzer başarımla işlem gereksiniminin yarıdan fazla azaldığı bulunmuştur. Ayrıca derinlemesine ayrışım da uygulanarak işlem gereksinimi 0.30 MFLOPS’a yani standart modelin beşte birinden daha aşağısına indirilmiştir.

ORCID

0000-0001-5142-6384

Keywords

Bilgisayar Bilimleri- Yazılım Mühendisliği, Bilgisayar Bilimleri, Yazılım Mühendisliği

Citation

Amir A. Taba B. Berg D. Melano T. McKinstry J. Di Nolfo C. Nayak T. Andreopoulos A. Garreau G. Mendoza M. Kusnitz J. Debole M. Esser S. Delbruck T. Flickner M. Modha D. 2017. A Low Power Fully Event-Based Gesture Recognition System. CVPR2017 The IEEE/CVF Conference on Computer Vision and Pattern Recognition Honolulu HI USA.Araujo A. Norris W. Sim J. 2019. Computing Receptive Fields of Convolutional Neural Networks. Distill https://distill.pub/2019/computing-receptive-fields.Ariav I. Dov D. Cohen I. 2018. A deep architecture for audio-visual voice activity detection in the presence of transients. Signal Processing 142 69–74.Arriandiaga A. Morrone G. Pasa L. Badino L. Bartolozzi C. 2021. Audio-Visual Target Speaker Enhancement on Multi-Talker Environment Using Event-Driven Cameras. ISCAS 2021 IEEE International Symposium on Circuits and Systems Daegu South Korea May 22-28 2021.Bai S. Kolter J.Z. Koltun V. 2018. Convolutional Sequence Modeling Revisited. ICLRW2018 6th International Conference on Learning Representations - Workshop Track Proceedings April 30 - May 3 2018 Vancouver BC Canada.Barua S. Miyatani Y. Veeraraghavan A. 2016. Direct face detection and video reconstruction from event cameras. WACV2016 Winter Conference on Applications of Computer Vision March 7-10 2016 Lake Placid NY USA.Berlincioni L. Cultrera L. Albisani C. Cresti L. Leonardo A. Picchioni S. Becattini F. Del Bimbo A. 2023. Neuromorphic Event-based Facial Expression Recognition. CVPRW2017 The IEEE/CVF Conference on Computer Vision and Pattern Recognition - Workshop Track. June 2023 Vancouver Canada pp. 4108–4118.Çubukçu A. Kuncan M. Kaplan K. Ertunç H.M. 2015. Development of a voice-controlled home automation using Zigbee module. In: 23nd Signal Processing and Communications Applications Conference (SIU). pp. 1801–1804.Deng Y. Chen H. Liu H. Li Y. 2022. A Voxel Graph CNN for Object Classification With Event Cameras. CVPR2022 The IEEE/CVF Conference on Computer Vision and Pattern Recognition New Orleans USA June 2022.Gallego G. Lund J.E.A. Mueggler E. Rebecq H. Delbrück T. Scaramuzza D. 2018. Event-Based 6-DOF Camera Tracking from Photometric Depth Maps. IEEE Trans. Pattern Anal. Mach. Intell. 40 2402–2412.Gallego G. Delbrück T. Orchard G. Bartolozzi C. Taba B. Censi A. Leutenegger S. Davison A.J. Conradt J. Daniilidis K. Scaramuzza D. 2022. Event-Based Vision: A Survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 154–180.Gehrig D. Loquercio A. Derpanis K.G. Scaramuzza D. 2019. End-to-End Learning of Representations for Asynchronous Event-Based Data ICCV2019 The IEEE International Conference on Computer Vision October 2019.Ghaemmaghami H. Dean D. Kalantari S. Sridharan S. Fookes C. 2015. Complete-linkage clustering for voice activity detection in audio and visual speech. Interspeech Dresden Germany 2015.Guy S. Lathuilière S. Mesejo P. Horaud R. 2020. Learning Visual Voice Activity Detection with an Automatically Annotated Dataset. ICPR2020 25th International Conference on Pattern Recognition January 10-15 2020 Milan Italy.Howard A.G. Zhu M. Chen B. Kalenichenko D. Wang W. Weyand T. Andreetto M. Adam H. 2017. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arxiv:1704.04861.Kim J. Hwang I. Kim Y.M. 2022. Ev-TTA: Test-Time Adaptation for Event-Based Object Recognition. CVPR2022 The IEEE/CVF Conference on Computer Vision and Pattern Recognition New Orleans USA June 2022.Korkmaz Y. Boyacı A. 2023. Hybrid voice activity detection system based on LSTM and auditory speech features. Biomedical Signal Processing and Control 80 104408.Krizhevsky A. Sutskever I. Hinton G.E. 2012. ImageNet Classification with Deep Convolutional Neural Networks. NIPS2012 Advances in Neural Information Processing Systems: Annual Conference on Neural Information Processing Systems 2012 December 3-8 2012 Lake Tahoe Nevada USA.Lenz G. Ieng S.H. Benosman R.B. 2020. Event-based Face Detection and Tracking using the Dynamics of Eye Blinks. Frontiers in Neuroscience 14 587.Li J. Li J. Zhu L. Xiang X. Huang T. Tian Y. 2022. Asynchronous Spatio-Temporal Memory Network for Continuous Event-Based Object Detection. IEEE Transactions on Image Processing 31 2975–2987.Li X. Neil D. Delbruck T. Liu S. 2019. Lip Reading Deep Network Exploiting Multi-Modal Spiking Visual and Auditory Sensors. ISCAS 2019 IEEE International Symposium on Circuits and Systems May 2019.Long J. Shelhamer E. Darrell T. 2015. Fully convolutional networks for semantic segmentation. CVPR2015 The IEEE/CVF Conference on Computer Vision and Pattern Recognition June 2015 Boston USA.Maqueda A.I. Loquercio A. Gallego G. Garcı́a N. Scaramuzza D. 2018. Event-Based Vision Meets Deep Learning on Steering Prediction for Self-Driving Cars. CVPR2018 The IEEE/CVF Conference on Computer Vision and Pattern Recognition Salt Lake City Utah USA June 2018.Moreira G. Graça A. Silva B. Martins P. Batista J.P. 2022. Neuromorphic Event-based Face Identity Recognition. ICPR2022 26th International Conference on Pattern Recognition Montreal August 21-25 2022 QC Canada pp. 922–929.Neil D. Pfeiffer M. Liu S.-C. 2016. Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences. NIPS2016 Proceedings of the 30th International Conference on Neural Information Processing Systems Barcelona Spain pp. 3889–3897.Pan L. Scheerlinck C. Yu X. Hartley R. Liu M. Dai Y. 2019. Bringing a Blurry Frame Alive at High Frame-Rate With an Event Camera. CVPR2019 The IEEE/CVF Conference on Computer Vision and Pattern Recognition Long Beach CA USA June 2019.Paredes-Valles F. de Croon G.C.H.E. 2021. Back to Event Basics: Self-Supervised Learning of Image Reconstruction for Event Cameras via Photometric Constancy. CVPR2021 The IEEE/CVF Conference on Conference on Computer Vision and Pattern Recognition June 2021.Patrona F. Iosifidis A. Tefas A. Nikolaidis N. Pitas I. 2016. Visual Voice Activity Detection in the Wild. IEEE Transactions on Multimedia 18 967–977.Perot E. de Tournemire P. Nitti D. Masci J. Sironi A. 2020. Learning to Detect Objects with a 1 Megapixel Event Camera. NIPS2020 Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020 December 6-12 2020.Rebecq H. Ranftl R. Koltun V. Scaramuzza D. 2019. Events-To-Video: Bringing Modern Computer Vision to Event Cameras. CVPR2019 The IEEE/CVF Conference on Computer Vision and Pattern Recognition Long Beach CA USA June 2019.Rethage D. Pons J. Serra X. 2018. A Wavenet for Speech Denoising. ICASSP2018 IEEE International Conference on Acoustics Speech and Signal Processing April 15–20 2018 Calgary Alberta Canada pp. 5069–5073.Ryan C. O’Sullivan B. Elrasad A. Cahill A. Lemley J. Kielty P. Posch C. Perot E. 2021. Real-time face & eye tracking and blink detection using event cameras. Neural Networks 141 87–97.Savran A. Tavarone R. Higy B. Badino L. Bartolozzi C. 2018. Energy and Computation Efficient Audio-Visual Voice Activity Detection Driven by Event-Cameras. FG2018 13th IEEE International Conference on Automatic Face & Gesture Recognition May 15-19 2018 Xi'an China.Savran A. Bartolozzi C. 2020. Face Pose Alignment with Event Cameras. Special Issue: Sensor Systems for Gesture Recognition Vol. 20 Issue 24 Article 7079.Savran A. 2023. Multi-timescale boosting for efficient and improved event camera face pose alignment. Computer Vision and Image Understanding Vol. 236 103817.Savran A. 2023a. Fully Convolutional Event-camera Voice Activity Detection Based on Event Intensity. ASYU2023 IEEE Innovations in Intelligent Systems and Applications Conference October 2023 Sivas Türkiye.Savran A. 2023b. Comparison of Timing Strategies for Face Pose Alignment with Event Camera. In: 8th International Conference on Computer Science and Engineering (UBMK). pp. 97–101.Schaefer S. Gehrig D. Scaramuzza D. 2022. AEGNN: Asynchronous Event-Based Graph Neural Networks. CVPR2022 The IEEE/CVF Conference on Computer Vision and Pattern Recognition New Orleans USA June 2022.Shahid M. Beyan C. Murino V. 2021. S-VVAD: Visual Voice Activity Detection by Motion Segmentation. WACV2021 Winter Conference on Applications of Computer Vision January 3-8 2021 Waikoloa HI USA pp. 2331-2340Szegedy C. Liu W. Jia Y. Sermanet P. Reed S. Anguelov D. Erhan D. Vanhoucke V. Rabinovich A. 2015. Going deeper with convolutions. CVPR2015 The IEEE/CVF Conference on Conference on Computer Vision and Pattern Recognition June 2015 Boston USA.Sharma R. Somandepalli K. Narayanan S.S. 2019. Toward Visual Voice Activity Detection for Unconstrained Videos. ICIP2019 International Conference on Image Processing September 22-25 2019 Taipei Taiwan.Tan G. Wang Y. Han H. Cao Y. Wu F. Zha Z.-J. 2022. Multi-Grained Spatio-Temporal Features Perceived Network for Event-Based Lip-Reading. CVPR2022 The IEEE/CVF Conference on Conference on Computer Vision and Pattern Recognition New Orleans USA June 2022.Tulyakov S. Bochicchio A. Gehrig D. Georgoulis S. Li Y. Scaramuzza D. 2022. Time Lens++: Event-Based Frame Interpolation With Parametric Non-Linear Flow and Multi-Scale Fusion. CVPR2022 The IEEE Conference on Conference on Computer Vision and Pattern Recognition New Orleans USA June 2022.Wang D. Xiao X. Kanda N. Yoshioka T. Wu J. 2023. Target Speaker Voice Activity Detection with Transformers and Its Integration with End-To-End Neural Diarization. In: IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP).Wang Y. Du B. Shen Y. Wu K. Zhao G. Sun J. Wen H. 2019. EV-Gait: Event-Based Robust Gait Recognition Using Dynamic Vision Sensors. The IEEE Conference on Computer Vision and Pattern Recognition Long Beach CA USA June 2019.Wang Y. Zhang X. Shen Y. Du B. Zhao G. Cui L. Wen H. 2022. Event-Stream Representation for Human Gaits Identification Using Deep Neural Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44 3436–3449.Wrench A. 2006. MOCHA-TIMIT www.cstr.ed.ac.uk/research/projects/artic/mocha.html.Yu F. Koltun V. 2016. Multi-Scale Context Aggregation by Dilated Convolutions. 4th International Conference on Learning Representations ICLR San Juan Puerto Rico May 2016.Zhang X.-L. Wang D. 2016. Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection. IEEE/ACM Transactions on Audio Speech and Language Processing 24 252–264.Zhang J. Dong B. Zhang H. Ding J. Heide F. Yin B. Yang X. 2022. Spiking Transformers for Event-Based Single Object Tracking. CVPR2022 The IEEE Conference on Conference on Computer Vision and Pattern Recognition New Orleans USA June 2022.Zhu L. Wang X. Chang Y. Li J. Huang T. Tian Y. 2022. Event-Based Video Reconstruction via Potential-Assisted Spiking Neural Network. CVPR2022 The IEEE Conference on Conference on Computer Vision and Pattern Recognition New Orleans USA June 2022.

OpenCitations Citation Count

N/A

Source

Journal of Intelligent Systems: Theory and Applications

Volume

7

Issue

2

Start Page

102

End Page

115

URI

https://gcris.yasar.edu.tr/handle/123456789/10449
https://search.trdizin.gov.tr/en/yayin/detay/1267098

Collections

TR-Dizin İndeksli Yayınlar Koleksiyonu

Full item page

Google Scholar™

Check

Olay Kamerası ile Verimli Konuşma Sesi Tespiti için Zamansal Evrişimsel Ağlar

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Open Access Color

Green Open Access

OpenAIRE Downloads

OpenAIRE Views

Publicly Funded

BIP! Indicators

Research Projects

Journal Issue

Abstract

Description

ORCID

Keywords

Fields of Science

Citation

WoS Q

Scopus Q

OpenCitations Citation Count

Source

Volume

Issue

Start Page

End Page

URI

Collections

Google Scholar™

OpenAlex FWCI

0.3577

Sustainable Development Goals