#360: LARNet-STC: Spatio-temporal orthogonal region selection network for laryngeal closure detection in endoscopy videos


The vocal folds (VFs) are a pair of muscles in the larynx that play a critical role in breathing, swallowing, and speaking. VF function can be adversely affected by various medical conditions including head or neck injuries, stroke, tumor, and neurological disorders. In this paper, we propose a deep learning system for automated detection of laryngeal adductor reflex (LAR) events in laryngeal endoscopy videos to enable objective, quantitative analysis of VF function. The proposed deep learning system incorporates our novel orthogonal region selection network and temporal context. This network learns to directly map its input to a VF open/close state without first segmenting or tracking the VF region. This one-step approach drastically reduces manual annotation needs from labor-intensive segmentation masks or VF motion tracks to frame-level class labels. The proposed spatio-temporal network with an orthogonal region selection subnetwork allows integration of local image features, global image features, and VF state information in time for robust LAR event detection. The proposed network is evaluated against several network variations that incorporate temporal context and is shown to lead to better performance. The experimental results show promising performance for automated, objective, and quantitative analysis of LAR events from laryngeal endoscopy videos with over 90% and 99% F1 scores for LAR and non-LAR frames respectively.