
SOTA | State of the art of Multimodal Mobile Stress Detection
Von Melanie Maier am 21.01.2026
Evaluating Multimodal Sensor Fusion, Machine Learning Models and Mobile Feasibility in Stress Detection Systems
Abstract
Over the last few years, mobile and wearable stress detection has evolved rapidly. This is due to maturing of sensor technology, machine learning (ML) and multimodal fusion strategies. Todays fast paced world critically increases stress-induced illnesses, therefor stress detection becomes more and more important. This State-of-the-Art article critically compares 11 recent and influential papers (2022-2025) about multimodal stress detection.
The analysis examines datasets, preprocessing strategies, fusion architecture, machine learning approaches, evaluation protocols and the extent, to which current models are feasible for real-world mobile deployment. Findings reveal strong progress in multimodal fusion and deep learning, but limited attention to on-device constraints, poor cross-subject generalization, and an overreliance on laboratory datasets. Clear research gaps and design recommendations are identified to guide future work toward robust, scalable, and ecologically valid mobile stress-detection systems.
KEYWORDS
Stress Detection, Mobile, Machine Learning, Sensors, Mobile Feasibility
1 Introduction
Stress is a multidimensional physiological and psychological response involving changes in the autonomic nervous system, endocrine activity, metabolism, behavior, cognition and movement. As of today, wearables, like smartwatches, enabled continuous monitoring of the human physiological state (Zhao, Masood, & Ning, 2025). Wearables and mobile sysrems have shown to be promising platforms for continuous stress monitoring due to high user acceptance, broad availability and the increasing fidelity of embedded sensors.
However, mobile stress detection in real-world settings remains challenging. Signals from wearable sensors are easily affected by motion artifacts, environment conditions, sensor placement variability and individual physiological differences. Approaches that rely on a single sensor, such as PPG (detects changes in blood volume using a pulse oximeter) or EDA (measures the electrical conductivity of the skin) generally struggle with robustness under real-world conditions.
Multimodal stress detection attempts to resolve this by combining heterogenous sensor modalities (e.g. PPG, EDA, ACC, Temp). The resulting sensor fusion pipelines can capture complementary aspects of the stress response), making them more reliable and generalizable.
This SOTA analyzes eleven key papers from 2022-2025, comparing their multimodality, ML models, fusion techniques, dataset characteristics and mobile feasibility. The goal is to build a structured understanding of the field’s strengths, limitations and future directions.
2 Multimodal Datasets and Data Characteristics
Multimodal Datasets are datasets that contain multiple types of data (modalities) collected from different sources, in this case different sensors. Instead of relying on one kind of information multimodal datasets combine several complementary data types to provide a richer and more robust understanding of whatever is studied. Each analyzed paper was either based on an established or a custom dataset.
2.1 WESAD as the benchmark
5 of the reviewed papers make use of the WESAD dataset. It’s a laboratory-based multimodal dataset capturing EDA, ECG, BVP/PPG, accelerometer data, respiration, EMG and skin temperature during a Trier Social Stress Test (TSST) scenario.
It provides a broad set of sensor data connected to stress and therefore allows for recognition of stress in individuals using a combination of data inputs.
Despite its popularity, WESAD has key limitations:
- Only 15 subjects
- Heavily lab-controlled
- Limited ecological validity
Nevertheless, its multimodal nature makes it valuable for fusion research.
2.2 Rare real-world datasets
Only two studies (Islam & Washington, 2023) (Darwish, et al., 2025) include real-life data. These capture stress under natural conditions (e.g daily life, workload), but suffer from event uncertainty, self-report noise and device variability.
2.3 Emerging multimodal datasets
A major 2025 contribution is EmpathicSchool (Hosseini, et al., 2025)which is a large multimodal dataset. It combines:
- Facial video
- PPG, ECG, EDA
- Accelerometer
- Structured academic stressors
This dataset expands the field toward a richer and visually enhanced multimodality, though it’s not yet optimized for mobile devices.
3 Preprocessing and Feature Engineering
Preprocessing and Feature Engineering are two essential steps in machine learning, that transform raw data from datasets into a form that machine learning models can understand and learn from effectively. They happen before training the model and often determine whether the model performs poorly or achieves the desired results.
The analyzed papers have shown that most studies follow a consistent preprocessing pipeline:
- Signal cleaning through band-pass filtering or artifact removal:
- Normalization:
- Segmentation (typical window sizes 30-60 seconds)
- Feature extraction: Physiological: HRV metrics, EDA peaks, spectral features
- Behavioral: ACC-derived activity levels
- Visual: CNN-extracted features
- Contextual: metadata, event labels
A reoccurring issue is inconsistent preprocessing documentation, which reduces transparency and reproducibility. Only a few papers (e.g. (Md Santo, et al., 2025) (Zhao, Masood, & Ning, 2025)) describe their pipelines in sufficient detail for replication.
4 Multimodal Fusion Strategies
Fusion strategies define how different modalities are combined to yield a stress prediction. Three main paradigms dominate.
4.1 Early Fusion
Early fusion means that features or raw data are chained before feeding them to a model.
Advantages: simple, efficient
Disadvantages: sensitive to missing modalities
Used in:
- TEANet (Md Santo, et al., 2025)
- Image-encoding CNN models (Ghosh, Kim, Ijaz, Singh, & Mahmud, 2022)
4.2 Late Fusion
Late fusion means that each modality is processed independently and then predictions are combined.
Advantages: robust to missing or weak signals
Disadvantages: limited modeling of cross-modality relationships
Used in:
- From lab to real-life (Darwish, et al., 2025)
- EmpathicSchool Dataset (Hosseini, et al., 2025)
4.3 Hybrid / Cross-Modality Fusion
In hybrid or cross-modality fusion, ML models learn relationships between modalities through different methods, e.g.:
- Transformer cross-attention (Oliver & Dakshit, 2025)
- Privileged modality learning (Zhao, Masood, & Ning, 2025)
- Context-aware adaptive fusion (Rashid, Mortlock, & Al Faruque, 2023)
These methods achieve the best results but also have the highest computational cost, making mobile deployment difficult without compression techniques.
5 Machine Learning Model Landscape
5.1 Classical ML
Classical machine learning models are for example Random Forests, SVMs and logistic regression. They:
- Have low computational cost
- Depend on hand-crafted features and
- Are less effective on high-dimensional multimodal signals
An example for such a ML model would be the Global HRV + RF model (Dahal, Bogue-Jimenez, & Doblas, 2023). It demonstrates a high mobile feasibility but low multimodal richness.
5.2 Deep Learning Models
The analyzed studies have shown different sorts of deep learning models:
5.2.1 CNNs and CNN-LSTM hybrids. CNN (Convolutional Neural Network) is a deep learning model originally developed for images but widely used for signal processing. CNN-LSTM combines that with a Long Short-Term Memory network for modeling long-term time dependencies. That means that a CNN captures instant patterns and LSTM captures temporal evolution across windows.
Strength: automatic feature extraction
Weakness: mobile computational limitations
5.2.2 Transformers. Transformers are a deep learning architecture designed to handle sequences of data (like language, signals, video, sensor time series). They learn global structure while CNNs learn local features and LSTMs learn sequential dependencies. They outperform RNNs and CNNs in cross-modality modeling (Oliver & Dakshit, 2025). They are based on self attention, a mechanism that allows the model to look at any part of the input at any time, and parallel processing, which lets them process all time steps simultaneously, not one-by-one. This makes them extremely fast on GPUs and very good for long signals, multimodal fusion, cross-modal attention (Oliver & Dakshit, 2025) and complex temporal patterns. However, they have a high memory footprint, high inference latency and therefore are unsuitable for on-device execution without quantization or pruning.
5.2.3 Self-Supervised Learning. Self-Supervised Learning is an approach in which the model derives its own learning objectives from the data in order to learn useful representations, without relying on costly human-labeled data. It is particularly powerful when large amounts of unlabeled data are available, as is often the case with physiological or sensor-based stress data. Islam and Washington show that SSL pretraining significantly improves person-specific stress detection, reducing training data requirements (Islam & Washington, 2023).
5.2.4 Autoencoders. Autoencoders are a class of neural networks designed to learn efficient representations of data, typically for purposes like dimensionality reduction, feature learning, or denoising. They are self-supervised in nature because they use the input data itself as the target for training, meaning no external labels are required. TEANet (Md Santo, et al., 2025) uses a transpose-enhanced autoencoder for feature compression, promising for low-resource mobile inference.
6 Mobile Feasibility Analysis
Mobile Feasibility refers to whether a machine-learning model and sensor setup, a stress-detection system, can realistically run on a mobile device such as:
- A smartphone
- A smartwatch
- A fitness tracker
- An embedded IoT health device
A crucial dimension often missing in published research is evaluation on actual mobile or wearable hardware.
From the papers analyzed:
- None provide full mobile benchmarking such as latency, battery and resource usage
- Only 3 report any on-device considerations
| Paper / Study | Sensors Used for Training | Sensors Required at Deployment (Inference) | Comment |
| PULSE (Zhao, Masood, & Ning, 2025) | EDA, PPG/BVP, ECG, ACC, Temperature | PPG + ACC only (EDA used only during training) | Privileged knowledge transfer enables strong sensor reduction |
| TEANet (Md Santo, et al., 2025) | PPG/BVP, EDA, ACC | PPG/BVP + ACC | EDA used for reconstruction during training; not required at inference |
| Cross-Modality Transformer (Oliver & Dakshit, 2025) | ECG, EDA, EMG, RESP, TEMP, ACC | All modalities required | No sensor reduction, computationally heavy |
| Individualized SSL (Islam & Washington, 2023) | HRV (ECG/PPG), EDA, ACC | HRV + ACC | EDA optional; personalization reduces training burden |
| SELF-CARE (Rashid, Mortlock, & Al Faruque, 2023) | PPG, EDA, ACC, contextual signals (GPS, phone logs) | PPG + ACC + contextual signals | EDA helpful but optional, high sensor complexity |
| From Lab to Real-Life (Darwish, et al., 2025) | PPG, EDA, ACC | PPG + ACC | Classical ML models, mobile-feasible |
| Image-Encoding CNN (Ghosh, Kim, Ijaz, Singh, & Mahmud, 2022) | PPG, EDA, ECG | PPG + EDA + ECG | No reduction, image-encoding pipeline requires all modalities |
| Global HRV + RF (Dahal, Bogue-Jimenez, & Doblas, 2023) | ECG-derived HRV features | ECG or PPG HRV only | Only true single-sensor solution |
| EmpathicSchool (Hosseini, et al., 2025) | Facial video, EDA, ECG, PPG, ACC | Video + physiological signals | Very sensor-intensive, not suitable for mobile deployment |
| Stressor Type Matters (Prajod, Mahesh, & André, 2024) | ECG, EDA, ACC, RESP | All modalities required | Focus on generalization, no deployment optimization |
Table 1: Training and Deployment Sensors
6.1 Critical barriers
The analyzed studies have shown that there are the following critical barriers for the mobile feasibility of the presented stress-detection systems:
- Large model sizes (Transformers, CNN-LSTM hybrids)
- High inference latency for multimodal pipelines
- Sensor synchronization issues in mobile settings
- Energy consumption rarely measured
- Multimodal dropout (missing modalities in real life)
6.2 Promising mobile-oriented techniques
There are still some promising mobile-oriented techniques:
- Model compression (quantization, pruning)
- Teacher-student learning (Zhao, Masood, & Ning,2025)
- Lightweight multimodal autoencoders (Md Santo, et al.,2025)
- Contextual gating (Rashid, Mortlock, & Al Faruque,2023)
These approaches are promising but still underexplored.
7 Key Comparative Insights
Across the reviewed studies, five clear patterns emerge:
- Multimodal fusion consistently outperformsunimodal approaches, but its computational cost isrestrictive for mobile implementation
- Cross-subject generalization remains weak, person-specific models still perform better
- Transformer architectures lead in accuracy, butremain unsuitable for real-time mobile inferencewithout compression
- Real-world datasets are severely lacking, leading tolimiting ecological validity
- Mobile feasibility is the largest research gap, almostentirely unaddressed in current literature
8 Research Gaps and Future Directions
- Multimodal real-world datasets. The field urgentlyneeds datasets collected under natural conditions, withmotion artifacts, lighting changes and real-worldstressors to test stress-detection systems under realconditions.
- Standardization of preprocessing pipelines. Currentinconsistency makes cross-paper comparisonsunreliable.
- Explicit modeling of modality dropout. In mobilecontexts, sensors fail frequently.
- Energy-aware model design. Most publishedarchitectures are unfitting for wearables.
- Fairness and personalization. Little work exists onhow stress-detection performance varies across gender,age, or physiology.
- Cross-dataset generalizability. Transfer learning anddomain adaptation are required for practical real-worlddeployment.
9 Conclusion
Multimodal mobile stress detection is progressing rapidly, particularly in fusion architectures and learning models. However, the field remains dominated by laboratory datasets and computationally expensive models. Practical mobile feasibility, real-world robustness, subject variability and missing-modality sensitivity are insufficiently addressed. Future research must integrate lightweight multimodal models, mobile-optimized architectures and real-world datasets to build usable, scalable and ethically sound stress-detection systems.
| Study | Dataset | Modalities Used | Fusion Type | ML Model | Metrics | Mobile Feasibility | Key Limitation |
| PULSE (Zhao, Masood, & Ning, 2025) | WESAD (and derivatives) | EDA, ECG, BVP/PPG, ACC, Temperature | Privileged modality (teacher-student) fusion | Deep Learning (teacher-student privileged knowledge transfer) | Accuracy, F1-score | Medium (model compression intended; not fully evaluated on-device) | Small lab dataset; EDA required during training; limited real-world evaluation |
| TEANet (Md Santo, et al., 2025) | WESAD | BVP/PPG, EDA, ACC | Early fusion (feature-level); autoencoder compression | Transpose-enhanced Autoencoder (DL) | Accuracy, F1-score, Kappa | Medium (feature compression promising; no full mobile benchmark) | Limited reporting on multimodality and energy/latency metrics |
| Cross-Modality Investigation (Oliver & Dakshit, 2025) | WESAD | ECG, EDA, EMG, RESP, TEMP, ACC | Attention-based cross-modality fusion (Transformer) | Transformer (cross-modal) | Accuracy, F1-score | Low (high compute & memory; not mobile-friendly) | High computational cost; lacks on-device evaluation |
| Individualized SSL (Islam & Washington, 2023) | Custom wearable + phone (real-world) | HRV (from PPG/ECG), EDA, ACC | Late fusion (separate encoders, combined features) | Self-Supervised Learning encoder + classifier | F1-score, ROC-AUC | Medium (personalized models reduce data needs; mobile deployment possible but not fully demonstrated) | Less generalizable across users; requires SSL pretraining |
| SELF-CARE (Rashid, Mortlock, & Al Faruque, 2023) | Custom context-aware dataset | PPG, EDA, ACC, contextual sensors (phone logs/GPS) | Hybrid/adaptive fusion (context-aware gating) | DNN + context module (hybrid) | Accuracy, F1-score | Accuracy, F1-score Medium (adaptive methods promising; requires many sensors) | High system complexity; many sensors reduce deployability |
| From Lab to Real-Life (Darwish, et al., 2025) | Wearable study (lab + real-world) | PPG, EDA, ACC | Late fusion (decision-level aggregation) | Classical ML (Random Forest, SVM) and simple ensembles | Accuracy (real-world vs lab comparisons) | High (simpler models validated in real-world; low latency/energy reported) | Decrease in accuracy in wild conditions; device variability |
| Image-encoding CNN (Prajod, Mahesh, & André, 2024) | WESAD + other datasets | PPG, EDA, ECG (time series encoded as images) | Early fusion via image-encoding (GAF/Recurrence plots) | CNN on image-encoded signals | Accuracy | Low (image encoding + CNN pipeline is compute-heavy) | Complex pipeline; not optimized for on-device inference |
| Global HRV + RF (Dahal, Bogue-Jimenez, & Doblas, 2023) | Custom HRV dataset (real-world) | HRV only (ECG-derived features) | Single-modality (no fusion) | Random Forest (classical ML) | Accuracy, F1-score | High (lightweight; mobile-ready) | Limited to HRV signal; lacks multimodal robustness |
| Recent Advances Review (Ghonge, Shukla, Pradeep, & Solanki, 2025) | Multiple (review) | Multiple (physiological, contextual, visual) | Survey of fusion strategies (various) | Survey (various ML/DL approaches) | N/A (review) | N/A (review paper; discusses feasibility conceptually) | Not empirical; synthesizes literature only |
| EmpathicSchool (Hosseini, et al., 2025) | EmpathicSchool (new multimodal dataset) | Facial video, EDA, ECG, PPG, ACC | Late multimodal fusion (visual + physiological) | CNN (visual) + physiological models; late fusion | Accuracy, F1-score | Low (video processing heavy; not optimized for wearables) | High sensor cost and compute; limited mobile applicability |
| Stressor Type Matters (Prajod, Mahesh, & André, 2024) | WESAD + multiple datasets (cross-dataset) | ECG, EDA, ACC, RESP (varies) | Analysis of modelling factors; mixed approaches | Mixed (classical ML + DL experiments) | Accuracy, F1-score (cross-dataset evaluation) | Low (study highlights generalization issues; no mobile eval) | Poor cross-dataset generalization; stressor sensitivity |
Table 1: Overview of reviewed Studies
REFERENCES
[1]Dahal, K., Bogue-Jimenez, B., & Doblas, A. (2023). Global Stress DetectionFramework Combining a Reduced Set of HRV Features and Random Forest Model. Sensors, Basel, Switzerland. doi:https://doi.org/10.3390/s23115220
[2] Darwish, B. A., Rehman, S. U., Sadek, I., Salem, N. M., Karee,, G., & Mahmoud, L. N. (2025). From lab to real-life: A three-stage validation of wearable technology for stress monitoring. MethodsX. doi:https://doi.org/10.1016/j.mex.2025.103205
[3] Ghonge, M., Shukla, V. K., Pradeep, N., & Solanki, R. K. (2025). Recent Advances in Multimodal Deep Learning for Stress Prediction: Toward Cycle-Aware and Gender-Sensitive Health Analytics. doi:https://doi.org/10.1051/epjconf/202534101059
[4] Ghosh, S., Kim, S., Ijaz, M. F., Singh, P. K., & Mahmud, M. (2022). Classification of Mental Stress from Wearable Physiological Sensors Using Image-Encoding-Based Deep Neural Network. Biosensors. doi:https://doi.org/10.3390/bios12121153
[5] Hosseini, M., Sohrab, F., Gottumukkala, R., Bhupatiraju, R. T., Katragadda, S., Raitoharju, J., . . . Gabbouj, M. (2025). A multimodal stress detection dataset with facial expressions and physiological signals.doi:https://doi.org/10.1038/s41597-025-05812-0
[6] Islam, T., & Washington, P. (2023). Individualized Stress Mobile Sensing Using Self-Supervised Pre-Training. Applied Sciences. doi:https://doi.org/10.3390/app132112035
[7] Md Santo, A., Sapnil Sarker , B., Mohammod , A. M., Sumaiya, K., Manish, S., & Chowdhury, M. (2025). TEANet: A Transpose-Enhanced Autoencoder Network for Wearable Stress Monitoring. Von https://arxiv.org/abs/2503.12657 abgerufen
[8] Oliver, E., & Dakshit, S. (2025). Cross-Modality Investigation on WESAD StressClassification. Von https://arxiv.org/abs/2502.18733 abgerufen
[9] Prajod, P., Mahesh, B., & André, E. (2024). Stressor Type Matters! — ExploringFactors Influencing Cross-Dataset Generalizability of Physiological Stress Detection. doi:10.48550/arXiv.2405.09563
[10] Rashid, N., Mortlock, T., & Al Faruque, M. A. (2023). Stress Detection using Context-Aware Sensor Fusion from Wearable Devices. Von https://arxiv.org/abs/2303.08215 abgerufen
[11] Zhao, Z., Masood, M., & Ning, Y. (2025). PULSE: Privileged Knowledge Transfer from Electrodermal Activity to Low-Cost Sensors for Stress Monitoring. CA, USA. doi:10.48550/arXiv.2510.24058
The comments are closed.