Week 2 Initial Design & Elaboration #
Project name: VoiceDiary
Code repository: link
VoiceDiary is an AI-powered voice journaling tool that analyzes tone, emotion, and key themes in spoken entries. It generates personalized emotional insights and well-being suggestions based on recorded reflections.
Team #
Team member | Telegram alias | Innopolis Email | Responcibilities |
---|---|---|---|
Dziyana Melnikava | @meldilen24 | dz.melnikava@innopolis.university | PM, Frontend |
Anastasia Kuchumova | @n_rngk | a.kuchumova@innopolis.university | Frontend, UX/UI |
Dzhamilia Fatkullina | @jam11a | d.fatkullina@innopolis.university | ML |
Elina Kuzmichyova | @lin_anile | e.kuzmichyova@innopolis.university | ML |
Olesia Novoselova | @doiwannaknoww8 | o.novoselova@innopolis.university | Backend |
Danil Davydyan | @chocop | d.davydyan@innopolis.university | Backend |
Detailed Requirements Elaboration #
Responsible: Dzhamilia Fatkullina, Dziyana Melnikava
Expanded User Stories for MVP #
As a user, I want to speak freely into the app without typing or structuring my thoughts, so that I can reflect naturally, even when tired or overwhelmed.
Acceptance Criteria:
- One-tap recording:
✅ Single press of a microphone button starts/stops recording.
✅ Visual feedback (waveform/timer) confirms active recording. - Uninterrupted flow:
✅ Pause/resume functionality without data loss. - Accessibility:
✅ No forced sign-up to record.
As a user, I want the app to require almost no setup or cognitive effort, so that I can use it daily, even on busy days.
Acceptance Criteria: #
- Instant insights:
✅ Auto-generated emotion summary (e.g., “You sound reflective today”).
✅ 2-3 key themes extracted without user input. - Zero friction:
✅ Provide user with hints of what he/she can talk about
Backlog in Kaiten: CLICK TO SEE #
Design #
Responsible: Dziyana Melnikava, Anastasia Kuchumova
User Flow Diagrams: CLICK TO SEE
Low-fidelity wireframes in Figma: CLICK TO SEE
Frontend #
Responsible: Dziyana Melnikava, Anastasia Kuchumova
Link to frontend branch: CLICK TO SEE
- The frontend structure was re-formed and made more detailed as MVP vision was changed.
- Skeleton components were created.
- Onboarding page was implemented according to the wireframe design in Figma.
Backend #
Responsible: Olesia Novoselova, Danil Davydyan
Project Architecture: Schema
Database schema: Schema
Link to API contract for ML: CLICK TO SEE
Link to Golang HTTP service connected to PostgreSQL: CLICK TO SEE
ML #
Responsible: Dzhamilia Fatkullina, Elina Kuzmichyova
Link to ML contribution: CLICK TO SEE
Research Summary #
Models #
1. Emotion Recognition From Speech (V1.0) - don’t have access to that model yet #
- Predicts six emotions (sad, neutral, happy, fear, angry, disgust) from speech without linguistic information.
- Method: Extracts audio features and passes them to an emotion recognition model.
- Datasets: Uses four publicly available datasets (Ravdess, CREMA-D, TESS, SAVEE) with labeled emotional speech.
- Relevance to VoiceDiary:
- Can enhance emotion analysis in voice entries by identifying emotional tones.
- Provides a framework for integrating emotion recognition into VoiceDiary’s AI analysis.
2. Robust Speech Recognition via Large-Scale Weak Supervision - Whisper article #
- Improves speech recognition robustness using large-scale, weakly supervised pre-training.
- Method: Trains models on 680,000 hours of multilingual audio to generalize across domains without fine-tuning.
- Key Features:
- Zero-shot transfer learning for diverse speech tasks.
- Competitive performance with human accuracy in real-world conditions.
- Relevance to VoiceDiary:
- Enhances transcription accuracy for diverse user voices and emotional tones.
- Supports multilingual entries, broadening VoiceDiary’s usability
- Transcription can be used for more effective and accurate recomendation and assist in emotion recognition
Datasets used in MVP #
3. Large Raw Emotional Dataset with Aggregation Mechanism - Dusha, large dataset for emotion recognition in Russian #
- Introduces “Dusha,” a large Russian speech emotion dataset with acted and real-life recordings.
- Method: Combines crowd-sourced acted speech and podcast segments, annotated via crowd-sourcing.
- Key Features:
- 350 hours of audio with transcripts, labeled for anger, happiness, sadness, and neutral.
- Includes raw and aggregated labels for flexible use.
- Relevance to VoiceDiary:
- Will be used for additional training of Emotion recognition models from speech
Datasets that we plan to use for future training beyond MVP (Data collection plan) #
Individual commitments #
Team member | Telegram alias | Contribution | Link |
---|---|---|---|
Dziyana Melnikava (Lead) | @meldilen24 | Developed User Flow Diagrams and low-fidelity wireframes in Figma (Onboarding, User Profile, enhanced Main pages), updated GitHub with detailed frontend project structure, set up PM tool for tracking, write the report | Commits, User Flow Diagrams, Figma Design |
Anastasia Kuchumova | @n_rngk | Developed low-fidelity wireframes in Figma (Main pages, components, Results page), implemented Onboarding Page | Commits, Figma Design |
Dzhamilia Fatkullina | @jam11a | Tested Wisper model and wrote research summary | Python code |
Elina Kuzmichyova | @lin_anile | Tested Wav2Vec model, prepared datasets and data plan, started additional traning | Python code |
Olesia Novoselova | @doiwannaknoww8 | Prepared database shema, created Golang HTTP service connecting to PostgreSQL, fetching user records as JSON via a GET endpoint, offering a POST endpoint to forward a record’s data URL to a Python ML API | Pull Request |
Danil Davydyan | @chocop | Defined the initial API contract on Python for ML, drew project architecture schema | Pull Request |
Plan for Next Week #
ML #
- Finish training Wav2Vec model
- Make summary from voice input transcription
- Make an orchestration of two models to predict a final emotion based on voice and text
Frontend #
- Create components for showing results of emotion prediction
- Connect frontend components to backend APIs
Backend #
- Integrate the model’s predictions into the backend
- Implement voice transmission
- Research MinIO
Confirmation of the code’s operability #
We confirm that the code in the main branch:
- [✓] In working condition.