hgzero/design/backend/sequence/inner/stt-녹음시작및인식.puml

@startuml
!theme mono

title STT Service - 음성 녹음 시작 및 화자 인식 (통합)

participant "Frontend<<E>>" as Frontend
participant "API Gateway<<E>>" as Gateway
participant "RecordingController" as Controller
participant "RecordingService" as Service
participant "AudioStreamManager" as StreamManager
participant "SpeakerIdentifier" as Speaker
participant "RecordingRepository" as Repository
participant "AzureSpeechClient" as AzureClient
database "STT DB" as DB
database "Azure Blob Storage<<E>>" as BlobStorage
queue "Azure Event Hubs<<E>>" as EventHub

== 회의 시작 이벤트 수신 및 녹음 준비 ==

EventHub -> Controller: MeetingStarted 이벤트 수신\n(meetingId, sessionId)
activate Controller

Controller -> Service: prepareRecording(meetingId, sessionId)
activate Service

Service -> Service: 녹음 세션 검증
note right
  - 중복 녹음 방지 체크
  - meetingId 유효성 검증
end note

Service -> Repository: createRecording(meetingId, sessionId)
activate Repository

Repository -> DB: 녹음 세션 생성\n(녹음ID, 회의ID, 세션ID, 상태, 생성일시)
activate DB
DB --> Repository: recordingId 반환
deactivate DB

Repository --> Service: RecordingEntity 반환
deactivate Repository

== Azure Speech Service 초기화 ==

Service -> AzureClient: initializeRecognizer(recordingId, sessionId)
activate AzureClient

AzureClient -> AzureClient: 음성 인식기 설정
note right
  Azure Speech 설정:
  - 언어: ko-KR
  - Format: PCM 16kHz
  - 샘플레이트: 16kHz
  - 화자 식별 활성화
  - 실시간 스트리밍 모드
  - Continuous recognition
end note

AzureClient -> BlobStorage: 녹음 파일 저장 경로 생성\n(path: recordings/{meetingId}/{sessionId}.wav)
activate BlobStorage
BlobStorage --> AzureClient: 저장 경로 URL 반환
deactivate BlobStorage

AzureClient --> Service: RecognizerConfig 반환
deactivate AzureClient

== 녹음 상태 업데이트 ==

Service -> Repository: updateRecordingStatus(recordingId, "RECORDING")
activate Repository

Repository -> DB: 녹음 상태 업데이트\n(상태='녹음중', 시작일시, 저장경로)
activate DB
DB --> Repository: 업데이트 완료
deactivate DB

Repository --> Service: 업데이트 완료
deactivate Repository

Service --> Controller: RecordingResponse(recordingId, status, storagePath)
deactivate Service

Controller --> EventHub: RecordingStarted 이벤트 발행\n(recordingId, meetingId, status)

Controller --> Gateway: 200 OK\n{sessionId, streamUrl}
deactivate Controller

== 음성 스트리밍 및 화자 식별 처리 ==

Frontend -> Gateway: WebSocket /ws/stt/{sessionId}\n[audio stream]
activate Gateway

Gateway -> Controller: 음성 데이터 수신
activate Controller

Controller -> Service: processAudioStream(sessionId, audioData)
activate Service

Service -> StreamManager: streamAudio(audioData)
activate StreamManager

StreamManager -> AzureClient: recognizeAsync(audioData)
activate AzureClient

AzureClient --> StreamManager: partial result\n(text, timestamp)
deactivate AzureClient

StreamManager --> Service: recognized text
deactivate StreamManager

== 화자 식별 ==

Service -> Speaker: identifySpeaker(audioFrame)
activate Speaker

Speaker -> AzureClient: analyzeSpeakerProfile()\n(Speaker Recognition API)
activate AzureClient
note right
  화자 식별:
  - Voice signature 생성
  - 기존 프로필과 매칭
  - 신규 화자 자동 등록
end note

AzureClient --> Speaker: speakerId
deactivate AzureClient

Speaker --> Service: speaker info
deactivate Speaker

== 화자별 세그먼트 저장 ==

Service -> Repository: saveSttSegment(segment)
activate Repository

Repository -> DB: STT 세그먼트 저장\n(세션ID, 텍스트, 화자ID, 타임스탬프, 신뢰도)
activate DB
DB --> Repository: segment saved
deactivate DB

Repository --> Service: saved
deactivate Repository

Service -> Repository: updateSpeakerInfo(recordingId, speakerId)
activate Repository

Repository -> DB: 화자 정보 저장/업데이트\n(녹음ID, 화자ID, 세그먼트수)
activate DB
DB --> Repository: 업데이트 완료
deactivate DB

Repository --> Service: 완료
deactivate Repository

Service --> Controller: streaming response\n{text, speaker, timestamp, confidence}
deactivate Service

Controller --> Gateway: WebSocket message
deactivate Controller

Gateway --> Frontend: 실시간 자막 전송\n{text, speaker, timestamp}
deactivate Gateway

note over Frontend, EventHub
처리 시간:
- DB 녹음 생성: ~100ms
- Azure 인식기 초기화: ~500ms
- Blob 경로 생성: ~200ms
- 화자 식별: ~300ms
- 실시간 인식 지연: < 1초
- 총 초기화 시간: ~1.1초

정확도:
- 화자 식별 정확도: > 90%
- 음성 인식 정확도: 60-95%
end note

@enduml