mirror of https://github.com/hwanny1128/HGZero.git synced 2025-12-06 07:56:24 +00:00

Minseo-Jo afbfc7f947 STT 구현 방안 문서 작성

- 음성인식(STT) 기술 개요 및 한국어 처리 특징 정리
- OpenAI Whisper API와 AWS Transcribe 비교 분석
- 실시간/배치 처리 방식별 아키텍처 설계
- WebSocket 기반 실시간 STT 처리 플로우 정의
- 성능 최적화 및 정확도 개선 방안 제시
- 비용 분석 및 모니터링 전략 수립

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-21 13:52:30 +09:00

35 KiB

Raw Blame History

STT (Speech-to-Text) 구현방안

📋 문서 정보

작성일: 2025-10-21
최종 수정일: 2025-10-21
작성자: 회의록 서비스 개발팀
버전: 2.0
검토자: 박서연(AI), 이준호(Backend), 이동욱(Backend), 최유진(Frontend), 홍길동(Architect), 정도현(QA)
STT 엔진: Azure Speech Services (실시간 스트리밍 + 화자 식별)

1. 개요

1.1 목적

회의 참석자의 발언을 실시간으로 음성 인식하여 텍스트로 변환하고, AI 기반 회의록 자동 작성의 기반 데이터를 제공합니다.

1.2 핵심 요구사항

실시간성: 발언 후 1초 이내 화면 표시 (Azure 실시간 스트리밍)
정확도: STT confidence score 90% 이상
화자 식별: 참석자별 발언 자동 구분 (Azure Speaker Diarization)
안정성: 네트워크 장애 시에도 녹음 데이터 보존

1.3 Azure Speech Services 선정 이유

✅ 실시간 스트리밍: 1초 이내 지연 시간으로 요구사항 충족
✅ 화자 식별 기본 제공: Speaker Diarization 내장 (별도 구현 불필요)
✅ 한국어 최적화: Microsoft의 한국어 특화 모델로 높은 정확도
✅ 엔터프라이즈 안정성: 99.9% SLA 보장
✅ Azure 생태계 통합: 향후 Azure 기반 인프라 확장 용이

1.4 차별화 전략

STT 자체는 기본 기능(Hygiene Factor)이나, 다음 차별화 요소와 연계됩니다:

맥락 기반 용어 설명 (RAG)
AI 회의록 자동 작성
Todo 자동 추출

2. 아키텍처 설계

2.1 전체 구조

┌─────────────┐      ┌──────────────┐      ┌─────────────────┐
│   Client    │─────▶│ STT Gateway  │─────▶│ Azure Speech    │
│ (Browser)   │      │   Service    │      │   Services      │
│             │      │              │      │ (실시간 스트리밍)│
└─────────────┘      └──────────────┘      └─────────────────┘
      │                     │                      │
      │                     │                      │
      │                     │              ┌───────▼───────┐
      │                     │              │ Speaker       │
      │                     │              │ Diarization   │
      │                     │              │ (화자 식별)    │
      │                     │              └───────────────┘
      │                     │                      │
      ▼                     │                      │
┌─────────────┐      ┌──────▼──────┐      ┌───────▼─────┐
│  WebSocket  │◀─────│  RabbitMQ   │◀─────│ Claude API  │
│   Server    │      │    Queue    │      │  (후처리)   │
└─────────────┘      └─────────────┘      └─────────────┘
      │                     │
      │                     ▼
      │              ┌─────────────┐
      └─────────────▶│    Redis    │
                     │   Cache     │
                     └─────────────┘

2.2 계층별 역할

Client Layer (Frontend)

MediaRecorder API: 브라우저에서 실시간 음성 캡처
WebSocket Client: 실시간 텍스트 수신 및 화면 동기화
로컬 저장: 네트워크 장애 시 음성 데이터 임시 저장

STT Gateway Service

오디오 스트림 수신: 클라이언트로부터 실시간 음성 스트림 수신
Azure Speech 연동: Azure Speech Services 실시간 스트리밍 API 호출
화자 식별 처리: Azure Speaker Diarization 결과 수신 및 참석자 매칭
이벤트 발행: RabbitMQ에 TextTranscribed 이벤트 발행

Azure Speech Services

실시간 스트리밍 STT: 음성을 실시간으로 텍스트 변환 (< 1초 지연)
Speaker Diarization: 화자별 발언 자동 구분
언어 모델: 한국어 특화 최적화 모델
신뢰도 점수: 각 발언에 대한 confidence score 제공

Message Queue (RabbitMQ)

비동기 처리: STT 결과를 비동기로 후속 서비스에 전달
이벤트 라우팅: TextTranscribed → AI Service, Meeting Service
재시도 로직: 실패 시 자동 재처리 (최대 3회)

AI Service (Claude API)

텍스트 후처리: 구어체 → 문어체 변환, 문법 교정
회의록 구조화: 템플릿에 맞춰 내용 정리
Todo 추출: 액션 아이템 자동 식별

Cache Layer (Redis)

실시간 발언 캐싱: meeting:{meeting_id}:live_text
섹션별 내용 캐싱: meeting:{meeting_id}:sections:{section_id}
화자 정보 캐싱: meeting:{meeting_id}:speakers

WebSocket Server

실시간 동기화: 모든 참석자에게 텍스트 변환 결과 즉시 전송
Delta 전송: 변경된 부분만 전송하여 대역폭 최적화

3. 데이터 구조 설계

3.1 Azure Speech 스트리밍 연결 설정

{
  "session_id": "SESSION_001",
  "meeting_id": "MTG_001",
  "config": {
    "language": "ko-KR",
    "sample_rate": 16000,
    "format": "audio/wav",
    "enable_diarization": true,
    "max_speakers": 10,
    "profanity_filter": "masked",
    "enable_dictation": true
  },
  "participants": [
    {
      "user_id": "USR_001",
      "name": "김철수",
      "voice_signature": null
    },
    {
      "user_id": "USR_002",
      "name": "이영희",
      "voice_signature": null
    }
  ]
}

3.2 실시간 오디오 스트림 전송 (WebSocket)

{
  "type": "audio_chunk",
  "session_id": "SESSION_001",
  "audio_data": "base64_encoded_audio",
  "timestamp": "2025-10-21T14:30:15.000Z",
  "sequence": 42
}

3.3 Azure Speech 실시간 응답 (WebSocket)

{
  "type": "recognition_result",
  "session_id": "SESSION_001",
  "result_id": "RESULT_001",
  "recognition_status": "Success",
  "duration": 4500000000,
  "offset": 0,
  "text": "회의를 시작하겠습니다. 오늘은 프로젝트 킥오프 회의입니다.",
  "confidence": 0.95,
  "speaker_id": "Speaker_1",
  "lexical": "회의를 시작하겠습니다 오늘은 프로젝트 킥오프 회의입니다",
  "itn": "회의를 시작하겠습니다. 오늘은 프로젝트 킥오프 회의입니다.",
  "display": "회의를 시작하겠습니다. 오늘은 프로젝트 킥오프 회의입니다.",
  "words": [
    {
      "word": "회의를",
      "offset": 0,
      "duration": 400000000,
      "confidence": 0.96
    },
    {
      "word": "시작하겠습니다",
      "offset": 400000000,
      "duration": 1100000000,
      "confidence": 0.94
    }
  ],
  "is_final": true,
  "timestamp": "2025-10-21T14:30:16.000Z"
}

3.4 화자 매칭 결과 (STT Gateway 내부 처리)

{
  "result_id": "RESULT_001",
  "azure_speaker_id": "Speaker_1",
  "matched_user": {
    "user_id": "USR_001",
    "name": "김철수",
    "confidence": 0.88
  },
  "matching_method": "voice_pattern",
  "timestamp": "2025-10-21T14:30:16.000Z"
}

3.5 Claude API 호출 구조

요청 (STT Gateway → Claude API)

{
  "model": "claude-3-5-sonnet-20241022",
  "max_tokens": 2048,
  "messages": [
    {
      "role": "user",
      "content": "다음은 회의 발언 내용입니다. 회의록 형식에 맞춰 정리해주세요.\n\n발언: \"회의를 시작하겠습니다. 오늘은 프로젝트 킥오프 회의입니다.\"\n화자: 김철수\n시간: 2025-10-21 14:30:15\n\n템플릿 섹션: 안건, 논의 내용, 결정 사항, Todo"
    }
  ],
  "temperature": 0.3,
  "system": "당신은 회의록 작성 전문가입니다. 발언 내용을 구조화하여 명확하고 간결하게 정리합니다."
}

응답 (Claude API → AI Service)

{
  "id": "msg_01XYZ...",
  "type": "message",
  "role": "assistant",
  "content": [
    {
      "type": "text",
      "text": "## 안건\n- 프로젝트 킥오프 회의 진행\n\n## 논의 내용\n- (발언 내용을 기반으로 자동 작성됩니다)\n\n## 결정 사항\n- (아직 결정된 사항 없음)\n\n## Todo\n- (아직 할당된 작업 없음)"
    }
  ],
  "model": "claude-3-5-sonnet-20241022",
  "stop_reason": "end_turn",
  "usage": {
    "input_tokens": 245,
    "output_tokens": 128
  }
}

3.4 RabbitMQ 이벤트 구조

{
  "event_type": "TextTranscribed",
  "event_id": "EVT_001",
  "timestamp": "2025-10-21T14:30:18.000Z",
  "correlation_id": "CORR_001",
  "payload": {
    "meeting_id": "MTG_001",
    "speaker": {
      "id": "USR_001",
      "name": "김철수"
    },
    "transcription": {
      "text": "회의를 시작하겠습니다. 오늘은 프로젝트 킥오프 회의입니다.",
      "confidence": 0.95,
      "segments": [...]
    },
    "timestamp": "2025-10-21T14:30:15.000Z"
  },
  "metadata": {
    "source": "stt-gateway-service",
    "version": "1.0"
  }
}

3.6 Redis 캐시 구조

// 1. 실시간 발언 (TTL: 10분)
Key: "meeting:MTG_001:live_text"
Value: {
  "speaker": "김철수",
  "text": "회의를 시작하겠습니다...",
  "timestamp": "2025-10-21T14:30:15.000Z",
  "is_final": true
}

// 2. 섹션별 내용 (TTL: 회의 종료 후 1시간)
Key: "meeting:MTG_001:sections:agenda"
Value: {
  "section_id": "agenda",
  "section_name": "안건",
  "content": "프로젝트 킥오프 회의 진행\n- 프로젝트 목표 및 범위 확정\n- 역할 분담 및 일정 계획",
  "verified": false,
  "last_updated": "2025-10-21T14:32:00.000Z"
}

// 3. 화자 정보 (TTL: 회의 종료 후 1시간)
Key: "meeting:MTG_001:speakers"
Value: [
  {
    "id": "USR_001",
    "name": "김철수",
    "role": "주관자",
    "speech_count": 15,
    "speech_duration_ms": 180000
  },
  {
    "id": "USR_002",
    "name": "이영희",
    "role": "참석자",
    "speech_count": 12,
    "speech_duration_ms": 150000
  }
]

// 4. 회의 메타데이터 (TTL: 회의 종료 후 24시간)
Key: "meeting:MTG_001:metadata"
Value: {
  "meeting_id": "MTG_001",
  "title": "프로젝트 킥오프 회의",
  "status": "in_progress",
  "start_time": "2025-10-21T14:00:00.000Z",
  "participants": ["USR_001", "USR_002", "USR_003"],
  "total_speech_count": 42,
  "last_activity": "2025-10-21T14:32:00.000Z"
}

3.7 WebSocket 실시간 동기화 메시지

{
  "type": "transcription_update",
  "message_id": "WS_MSG_001",
  "timestamp": "2025-10-21T14:30:18.000Z",
  "data": {
    "meeting_id": "MTG_001",
    "speaker": {
      "id": "USR_001",
      "name": "김철수"
    },
    "transcription": {
      "text": "회의를 시작하겠습니다.",
      "is_final": true,
      "confidence": 0.95
    },
    "target_section": "agenda",
    "action": "append"
  }
}

4. 처리 흐름 (Sequence)

4.1 실시간 스트리밍 흐름

Client          STT Gateway      Azure Speech     RabbitMQ      AI Service    WebSocket Server
  │                   │                │              │             │                │
  │─1.WebSocket 연결─▶│                │              │             │                │
  │                   │─2.Speech 세션─▶│              │             │                │
  │                   │   시작          │              │             │                │
  │                   │◀─3.세션 준비───│              │             │                │
  │                   │                │              │             │                │
  │─4.실시간 음성──▶│                │              │             │                │
  │   스트림 전송     │─5.오디오 전송─▶│              │             │                │
  │                   │                │              │             │                │
  │                   │◀─6.실시간 텍스트│              │             │                │
  │                   │   (화자 식별)   │              │             │                │
  │                   │                │              │             │                │
  │                   │──────7.이벤트 발행─────────▶│             │                │
  │                   │                │              │──8.구독──▶│                │
  │                   │                │              │             │──9.Claude──▶│
  │                   │                │              │             │   후처리     │
  │◀────────────────────────────────────────────────────────10.실시간 동기화────│

단계별 설명:

Client: WebSocket으로 STT Gateway 연결
STT Gateway: Azure Speech Services 스트리밍 세션 시작
Azure Speech: 세션 준비 완료 응답
Client: MediaRecorder로 실시간 음성 스트림 전송
STT Gateway: Azure Speech로 오디오 스트림 전달
Azure Speech: 실시간 텍스트 변환 + 화자 식별 (< 1초 지연)
STT Gateway: RabbitMQ에 TextTranscribed 이벤트 발행
AI Service: RabbitMQ 구독하여 이벤트 수신
AI Service: Claude API로 텍스트 후처리 (구조화, 요약)
WebSocket Server: 모든 참석자에게 실시간 동기화

4.2 화자 식별 흐름

Azure Speech      STT Gateway      Redis Cache      Participants DB
  │                   │                │                   │
  │─1.Speaker_1───▶│                │                   │
  │   인식 결과      │                │                   │
  │                   │─2.Speaker_1──▶│                   │
  │                   │   매핑 조회    │                   │
  │                   │◀─3.매핑 없음──│                   │
  │                   │                │                   │
  │                   │────────4.참석자 목록 조회────────▶│
  │                   │◀───────5.참석자 목록──────────────│
  │                   │                │                   │
  │                   │─6.음성 패턴───│                   │
  │                   │   기반 매칭    │                   │
  │                   │                │                   │
  │                   │─7.Speaker_1 =─▶│                   │
  │                   │   USR_001 저장 │                   │

화자 매칭 전략:

첫 발언: Azure가 제공한 Speaker_1, Speaker_2 등을 참석자 목록과 매칭
음성 패턴 분석: 발언 순서, 발언 빈도, 음성 특징 기반 추정
Redis 캐싱: 매칭 결과를 캐싱하여 이후 발언에 재사용
수동 보정: 사용자가 화자를 수동으로 지정 가능

5. 구현 상세

5.1 Frontend (React)

음성 캡처 및 WebSocket 스트리밍

// Azure Speech 실시간 스트리밍
class AzureSpeechRecorder {
  constructor(meetingId, speakerId) {
    this.meetingId = meetingId;
    this.speakerId = speakerId;
    this.ws = null;
    this.mediaRecorder = null;
    this.audioContext = null;
  }

  async start() {
    // WebSocket 연결
    this.ws = new WebSocket(`ws://localhost:3001/api/stt/stream`);

    this.ws.onopen = () => {
      // 세션 시작 요청
      this.ws.send(JSON.stringify({
        type: 'session_start',
        session_id: `SESSION_${Date.now()}`,
        meeting_id: this.meetingId,
        config: {
          language: 'ko-KR',
          sample_rate: 16000,
          format: 'audio/wav',
          enable_diarization: true,
          max_speakers: 10
        }
      }));
    };

    this.ws.onmessage = (event) => {
      const message = JSON.parse(event.data);
      if (message.type === 'session_ready') {
        this.startRecording();
      }
    };

    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
    };
  }

  async startRecording() {
    const stream = await navigator.mediaDevices.getUserMedia({
      audio: {
        sampleRate: 16000, // Azure 권장
        channelCount: 1,
        echoCancellation: true,
        noiseSuppression: true,
        autoGainControl: true
      }
    });

    // AudioContext로 PCM 변환
    this.audioContext = new AudioContext({ sampleRate: 16000 });
    const source = this.audioContext.createMediaStreamSource(stream);
    const processor = this.audioContext.createScriptProcessor(4096, 1, 1);

    processor.onaudioprocess = (e) => {
      if (this.ws && this.ws.readyState === WebSocket.OPEN) {
        const audioData = e.inputBuffer.getChannelData(0);

        // Float32 PCM to Int16 PCM 변환
        const int16Array = new Int16Array(audioData.length);
        for (let i = 0; i < audioData.length; i++) {
          int16Array[i] = Math.max(-32768, Math.min(32767, audioData[i] * 32768));
        }

        // Base64 인코딩하여 전송
        const base64Audio = this.arrayBufferToBase64(int16Array.buffer);

        this.ws.send(JSON.stringify({
          type: 'audio_chunk',
          session_id: this.sessionId,
          audio_data: base64Audio,
          timestamp: new Date().toISOString()
        }));
      }
    };

    source.connect(processor);
    processor.connect(this.audioContext.destination);
  }

  arrayBufferToBase64(buffer) {
    let binary = '';
    const bytes = new Uint8Array(buffer);
    for (let i = 0; i < bytes.byteLength; i++) {
      binary += String.fromCharCode(bytes[i]);
    }
    return btoa(binary);
  }

  stop() {
    if (this.ws) {
      this.ws.send(JSON.stringify({
        type: 'session_end',
        session_id: this.sessionId
      }));
      this.ws.close();
    }

    if (this.audioContext) {
      this.audioContext.close();
    }
  }
}

WebSocket 실시간 수신

class TranscriptionWebSocket {
  constructor(meetingId, onTranscription) {
    this.meetingId = meetingId;
    this.onTranscription = onTranscription;
    this.ws = null;
  }

  connect() {
    this.ws = new WebSocket(`ws://localhost:8080/ws/meetings/${this.meetingId}`);

    this.ws.onmessage = (event) => {
      const message = JSON.parse(event.data);

      if (message.type === 'transcription_update') {
        this.onTranscription(message.data);
      }
    };

    this.ws.onerror = (error) => {
      console.error('WebSocket error:', error);
      // 재연결 로직
      setTimeout(() => this.connect(), 3000);
    };
  }

  disconnect() {
    if (this.ws) {
      this.ws.close();
    }
  }
}

5.2 Backend (Node.js + Azure Speech SDK)

STT Gateway Service (WebSocket Server)

const WebSocket = require('ws');
const sdk = require('microsoft-cognitiveservices-speech-sdk');
const amqp = require('amqplib');
const redis = require('redis');

const wss = new WebSocket.Server({ port: 3001, path: '/api/stt/stream' });
const redisClient = redis.createClient({ url: process.env.REDIS_URL });

// Azure Speech 설정
const AZURE_SPEECH_KEY = process.env.AZURE_SPEECH_KEY;
const AZURE_SPEECH_REGION = process.env.AZURE_SPEECH_REGION; // e.g., 'koreacentral'

// 세션 저장소
const sessions = new Map();

wss.on('connection', (ws) => {
  console.log('Client connected');
  let recognizer = null;
  let sessionId = null;

  ws.on('message', async (data) => {
    const message = JSON.parse(data);

    try {
      switch (message.type) {
        case 'session_start':
          sessionId = message.session_id;
          await startAzureSpeechSession(ws, sessionId, message.meeting_id, message.config);
          break;

        case 'audio_chunk':
          // 오디오 청크는 Azure Speech SDK가 자동 처리
          break;

        case 'session_end':
          if (recognizer) {
            recognizer.stopContinuousRecognitionAsync();
          }
          break;
      }
    } catch (error) {
      console.error('WebSocket message error:', error);
      ws.send(JSON.stringify({
        type: 'error',
        error: error.message
      }));
    }
  });

  ws.on('close', () => {
    if (recognizer) {
      recognizer.stopContinuousRecognitionAsync();
    }
    sessions.delete(sessionId);
    console.log('Client disconnected');
  });
});

// Azure Speech 세션 시작
async function startAzureSpeechSession(ws, sessionId, meetingId, config) {
  // Azure Speech SDK 설정
  const speechConfig = sdk.SpeechConfig.fromSubscription(
    AZURE_SPEECH_KEY,
    AZURE_SPEECH_REGION
  );

  speechConfig.speechRecognitionLanguage = config.language || 'ko-KR';
  speechConfig.enableDictation();
  speechConfig.setProfanity(sdk.ProfanityOption.Masked);

  // 오디오 스트림 설정 (Push Stream)
  const pushStream = sdk.AudioInputStream.createPushStream();
  const audioConfig = sdk.AudioConfig.fromStreamInput(pushStream);

  // Conversation Transcriber (화자 식별 포함)
  const transcriber = new sdk.ConversationTranscriber(speechConfig, audioConfig);

  // 실시간 인식 이벤트 핸들러
  transcriber.transcribed = async (s, e) => {
    if (e.result.reason === sdk.ResultReason.RecognizedSpeech) {
      const result = {
        text: e.result.text,
        speaker_id: e.result.speakerId,
        confidence: e.result.properties.getProperty('Confidence'),
        offset: e.result.offset,
        duration: e.result.duration
      };

      console.log(`[${result.speaker_id}]: ${result.text}`);

      // 화자 매칭
      const matchedUser = await matchSpeaker(meetingId, result.speaker_id);

      // RabbitMQ 이벤트 발행
      const event = {
        event_type: 'TextTranscribed',
        event_id: `EVT_${Date.now()}`,
        timestamp: new Date().toISOString(),
        payload: {
          meeting_id: meetingId,
          speaker: {
            id: matchedUser?.user_id || 'Unknown',
            name: matchedUser?.name || result.speaker_id,
            azure_speaker_id: result.speaker_id
          },
          transcription: {
            text: result.text,
            confidence: parseFloat(result.confidence) || 0.9
          },
          timestamp: new Date().toISOString()
        },
        metadata: {
          source: 'azure-speech-service',
          version: '2.0'
        }
      };

      await publishToQueue('text-transcribed', event);

      // WebSocket으로 클라이언트에 실시간 전송
      ws.send(JSON.stringify({
        type: 'recognition_result',
        session_id: sessionId,
        result_id: `RESULT_${Date.now()}`,
        recognition_status: 'Success',
        text: result.text,
        confidence: result.confidence,
        speaker_id: result.speaker_id,
        matched_user: matchedUser,
        is_final: true,
        timestamp: new Date().toISOString()
      }));
    }
  };

  // 에러 핸들러
  transcriber.canceled = (s, e) => {
    console.error(`Recognition canceled: ${e.errorDetails}`);
    ws.send(JSON.stringify({
      type: 'error',
      error: e.errorDetails
    }));
  };

  // 인식 시작
  transcriber.startTranscribingAsync(() => {
    console.log('Azure Speech recognition started');
    ws.send(JSON.stringify({
      type: 'session_ready',
      session_id: sessionId
    }));

    // 세션 저장
    sessions.set(sessionId, {
      transcriber,
      pushStream,
      meetingId
    });
  });

  // WebSocket에서 받은 오디오 데이터를 Push Stream에 전달
  ws.on('message', (data) => {
    const message = JSON.parse(data);
    if (message.type === 'audio_chunk' && message.session_id === sessionId) {
      const audioBuffer = Buffer.from(message.audio_data, 'base64');
      pushStream.write(audioBuffer);
    }
  });
}

// 화자 매칭 로직
async function matchSpeaker(meetingId, azureSpeakerId) {
  // Redis에서 기존 매칭 조회
  const cacheKey = `meeting:${meetingId}:speaker_mapping:${azureSpeakerId}`;
  const cached = await redisClient.get(cacheKey);

  if (cached) {
    return JSON.parse(cached);
  }

  // 신규 화자인 경우 참석자 목록에서 추정
  // TODO: 실제로는 발언 패턴, 순서 등을 분석하여 매칭
  const participants = await getParticipants(meetingId);

  if (participants && participants.length > 0) {
    // 간단한 매칭 전략: 순서대로 할당
    const speakerIndex = parseInt(azureSpeakerId.replace('Speaker_', '')) - 1;
    const matchedUser = participants[speakerIndex % participants.length];

    // Redis에 캐싱
    await redisClient.setEx(cacheKey, 3600, JSON.stringify(matchedUser));

    return matchedUser;
  }

  return null;
}

// 참석자 목록 조회
async function getParticipants(meetingId) {
  // TODO: 실제 DB에서 조회
  // 임시로 Redis에서 조회
  const key = `meeting:${meetingId}:participants`;
  const data = await redisClient.get(key);
  return data ? JSON.parse(data) : [];
}

// RabbitMQ 발행
async function publishToQueue(queueName, message) {
  const connection = await amqp.connect(process.env.RABBITMQ_URL);
  const channel = await connection.createChannel();
  await channel.assertQueue(queueName, { durable: true });
  channel.sendToQueue(queueName, Buffer.from(JSON.stringify(message)), {
    persistent: true
  });
  await channel.close();
  await connection.close();
}

console.log('Azure Speech STT Gateway running on port 3001');

AI Service (Claude 후처리)

const Anthropic = require('@anthropic-ai/sdk');
const amqp = require('amqplib');
const redis = require('redis');

const anthropic = new Anthropic({
  apiKey: process.env.ANTHROPIC_API_KEY
});

const redisClient = redis.createClient({
  url: process.env.REDIS_URL
});

// RabbitMQ 구독
async function consumeQueue() {
  const connection = await amqp.connect(process.env.RABBITMQ_URL);
  const channel = await connection.createChannel();
  await channel.assertQueue('text-transcribed', { durable: true });

  channel.consume('text-transcribed', async (msg) => {
    const event = JSON.parse(msg.content.toString());
    await processTranscription(event);
    channel.ack(msg);
  });
}

// Claude로 텍스트 후처리
async function processTranscription(event) {
  const { meeting_id, speaker, transcription } = event.payload;

  // Redis에서 기존 회의록 내용 조회
  const sectionsKey = `meeting:${meeting_id}:sections:*`;
  const sections = await redisClient.keys(sectionsKey);

  const context = sections.length > 0
    ? await redisClient.get(sections[0])
    : '(새로운 회의)';

  // Claude API 호출
  const message = await anthropic.messages.create({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 2048,
    temperature: 0.3,
    system: '당신은 회의록 작성 전문가입니다. 발언 내용을 구조화하여 명확하고 간결하게 정리합니다.',
    messages: [
      {
        role: 'user',
        content: `다음은 회의 발언 내용입니다. 회의록 형식에 맞춰 정리해주세요.

발언: "${transcription.text}"
화자: ${speaker.name}
시간: ${event.payload.timestamp}

기존 회의록 내용:
${context}

템플릿 섹션: 안건, 논의 내용, 결정 사항, Todo`
      }
    ]
  });

  const structuredContent = message.content[0].text;

  // Redis에 업데이트된 내용 저장
  await redisClient.setEx(
    `meeting:${meeting_id}:sections:discussion`,
    3600,
    structuredContent
  );

  // WebSocket으로 실시간 동기화
  await broadcastToWebSocket(meeting_id, {
    type: 'transcription_update',
    data: {
      meeting_id,
      speaker,
      transcription: {
        text: structuredContent,
        is_final: true,
        confidence: transcription.confidence
      },
      target_section: 'discussion',
      action: 'append'
    }
  });
}

// WebSocket 브로드캐스트
async function broadcastToWebSocket(meetingId, message) {
  // WebSocket 서버로 메시지 전송 (구현 필요)
  // 실제로는 Redis Pub/Sub 또는 별도 WebSocket 서버 연동
}

// 서비스 시작
(async () => {
  await redisClient.connect();
  await consumeQueue();
  console.log('AI Service started');
})();

6. 오류 처리 및 복구 전략

6.1 오류 시나리오

시나리오	감지 방법	대응 전략
Azure Speech 장애	SDK error callback	자동 재연결 (exponential backoff), 로컬 녹음 저장
네트워크 단절	WebSocket 연결 끊김	자동 재연결 (최대 5회), 클라이언트 로컬 저장
낮은 confidence	score < 0.7	사용자에게 경고 표시, 수동 수정 권장
화자 식별 실패	Speaker_Unknown	"미지정 화자"로 표시, 수동 지정 인터페이스 제공
RabbitMQ 장애	메시지 발행 실패	재시도 3회 후 Redis 임시 저장, 수동 복구
Azure API 할당량 초과	429 Too Many Requests	경고 알림, 회의 일시 중지 권장

6.2 재시도 로직

async function retryWithExponentialBackoff(fn, maxRetries = 3) {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;
      const delay = Math.pow(2, attempt) * 1000; // 1s, 2s, 4s
      await new Promise(resolve => setTimeout(resolve, delay));
    }
  }
}

7. 성능 및 확장성

7.1 성능 목표

지표	목표값	측정 방법
STT 지연 시간	< 1초	발언 시작 → 화면 표시 (Azure 실시간 스트리밍)
WebSocket 지연	< 100ms	메시지 발행 → 클라이언트 수신
Claude API 응답	< 2초	API 호출 → 응답 수신
동시 회의 처리	100개	Azure Speech 동시 세션 부하 테스트
화자 식별 정확도	> 85%	Speaker Diarization 정확도

7.2 확장성 전략

수평 확장: STT Gateway WebSocket 서버를 여러 인스턴스로 분산 (Load Balancer)
캐싱: Redis에 화자 매칭 정보 캐싱하여 중복 처리 방지
Queue 파티셔닝: 회의 ID 기반 RabbitMQ 파티셔닝
Azure 리소스 관리: Azure Speech Services 할당량 모니터링 및 자동 스케일링
CDN: 음성 파일 저장 시 S3 + CloudFront 활용

8. 보안 및 개인정보 보호

8.1 보안 요구사항

전송 암호화: HTTPS/WSS 사용
인증/인가: JWT 토큰 기반 회의 접근 제어
음성 데이터 보호: 녹음 파일 암호화 저장 (AES-256)
개인정보 처리: GDPR 준수, 음성 데이터 보관 기간 제한 (30일)

8.2 데이터 생명주기

녹음 시작 → 실시간 처리 → Redis 캐싱 (10분)
                          ↓
                    PostgreSQL 저장 (30일)
                          ↓
                    자동 삭제 (회의 종료 후 30일)

9. 모니터링 및 로깅

9.1 모니터링 지표

STT 성공률: Whisper 성공률, Google 폴백 비율
평균 confidence score: 텍스트 변환 품질 추적
처리 지연 시간: 각 단계별 소요 시간
오류율: API 오류, 네트워크 오류 비율

9.2 로깅 전략

// 구조화된 로그
{
  "timestamp": "2025-10-21T14:30:18.000Z",
  "level": "INFO",
  "service": "stt-gateway",
  "event": "transcription_success",
  "request_id": "REQ_001",
  "meeting_id": "MTG_001",
  "provider": "whisper",
  "confidence": 0.95,
  "processing_time_ms": 850
}

10. 테스트 전략

10.1 단위 테스트

Azure Speech SDK 연동 모킹
Claude API 응답 파싱
Redis 캐싱 로직
화자 매칭 알고리즘

10.2 통합 테스트

STT Gateway (Azure Speech) → RabbitMQ → AI Service 전체 플로우
WebSocket 양방향 실시간 동기화
화자 식별 및 매칭 정확도 검증

10.3 성능 테스트

동시 100개 회의 시뮬레이션 (Azure Speech 동시 세션)
실시간 스트리밍 지연 시간 측정 (목표: < 1초)
Azure API 할당량 및 처리량 테스트

10.4 품질 테스트

다양한 음질 환경에서 STT 정확도 측정
화자 식별 정확도 검증 (목표: > 85%)
한국어 방언 및 억양 대응 테스트

11. 구현 일정

단계	작업	담당자	예상 기간
1	Frontend 음성 캡처 (WebSocket) 구현	최유진	4일
2	Azure Speech SDK 연동 및 STT Gateway 개발	이준호	6일
3	화자 식별 및 매칭 로직 구현	박서연	3일
4	RabbitMQ 설정 및 이벤트 처리	이동욱	3일
5	AI Service (Claude 연동)	박서연	4일
6	Redis 캐싱 구현	이준호	2일
7	WebSocket 양방향 실시간 동기화	최유진	4일
8	통합 테스트 및 화자 식별 검증	정도현	6일
9	성능 최적화 및 Azure 리소스 튜닝	전체	3일

총 예상 기간: 35일 (약 5주)

12. 참고 자료

Azure Speech Services

기타 참고 자료

13. 변경 이력

버전	날짜	작성자	변경 내용
1.0	2025-10-21	개발팀 전체	최초 작성 (Whisper + Google 하이브리드 전략)
2.0	2025-10-21	개발팀 전체	Azure Speech Services 단일 전략으로 전면 변경 - STT 엔진: Whisper → Azure Speech Services - 실시간 스트리밍 방식 적용 (지연 시간 < 1초) - Speaker Diarization 기본 지원 - 폴백 전략 제거 (Azure 단일 사용) - 구현 코드 전면 수정 (Frontend/Backend) - 구현 일정 조정 (4주 → 5주)

문서 승인:

AI Specialist: 박서연
Backend Developer: 이준호, 이동욱
Frontend Developer: 최유진
Architect: 홍길동
QA Engineer: 정도현

35 KiB Raw Blame History

STT (Speech-to-Text) 구현방안

📋 문서 정보

1. 개요

1.1 목적

1.2 핵심 요구사항

1.3 Azure Speech Services 선정 이유

1.4 차별화 전략

2. 아키텍처 설계

2.1 전체 구조

2.2 계층별 역할

Client Layer (Frontend)

STT Gateway Service

Azure Speech Services

Message Queue (RabbitMQ)

AI Service (Claude API)

Cache Layer (Redis)

WebSocket Server

3. 데이터 구조 설계

3.1 Azure Speech 스트리밍 연결 설정

3.2 실시간 오디오 스트림 전송 (WebSocket)

3.3 Azure Speech 실시간 응답 (WebSocket)

3.4 화자 매칭 결과 (STT Gateway 내부 처리)

3.5 Claude API 호출 구조

요청 (STT Gateway → Claude API)

응답 (Claude API → AI Service)

3.4 RabbitMQ 이벤트 구조

3.6 Redis 캐시 구조

3.7 WebSocket 실시간 동기화 메시지

4. 처리 흐름 (Sequence)

4.1 실시간 스트리밍 흐름

4.2 화자 식별 흐름

5. 구현 상세

5.1 Frontend (React)

음성 캡처 및 WebSocket 스트리밍

WebSocket 실시간 수신

5.2 Backend (Node.js + Azure Speech SDK)

STT Gateway Service (WebSocket Server)

AI Service (Claude 후처리)

6. 오류 처리 및 복구 전략

6.1 오류 시나리오

6.2 재시도 로직

7. 성능 및 확장성

7.1 성능 목표

7.2 확장성 전략

8. 보안 및 개인정보 보호

8.1 보안 요구사항

8.2 데이터 생명주기

9. 모니터링 및 로깅

9.1 모니터링 지표

9.2 로깅 전략

10. 테스트 전략

10.1 단위 테스트

10.2 통합 테스트

10.3 성능 테스트

10.4 품질 테스트

11. 구현 일정

12. 참고 자료

Azure Speech Services

기타 참고 자료

13. 변경 이력

35 KiB

Raw Blame History