# KT 이벤트 마케팅 서비스 - 운영환경 물리아키텍처 설계서

## 1. 개요

### 1.1 설계 목적

본 문서는 KT 이벤트 마케팅 서비스의 운영환경 물리 아키텍처를 정의합니다.

- **설계 범위**: 운영환경 전용 물리 인프라 설계
- **설계 목적**:
  - 고가용성과 확장성을 고려한 프로덕션 환경
  - 엔터프라이즈급 보안과 모니터링 체계
  - 실사용자 규모에 따른 성능 최적화
  - 관리형 서비스 중심의 안정적인 구성
- **대상 환경**: Azure 기반 운영환경 (Production)
- **대상 시스템**: 7개 마이크로서비스 + 관리형 백킹서비스

### 1.2 설계 원칙

운영환경에 적합한 5대 핵심 원칙을 정의합니다.

| 원칙 | 설명 | 적용 방법 |
|------|------|-----------|
| **고가용성** | 99.9% 이상 가용성 보장 | Multi-Zone 배포, 관리형 서비스 |
| **확장성** | 자동 스케일링 지원 | HPA, 클러스터 오토스케일러 |
| **보안 우선** | 다층 보안 아키텍처 | WAF, Private Endpoints, RBAC |
| **관측 가능성** | 종합 모니터링 체계 | Azure Monitor, Application Insights |
| **재해복구** | 자동 백업 및 복구 | 지역 간 복제, 자동 장애조치 |

### 1.3 참조 아키텍처

| 아키텍처 문서 | 연관관계 | 참조 방법 |
|---------------|----------|-----------|
| [아키텍처 패턴](../pattern/architecture-pattern.md) | 마이크로서비스 패턴 기반 | 서비스 분리 및 통신 패턴 |
| [논리 아키텍처](../logical/) | 논리적 컴포넌트 구조 | 물리적 배치 및 연결 관계 |
| [데이터 설계서](../database/) | 데이터 저장소 요구사항 | 관리형 데이터베이스 구성 |
| [HighLevel 아키텍처](../high-level-architecture.md) | 전체 시스템 구조 | CI/CD 및 엔터프라이즈 서비스 |

## 2. 운영환경 아키텍처 개요

### 2.1 환경 특성

| 특성 | 운영환경 설정값 | 근거 |
|------|----------------|------|
| **목적** | 실제 사용자 서비스 제공 | 비즈니스 연속성 보장 |
| **사용자 규모** | 1만~10만 명 동시 사용자 | 확장 가능한 아키텍처 |
| **가용성 목표** | 99.9% (연간 8.7시간 다운타임) | SLA 기준 가용성 |
| **확장성** | 자동 스케일링 (2-10배) | 트래픽 패턴 대응 |
| **보안 수준** | 엔터프라이즈급 (다층 보안) | 데이터 보호 및 규제 준수 |
| **데이터 보호** | 실제 개인정보 보호 | GDPR, 개인정보보호법 준수 |

### 2.2 전체 아키텍처

전체 시스템은 CDN → Application Gateway → AKS → 관리형 서비스 플로우로 구성됩니다.

- **아키텍처 다이어그램**: [physical-architecture-prod.mmd](./physical-architecture-prod.mmd)
- **네트워크 다이어그램**: [network-prod.mmd](./network-prod.mmd)

**주요 컴포넌트**:
- **Azure Front Door + CDN**: 글로벌 가속 및 DDoS 보호
- **Application Gateway + WAF**: L7 로드밸런싱 및 웹 보안
- **AKS Premium**: Multi-Zone Kubernetes 클러스터
- **Azure Database for PostgreSQL**: 관리형 주 데이터베이스
- **Azure Cache for Redis**: 관리형 캐시 서비스
- **Azure Service Bus Premium**: 엔터프라이즈 메시징

## 3. 컴퓨팅 아키텍처

### 3.1 Kubernetes 클러스터 구성

#### 3.1.1 클러스터 설정

| 설정 항목 | 설정값 | 설명 |
|-----------|--------|------|
| **Kubernetes 버전** | 1.28.x | 안정된 최신 버전 |
| **서비스 티어** | Standard | 프로덕션 워크로드 지원 |
| **CNI 플러그인** | Azure CNI | 고성능 네트워킹 |
| **DNS** | CoreDNS + Private DNS | 내부 도메인 해석 |
| **RBAC** | 엄격한 권한 관리 | 최소 권한 원칙 |
| **Pod Security** | Restricted 정책 | 강화된 보안 설정 |
| **Ingress Controller** | Application Gateway | Azure 네이티브 통합 |

#### 3.1.2 노드 풀 구성

| 노드 풀 | 인스턴스 크기 | 노드 수 | Multi-Zone | 스케일링 | 용도 |
|---------|---------------|---------|------------|----------|------|
| **System** | Standard_D2s_v3 | 3개 (Zone별 1개) | 3-Zone | 수동 | 시스템 워크로드 |
| **Application** | Standard_D4s_v3 | 6개 (Zone별 2개) | 3-Zone | 자동 (3-15) | 애플리케이션 워크로드 |

### 3.2 고가용성 구성

#### 3.2.1 Multi-Zone 배포

| 가용성 전략 | 설정 | 설명 |
|-------------|------|------|
| **Zone 분산** | 3개 Zone 균등 배포 | Korea Central 전 Zone 활용 |
| **Pod Anti-Affinity** | 활성화 | 동일 Zone 집중 방지 |
| **Pod Disruption Budget** | 최소 1개 Pod 유지 | 롤링 업데이트 안정성 |

### 3.3 서비스별 리소스 할당

#### 3.3.1 애플리케이션 서비스

| 서비스명 | CPU Requests | CPU Limits | Memory Requests | Memory Limits | Replicas | HPA |
|----------|--------------|------------|-----------------|---------------|----------|-----|
| **user-service** | 200m | 500m | 256Mi | 512Mi | 3 | 2-10 |
| **event-service** | 300m | 800m | 512Mi | 1Gi | 3 | 3-15 |
| **content-service** | 200m | 500m | 256Mi | 512Mi | 2 | 2-8 |
| **ai-service** | 500m | 1000m | 1Gi | 2Gi | 2 | 2-8 |
| **participation-service** | 200m | 500m | 256Mi | 512Mi | 2 | 2-10 |
| **analytics-service** | 300m | 800m | 512Mi | 1Gi | 2 | 2-6 |
| **distribution-service** | 200m | 500m | 256Mi | 512Mi | 2 | 2-8 |

#### 3.3.2 HPA 구성

```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: event-service-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: event-service
  minReplicas: 3
  maxReplicas: 15
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
```

## 4. 네트워크 아키텍처

### 4.1 네트워크 토폴로지

**네트워크 구성**: [network-prod.mmd](./network-prod.mmd)

#### 4.1.1 Virtual Network 구성

| 서브넷 | 주소 대역 | 용도 | 특별 설정 |
|--------|-----------|------|-----------|
| **Gateway Subnet** | 10.0.4.0/24 | Application Gateway | 고정 IP 할당 |
| **Application Subnet** | 10.0.1.0/24 | AKS 클러스터 | CNI 통합 |
| **Database Subnet** | 10.0.2.0/24 | 관리형 데이터베이스 | Private Endpoint |
| **Cache Subnet** | 10.0.3.0/24 | 관리형 캐시 | Private Endpoint |

#### 4.1.2 네트워크 보안 그룹

| 방향 | 규칙 이름 | 포트 | 소스/대상 | 목적 |
|------|-----------|------|-----------|------|
| **Inbound** | AllowHTTPS | 443 | Internet | 웹 트래픽 |
| **Inbound** | AllowHTTP | 80 | Internet | HTTP 리다이렉트 |
| **Inbound** | DenyAll | * | * | 기본 거부 |
| **Outbound** | AllowInternal | * | VNet | 내부 통신 |

### 4.2 트래픽 라우팅

#### 4.2.1 Application Gateway 구성

| 설정 항목 | 설정값 | 설명 |
|-----------|--------|------|
| **SKU** | WAF_v2 | Web Application Firewall 포함 |
| **인스턴스** | 2-10 (자동 스케일링) | 트래픽에 따라 동적 조정 |
| **Public IP** | 고정 IP | 도메인 연결용 |
| **백엔드 풀** | AKS NodePort 서비스 | 30080-30086 포트 |

#### 4.2.2 WAF 구성

```yaml
# WAF 정책 예시
apiVersion: network.azure.com/v1
kind: ApplicationGatewayWebApplicationFirewallPolicy
metadata:
  name: kt-event-waf-policy
spec:
  policySettings:
    mode: Prevention
    state: Enabled
    fileUploadLimitInMb: 100
    maxRequestBodySizeInKb: 128
  managedRules:
    managedRuleSets:
    - ruleSetType: OWASP
      ruleSetVersion: "3.2"
      ruleGroupOverrides:
      - ruleGroupName: REQUEST-920-PROTOCOL-ENFORCEMENT
        rules:
        - ruleId: "920230"
          state: Disabled
  customRules:
  - name: RateLimitRule
    priority: 1
    ruleType: RateLimitRule
    rateLimitDuration: PT1M
    rateLimitThreshold: 100
    matchConditions:
    - matchVariables:
      - variableName: RemoteAddr
    action: Block
```

### 4.3 Network Policies

#### 4.3.1 마이크로서비스 간 통신 제어

| 정책 이름 | Ingress 규칙 | Egress 규칙 | 적용 대상 |
|-----------|--------------|-------------|-----------|
| **default-deny** | 모든 트래픽 거부 | 모든 트래픽 거부 | 전체 네임스페이스 |
| **allow-ingress** | Ingress Controller만 허용 | 제한 없음 | 웹 서비스 |
| **allow-database** | 애플리케이션만 허용 | DNS, PostgreSQL만 | 데이터베이스 통신 |

### 4.4 서비스 디스커버리

| 서비스명 | 내부 DNS 주소 | 포트 | 외부 접근 방법 | LoadBalancer 유형 |
|----------|---------------|------|----------------|--------------------|
| **user-service** | user-service.prod.svc.cluster.local | 8080 | NodePort 30080 | Application Gateway |
| **event-service** | event-service.prod.svc.cluster.local | 8080 | NodePort 30081 | Application Gateway |
| **content-service** | content-service.prod.svc.cluster.local | 8080 | NodePort 30082 | Application Gateway |
| **ai-service** | ai-service.prod.svc.cluster.local | 8080 | NodePort 30083 | Application Gateway |
| **participation-service** | participation-service.prod.svc.cluster.local | 8080 | NodePort 30084 | Application Gateway |
| **analytics-service** | analytics-service.prod.svc.cluster.local | 8080 | NodePort 30085 | Application Gateway |
| **distribution-service** | distribution-service.prod.svc.cluster.local | 8080 | NodePort 30086 | Application Gateway |

## 5. 데이터 아키텍처

### 5.1 관리형 주 데이터베이스

#### 5.1.1 데이터베이스 구성

| 설정 항목 | 설정값 | 설명 |
|-----------|--------|------|
| **서비스** | Azure Database for PostgreSQL Flexible | 관리형 데이터베이스 |
| **버전** | PostgreSQL 15 | 최신 안정 버전 |
| **SKU** | GP_Standard_D4s_v3 | 4 vCPU, 16GB RAM |
| **스토리지** | 1TB Premium SSD | 고성능 스토리지 |
| **고가용성** | Zone Redundant | 다중 Zone 복제 |
| **백업** | 35일 자동 백업 | Point-in-time 복구 |
| **보안** | Private Endpoint | VNet 내부 통신만 |

#### 5.1.2 읽기 전용 복제본

```yaml
# 읽기 복제본 구성 예시
apiVersion: dbforpostgresql.azure.com/v1beta1
kind: FlexibleServer
metadata:
  name: kt-event-db-replica
spec:
  location: Korea Central
  sourceServerId: /subscriptions/.../kt-event-db-primary
  replicaRole: Read
  sku:
    name: GP_Standard_D2s_v3
    tier: GeneralPurpose
  storage:
    sizeGB: 512
    tier: P4
  highAvailability:
    mode: ZoneRedundant
```

### 5.2 관리형 캐시 서비스

#### 5.2.1 캐시 클러스터 구성

| 설정 항목 | 설정값 | 설명 |
|-----------|--------|------|
| **서비스** | Azure Cache for Redis Premium | 관리형 캐시 |
| **크기** | P2 (6GB) | 프로덕션 워크로드 |
| **복제** | 3개 복제본 | 고가용성 |
| **클러스터** | 활성화 | 수평 확장 지원 |
| **지속성** | RDB + AOF | 데이터 영구 저장 |
| **보안** | Private Endpoint + TLS | 암호화 통신 |

#### 5.2.2 캐시 전략

```yaml
# 캐시 정책 예시
apiVersion: v1
kind: ConfigMap
metadata:
  name: redis-config
data:
  redis.conf: |
    # 메모리 정책
    maxmemory-policy allkeys-lru

    # RDB 스냅샷
    save 900 1
    save 300 10
    save 60 10000

    # AOF 설정
    appendonly yes
    appendfsync everysec

    # 클러스터 설정
    cluster-enabled yes
    cluster-config-file nodes.conf
    cluster-node-timeout 5000
```

### 5.3 데이터 백업 및 복구

#### 5.3.1 자동 백업 전략

```yaml
# 백업 정책 예시
apiVersion: backup.azure.com/v1
kind: BackupPolicy
metadata:
  name: kt-event-backup-policy
spec:
  postgresql:
    retentionPolicy:
      dailyBackups: 35
      weeklyBackups: 12
      monthlyBackups: 12
      yearlyBackups: 7
    backupSchedule:
      dailyBackup:
        time: "02:00"
      weeklyBackup:
        day: "Sunday"
        time: "01:00"
  redis:
    retentionPolicy:
      dailyBackups: 7
      weeklyBackups: 4
    persistencePolicy:
      rdbEnabled: true
      aofEnabled: true
```

## 6. 메시징 아키텍처

### 6.1 관리형 Message Queue

#### 6.1.1 Message Queue 구성

```yaml
# Service Bus Premium 구성
apiVersion: servicebus.azure.com/v1beta1
kind: Namespace
metadata:
  name: kt-event-servicebus-prod
spec:
  sku:
    name: Premium
    capacity: 1
  zoneRedundant: true
  encryption:
    enabled: true
  networkRuleSets:
    defaultAction: Deny
    virtualNetworkRules:
    - subnetId: /subscriptions/.../vnet/subnets/application
      ignoreMissingVnetServiceEndpoint: false
  privateEndpoints:
  - name: kt-event-sb-pe
    subnetId: /subscriptions/.../vnet/subnets/application
```

#### 6.1.2 큐 및 토픽 설계

```yaml
# 큐 구성 예시
apiVersion: servicebus.azure.com/v1beta1
kind: Queue
metadata:
  name: ai-schedule-generation
  namespace: kt-event-servicebus-prod
spec:
  maxSizeInMegabytes: 16384
  maxDeliveryCount: 10
  duplicateDetectionHistoryTimeWindow: PT10M
  enablePartitioning: true
  deadLetteringOnMessageExpiration: true
  enableBatchedOperations: true
  autoDeleteOnIdle: P14D
  forwardTo: ""
  forwardDeadLetteredMessagesTo: "ai-schedule-dlq"
```

## 7. 보안 아키텍처

### 7.1 다층 보안 아키텍처

#### 7.1.1 보안 계층 구조

```yaml
# L1-L4 보안 계층 정의
securityLayers:
  L1_Network:
    components:
      - Azure Front Door (DDoS Protection)
      - Application Gateway WAF
      - Network Security Groups
    purpose: "네트워크 레벨 보안"

  L2_Platform:
    components:
      - AKS RBAC
      - Pod Security Standards
      - Network Policies
    purpose: "플랫폼 레벨 보안"

  L3_Application:
    components:
      - Azure Active Directory
      - Managed Identity
      - OAuth 2.0 / JWT
    purpose: "애플리케이션 레벨 인증/인가"

  L4_Data:
    components:
      - Private Endpoints
      - TLS 1.3 Encryption
      - Azure Key Vault
    purpose: "데이터 보호 및 암호화"
```

### 7.2 인증 및 권한 관리

#### 7.2.1 클라우드 Identity 통합

```yaml
# Azure AD 애플리케이션 등록
apiVersion: identity.azure.com/v1beta1
kind: AzureIdentity
metadata:
  name: kt-event-identity
spec:
  type: 0  # User Assigned Identity
  resourceID: /subscriptions/.../kt-event-identity
  clientID: xxxx-xxxx-xxxx-xxxx
---
apiVersion: identity.azure.com/v1beta1
kind: AzureIdentityBinding
metadata:
  name: kt-event-identity-binding
spec:
  azureIdentity: kt-event-identity
  selector: kt-event-app
```

#### 7.2.2 RBAC 구성

```yaml
# 클러스터 역할 및 서비스 계정
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kt-event-app-role
rules:
- apiGroups: [""]
  resources: ["secrets", "configmaps"]
  verbs: ["get", "list"]
- apiGroups: ["apps"]
  resources: ["deployments"]
  verbs: ["get", "list", "watch"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: kt-event-app-sa
  annotations:
    azure.workload.identity/client-id: xxxx-xxxx-xxxx-xxxx
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kt-event-app-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: kt-event-app-role
subjects:
- kind: ServiceAccount
  name: kt-event-app-sa
  namespace: production
```

### 7.3 네트워크 보안

#### 7.3.1 Private Endpoints

```yaml
# PostgreSQL Private Endpoint
apiVersion: network.azure.com/v1beta1
kind: PrivateEndpoint
metadata:
  name: kt-event-db-pe
spec:
  subnet: /subscriptions/.../vnet/subnets/database
  privateLinkServiceConnections:
  - name: kt-event-db-plsc
    privateLinkServiceId: /subscriptions/.../kt-event-postgresql
    groupIds: ["postgresqlServer"]
  customDnsConfigs:
  - fqdn: kt-event-db.privatelink.postgres.database.azure.com
    ipAddresses: ["10.0.2.10"]
```

### 7.4 암호화 및 키 관리

#### 7.4.1 관리형 Key Vault 구성

```yaml
# Key Vault 구성
apiVersion: keyvault.azure.com/v1beta1
kind: Vault
metadata:
  name: kt-event-keyvault-prod
spec:
  location: Korea Central
  sku:
    family: A
    name: premium
  enabledForDeployment: true
  enabledForDiskEncryption: true
  enabledForTemplateDeployment: true
  enableSoftDelete: true
  softDeleteRetentionInDays: 30
  enablePurgeProtection: true
  networkAcls:
    defaultAction: Deny
    virtualNetworkRules:
    - id: /subscriptions/.../vnet/subnets/application
  accessPolicies:
  - tenantId: xxxx-xxxx-xxxx-xxxx
    objectId: xxxx-xxxx-xxxx-xxxx  # Managed Identity
    permissions:
      secrets: ["get", "list"]
      keys: ["get", "list", "decrypt", "encrypt"]
```

## 8. 모니터링 및 관측 가능성

### 8.1 종합 모니터링 스택

#### 8.1.1 클라우드 모니터링 통합

```yaml
# Azure Monitor 설정
apiVersion: insights.azure.com/v1beta1
kind: Workspace
metadata:
  name: kt-event-workspace-prod
spec:
  location: Korea Central
  sku:
    name: PerGB2018
  retentionInDays: 30
  publicNetworkAccessForIngestion: Enabled
  publicNetworkAccessForQuery: Enabled
---
# Application Insights
apiVersion: insights.azure.com/v1beta1
kind: Component
metadata:
  name: kt-event-appinsights-prod
spec:
  applicationType: web
  workspaceId: /subscriptions/.../kt-event-workspace-prod
  samplingPercentage: 100
```

#### 8.1.2 메트릭 및 알림

```yaml
# 중요 알림 설정
apiVersion: insights.azure.com/v1beta1
kind: MetricAlert
metadata:
  name: high-cpu-alert
spec:
  description: "High CPU usage alert"
  severity: 2
  enabled: true
  scopes:
  - /subscriptions/.../resourceGroups/kt-event-prod/providers/Microsoft.ContainerService/managedClusters/kt-event-aks-prod
  evaluationFrequency: PT1M
  windowSize: PT5M
  criteria:
    allOf:
    - metricName: "cpuUsagePercentage"
      operator: GreaterThan
      threshold: 80
      timeAggregation: Average
  actions:
  - actionGroupId: /subscriptions/.../actionGroups/kt-event-alerts
---
# 리소스 알림
apiVersion: insights.azure.com/v1beta1
kind: ActivityLogAlert
metadata:
  name: resource-health-alert
spec:
  description: "Resource health degradation"
  enabled: true
  scopes:
  - /subscriptions/xxxx-xxxx-xxxx-xxxx
  condition:
    allOf:
    - field: category
      equals: ResourceHealth
    - field: properties.currentHealthStatus
      equals: Degraded
```

### 8.2 로깅 및 추적

#### 8.2.1 중앙집중식 로깅

```yaml
# Fluentd DaemonSet for log collection
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_format %Y-%m-%dT%H:%M:%S.%NZ
      </parse>
    </source>

    <match kubernetes.**>
      @type azure-loganalytics
      customer_id "#{ENV['WORKSPACE_ID']}"
      shared_key "#{ENV['WORKSPACE_KEY']}"
      log_type ContainerLogs
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.buffer
        flush_mode interval
        flush_interval 30s
        chunk_limit_size 2m
        queue_limit_length 8
        retry_limit 17
        retry_wait 1.0
      </buffer>
    </match>
```

#### 8.2.2 애플리케이션 성능 모니터링

```yaml
# APM 설정 및 커스텀 메트릭
apiVersion: v1
kind: ConfigMap
metadata:
  name: apm-config
data:
  applicationinsights.json: |
    {
      "connectionString": "InstrumentationKey=xxxx-xxxx-xxxx-xxxx;IngestionEndpoint=https://koreacentral-1.in.applicationinsights.azure.com/",
      "role": {
        "name": "kt-event-services"
      },
      "sampling": {
        "percentage": 100
      },
      "instrumentation": {
        "logging": {
          "level": "INFO"
        },
        "micrometer": {
          "enabled": true
        }
      },
      "customMetrics": [
        {
          "name": "business.events.created",
          "description": "Number of events created"
        },
        {
          "name": "business.participants.registered",
          "description": "Number of participants registered"
        }
      ]
    }
```

## 9. 배포 관련 컴포넌트

| 컴포넌트 | 역할 | 설정 | 보안 스캔 | 롤백 정책 |
|----------|------|------|-----------|-----------|
| **GitHub Actions** | CI/CD 파이프라인 | Enterprise 워크플로우 | Snyk, SonarQube | 자동 롤백 |
| **Azure Container Registry** | 컨테이너 이미지 저장소 | Premium 티어 | Vulnerability 스캔 | 이미지 버전 관리 |
| **ArgoCD** | GitOps 배포 | HA 모드 | Policy 검증 | Git 기반 롤백 |
| **Helm** | 패키지 관리 | Chart 버전 관리 | 보안 정책 | 릴리스 히스토리 |

## 10. 재해복구 및 고가용성

### 10.1 재해복구 전략

#### 10.1.1 백업 및 복구 목표

```yaml
# RTO/RPO 정의
disasterRecovery:
  objectives:
    RTO: "1시간"    # Recovery Time Objective
    RPO: "15분"     # Recovery Point Objective
  strategies:
    database:
      primaryRegion: "Korea Central"
      secondaryRegion: "Korea South"
      replication: "Geo-Redundant"
      automaticFailover: true
    application:
      multiRegion: false
      backupRegion: "Korea South"
      restoreTime: "30분"
    storage:
      replication: "GRS"  # Geo-Redundant Storage
      accessTier: "Hot"
```

#### 10.1.2 자동 장애조치

```yaml
# Database Failover Group
apiVersion: sql.azure.com/v1beta1
kind: FailoverGroup
metadata:
  name: kt-event-db-fg
spec:
  primaryServer: kt-event-db-primary
  partnerServers:
  - name: kt-event-db-secondary
    location: Korea South
  readWriteEndpoint:
    failoverPolicy: Automatic
    failoverWithDataLossGracePeriodMinutes: 60
  readOnlyEndpoint:
    failoverPolicy: Enabled
  databases:
  - kt_event_marketing
---
# Redis Cache Failover
apiVersion: cache.azure.com/v1beta1
kind: RedisCache
metadata:
  name: kt-event-cache-secondary
spec:
  location: Korea South
  sku:
    name: Premium
    capacity: P2
  redisConfiguration:
    rdb-backup-enabled: "true"
    rdb-backup-frequency: "60"
    rdb-backup-max-snapshot-count: "1"
```

### 10.2 비즈니스 연속성

#### 10.2.1 운영 절차

```yaml
# 인시던트 대응 절차
incidentResponse:
  severity1_critical:
    responseTime: "15분"
    escalation: "CTO, 개발팀장"
    communicationChannel: "Slack #incident-critical"
    actions:
      - "자동 스케일링 확인"
      - "장애조치 검토"
      - "고객 공지 준비"

  severity2_high:
    responseTime: "30분"
    escalation: "개발팀장, 인프라팀"
    communicationChannel: "Slack #incident-high"

  maintenanceWindow:
    schedule: "매주 일요일 02:00-04:00"
    duration: "2시간"
    approvalRequired: true

  changeManagement:
    approvalProcess: "2-person approval"
    testingRequired: true
    rollbackPlan: "mandatory"
```

## 11. 비용 최적화

### 11.1 운영환경 비용 구조

#### 11.1.1 월간 비용 분석

| 구성요소 | 사양 | 월간 예상 비용 (USD) | 최적화 방안 |
|----------|------|---------------------|-------------|
| **AKS 클러스터** | Standard 티어 | $75 | - |
| **VM 노드** | 9 x D4s_v3 (Reserved 1년) | $650 | Reserved Instance 30% 할인 |
| **Application Gateway** | WAF_v2 + 자동스케일링 | $200 | 트래픽 기반 최적화 |
| **PostgreSQL** | GP_Standard_D4s_v3 + Replica | $450 | Reserved 할인, 읽기 복제본 최적화 |
| **Redis Cache** | Premium P2 | $300 | 사용량 기반 스케일링 |
| **Service Bus** | Premium 1 Unit | $700 | 메시지 처리량 기반 |
| **Storage** | 2TB Premium + 백업 | $150 | 생명주기 정책 |
| **네트워크** | 트래픽 + Private Endpoint | $200 | CDN 캐시 최적화 |
| **모니터링** | Log Analytics + App Insights | $100 | 데이터 보존 정책 |
| **총 예상 비용** | - | **$2,825** | **Reserved Instance로 30% 절약 가능** |

#### 11.1.2 비용 최적화 전략

```yaml
# 비용 최적화 전략
costOptimization:
  computing:
    reservedInstances:
      commitment: "1년"
      savings: "30%"
      targetServices: ["VM", "PostgreSQL"]
    autoScaling:
      schedule: "업무시간 기반"
      metrics: ["CPU", "Memory", "Custom"]
      savings: "20%"

  storage:
    lifecyclePolicy:
      hotTier: "30일"
      coolTier: "90일"
      archiveTier: "1년"
      savings: "40%"
    compression:
      enabled: true
      savings: "25%"

  network:
    cdnOptimization:
      cacheHitRatio: ">90%"
      savings: "50%"
    privateEndpoints:
      dataTransferSavings: "60%"
```

### 11.2 성능 대비 비용 효율성

#### 11.2.1 Auto Scaling 최적화

```yaml
# 예측 스케일링 설정
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: predictive-scaling-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: event-service
  minReplicas: 3
  maxReplicas: 15
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
      selectPolicy: Max
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1k"
```

## 12. 운영 가이드

### 12.1 일상 운영 절차

#### 12.1.1 정기 점검 항목

```yaml
# 운영 체크리스트
operationalChecklist:
  daily:
    - name: "헬스체크 상태 확인"
      command: "kubectl get pods -A | grep -v Running"
      expected: "결과 없음"
    - name: "리소스 사용률 점검"
      command: "kubectl top nodes && kubectl top pods"
      threshold: "CPU 70%, Memory 80%"
    - name: "에러 로그 확인"
      query: "ContainerLogs | where LogLevel == 'ERROR'"
      timeRange: "지난 24시간"

  weekly:
    - name: "백업 상태 확인"
      service: "PostgreSQL, Redis"
      retention: "35일"
    - name: "보안 업데이트 점검"
      scope: "Node 이미지, 컨테이너 이미지"
      action: "보안 패치 적용"
    - name: "성능 트렌드 분석"
      metrics: "응답시간, 처리량, 에러율"
      comparison: "지난 주 대비"

  monthly:
    - name: "비용 분석 및 최적화"
      scope: "전체 인프라"
      report: "월간 비용 리포트"
    - name: "용량 계획 수립"
      forecast: "3개월 전망"
      action: "리소스 확장 계획"
```

### 12.2 인시던트 대응

#### 12.2.1 장애 대응 절차

```yaml
# 심각도별 대응 절차
incidentManagement:
  severity1_critical:
    definition: "서비스 완전 중단"
    responseTeam: ["CTO", "개발팀장", "SRE팀"]
    responseTime: "15분"
    communication:
      internal: "Slack #incident-war-room"
      external: "고객 공지 시스템"
    actions:
      - step1: "자동 장애조치 확인"
      - step2: "트래픽 라우팅 재설정"
      - step3: "수동 스케일업"
      - step4: "근본 원인 분석"

  severity2_high:
    definition: "부분 기능 장애"
    responseTeam: ["개발팀장", "해당 서비스 개발자"]
    responseTime: "30분"
    escalation: "1시간 내 해결 안되면 Severity 1로 상향"

  severity3_medium:
    definition: "성능 저하"
    responseTeam: ["해당 서비스 개발자"]
    responseTime: "2시간"
    monitoring: "지속적 모니터링 강화"
```

#### 12.2.2 자동 복구 메커니즘

```yaml
# 자동 복구 설정
autoRecovery:
  podRestart:
    livenessProbe:
      httpGet:
        path: /actuator/health
        port: 8080
      initialDelaySeconds: 60
      periodSeconds: 30
      timeoutSeconds: 10
      failureThreshold: 3
    restartPolicy: Always

  nodeReplacement:
    trigger: "노드 실패 감지"
    action: "자동 노드 교체"
    timeLimit: "10분"

  trafficRerouting:
    healthCheck:
      interval: "10초"
      unhealthyThreshold: 3
    action: "자동 트래픽 재라우팅"
    rollback: "헬스체크 통과 시 자동 복구"
```

## 13. 확장 계획

### 13.1 단계별 확장 로드맵

#### 13.1.1 Phase 1-3

```yaml
# 3단계 확장 계획
scalingRoadmap:
  phase1_foundation:
    period: "0-6개월"
    target: "안정적 서비스 런칭"
    objectives:
      - "기본 인프라 구축 완료"
      - "모니터링 체계 확립"
      - "초기 사용자 1만명 지원"
    deliverables:
      - "운영환경 배포"
      - "CI/CD 파이프라인"
      - "기본 보안 체계"

  phase2_growth:
    period: "6-12개월"
    target: "사용자 증가 대응"
    objectives:
      - "사용자 5만명 지원"
      - "성능 최적화"
      - "글로벌 서비스 준비"
    deliverables:
      - "다중 지역 배포"
      - "CDN 최적화"
      - "고급 모니터링"

  phase3_scale:
    period: "12-24개월"
    target: "대규모 서비스 운영"
    objectives:
      - "사용자 10만명+ 지원"
      - "AI 기능 고도화"
      - "글로벌 서비스 완성"
    deliverables:
      - "멀티 클라우드 구성"
      - "엣지 컴퓨팅 도입"
      - "실시간 AI 추천"
```

### 13.2 기술적 확장성

#### 13.2.1 수평 확장 전략

```yaml
# 계층별 확장 전략
horizontalScaling:
  application:
    currentCapacity: "3-15 replicas per service"
    maxCapacity: "50 replicas per service"
    scalingTrigger: "CPU 70%, Memory 80%"
    estimatedUsers: "10만명 동시 사용자"

  database:
    currentSetup: "Primary + Read Replica"
    scalingPath:
      - step1: "Read Replica 증설 (최대 5개)"
      - step2: "샤딩 도입 (서비스별)"
      - step3: "Cross-region 복제"
    estimatedCapacity: "100만 트랜잭션/일"

  cache:
    currentSetup: "Premium P2 (6GB)"
    scalingPath:
      - step1: "P4 (26GB) 확장"
      - step2: "클러스터 모드 활성화"
      - step3: "지역별 캐시 클러스터"
    estimatedCapacity: "1M ops/초"
```

## 14. 운영환경 특성 요약

**핵심 설계 원칙**:
- **고가용성 우선**: 99.9% 가용성을 위한 Multi-Zone, 관리형 서비스 활용
- **보안 강화**: 다층 보안 아키텍처로 엔터프라이즈급 보안 구현
- **관측 가능성**: 종합 모니터링으로 사전 문제 감지 및 대응
- **자동화**: 스케일링, 백업, 복구의 완전 자동화
- **비용 효율**: Reserved Instance와 자동 스케일링으로 비용 최적화

**주요 성과 목표**:
- **가용성**: 99.9% (연간 8.7시간 다운타임 이하)
- **성능**: 평균 응답시간 200ms 이하, 동시 사용자 10만명 지원
- **확장성**: 트래픽 2-10배 자동 스케일링 대응
- **보안**: 제로 보안 인시던트, 완전한 데이터 암호화
- **복구**: RTO 1시간, RPO 15분 이하

**최적화 목표**:
- **성능 최적화**: 캐시 적중률 90%+, CDN 활용으로 글로벌 응답속도 향상
- **비용 최적화**: Reserved Instance로 30% 비용 절감, 자동 스케일링으로 20% 추가 절약
- **운영 효율성**: 80% 자동화된 운영, 인시던트 자동 감지 및 대응

---

**문서 버전**: v1.0
**최종 수정일**: 2025-10-29
**작성자**: System Architect (박영자 "전문 아키텍트")
**검토자**: DevOps Engineer (송근정 "데브옵스 마스터")