Crash recovery architecture

Overview

Content Analytics uses PersistentHitQueue to protect against data loss during the batching window (0-5 seconds). Events are written to disk immediately when tracked. On next app launch, any persisted events are recovered from disk into memory for processing, then cleared from disk (no data loss - events are safely in memory before disk cleanup).

How It Works

User tracks event
  └─> Event added to memory + disk (crash-safe)
      │
      └─> Batching (0-5 seconds)
          │
          └─> Flush triggered
              │
              ├─> Process accumulated events
              ├─> Calculate aggregated metrics
              └─> Dispatch to Edge Network (Edge guarantees delivery)

Architecture Components

BatchCoordinator

Responsibilities:

Key Methods:

Android

fun addAssetEvent(event: Event)
  ├─> assetHitProcessor.accumulateEvent(event)      // Add to memory
  ├─> persistEventImmediately(event, queue)         // Write to disk
  └─> checkAndFlushIfNeeded()                       // Check thresholds

suspend fun performFlush()
  ├─> val events = assetHitProcessor.processAccumulatedEvents()
  └─> [Orchestrator processes events → dispatches to Edge]
      └─> Edge guarantees delivery from here

iOS

func addAssetEvent(_ event: Event)
  ├─> assetHitProcessor.accumulateEvent(event)      // Add to memory
  ├─> persistEventImmediately(event, to: queue)    // Write to disk
  └─> checkAndFlushIfNeeded()                       // Check thresholds

func performFlush()
  ├─> let events = assetHitProcessor.processAccumulatedEvents()
  └─> [Orchestrator processes events → dispatches to Edge]
      └─> Edge guarantees delivery from here

DirectHitProcessor

Responsibilities:

Event Lifecycle:

Android

override suspend fun processHit(entity: DataEntity): Boolean
  ├─> Decode event from disk
  ├─> Accumulate in memory (if not already present)
  └─> return true → clear from disk (event now in memory)

iOS

func processHit(entity: DataEntity, completion: (Bool) -> Void)
  ├─> Decode event from disk
  ├─> Accumulate in memory (if not already present)
  └─> completion(true) → clear from disk (event now in memory)

PersistentHitQueue (AEPServices)

Provides:

Storage:

Detailed Timeline Example

Time   │ Event                                │ Memory │ Disk │ Safe?
───────┼──────────────────────────────────────┼────────┼──────┼───────
00.00s │ User views Asset A                   │ ✓      │ ✓    │ ✅ YES
00.01s │ Event written to disk                │ ✓      │ ✓    │ ✅ YES
00.50s │ User clicks Asset B                  │ ✓      │ ✓    │ ✅ YES
01.00s │ User clicks Asset B                  │ ✓      │ ✓    │ ✅ YES
       │ [Batching window - events on disk]   │        │      │
02.00s │ Timer fires → Flush triggered        │ ✓      │ ✓    │ ✅ YES
02.01s │ Process accumulated events           │ ✓      │ ✓    │ ✅ YES
02.02s │ Calculate metrics (1 view, 2 clicks) │ ✓      │ ✓    │ ✅ YES
02.03s │ Dispatch to Edge Network             │ ✗      │ ✗    │ ✅ YES*
       │ (*Edge guarantees delivery)          │        │      │

Legend:
✓ = Present
✗ = Not present

Events stay on disk during the entire batching window. Once events are handed off to Edge, their persistence takes over.

Crash Scenarios

Scenario 1: Crash During Batching (0-5s window)

Status: Events in memory + disk
Crash:  ⚡ App terminated
        └─> Memory lost ✗
        └─> Disk persists ✓

Recovery on Next Launch:
1. PersistentHitQueue.beginProcessing() starts
2. DirectHitProcessor.processHit() called for each persisted event
3. Events accumulated in memory, cleared from disk
4. Normal batch processing resumes

Result: ✅ ZERO DATA LOSS

Scenario 2: Crash During Flush

Status: Events being processed
Crash:  ⚡ App terminated mid-dispatch
        └─> Memory lost ✗
        └─> Events may still be on disk if not yet processed

Recovery on Next Launch:
1. Any remaining events on disk are recovered
2. Re-accumulated and dispatched on next flush

Result: ✅ ZERO DATA LOSS (possible duplicate if crash after Edge dispatch)

Scenario 3: Crash After Edge Dispatch

Status: Events dispatched to Edge
Crash:  ⚡ App terminated
        └─> Disk already cleared during processHit()
        └─> Edge has the events

Result: ✅ ZERO DATA LOSS - Edge guarantees delivery

Edge Network Handoff

Once events are dispatched to Edge extension:

ContentAnalytics → runtime.dispatch(event) → Event Hub → Edge Extension
                                                           └─> Edge.PersistentHitQueue
                                                               └─> Network retries
                                                               └─> Exponential backoff

Handoff Point: After eventDispatcher.dispatch() completes, Edge extension owns persistence.

Edge Guarantees: Once Edge receives the event, it handles persistence, retries, and delivery confirmation.

Metrics Calculation

Metrics are derived from events, not stored separately:

Android

// On flush (ContentAnalyticsOrchestrator.kt)
private fun buildAssetMetricsCollection(events: List<Event>): AssetMetricsCollection {
    val groupedEvents = events.groupBy { it.assetKey ?: "" }
    val metricsMap = mutableMapOf<String, AssetMetrics>()
    
    for ((key, events) in groupedEvents) {
        val views = events.count { it.interactionType == InteractionType.VIEW }
        val clicks = events.count { it.interactionType == InteractionType.CLICK }
        metricsMap[key] = AssetMetrics(viewCount = views, clickCount = clicks, ...)
    }
    
    return AssetMetricsCollection(metricsMap)
}

iOS

// On flush (ContentAnalyticsOrchestrator.swift)
func buildAssetMetricsCollection(from events: [Event]) -> AssetMetricsCollection {
    let groupedEvents = Dictionary(grouping: events) { $0.assetKey ?? "" }
    var metricsMap: [String: AssetMetrics] = [:]
    
    for (key, events) in groupedEvents {
        let views = events.filter { $0.interactionType == .view }.count
        let clicks = events.filter { $0.interactionType == .click }.count
        metricsMap[key] = AssetMetrics(viewCount: views, clickCount: clicks, ...)
    }
    
    return AssetMetricsCollection(metrics: metricsMap)
}

This avoids state sync issues. Just events are counted on flush. If the app crashes, the restored events give the same metrics.

Configuration

{
  "contentanalytics.batchingEnabled": true,
  "contentanalytics.maxBatchSize": 10,
  "contentanalytics.batchFlushInterval": 2000
}

Parameters:

Performance Characteristics

Operation
Time
Notes
Event persistence
~1-2ms
SQLite write
Event recovery
~5-10ms
SQLite read on launch
Batch flush
~10-20ms
Metrics calculation + Edge dispatch
Memory per event
~2KB
Event object + metadata
Disk per event
~1-2KB
JSON encoding

Memory Usage: With default batch size (10), worst-case memory is ~20-40KB (negligible).

Network Efficiency: Batching reduces Edge Network calls by 10x for high-volume tracking.

Thread Safety

All operations use Kotlin coroutines with Mutex for thread-safe access:

Android

// BatchCoordinator
private val scope = CoroutineScope(Dispatchers.IO + SupervisorJob())
private val stateMutex = kotlinx.coroutines.sync.Mutex()

// DirectHitProcessor
private val mutex = Mutex()

// All state mutations wrapped in mutex.withLock { }

Testing Crash Recovery

Test 1: Crash During Batching

  1. Track 5 asset events
  2. DO NOT wait for flush timer
  3. Force-quit app (⌘+Q or kill process)
  4. Relaunch app
  5. Track 5 more asset events
  6. Wait 2 seconds for flush
  7. Verify: 1 Edge event with 10 aggregated interactions

Test 2: Crash During Flush

  1. Track 10 asset events (triggers immediate flush)
  2. Set breakpoint in sendToEdge()
  3. Force-quit app at breakpoint
  4. Relaunch app
  5. Wait 5 seconds
  6. Verify: Events re-dispatched (possible duplicate)

Test 3: Background Termination

  1. Track events
  2. Background app
  3. OS terminates app (memory pressure)
  4. Relaunch app
  5. Verify: Events recovered and dispatched

Implementation Details

Key Files

Thread Safety

Data Flow

Event tracked
  └─> BatchCoordinator.addAssetEvent()
      ├─> DirectHitProcessor.accumulateEvent()  [memory]
      ├─> PersistentHitQueue.queue()            [disk]
      └─> checkAndFlushIfNeeded()
          └─> performFlush()
              └─> DirectHitProcessor.processAccumulatedEvents()
                  └─> Orchestrator.processAssetEvents()
                      └─> EventDispatcher.dispatch()  [→ Edge]

Callback Chain Architecture

The SDK uses a callback chain to decouple components while maintaining type safety:

┌─────────────────────────────────────────────────────────────────────────────┐
│                           INITIALIZATION PHASE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  ContentAnalyticsFactory.createOrchestrator()                               │
│    │                                                                        │
│    ├─> Creates BatchCoordinator(assetQueue, experienceQueue, state)         │
│    │     └─> DirectHitProcessor initialized with no-op callbacks            │
│    │                                                                        │
│    ├─> Creates ContentAnalyticsOrchestrator(batchCoordinator, ...)          │
│    │                                                                        │
│    └─> Wires callbacks: batchCoordinator.setCallbacks(                      │
│          assetCallback: orchestrator.processAssetEvents,                    │
│          experienceCallback: orchestrator.processExperienceEvents           │
│        )                                                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                            RUNTIME DATA FLOW                                │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  User calls ContentAnalytics.trackAssetInteraction()                        │
│    │                                                                        │
│    v                                                                        │
│  ┌──────────────────┐                                                       │
│  │ BatchCoordinator │                                                       │
│  │  addAssetEvent() │──────────────────────────────────────────┐            │
│  └────────┬─────────┘                                          │            │
│           │                                                    │            │
│           v                                                    v            │
│  ┌────────────────────┐                           ┌─────────────────────┐   │
│  │ DirectHitProcessor │                           │ PersistentHitQueue  │   │
│  │ accumulateEvent()  │                           │ queue() [disk]      │   │
│  │ [memory buffer]    │                           └─────────────────────┘   │
│  └────────┬───────────┘                                                     │
│           │                                                                 │
│           │ (on flush trigger: count >= 10 or timer >= 2s)                  │
│           v                                                                 │
│  ┌────────────────────────────┐                                             │
│  │ DirectHitProcessor         │                                             │
│  │ processAccumulatedEvents() │                                             │
│  └────────┬───────────────────┘                                             │
│           │                                                                 │
│           │ invokes processingCallback([events])                            │
│           v                                                                 │
│  ┌─────────────────────────────────┐                                        │
│  │ ContentAnalyticsOrchestrator    │                                        │
│  │ processAssetEvents([events])    │                                        │
│  │   ├─> Group by asset key        │                                        │
│  │   ├─> Calculate metrics         │                                        │
│  │   └─> Build XDM payload         │                                        │
│  └────────┬────────────────────────┘                                        │
│           │                                                                 │
│           v                                                                 │
│  ┌───────────────────┐                                                      │
│  │ EdgeEventDispatcher│                                                     │
│  │ dispatch()         │──────────────> Edge Network                         │
│  └───────────────────┘                                                      │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Callbacks avoid circular dependencies - BatchCoordinator doesn't need to import Orchestrator. Also this makes testing easier since you can inject mocks.

Logging

Enable verbose logging to debug crash recovery:

Log.setLogLevel(.trace)

Look for:

[BATCH_PROCESSOR] Accumulated ASSET event | ID: <uuid>
[BATCH_PROCESSOR] Recovered event from disk | Type: asset | ID: <uuid>
[BATCH_PROCESSOR] Processing 5 asset events

Comparison with Edge Extension

Feature
Content Analytics
Edge Extension
Pre-dispatch persistence
✅ YES (0-5s)
❌ NO
Batching
✅ YES
❌ NO
Post-dispatch persistence
✅ Edge's queue
✅ PersistentHitQueue
Network retries
✅ Edge handles
✅ Exponential backoff
Crash recovery during batch
✅ FULL
N/A

Content Analytics batches events for 0-5 seconds before dispatch. Without disk persistence during that window, crashes would lose data. Edge dispatches immediately so it doesn't need this.

Known Limitations

  1. No dispatch confirmation: Extensions cannot receive callbacks from Edge to confirm receipt
  2. Possible duplicates: Crash during Edge dispatch may cause duplicate events (Edge deduplication handles this)
  3. Memory overhead: Events held in memory + disk during batching (minimal: ~40KB)