Crash recovery architecture
Overview
Content Analytics uses PersistentHitQueue to protect against data loss during the batching window (0-5 seconds). Events are written to disk immediately when tracked. On next app launch, any persisted events are recovered from disk into memory for processing, then cleared from disk (no data loss - events are safely in memory before disk cleanup).
How It Works
User tracks event
└─> Event added to memory + disk (crash-safe)
│
└─> Batching (0-5 seconds)
│
└─> Flush triggered
│
├─> Process accumulated events
├─> Calculate aggregated metrics
└─> Dispatch to Edge Network (Edge guarantees delivery)
Architecture Components
BatchCoordinator
Responsibilities:
- Manages batching logic (count threshold and time-based flush)
- Writes incoming events to disk immediately via
PersistentHitQueue - Maintains in-memory event counters
- Triggers flush when threshold reached (10 events or 5 seconds)
- Coordinates between
DirectHitProcessorandContentAnalyticsOrchestrator
Key Methods:
Android
fun addAssetEvent(event: Event)
├─> assetHitProcessor.accumulateEvent(event) // Add to memory
├─> persistEventImmediately(event, queue) // Write to disk
└─> checkAndFlushIfNeeded() // Check thresholds
suspend fun performFlush()
├─> val events = assetHitProcessor.processAccumulatedEvents()
└─> [Orchestrator processes events → dispatches to Edge]
└─> Edge guarantees delivery from here
iOS
func addAssetEvent(_ event: Event)
├─> assetHitProcessor.accumulateEvent(event) // Add to memory
├─> persistEventImmediately(event, to: queue) // Write to disk
└─> checkAndFlushIfNeeded() // Check thresholds
func performFlush()
├─> let events = assetHitProcessor.processAccumulatedEvents()
└─> [Orchestrator processes events → dispatches to Edge]
└─> Edge guarantees delivery from here
DirectHitProcessor
Responsibilities:
- Implements
HitProcessingprotocol forPersistentHitQueueintegration - Accumulates events in memory for fast batching
- On recovery: loads events from disk into memory, then clears disk (no data loss)
Event Lifecycle:
Android
override suspend fun processHit(entity: DataEntity): Boolean
├─> Decode event from disk
├─> Accumulate in memory (if not already present)
└─> return true → clear from disk (event now in memory)
iOS
func processHit(entity: DataEntity, completion: (Bool) -> Void)
├─> Decode event from disk
├─> Accumulate in memory (if not already present)
└─> completion(true) → clear from disk (event now in memory)
PersistentHitQueue (AEPServices)
Provides:
- Two separate queues:
asset.eventsandexperience.events - SQLite-backed persistence (survives crashes, force-quit, background termination)
- Automatic processing via
beginProcessing() - Thread-safe operations
Storage:
- Events encoded as JSON via
Event: Codable - Each event wrapped with type metadata (
assetorexperience) - Unique identifier:
event.id.uuidString
Detailed Timeline Example
Time │ Event │ Memory │ Disk │ Safe?
───────┼──────────────────────────────────────┼────────┼──────┼───────
00.00s │ User views Asset A │ ✓ │ ✓ │ ✅ YES
00.01s │ Event written to disk │ ✓ │ ✓ │ ✅ YES
00.50s │ User clicks Asset B │ ✓ │ ✓ │ ✅ YES
01.00s │ User clicks Asset B │ ✓ │ ✓ │ ✅ YES
│ [Batching window - events on disk] │ │ │
02.00s │ Timer fires → Flush triggered │ ✓ │ ✓ │ ✅ YES
02.01s │ Process accumulated events │ ✓ │ ✓ │ ✅ YES
02.02s │ Calculate metrics (1 view, 2 clicks) │ ✓ │ ✓ │ ✅ YES
02.03s │ Dispatch to Edge Network │ ✗ │ ✗ │ ✅ YES*
│ (*Edge guarantees delivery) │ │ │
Legend:
✓ = Present
✗ = Not present
Events stay on disk during the entire batching window. Once events are handed off to Edge, their persistence takes over.
Crash Scenarios
Scenario 1: Crash During Batching (0-5s window)
Status: Events in memory + disk
Crash: ⚡ App terminated
└─> Memory lost ✗
└─> Disk persists ✓
Recovery on Next Launch:
1. PersistentHitQueue.beginProcessing() starts
2. DirectHitProcessor.processHit() called for each persisted event
3. Events accumulated in memory, cleared from disk
4. Normal batch processing resumes
Result: ✅ ZERO DATA LOSS
Scenario 2: Crash During Flush
Status: Events being processed
Crash: ⚡ App terminated mid-dispatch
└─> Memory lost ✗
└─> Events may still be on disk if not yet processed
Recovery on Next Launch:
1. Any remaining events on disk are recovered
2. Re-accumulated and dispatched on next flush
Result: ✅ ZERO DATA LOSS (possible duplicate if crash after Edge dispatch)
Scenario 3: Crash After Edge Dispatch
Status: Events dispatched to Edge
Crash: ⚡ App terminated
└─> Disk already cleared during processHit()
└─> Edge has the events
Result: ✅ ZERO DATA LOSS - Edge guarantees delivery
Edge Network Handoff
Once events are dispatched to Edge extension:
ContentAnalytics → runtime.dispatch(event) → Event Hub → Edge Extension
└─> Edge.PersistentHitQueue
└─> Network retries
└─> Exponential backoff
Handoff Point: After eventDispatcher.dispatch() completes, Edge extension owns persistence.
Edge Guarantees: Once Edge receives the event, it handles persistence, retries, and delivery confirmation.
Metrics Calculation
Metrics are derived from events, not stored separately:
Android
// On flush (ContentAnalyticsOrchestrator.kt)
private fun buildAssetMetricsCollection(events: List<Event>): AssetMetricsCollection {
val groupedEvents = events.groupBy { it.assetKey ?: "" }
val metricsMap = mutableMapOf<String, AssetMetrics>()
for ((key, events) in groupedEvents) {
val views = events.count { it.interactionType == InteractionType.VIEW }
val clicks = events.count { it.interactionType == InteractionType.CLICK }
metricsMap[key] = AssetMetrics(viewCount = views, clickCount = clicks, ...)
}
return AssetMetricsCollection(metricsMap)
}
iOS
// On flush (ContentAnalyticsOrchestrator.swift)
func buildAssetMetricsCollection(from events: [Event]) -> AssetMetricsCollection {
let groupedEvents = Dictionary(grouping: events) { $0.assetKey ?? "" }
var metricsMap: [String: AssetMetrics] = [:]
for (key, events) in groupedEvents {
let views = events.filter { $0.interactionType == .view }.count
let clicks = events.filter { $0.interactionType == .click }.count
metricsMap[key] = AssetMetrics(viewCount: views, clickCount: clicks, ...)
}
return AssetMetricsCollection(metrics: metricsMap)
}
This avoids state sync issues. Just events are counted on flush. If the app crashes, the restored events give the same metrics.
Configuration
{
"contentanalytics.batchingEnabled": true,
"contentanalytics.maxBatchSize": 10,
"contentanalytics.batchFlushInterval": 2000
}
Parameters:
maxBatchSize: Event count threshold (default: 10)batchFlushInterval: Timer interval for periodic flush in milliseconds (default: 2000 ms = 2s). Max wait time is derived from this (2.5× = 5000 ms).batchingEnabled: Set tofalsefor immediate dispatch (no batching)
Performance Characteristics
Memory Usage: With default batch size (10), worst-case memory is ~20-40KB (negligible).
Network Efficiency: Batching reduces Edge Network calls by 10x for high-volume tracking.
Thread Safety
All operations use Kotlin coroutines with Mutex for thread-safe access:
Android
// BatchCoordinator
private val scope = CoroutineScope(Dispatchers.IO + SupervisorJob())
private val stateMutex = kotlinx.coroutines.sync.Mutex()
// DirectHitProcessor
private val mutex = Mutex()
// All state mutations wrapped in mutex.withLock { }
Testing Crash Recovery
Test 1: Crash During Batching
- Track 5 asset events
- DO NOT wait for flush timer
- Force-quit app (⌘+Q or kill process)
- Relaunch app
- Track 5 more asset events
- Wait 2 seconds for flush
- Verify: 1 Edge event with 10 aggregated interactions
Test 2: Crash During Flush
- Track 10 asset events (triggers immediate flush)
- Set breakpoint in
sendToEdge() - Force-quit app at breakpoint
- Relaunch app
- Wait 5 seconds
- Verify: Events re-dispatched (possible duplicate)
Test 3: Background Termination
- Track events
- Background app
- OS terminates app (memory pressure)
- Relaunch app
- Verify: Events recovered and dispatched
Implementation Details
Key Files
BatchCoordinator.swift- Batching logic and persistence coordinationDirectHitProcessor.swift- Crash recovery and event accumulationContentAnalyticsOrchestrator.swift- Metrics calculation and Edge dispatchPersistentHitQueue(AEPServices) - SQLite-backed queue
Thread Safety
- All operations use serial dispatch queues
batchQueue(BatchCoordinator) - batch operationsqueue(DirectHitProcessor) - hit processing
Data Flow
Event tracked
└─> BatchCoordinator.addAssetEvent()
├─> DirectHitProcessor.accumulateEvent() [memory]
├─> PersistentHitQueue.queue() [disk]
└─> checkAndFlushIfNeeded()
└─> performFlush()
└─> DirectHitProcessor.processAccumulatedEvents()
└─> Orchestrator.processAssetEvents()
└─> EventDispatcher.dispatch() [→ Edge]
Callback Chain Architecture
The SDK uses a callback chain to decouple components while maintaining type safety:
┌─────────────────────────────────────────────────────────────────────────────┐
│ INITIALIZATION PHASE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ContentAnalyticsFactory.createOrchestrator() │
│ │ │
│ ├─> Creates BatchCoordinator(assetQueue, experienceQueue, state) │
│ │ └─> DirectHitProcessor initialized with no-op callbacks │
│ │ │
│ ├─> Creates ContentAnalyticsOrchestrator(batchCoordinator, ...) │
│ │ │
│ └─> Wires callbacks: batchCoordinator.setCallbacks( │
│ assetCallback: orchestrator.processAssetEvents, │
│ experienceCallback: orchestrator.processExperienceEvents │
│ ) │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ RUNTIME DATA FLOW │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ User calls ContentAnalytics.trackAssetInteraction() │
│ │ │
│ v │
│ ┌──────────────────┐ │
│ │ BatchCoordinator │ │
│ │ addAssetEvent() │──────────────────────────────────────────┐ │
│ └────────┬─────────┘ │ │
│ │ │ │
│ v v │
│ ┌────────────────────┐ ┌─────────────────────┐ │
│ │ DirectHitProcessor │ │ PersistentHitQueue │ │
│ │ accumulateEvent() │ │ queue() [disk] │ │
│ │ [memory buffer] │ └─────────────────────┘ │
│ └────────┬───────────┘ │
│ │ │
│ │ (on flush trigger: count >= 10 or timer >= 2s) │
│ v │
│ ┌────────────────────────────┐ │
│ │ DirectHitProcessor │ │
│ │ processAccumulatedEvents() │ │
│ └────────┬───────────────────┘ │
│ │ │
│ │ invokes processingCallback([events]) │
│ v │
│ ┌─────────────────────────────────┐ │
│ │ ContentAnalyticsOrchestrator │ │
│ │ processAssetEvents([events]) │ │
│ │ ├─> Group by asset key │ │
│ │ ├─> Calculate metrics │ │
│ │ └─> Build XDM payload │ │
│ └────────┬────────────────────────┘ │
│ │ │
│ v │
│ ┌───────────────────┐ │
│ │ EdgeEventDispatcher│ │
│ │ dispatch() │──────────────> Edge Network │
│ └───────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Callbacks avoid circular dependencies - BatchCoordinator doesn't need to import Orchestrator. Also this makes testing easier since you can inject mocks.
Logging
Enable verbose logging to debug crash recovery:
Log.setLogLevel(.trace)
Look for:
[BATCH_PROCESSOR] Accumulated ASSET event | ID: <uuid>
[BATCH_PROCESSOR] Recovered event from disk | Type: asset | ID: <uuid>
[BATCH_PROCESSOR] Processing 5 asset events
Comparison with Edge Extension
Content Analytics batches events for 0-5 seconds before dispatch. Without disk persistence during that window, crashes would lose data. Edge dispatches immediately so it doesn't need this.
Known Limitations
- No dispatch confirmation: Extensions cannot receive callbacks from Edge to confirm receipt
- Possible duplicates: Crash during Edge dispatch may cause duplicate events (Edge deduplication handles this)
- Memory overhead: Events held in memory + disk during batching (minimal: ~40KB)