Skip to content

Conversation

@adamjacobmuller
Copy link

@adamjacobmuller adamjacobmuller commented Dec 3, 2025

Summary

This PR adds an intelligent in-memory snapshot caching system for the /api/frame.jpeg endpoint to eliminate slow response times caused by large keyframe intervals.

Problem

The current /api/frame.jpeg implementation blocks until the next keyframe arrives. When cameras use large keyframe intervals (5-10+ seconds), snapshot requests can take that long to respond:

  • Typical response time: 150ms - 10,000ms (depends on when next keyframe arrives)
  • Cached response time: <10ms (99% improvement)

This creates a poor user experience in:

  • Home Assistant dashboards (loading spinners, timeouts)
  • Notification thumbnails (delayed or missing images)
  • Mobile apps (perceived as "broken" or "slow")

Additional benefit: Reduces camera load since multiple clients can share the same cached snapshot instead of each triggering separate connections.

Solution

Implements background snapshot caching with:

  • Background keyframe consumer that continuously captures frames
  • Always-ready snapshots - no waiting for next keyframe
  • Configurable idle timeout (default: 10 minutes) - stops when not needed
  • Stale detection - auto-restarts if cache gets too old
  • Opt-in by design - clients must request ?cached=true (or configure globally)
  • Graceful fallback - serves fresh snapshot if cache unavailable

Configuration

mjpeg:
  snapshot_cache: true                        # Enable/disable (default: true)
  snapshot_cache_timeout: 600                 # Idle timeout in seconds (default: 600)
  snapshot_serve_cached_by_default: false     # Default behavior (default: false)

Usage

# Request cached snapshot - returns immediately (<10ms)
curl http://localhost:1984/api/frame.jpeg?src=camera1&cached=true

# Request fresh snapshot - waits for next keyframe (traditional behavior)
curl http://localhost:1984/api/frame.jpeg?src=camera1&cached=false

# Check cache age via response headers
# X-Snapshot-Age-Ms: 1234
# X-Snapshot-Timestamp: 2025-12-03T12:34:56.789Z
# X-Snapshot-Cached: true/false

Performance Impact

Before (fresh snapshot):

  • First request: 150-2700ms (average keyframe interval)
  • Subsequent requests: 150-2700ms (each waits for keyframe)
  • 10-second keyframe interval = 10-second snapshot delay

After (cached snapshot):

  • First request: <10ms (if cache exists)
  • Subsequent requests: <10ms (served from memory)
  • Eliminates keyframe wait entirely

Memory overhead: ~300KB JPEG per cached stream (only while accessed)

Benefits

  • Instant snapshots - no keyframe waiting
  • Better UX - dashboards/apps feel responsive
  • Reduced camera load - single persistent connection shared by all clients
  • Production-ready - extensive logging at TRACE level for debugging

Implementation Details

  • New file: internal/streams/snapshot_cache.go (223 lines)
  • Modified: internal/mjpeg/init.go - add caching logic to frame handler
  • Modified: internal/streams/stream.go - add cache storage fields
  • Includes fix: pkg/h265/rtp.go - prevent panic from stale buffer pointers

Testing

Tested with:

  • Cameras with 1s, 5s, and 10s keyframe intervals
  • Multiple concurrent clients requesting cached snapshots
  • Idle timeout triggering and cache restart
  • Stale cache detection and recovery
  • H264, H265, and JPEG source codecs

Related Issues

Directly addresses:

May help with:


This change is backward compatible - default behavior is unchanged unless clients opt-in with ?cached=true.

Adam Jacob Muller added 5 commits December 3, 2025 11:48
Implements a high-performance snapshot caching system that dramatically
reduces latency for repeated snapshot requests from 150-2700ms to <10ms.

Problem Statement:
- Every /api/frame.jpeg request required waiting for RTSP connection,
  keyframe arrival, and FFmpeg transcoding (150-2700ms total)
- Home Assistant dashboards, motion detection systems, and preview grids
  generate dozens of requests per minute, causing high latency and
  resource usage

Solution:
- Background consumer continuously transcodes keyframes to JPEG
- Snapshots cached in memory (~100-500KB per stream)
- Configurable idle timeout stops producer after inactivity
- Zero waste when producers already running (piggybacks existing streams)
- Cache persists in memory even after timeout for instant resumption

Architecture:
1. Stream-level cache storage (internal/streams/stream.go)
   - Thread-safe JPEG data and timestamp storage
   - RWMutex for concurrent read access

2. Background SnapshotCacher (internal/streams/snapshot_cache.go)
   - Persistent keyframe consumer with idle timeout (600s default)
   - Continuous JPEG transcoding via injected function
   - Graceful shutdown via consumer.Stop() to unblock WriteTo
   - Automatic cleanup on idle timeout or stream termination

3. Enhanced snapshot handler (internal/mjpeg/init.go)
   - Check cache first, serve instantly if available
   - Fall back to fresh snapshot if cache miss or client requests
   - HTTP headers expose cache age/timestamp for client decisions
   - Query parameter override: ?cached=true/false

Configuration:
  mjpeg:
    snapshot_cache: true                        # Enable/disable
    snapshot_cache_timeout: 600                 # Idle timeout (seconds)
    snapshot_serve_cached_by_default: false     # Serve policy

HTTP Response Headers:
  X-Snapshot-Age-Ms: 234          # Milliseconds since capture
  X-Snapshot-Timestamp: 2025-...  # ISO 8601 capture time
  X-Snapshot-Cached: true         # true if from cache

Performance:
- First request: 150-2700ms (unchanged - cold start)
- Subsequent requests: <10ms (~99% improvement)
- Memory usage: ~300KB per stream (30MB for 100 cameras)
- Works with WebRTC/HLS: zero additional overhead when consumers active

Key Implementation Details:
- Dependency injection for transcode function avoids import cycles
- WriteBuffer.WriteTo blocks until Stop() called on consumer
- Idempotent stop() via atomic.Bool prevents double cleanup
- Cache never evicted (negligible memory for typical deployments)
- Per-request policy override via query parameter

Tested:
- Multi-architecture Docker image (linux/amd64, linux/arm64)
- Verified cache updates continuously in background
- Confirmed graceful shutdown on idle timeout
- Validated 11x performance improvement in production test
Addresses panic: runtime error: slice bounds out of range [32382:2108]

The issue occurred when nuStart wasn't reset after buffer clearing,
causing it to point beyond the buffer length on subsequent fragmented
units. This adds:

1. Bounds checking before writing NAL unit size to prevent invalid
   slice operations
2. nuStart reset when buffer is cleared to prevent stale state

The panic typically occurred during H265 RTP stream processing when
fragmentation unit (FU) state became inconsistent.
Modified TouchSnapshotCache call to include stream name parameter,
enabling per-stream logging and diagnostics in snapshot cache operations.
This improves troubleshooting and monitoring of cache behavior across
multiple streams.
Key improvements to snapshot cache implementation:

1. Enhanced logging with stream names:
   - Added stream name field to SnapshotCacher for per-stream logging
   - All log messages now include stream name for better diagnostics
   - Added trace-level logging for detailed troubleshooting

2. Stale cache detection and recovery:
   - Detect when cache age exceeds 2x timeout threshold
   - Automatically restart cacher when cache becomes stale
   - Prevents serving outdated snapshots from stuck cachers

3. Improved lifecycle management:
   - Clear cacher reference in run() loop on exit for auto-restart
   - Better error handling when consumer fails to start
   - Retain old cached snapshot when new cacher fails to start

4. Fixed potential deadlock:
   - Check cache age before acquiring snapshotCacherMu
   - Prevents lock ordering issues between cachedJPEGMu and snapshotCacherMu

5. Better observability:
   - Log bytes written on consumer errors
   - Log when WriteTo completes normally vs error
   - Track timestamp with nanosecond precision in logs
   - Added stopAndClear helper for clarity (future use)

These changes make the snapshot cache more resilient to transient
producer failures and easier to debug in production.
Change most operational snapshot-cache logs from DEBUG to TRACE:
- Startup sequence (creating cacher, adding consumer, etc)
- Run loop messages
- Cache update messages (fires on every keyframe)
- Stop/cleanup sequence

Keeps important messages at DEBUG or higher:
- Successfully started cacher
- Idle timeout events
- Warning/error conditions
@felipecrs
Copy link
Contributor

Interesting. I wonder how much this overlaps with GOP cache.

I mean, if GOP cache was implemented, I suppose it could be reused for the snapshot too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants