Critical fixes for NimBLE host task / BLE loop task concurrency:
- Defer all disconnect map cleanup from NimBLE callbacks to loop task via
SPSC ring buffer, preventing iterator invalidation and use-after-free
- Defer enterErrorRecovery() from callback context to loop task
- Add WDT feed in enterErrorRecovery() host-sync polling loop
Operational hardening:
- Cache NimBLERemoteCharacteristic* pointers in write() to avoid repeated
service/characteristic lookups per fragment
- Add isConnected() checks before GATT operations (read, enableNotifications)
- Validate peer address in notification callback to guard against handle reuse
- Skip stuck-state detector during CONNECTING/CONN_STARTING states
- Expire stale pending data entries after HANDSHAKE_TIMEOUT (30s)
- Read actual connection RSSI via ble_gap_conn_rssi() for peripheral connections
instead of hardcoding 0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduces crash rate from every 60-85s to 1 reboot per 6+ minutes.
Zero WDT triggers in 10-minute stability test.
BLE mutex fixes (BLEInterface.cpp):
- Release _mutex before blocking GATT ops in onConnected() and
onServicesDiscovered() — prevents 5-30s main-loop stalls during
service discovery, notification subscribe, identity exchange
- Non-blocking try_lock() for peerCount(), getConnectedPeerSummaries(),
get_stats() — returns empty/default if BLE task holds mutex
- Write-without-response in initiateHandshake()
WDT and persistence (main.cpp, sdkconfig.defaults, microReticulum):
- 30s WDT timeout (up from 10s) for SPIFFS flash I/O headroom
- Register Identity::set_persist_yield_callback() to feed WDT every
5 entries during save_known_destinations() (70+ entries = 30-50s)
- WDT feeds between reticulum and identity persist calls
BLE host desync recovery (NimBLEPlatform):
- Time-based desync tracking instead of aggressive counter-based reboot
- 60s tolerance without connections, 5 minutes with active connections
(data still flows over existing BLE mesh links)
- Remove immediate recoverBLEStack() from 574 handler and
enterErrorRecovery() — let startScan() manage reboot decision
- Increase CONNECTION_COOLDOWN from 3s to 10s to reduce 574 risk
- Increase SCAN_FAIL_RECOVERY_THRESHOLD from 5 to 10
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Subscribe loopTask and BLE task to the ESP32 Task Watchdog (10s timeout)
to detect and recover from silent hangs. Per-step WDT feeds in the main
loop prevent false triggers from cumulative slow operations.
Fix BLE mutex starvation that blocked the main loop for 3-6s:
- Move processDiscoveredPeers() out of performMaintenance() so _mutex
is not held during blocking NimBLE connect calls
- Use try_lock() in send_outgoing() to skip sends when BLE task has
the mutex, rather than blocking (Reticulum retransmits)
- Switch BLE data writes to write-without-response (non-blocking)
- Add WDT feeds to all NimBLE blocking wait loops
Replace NimBLE soft-reset recovery with immediate reboot — deinit()
during sync failures caused CORRUPT HEAP panics. With atomic file
persistence, data survives reboots reliably.
Reduce loop task stack from 49KB to 16KB (measured peak ~6KB).
Add NimBLE PHY update null guard to patch_nimble.py.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix systemic One Definition Rule violation where BLEInterface.h included
headers from deps/microReticulum/src/BLE/ while .cpp files compiled
against local lib/ble_interface/ versions, causing struct layout mismatches
(PeerInfo field shifting corrupted conn_handle/mtu) and class layout
mismatches (BLEPeerManager member differences caused LoadProhibited crash).
Key fixes:
- Include local BLE headers instead of deps versions in BLEInterface.h
- Sync PeerInfo keepalive tracking fields and BLETypes constants with deps
- Shutdown re-entrancy guard and proper client cleanup via deinit(true)
- Host sync checks before scan, advertise, and connect operations
- Avoid deadlock by deferring _on_connected from NimBLE host task
- Duplicate identity detection, stale handle cross-check in keepalives
- Bounds validation on conn_handle in setPeerHandle/promoteToIdentityKeyed
- Periodic persist_data() call for display name persistence across reboots
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cleanup after BLE stability debugging - the Serial.printf heartbeat and
loop_count were temporary instrumentation no longer needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: connectNative() used raw ble_gap_connect() which bypasses
NimBLE's client management. The NimBLEClient created afterwards wasn't
associated with the connection handle, causing service discovery to fail
with "could not retrieve services". This led to a connect-disconnect loop
where no BLE peers could complete handshakes.
Fix: Replace raw ble_gap_connect() with NimBLEClient::connect() which
properly manages the GAP event handler, connection handle tracking, MTU
exchange, and service discovery. Connections now succeed with MTU 517
and identity handshakes complete.
Also fixed:
- Error recovery escalates to full stack reset (deinit/reinit) when
NimBLE host fails to sync, instead of looping in a dead state
- Added recursion guard in enterErrorRecovery()
- Promoted key BLE logs (scan, connect, peer status) to INFO level
for visibility during monitoring
- Added 10-second serial heartbeat with connection/peer/heap stats
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: Bytes objects stored in PSRAM-allocated BLEInterface had
corrupted shared_ptr members from uninitialized memory, causing crashes
in processDiscoveredPeers(). Fixed by using heap_caps_calloc instead of
heap_caps_malloc for PSRAM placement-new allocation.
Additional fixes:
- Reduce pool sizes to fit memory budget (reassembler 134KB→17KB,
fragmenters 8→4, handshakes 32→4, pending data 64→8)
- Store local MAC as BLEAddress struct instead of Bytes to avoid
heap allocation in PSRAM-resident object
- Move setLocalMac after platform start (NimBLE needs to be running
for valid random address), add lazy MAC init fallback in loop()
- Add stuck-state detector: resets GAP state machine if hardware
is idle but state machine thinks it's busy
- Enhance getLocalAddress with 3 fallback methods (NimBLE API,
ble_hs_id_copy_addr RANDOM, esp_read_mac efuse)
- Fix C++17 structured binding to C++11 compatibility
- Increase BLE task stack 8KB→12KB for string ops in debug logs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace std::map and std::vector with fixed-size pools in
BLEInterface (fragmenters, pending handshakes, pending data)
- Track keepalive failures and disconnect after 3 consecutive
- Force-disconnect zombie peers detected by BLEPeerManager
- Add periodic advertising refresh (every 60s) to combat silent stops
- Buffer incoming data when identity not yet mapped instead of dropping
- Subtract ATT_OVERHEAD from MTU in NimBLEPlatform connection setup
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split T-Deck firmware from microReticulum examples/lxmf_tdeck/ into its
own repo. microReticulum is consumed as a git submodule dependency pinned
to feat/t-deck. All include paths updated from relative symlinks to bare
includes resolved via library build flags.
Both tdeck (NimBLE) and tdeck-bluedroid environments compile successfully.
Licensed under AGPLv3.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>