The unprotected _clients.find() could race with
processPendingDisconnects() erasing from the map concurrently.
Mutex is released before the blocking getService() GATT call.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Previously, a mutex timeout left characteristic caches empty but
still signalled success to callers, making all GATT ops silently
fail for the connection.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Re-check hasActiveWriteOperations() after acquiring mutex in
processPendingDisconnects() to close race where write() registers
an op between the pre-mutex check and mutex acquisition
- Move cached char pointer writes inside connection-exists guard in
discoverServices() to prevent dangling pointers on handle reuse
- Add WARNING logs to both onConnect callbacks on mutex timeout
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move getService()/getCharacteristic() out of mutex-held paths in
writeCharacteristic(), read(), enableNotifications() by caching all
three char pointers (RX, TX, Identity) during discoverServices()
- Replace 5-second spin-wait in processPendingDisconnects() with
non-blocking deferral: break if GATT ops in flight, retry next loop
- Add WARNING logs to all read-path helpers on mutex timeout
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Move beginWriteOperation() before xSemaphoreGive(_conn_mutex) in
write(), writeCharacteristic(), read(), and enableNotifications()
so the active-op counter is incremented while the mutex is still
held. This closes the window where processPendingDisconnects()
could observe hasActiveWriteOperations()==false and delete the
client before the GATT caller has registered its operation.
- Add _conn_mutex around _connections/_clients insertions in both
server and client onConnect() callbacks, preventing concurrent
map insertions from corrupting the red-black tree.
- Protect updateConnectionMTU() with _conn_mutex.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
writeCharacteristic(), read(), and enableNotifications() resolve
characteristic pointers under _conn_mutex then call blocking GATT
ops after releasing it — same pattern as write(). Without the
active-operation guard, processPendingDisconnects() could delete
the client (and its child characteristics) during the GATT call.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Defer NimBLEDevice::deleteClient() in processPendingDisconnects()
until after releasing _conn_mutex and waiting for any active write
operations to complete. Prevents use-after-free when write() holds
a child NimBLERemoteCharacteristic* pointer across the mutex boundary.
- Add _conn_mutex protection to getConnectionCount(), isConnectedTo(),
and isDeviceConnected() which read _connections without synchronization.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
send_outgoing() on loopTask (core 1) calls write() which reads
_connections, _clients, and _cached_rx_chars maps, while
processPendingDisconnects() on the BLE task (core 0) erases from
them — with no synchronization. This causes std::map red-black tree
corruption, manifesting as LoadProhibited crashes in map rotate/insert
operations (EXCVADDR=0x00000008).
Protect all map accesses in write(), writeCharacteristic(), read(),
enableNotifications(), getConnection(), getConnections(), and
processPendingDisconnects() with _conn_mutex. The mutex is released
before any blocking GATT operations (writeValue, readValue, subscribe)
to avoid holding it during 10-30s NimBLE timeouts.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consolidate #ifndef/#ifdef into single #ifdef/#else/#endif block.
Add warning comment to generated header about static linkage.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move BG_COLOR inline into #ifndef block to avoid unused variable
when HAS_SPLASH_IMAGE is defined. Make show_splash() private since
it's only called internally from init_hardware_only().
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Matches the guard already on notify() to prevent use-after-free
of _tx_char during a NimBLE host reset.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
BLE task is no longer subscribed to WDT, so these 23 calls were
silently returning ESP_ERR_NOT_FOUND. Removes dead code and the
now-unused esp_task_wdt.h include.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prevents double PSRAM allocation and LVGL driver re-registration
if init() were called more than once.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move Display::init_hardware_only() and POWER_EN to right after serial
banner, before GPS/WiFi/SD/Reticulum init. Add 150ms delay after
POWER_EN HIGH so ST7789V power rail stabilizes before SPI commands
(without this, SWRESET is sent to an unpowered chip and silently lost).
Splash now visible for entire boot period (~18s) until LVGL takes over.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two root causes for frequent device reboots:
1. LVGL task (priority 2) starved loopTask (priority 1) on core 1.
During heavy screen rendering, loopTask couldn't run for 30+ seconds,
triggering the Task WDT. Fixed by lowering LVGL to priority 1 so
FreeRTOS round-robins both tasks fairly.
2. BLE task was registered with the 30s Task WDT, but blocking NimBLE
GATT operations (connect + service discovery + subscribe + read) can
legitimately take 30-60s total. Removed BLE task from WDT since
NimBLE has its own internal ~30s timeouts per GATT operation.
Also added ble_hs_synced() guards to write(), read(), notify(),
writeCharacteristic(), discoverServices(), and enableNotifications()
to prevent use-after-free on stale NimBLE client pointers during
host resets.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Display and LoRa were creating separate SPIClass(HSPI) instances which
claimed GPIO pins via the matrix, preventing SD card (on FSPI) from
accessing MISO after Display init. Now all three peripherals use the
global SPI (FSPI) instance, eliminating GPIO routing conflicts.
- Display: use &SPI instead of new SPIClass(HSPI)
- SX1262Interface: use &SPI instead of new SPIClass(HSPI)
- SDAccess: enable format_if_empty for unformatted cards
Verified on device: SD (128GB SDHC), display, and LoRa all coexist.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
SD card was unresponsive (MISO stuck 0xFF) because Display's HSPI
peripheral had already claimed the GPIO pins via the matrix, preventing
FSPI from routing MISO. Fix by initializing SD card BEFORE Display,
using the global SPI (FSPI) instance — matching LilyGo's reference code.
- Move SD card init before display init in boot sequence
- Use global SPI (FSPI) instead of Display's SPIClass(HSPI)
- Lower SPI frequency to 800kHz matching LilyGo example
- Drive all CS lines (display, LoRa, SD) high before SD init
- Add MISO=38 to Display's SPI.begin for post-init bus sharing
- Add Display::get_spi() accessor for future shared use
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The T-Deck Plus shares HSPI across the display (CS=12), LoRa (CS=9),
and SD card (CS=39). Previously SD logging was disabled because
SD.begin() reconfigured the SPI bus and blanked the display.
This introduces a FreeRTOS mutex created in main.cpp and injected into
Display, SX1262Interface, and a new SDAccess class so all three
peripherals serialize their SPI transactions safely.
- Add SDAccess class wrapping SD.begin() and file ops with mutex
- Add set_spi_mutex() to Display and SX1262Interface
- Wrap Display flush, fill, draw, and power ops in mutex
- Refactor SDLogger to use SDAccess mutex instead of owning SD.begin()
- Wire up mutex creation and injection order in setup()
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
UIManager.cpp includes Tone.h, so tdeck_ui should declare this
dependency rather than relying on implicit global discovery.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After a 574 connection failure, the NimBLE controller's scan state can
become corrupted (returning rc=530 / Invalid HCI Params) even after the
host re-syncs. This led to scan failure escalation and device reboots.
Key fixes:
- Add ble_gap_conn_cancel() to enterErrorRecovery() — stuck GAP master
connection operations were blocking all subsequent scans
- Add ble_hs_sched_reset(BLE_HS_ECONTROLLER) in error recovery to force
a full host-controller resynchronization after desync
- Proactively cancel stale GAP connections before scan start
- Reduce SCAN_FAIL_RECOVERY_THRESHOLD from 10 to 5 for faster recovery
- Enhanced scan failure logging with GAP state diagnostics
- Move ESP reset reason logging after WiFi init for UDP log visibility
- Suppress connection candidate log spam when at max connections
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 30-second cooldown after NimBLE host desync recovery before
allowing new connection attempts. During desync, client->connect()
blocks waiting for a host-task completion event that never arrives,
causing WDT crashes. The cooldown skips connection attempts while
the host is desynced or recently recovered.
Also adds ESP reset reason logging at boot to diagnose crash types
(WDT, panic, brownout, etc.) in soak test logs.
Soak test results: Run 3 (before) had 17 reboots in ~4 hours with
a 12-crash-in-14-minutes loop. Run 4 (after) has 1 early reboot
then 19+ hours of continuous uptime with the same desync frequency.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
platformio.ini:
- Replace -I${PROJECT_DIR}/lib, -I${PROJECT_DIR}/deps/... with relative
paths (-Ilib, -Ideps/...) in both tdeck-bluedroid and tdeck environments;
${PROJECT_DIR} is mangled on Windows inside build_flags, causing include
paths to resolve inside the PlatformIO builder directory instead of the
project root
- Remove hardcoded -I.pio/libdeps/tdeck/TinyGPSPlus/src and
-I.pio/libdeps/tdeck/NimBLE-Arduino/src; these paths reference generated
cache, break on fresh clones, and are redundant with lib_ldf_mode = deep+
- Fix OTA upload_command: replace python3 with $PYTHONEXE so it resolves
to PlatformIO's bundled Python on Windows, macOS, and Linux
src/main.cpp, lib/tdeck_ui/UI/LXMF/UIManager.cpp:
- Change #include "tone/Tone.h" to #include "Tone.h"; PlatformIO
automatically adds -Ilib/tone for local libraries, making the
subdirectory prefix unnecessary and broken when -Ilib is not effective
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Patch 3 (ble_gap.c): Handle BLE_ERR_CONN_ESTABLISHMENT (574) unconditionally.
NimBLE only handled 574 under BLE_PERIODIC_ADV_WITH_RESPONSES (disabled on
ESP32), causing ble_gap_master_failed() to never be called. This left the
master GAP state stuck in BLE_GAP_OP_M_CONN, permanently blocking scan and
advertising. Also clean up master state in the default case instead of
assert(0).
Patch 4 (NimBLEDevice.cpp): Expose host reset reason via global volatile int.
NimBLE's onReset callback logs the reason code through ESP_LOG (serial UART
only). This patch adds nimble_host_reset_reason that the BLE loop polls to
capture the reason in UDP log output for remote soak test monitoring.
NimBLEPlatform.cpp: Escalate persistent scan failures to full stack recovery.
After 3 consecutive enterErrorRecovery() rounds fail to restore scanning (30
total scan failures), escalate to recoverBLEStack() (clean reboot) instead
of looping indefinitely in a broken state.
Validated with 17+ hour soak test: device recovers from desyncs and maintains
3 active BLE connections with stable heap (~43K).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
After boot, the conversation list called recall_app_data() once during
initial load. If announces hadn't arrived yet (or known destinations
hadn't been loaded with app_data), conversations showed raw hashes
permanently until the user navigated away and back.
Add a lazy name resolution check to update_status() (called every 3s):
if any conversations have unresolved names, try recall_app_data() again
and refresh the list when a display name becomes available.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Critical fixes for NimBLE host task / BLE loop task concurrency:
- Defer all disconnect map cleanup from NimBLE callbacks to loop task via
SPSC ring buffer, preventing iterator invalidation and use-after-free
- Defer enterErrorRecovery() from callback context to loop task
- Add WDT feed in enterErrorRecovery() host-sync polling loop
Operational hardening:
- Cache NimBLERemoteCharacteristic* pointers in write() to avoid repeated
service/characteristic lookups per fragment
- Add isConnected() checks before GATT operations (read, enableNotifications)
- Validate peer address in notification callback to guard against handle reuse
- Skip stuck-state detector during CONNECTING/CONN_STARTING states
- Expire stale pending data entries after HANDSHAKE_TIMEOUT (30s)
- Read actual connection RSSI via ble_gap_conn_rssi() for peripheral connections
instead of hardcoding 0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Propagation sync (microReticulum submodule):
- Fix msgpack interop: send nil (not 0) for per_transfer_limit so
Python server doesn't reject all messages as exceeding "0 KB limit"
- Fix Resource response routing: extract request_id from packed data
when not present in Resource advertisement, route to pending request
callback instead of generic concluded handler
- Fix Link::request() to manually build packed arrays, avoiding
Bytes::to_msgpack() BIN-wrapping that breaks protocol interop
UI enhancements:
- PropagationNodesScreen: manual node entry via 32-char hex hash in
search field, with paste support and radio button selection
- StatusScreen: display stamp cost from propagation node
- UIManager: NVS persistence for selected propagation node, proactive
path request on node selection, sync state machine with timeout
handling
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix use-after-free crash on hangup: set _call_state=IDLE before deleting
_lxst_audio, preventing pump_call_tx() (runs without LVGL lock) from
accessing freed memory
- Replace single-slot _call_signal_pending with 8-element ring buffer queue
to prevent signal loss when CONNECTING+ESTABLISHED arrive in rapid succession
- Extract TX pump into pump_call_tx() called right after reticulum->loop()
for low-latency audio TX without LVGL lock dependency (was buried at step 10)
- Tune ES7210 mic gain to 21dB (was 15dB) to improve Codec2 input level
without ADC clipping that occurred at 24dB
- I2S capture: use APLL for accurate 8kHz clock, direct 8kHz sampling
(no more 16→8kHz decimation), DMA 16x64 for encode burst headroom
- Reduce Reticulum log verbosity to LOG_INFO (was LOG_TRACE)
- BLE: add ble_hs_sched_reset() tiered recovery before reboot on desync,
widen supervision timeout to 4.0s for WiFi coexistence
- Add UDP multicast log broadcasting and OTA flash support
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Columba's native OboePlaybackEngine ring buffer expects exactly
frameSamples (1600 for Codec2 3200 mode) decoded samples per
writeEncodedPacket call = 10 sub-frames of 160 samples each.
Changes:
- Batch exactly 10 sub-frames per fixarray element (82 bytes each:
codec_type + mode_header + 10*8 raw bytes)
- Up to 2 batches per msgpack packet, matching Columba C2C format
- Proper fixarray wrapping for multi-batch, bare bin8 for single
- Add codec_type byte (0x02) prefix per batch element
- Respond to PREFERRED_PROFILE negotiation with LBW (Codec2 3200)
- Add capture diagnostics (raw PCM peaks, I2S dump, rate logging)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PCM_RING_FRAMES 16→50 (320ms→1000ms capacity) and
PREBUFFER_FRAMES 3→15 (60ms→300ms prebuffer) to match
LXST-kt's buffering strategy. Interop test suite confirms
zero underruns with ±100ms jitter at these settings.
Also adds tests/interop/ with 48 Python tests verifying
wire format, codec round-trip, and pipeline compatibility
between Pyxis, Python LXST, and LXST-kt implementations.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
A desynced NimBLE host can't actually communicate over existing
connections, so they're effectively zombies. Waiting 5 minutes left
the device unresponsive. 90s gives enough time for self-recovery
while avoiding prolonged dead states.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only destinations that have exchanged messages are written to SPIFFS.
UIManager marks destinations as persistent on send_message() and
on_message_received(). Reduces persist time from 40-50s to <1s.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reduces crash rate from every 60-85s to 1 reboot per 6+ minutes.
Zero WDT triggers in 10-minute stability test.
BLE mutex fixes (BLEInterface.cpp):
- Release _mutex before blocking GATT ops in onConnected() and
onServicesDiscovered() — prevents 5-30s main-loop stalls during
service discovery, notification subscribe, identity exchange
- Non-blocking try_lock() for peerCount(), getConnectedPeerSummaries(),
get_stats() — returns empty/default if BLE task holds mutex
- Write-without-response in initiateHandshake()
WDT and persistence (main.cpp, sdkconfig.defaults, microReticulum):
- 30s WDT timeout (up from 10s) for SPIFFS flash I/O headroom
- Register Identity::set_persist_yield_callback() to feed WDT every
5 entries during save_known_destinations() (70+ entries = 30-50s)
- WDT feeds between reticulum and identity persist calls
BLE host desync recovery (NimBLEPlatform):
- Time-based desync tracking instead of aggressive counter-based reboot
- 60s tolerance without connections, 5 minutes with active connections
(data still flows over existing BLE mesh links)
- Remove immediate recoverBLEStack() from 574 handler and
enterErrorRecovery() — let startScan() manage reboot decision
- Increase CONNECTION_COOLDOWN from 3s to 10s to reduce 574 risk
- Increase SCAN_FAIL_RECOVERY_THRESHOLD from 5 to 10
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Subscribe loopTask and BLE task to the ESP32 Task Watchdog (10s timeout)
to detect and recover from silent hangs. Per-step WDT feeds in the main
loop prevent false triggers from cumulative slow operations.
Fix BLE mutex starvation that blocked the main loop for 3-6s:
- Move processDiscoveredPeers() out of performMaintenance() so _mutex
is not held during blocking NimBLE connect calls
- Use try_lock() in send_outgoing() to skip sends when BLE task has
the mutex, rather than blocking (Reticulum retransmits)
- Switch BLE data writes to write-without-response (non-blocking)
- Add WDT feeds to all NimBLE blocking wait loops
Replace NimBLE soft-reset recovery with immediate reboot — deinit()
during sync failures caused CORRUPT HEAP panics. With atomic file
persistence, data survives reboots reliably.
Reduce loop task stack from 49KB to 16KB (measured peak ~6KB).
Add NimBLE PHY update null guard to patch_nimble.py.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NimBLE crash fix:
- Patch ble_hs.c assert(0) in BLE_HS_SYNC_STATE_BRINGUP timer handler
via pre-build script (patch_nimble.py). The assert fires when a timer
callback races with host re-sync — harmless, but kills the ESP32 and
corrupts any file writes in progress.
Persistence fixes (in microReticulum submodule):
- Atomic save: write to temp file then rename, protecting existing data
- Fast persist: 5s after dirty flag instead of waiting 60s interval
- Corrupt file recovery: delete invalid files, recover from temp files
- INFO-level logging for load/save visibility
Other:
- Wrap LXMF announce in try/catch for crash safety
- Call Identity::should_persist_data() from main loop
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Fix systemic One Definition Rule violation where BLEInterface.h included
headers from deps/microReticulum/src/BLE/ while .cpp files compiled
against local lib/ble_interface/ versions, causing struct layout mismatches
(PeerInfo field shifting corrupted conn_handle/mtu) and class layout
mismatches (BLEPeerManager member differences caused LoadProhibited crash).
Key fixes:
- Include local BLE headers instead of deps versions in BLEInterface.h
- Sync PeerInfo keepalive tracking fields and BLETypes constants with deps
- Shutdown re-entrancy guard and proper client cleanup via deinit(true)
- Host sync checks before scan, advertise, and connect operations
- Avoid deadlock by deferring _on_connected from NimBLE host task
- Duplicate identity detection, stale handle cross-check in keepalives
- Bounds validation on conn_handle in setPeerHandle/promoteToIdentityKeyed
- Periodic persist_data() call for display name persistence across reboots
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cleanup after BLE stability debugging - the Serial.printf heartbeat and
loop_count were temporary instrumentation no longer needed.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: connectNative() used raw ble_gap_connect() which bypasses
NimBLE's client management. The NimBLEClient created afterwards wasn't
associated with the connection handle, causing service discovery to fail
with "could not retrieve services". This led to a connect-disconnect loop
where no BLE peers could complete handshakes.
Fix: Replace raw ble_gap_connect() with NimBLEClient::connect() which
properly manages the GAP event handler, connection handle tracking, MTU
exchange, and service discovery. Connections now succeed with MTU 517
and identity handshakes complete.
Also fixed:
- Error recovery escalates to full stack reset (deinit/reinit) when
NimBLE host fails to sync, instead of looping in a dead state
- Added recursion guard in enterErrorRecovery()
- Promoted key BLE logs (scan, connect, peer status) to INFO level
for visibility during monitoring
- Added 10-second serial heartbeat with connection/peer/heap stats
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: Bytes objects stored in PSRAM-allocated BLEInterface had
corrupted shared_ptr members from uninitialized memory, causing crashes
in processDiscoveredPeers(). Fixed by using heap_caps_calloc instead of
heap_caps_malloc for PSRAM placement-new allocation.
Additional fixes:
- Reduce pool sizes to fit memory budget (reassembler 134KB→17KB,
fragmenters 8→4, handshakes 32→4, pending data 64→8)
- Store local MAC as BLEAddress struct instead of Bytes to avoid
heap allocation in PSRAM-resident object
- Move setLocalMac after platform start (NimBLE needs to be running
for valid random address), add lazy MAC init fallback in loop()
- Add stuck-state detector: resets GAP state machine if hardware
is idle but state machine thinks it's busy
- Enhance getLocalAddress with 3 fallback methods (NimBLE API,
ble_hs_id_copy_addr RANDOM, esp_read_mac efuse)
- Fix C++17 structured binding to C++11 compatibility
- Increase BLE task stack 8KB→12KB for string ops in debug logs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace crude 2-tap averaging with proper 15-tap half-band FIR filter
for 16kHz→8kHz decimation (~60dB stopband attenuation, Kaiser beta=6).
Exploits symmetry + half-band zeros for only 5 MACs per output sample.
Separate TDM deinterleave (CH0 extraction at 16kHz) from FIR decimation
for cleaner signal processing pipeline.
Reduce ES7210 mic gain from 8 (24dB) to 5 (15dB) to avoid ADC clipping;
AGC in the voice filter chain compensates for quieter input.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Split Codec2 into separate encode/decode instances to eliminate mutex
contention between capture task (core 0) and main thread decode
- Fix TDM deinterleave: stride-4 (was stride-2) for [CH0,CH1] at 16kHz
to produce 8kHz mono output matching Codec2's expected sample rate
- Add 2-tap anti-aliasing average before decimation to reduce >4kHz alias
- Add hard limiter at ±16000 to prevent ADC clipping artifacts
- TX batching: send exactly 8 Codec2 frames per packet (2560 decoded
samples) to match Columba's PacketRingBuffer.frameSamples requirement
- Add capture diagnostics: sample rate, raw/downsampled peaks, hex dumps
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>