pyxis

mirror of https://github.com/torlando-tech/pyxis.git synced 2026-03-30 13:45:38 +00:00

Author	SHA1	Message	Date
torlando-tech	e2df70161b	Protect _clients lookup in discoverServices with _conn_mutex The unprotected _clients.find() could race with processPendingDisconnects() erasing from the map concurrently. Mutex is released before the blocking getService() GATT call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 12:30:52 -05:00
torlando-tech	602d8f7083	Improve write() cache-miss warning to indicate discovery dependency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 12:28:35 -05:00
torlando-tech	74def922a7	Guard enableNotifications against missing connection entry Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 12:05:50 -05:00
torlando-tech	8013597a5f	Avoid redundant mutex retry in discoverServices error path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 11:59:05 -05:00
torlando-tech	d9883c9e36	Report failure on mutex timeout in discoverServices Previously, a mutex timeout left characteristic caches empty but still signalled success to callers, making all GATT ops silently fail for the connection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 11:46:43 -05:00
torlando-tech	8c0dd227f4	Add missing mutex timeout warning in updateConnectionMTU Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 11:44:30 -05:00
torlando-tech	20a072d258	Fix TOCTOU in disconnect, stale cache in discovery, silent onConnect failures - Re-check hasActiveWriteOperations() after acquiring mutex in processPendingDisconnects() to close race where write() registers an op between the pre-mutex check and mutex acquisition - Move cached char pointer writes inside connection-exists guard in discoverServices() to prevent dangling pointers on handle reuse - Add WARNING logs to both onConnect callbacks on mutex timeout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 11:01:55 -05:00
torlando-tech	9bd075c91e	Cache char pointers in discoverServices, defer disconnect, add mutex timeout logs - Move getService()/getCharacteristic() out of mutex-held paths in writeCharacteristic(), read(), enableNotifications() by caching all three char pointers (RX, TX, Identity) during discoverServices() - Replace 5-second spin-wait in processPendingDisconnects() with non-blocking deferral: break if GATT ops in flight, retry next loop - Add WARNING logs to all read-path helpers on mutex timeout Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 01:50:46 -05:00
torlando-tech	334f024179	Close TOCTOU gap and protect onConnect map insertions - Move beginWriteOperation() before xSemaphoreGive(_conn_mutex) in write(), writeCharacteristic(), read(), and enableNotifications() so the active-op counter is incremented while the mutex is still held. This closes the window where processPendingDisconnects() could observe hasActiveWriteOperations()==false and delete the client before the GATT caller has registered its operation. - Add _conn_mutex around _connections/_clients insertions in both server and client onConnect() callbacks, preventing concurrent map insertions from corrupting the red-black tree. - Protect updateConnectionMTU() with _conn_mutex. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 00:16:58 -05:00
torlando-tech	fffd8ec79e	Add beginWriteOperation() guards to all blocking GATT methods writeCharacteristic(), read(), and enableNotifications() resolve characteristic pointers under _conn_mutex then call blocking GATT ops after releasing it — same pattern as write(). Without the active-operation guard, processPendingDisconnects() could delete the client (and its child characteristics) during the GATT call. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 23:54:01 -05:00
torlando-tech	71bc4ae82b	Address Greptile review: fix use-after-free and unprotected accessors - Defer NimBLEDevice::deleteClient() in processPendingDisconnects() until after releasing _conn_mutex and waiting for any active write operations to complete. Prevents use-after-free when write() holds a child NimBLERemoteCharacteristic* pointer across the mutex boundary. - Add _conn_mutex protection to getConnectionCount(), isConnectedTo(), and isDeviceConnected() which read _connections without synchronization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 23:05:47 -05:00
torlando-tech	9174d27183	Fix cross-thread race condition on BLE connection maps send_outgoing() on loopTask (core 1) calls write() which reads _connections, _clients, and _cached_rx_chars maps, while processPendingDisconnects() on the BLE task (core 0) erases from them — with no synchronization. This causes std::map red-black tree corruption, manifesting as LoadProhibited crashes in map rotate/insert operations (EXCVADDR=0x00000008). Protect all map accesses in write(), writeCharacteristic(), read(), enableNotifications(), getConnection(), getConnections(), and processPendingDisconnects() with _conn_mutex. The mutex is released before any blocking GATT operations (writeValue, readValue, subscribe) to avoid holding it during 10-30s NimBLE timeouts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 22:54:29 -05:00
torlando-tech	ed8c08109f	Add ble_hs_synced() guard to notifyAll() Matches the guard already on notify() to prevent use-after-free of _tx_char during a NimBLE host reset. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 16:20:06 -05:00
torlando-tech	4ba97057c5	Remove no-op esp_task_wdt_reset() calls from NimBLEPlatform BLE task is no longer subscribed to WDT, so these 23 calls were silently returning ESP_ERR_NOT_FOUND. Removes dead code and the now-unused esp_task_wdt.h include. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 15:32:44 -05:00
torlando-tech	c80e63dee9	Fix Task WDT crashes: LVGL priority starvation + BLE WDT false positives Two root causes for frequent device reboots: 1. LVGL task (priority 2) starved loopTask (priority 1) on core 1. During heavy screen rendering, loopTask couldn't run for 30+ seconds, triggering the Task WDT. Fixed by lowering LVGL to priority 1 so FreeRTOS round-robins both tasks fairly. 2. BLE task was registered with the 30s Task WDT, but blocking NimBLE GATT operations (connect + service discovery + subscribe + read) can legitimately take 30-60s total. Removed BLE task from WDT since NimBLE has its own internal ~30s timeouts per GATT operation. Also added ble_hs_synced() guards to write(), read(), notify(), writeCharacteristic(), discoverServices(), and enableNotifications() to prevent use-after-free on stale NimBLE client pointers during host resets. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 14:11:51 -05:00
torlando-tech	46ce057a1e	BLE stability: host-controller resync, stuck GAP conn cancel, scan diagnostics After a 574 connection failure, the NimBLE controller's scan state can become corrupted (returning rc=530 / Invalid HCI Params) even after the host re-syncs. This led to scan failure escalation and device reboots. Key fixes: - Add ble_gap_conn_cancel() to enterErrorRecovery() — stuck GAP master connection operations were blocking all subsequent scans - Add ble_hs_sched_reset(BLE_HS_ECONTROLLER) in error recovery to force a full host-controller resynchronization after desync - Proactively cancel stale GAP connections before scan start - Reduce SCAN_FAIL_RECOVERY_THRESHOLD from 10 to 5 for faster recovery - Enhanced scan failure logging with GAP state diagnostics - Move ESP reset reason logging after WiFi init for UDP log visibility - Suppress connection candidate log spam when at max connections Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 19:57:55 -05:00
torlando-tech	2cc9441f0a	BLE stability: desync connect cooldown prevents crash-on-connect Add 30-second cooldown after NimBLE host desync recovery before allowing new connection attempts. During desync, client->connect() blocks waiting for a host-task completion event that never arrives, causing WDT crashes. The cooldown skips connection attempts while the host is desynced or recently recovered. Also adds ESP reset reason logging at boot to diagnose crash types (WDT, panic, brownout, etc.) in soak test logs. Soak test results: Run 3 (before) had 17 reboots in ~4 hours with a 12-crash-in-14-minutes loop. Run 4 (after) has 1 early reboot then 19+ hours of continuous uptime with the same desync frequency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 18:34:40 -05:00
torlando-tech	74d832fb63	NimBLE patches: fix 574 stuck GAP state, add desync diagnostics Patch 3 (ble_gap.c): Handle BLE_ERR_CONN_ESTABLISHMENT (574) unconditionally. NimBLE only handled 574 under BLE_PERIODIC_ADV_WITH_RESPONSES (disabled on ESP32), causing ble_gap_master_failed() to never be called. This left the master GAP state stuck in BLE_GAP_OP_M_CONN, permanently blocking scan and advertising. Also clean up master state in the default case instead of assert(0). Patch 4 (NimBLEDevice.cpp): Expose host reset reason via global volatile int. NimBLE's onReset callback logs the reason code through ESP_LOG (serial UART only). This patch adds nimble_host_reset_reason that the BLE loop polls to capture the reason in UDP log output for remote soak test monitoring. NimBLEPlatform.cpp: Escalate persistent scan failures to full stack recovery. After 3 consecutive enterErrorRecovery() rounds fail to restore scanning (30 total scan failures), escalate to recoverBLEStack() (clean reboot) instead of looping indefinitely in a broken state. Validated with 17+ hour soak test: device recovers from desyncs and maintains 3 active BLE connections with stable heap (~43K). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 12:49:41 -05:00
torlando-tech	d6d4eb2c9c	BLE stability: defer disconnect processing, fix data races, harden operations Critical fixes for NimBLE host task / BLE loop task concurrency: - Defer all disconnect map cleanup from NimBLE callbacks to loop task via SPSC ring buffer, preventing iterator invalidation and use-after-free - Defer enterErrorRecovery() from callback context to loop task - Add WDT feed in enterErrorRecovery() host-sync polling loop Operational hardening: - Cache NimBLERemoteCharacteristic* pointers in write() to avoid repeated service/characteristic lookups per fragment - Add isConnected() checks before GATT operations (read, enableNotifications) - Validate peer address in notification callback to guard against handle reuse - Skip stuck-state detector during CONNECTING/CONN_STARTING states - Expire stale pending data entries after HANDSHAKE_TIMEOUT (30s) - Read actual connection RSSI via ble_gap_conn_rssi() for peripheral connections instead of hardcoding 0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 00:15:24 -05:00
torlando-tech	6744eb136d	LXST voice call stability: fix hangup crash, signal queue, TX pump, mic tuning - Fix use-after-free crash on hangup: set _call_state=IDLE before deleting _lxst_audio, preventing pump_call_tx() (runs without LVGL lock) from accessing freed memory - Replace single-slot _call_signal_pending with 8-element ring buffer queue to prevent signal loss when CONNECTING+ESTABLISHED arrive in rapid succession - Extract TX pump into pump_call_tx() called right after reticulum->loop() for low-latency audio TX without LVGL lock dependency (was buried at step 10) - Tune ES7210 mic gain to 21dB (was 15dB) to improve Codec2 input level without ADC clipping that occurred at 24dB - I2S capture: use APLL for accurate 8kHz clock, direct 8kHz sampling (no more 16→8kHz decimation), DMA 16x64 for encode burst headroom - Reduce Reticulum log verbosity to LOG_INFO (was LOG_TRACE) - BLE: add ble_hs_sched_reset() tiered recovery before reboot on desync, widen supervision timeout to 4.0s for WiFi coexistence - Add UDP multicast log broadcasting and OTA flash support Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 10:57:14 -05:00
torlando-tech	5949cd97ff	Reduce BLE desync reboot tolerance from 5min to 90s with connections A desynced NimBLE host can't actually communicate over existing connections, so they're effectively zombies. Waiting 5 minutes left the device unresponsive. 90s gives enough time for self-recovery while avoiding prolonged dead states. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 16:39:43 -05:00
torlando-tech	e343caf2d2	Stability: WDT yield, BLE mutex fixes, time-based desync recovery Reduces crash rate from every 60-85s to 1 reboot per 6+ minutes. Zero WDT triggers in 10-minute stability test. BLE mutex fixes (BLEInterface.cpp): - Release _mutex before blocking GATT ops in onConnected() and onServicesDiscovered() — prevents 5-30s main-loop stalls during service discovery, notification subscribe, identity exchange - Non-blocking try_lock() for peerCount(), getConnectedPeerSummaries(), get_stats() — returns empty/default if BLE task holds mutex - Write-without-response in initiateHandshake() WDT and persistence (main.cpp, sdkconfig.defaults, microReticulum): - 30s WDT timeout (up from 10s) for SPIFFS flash I/O headroom - Register Identity::set_persist_yield_callback() to feed WDT every 5 entries during save_known_destinations() (70+ entries = 30-50s) - WDT feeds between reticulum and identity persist calls BLE host desync recovery (NimBLEPlatform): - Time-based desync tracking instead of aggressive counter-based reboot - 60s tolerance without connections, 5 minutes with active connections (data still flows over existing BLE mesh links) - Remove immediate recoverBLEStack() from 574 handler and enterErrorRecovery() — let startScan() manage reboot decision - Increase CONNECTION_COOLDOWN from 3s to 10s to reduce 574 risk - Increase SCAN_FAIL_RECOVERY_THRESHOLD from 5 to 10 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 12:30:30 -05:00
torlando-tech	3ca27f53f6	Task watchdog, BLE mutex fixes, NimBLE crash-safe recovery Subscribe loopTask and BLE task to the ESP32 Task Watchdog (10s timeout) to detect and recover from silent hangs. Per-step WDT feeds in the main loop prevent false triggers from cumulative slow operations. Fix BLE mutex starvation that blocked the main loop for 3-6s: - Move processDiscoveredPeers() out of performMaintenance() so _mutex is not held during blocking NimBLE connect calls - Use try_lock() in send_outgoing() to skip sends when BLE task has the mutex, rather than blocking (Reticulum retransmits) - Switch BLE data writes to write-without-response (non-blocking) - Add WDT feeds to all NimBLE blocking wait loops Replace NimBLE soft-reset recovery with immediate reboot — deinit() during sync failures caused CORRUPT HEAP panics. With atomic file persistence, data survives reboots reliably. Reduce loop task stack from 49KB to 16KB (measured peak ~6KB). Add NimBLE PHY update null guard to patch_nimble.py. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 10:45:43 -05:00
torlando-tech	43a7e1088f	BLE P2P stability: fix ODR violation, shutdown safety, connection robustness Fix systemic One Definition Rule violation where BLEInterface.h included headers from deps/microReticulum/src/BLE/ while .cpp files compiled against local lib/ble_interface/ versions, causing struct layout mismatches (PeerInfo field shifting corrupted conn_handle/mtu) and class layout mismatches (BLEPeerManager member differences caused LoadProhibited crash). Key fixes: - Include local BLE headers instead of deps versions in BLEInterface.h - Sync PeerInfo keepalive tracking fields and BLETypes constants with deps - Shutdown re-entrancy guard and proper client cleanup via deinit(true) - Host sync checks before scan, advertise, and connect operations - Avoid deadlock by deferring _on_connected from NimBLE host task - Duplicate identity detection, stale handle cross-check in keepalives - Bounds validation on conn_handle in setPeerHandle/promoteToIdentityKeyed - Periodic persist_data() call for display name persistence across reboots Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 00:24:45 -05:00
torlando-tech	869963c33c	BLE: use NimBLEClient for connections, fix service discovery and host reset Root cause: connectNative() used raw ble_gap_connect() which bypasses NimBLE's client management. The NimBLEClient created afterwards wasn't associated with the connection handle, causing service discovery to fail with "could not retrieve services". This led to a connect-disconnect loop where no BLE peers could complete handshakes. Fix: Replace raw ble_gap_connect() with NimBLEClient::connect() which properly manages the GAP event handler, connection handle tracking, MTU exchange, and service discovery. Connections now succeed with MTU 517 and identity handshakes complete. Also fixed: - Error recovery escalates to full stack reset (deinit/reinit) when NimBLE host fails to sync, instead of looping in a dead state - Added recursion guard in enterErrorRecovery() - Promoted key BLE logs (scan, connect, peer status) to INFO level for visibility during monitoring - Added 10-second serial heartbeat with connection/peer/heap stats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 21:34:28 -05:00
torlando-tech	769c9952bd	BLE P2P stability: PSRAM zero-init, pool sizing, stuck-state recovery Root cause: Bytes objects stored in PSRAM-allocated BLEInterface had corrupted shared_ptr members from uninitialized memory, causing crashes in processDiscoveredPeers(). Fixed by using heap_caps_calloc instead of heap_caps_malloc for PSRAM placement-new allocation. Additional fixes: - Reduce pool sizes to fit memory budget (reassembler 134KB→17KB, fragmenters 8→4, handshakes 32→4, pending data 64→8) - Store local MAC as BLEAddress struct instead of Bytes to avoid heap allocation in PSRAM-resident object - Move setLocalMac after platform start (NimBLE needs to be running for valid random address), add lazy MAC init fallback in loop() - Add stuck-state detector: resets GAP state machine if hardware is idle but state machine thinks it's busy - Enhance getLocalAddress with 3 fallback methods (NimBLE API, ble_hs_id_copy_addr RANDOM, esp_read_mac efuse) - Fix C++17 structured binding to C++11 compatibility - Increase BLE task stack 8KB→12KB for string ops in debug logs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 20:57:05 -05:00
torlando-tech	ec181e18f8	BLE interop fixes: pools, keepalive tracking, data buffering - Replace std::map and std::vector with fixed-size pools in BLEInterface (fragmenters, pending handshakes, pending data) - Track keepalive failures and disconnect after 3 consecutive - Force-disconnect zombie peers detected by BLEPeerManager - Add periodic advertising refresh (every 60s) to combat silent stops - Buffer incoming data when identity not yet mapped instead of dropping - Subtract ATT_OVERHEAD from MTU in NimBLEPlatform connection setup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 00:18:02 -05:00
torlando-tech	ac6ceca9f8	Initial commit: standalone Pyxis T-Deck firmware Split T-Deck firmware from microReticulum examples/lxmf_tdeck/ into its own repo. microReticulum is consumed as a git submodule dependency pinned to feat/t-deck. All include paths updated from relative symlinks to bare includes resolved via library build flags. Both tdeck (NimBLE) and tdeck-bluedroid environments compile successfully. Licensed under AGPLv3. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-06 19:48:33 -05:00

28 Commits