Commit Graph

8 Commits

Author SHA1 Message Date
torlando-tech 76ffd29b01 feat(ble): TX/RX fragment + byte counters in BLE heartbeat
Pre-this the 10s heartbeat reported running/scanning/connected/peers
state but nothing about whether data was actually flowing. With the
counters added to BLEInterface and threaded into the heartbeat
snprintf, the line now also surfaces:

  tx_pkt   — outbound RNS packets attempted
  tx_frag  — BLE fragments actually written/notified
  tx_b     — total bytes written
  tx_fail  — platform write/notify returned false
  rx_frag  — BLE fragments handed to the reassembler
  rx_b     — total bytes received

That was enough to root-cause the Columba-side stalls observed
during the BLE end-to-end testing session: pyxis showed connected=1
but tx_pkt frozen, surfacing that the keepalive loop wasn't firing
for a peer whose handshake had completed but identity recording
raced. Cumulative-since-start, no reset; cheap to keep on always.
2026-05-10 15:23:09 -04:00
torlando-tech a0ff631001 Track A.5/6/7: Identity persistence + Transport stats + Interface overrides
Three API migrations to keep the graft moving against vanilla
attermann/microReticulum @ 0.3.0:

(A.5) Identity persistence migrated to OS::set_loop_callback.
  Was:  Identity::set_persist_yield_callback(cb)        // fork-only
        Identity::should_persist_data()                 // fork-only
  Now:  RNS::Utilities::OS::set_loop_callback(cb)       // upstream global
        reticulum->should_persist_data()                // already used
  The fork's split between Identity-specific 5s fast-flush and
  Reticulum-level 60s full-persist is unified upstream into a single
  Reticulum::should_persist_data() entry point. The fast cadence is
  folded into microStore's dirty-tracking. If we observe excessive
  lost-known-destinations after crashes, revisit microStore's flush
  cadence rather than re-adding the fork-only Identity API.

(A.6) Transport stats diagnostics disabled — vanilla upstream doesn't
  expose the *_count() getter family the fork added. Two [TABLES]
  diagnostic blocks in main.cpp now print a placeholder. Restore by
  porting to upstream's get_path_table().size() and friends, or PR the
  getters back to upstream Transport. Tracked in
  pyxis_microReticulum_graft_spike_findings.md.

(A.7) BLE/SX1262 Interface stat methods are no longer virtual overrides.
  Vanilla upstream Interface base class doesn't declare get_stats /
  get_rssi / get_snr. Kept the methods as plain (non-virtual)
  BLEInterface / SX1262Interface members; callers needing stats access
  must hold the concrete type, not the base Interface*. Propose
  upstream PR adding to base API if polymorphic access matters.

Also: setLogCallback -> set_log_callback (renamed in upstream commit
4d6f0b9 "Added dual-class PSRAM/TLSF allocator system").

Pyxis still doesn't build — next failures (4 distinct):
  - OS::register_filesystem signature changed to microStore::FileSystem&.
    Real microStore migration needed for UniversalFileSystem.
  - LXMRouter::process_sync still missing despite vendored src-shim copy.
    Include-order or shadowing — needs investigation.
  - MEMORY_MONITOR_POLL macro not picked up despite -I src-shim/Instrumentation.
  - Identity::should_persist_data appears to still be referenced via
    LXMF or another vendored layer — would surface once the above land.
2026-05-04 20:25:02 -04:00
torlando-tech d6d4eb2c9c BLE stability: defer disconnect processing, fix data races, harden operations
Critical fixes for NimBLE host task / BLE loop task concurrency:
- Defer all disconnect map cleanup from NimBLE callbacks to loop task via
  SPSC ring buffer, preventing iterator invalidation and use-after-free
- Defer enterErrorRecovery() from callback context to loop task
- Add WDT feed in enterErrorRecovery() host-sync polling loop

Operational hardening:
- Cache NimBLERemoteCharacteristic* pointers in write() to avoid repeated
  service/characteristic lookups per fragment
- Add isConnected() checks before GATT operations (read, enableNotifications)
- Validate peer address in notification callback to guard against handle reuse
- Skip stuck-state detector during CONNECTING/CONN_STARTING states
- Expire stale pending data entries after HANDSHAKE_TIMEOUT (30s)
- Read actual connection RSSI via ble_gap_conn_rssi() for peripheral connections
  instead of hardcoding 0

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 00:15:24 -05:00
torlando-tech e343caf2d2 Stability: WDT yield, BLE mutex fixes, time-based desync recovery
Reduces crash rate from every 60-85s to 1 reboot per 6+ minutes.
Zero WDT triggers in 10-minute stability test.

BLE mutex fixes (BLEInterface.cpp):
- Release _mutex before blocking GATT ops in onConnected() and
  onServicesDiscovered() — prevents 5-30s main-loop stalls during
  service discovery, notification subscribe, identity exchange
- Non-blocking try_lock() for peerCount(), getConnectedPeerSummaries(),
  get_stats() — returns empty/default if BLE task holds mutex
- Write-without-response in initiateHandshake()

WDT and persistence (main.cpp, sdkconfig.defaults, microReticulum):
- 30s WDT timeout (up from 10s) for SPIFFS flash I/O headroom
- Register Identity::set_persist_yield_callback() to feed WDT every
  5 entries during save_known_destinations() (70+ entries = 30-50s)
- WDT feeds between reticulum and identity persist calls

BLE host desync recovery (NimBLEPlatform):
- Time-based desync tracking instead of aggressive counter-based reboot
- 60s tolerance without connections, 5 minutes with active connections
  (data still flows over existing BLE mesh links)
- Remove immediate recoverBLEStack() from 574 handler and
  enterErrorRecovery() — let startScan() manage reboot decision
- Increase CONNECTION_COOLDOWN from 3s to 10s to reduce 574 risk
- Increase SCAN_FAIL_RECOVERY_THRESHOLD from 5 to 10

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 12:30:30 -05:00
torlando-tech 43a7e1088f BLE P2P stability: fix ODR violation, shutdown safety, connection robustness
Fix systemic One Definition Rule violation where BLEInterface.h included
headers from deps/microReticulum/src/BLE/ while .cpp files compiled
against local lib/ble_interface/ versions, causing struct layout mismatches
(PeerInfo field shifting corrupted conn_handle/mtu) and class layout
mismatches (BLEPeerManager member differences caused LoadProhibited crash).

Key fixes:
- Include local BLE headers instead of deps versions in BLEInterface.h
- Sync PeerInfo keepalive tracking fields and BLETypes constants with deps
- Shutdown re-entrancy guard and proper client cleanup via deinit(true)
- Host sync checks before scan, advertise, and connect operations
- Avoid deadlock by deferring _on_connected from NimBLE host task
- Duplicate identity detection, stale handle cross-check in keepalives
- Bounds validation on conn_handle in setPeerHandle/promoteToIdentityKeyed
- Periodic persist_data() call for display name persistence across reboots

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 00:24:45 -05:00
torlando-tech 769c9952bd BLE P2P stability: PSRAM zero-init, pool sizing, stuck-state recovery
Root cause: Bytes objects stored in PSRAM-allocated BLEInterface had
corrupted shared_ptr members from uninitialized memory, causing crashes
in processDiscoveredPeers(). Fixed by using heap_caps_calloc instead of
heap_caps_malloc for PSRAM placement-new allocation.

Additional fixes:
- Reduce pool sizes to fit memory budget (reassembler 134KB→17KB,
  fragmenters 8→4, handshakes 32→4, pending data 64→8)
- Store local MAC as BLEAddress struct instead of Bytes to avoid
  heap allocation in PSRAM-resident object
- Move setLocalMac after platform start (NimBLE needs to be running
  for valid random address), add lazy MAC init fallback in loop()
- Add stuck-state detector: resets GAP state machine if hardware
  is idle but state machine thinks it's busy
- Enhance getLocalAddress with 3 fallback methods (NimBLE API,
  ble_hs_id_copy_addr RANDOM, esp_read_mac efuse)
- Fix C++17 structured binding to C++11 compatibility
- Increase BLE task stack 8KB→12KB for string ops in debug logs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 20:57:05 -05:00
torlando-tech ec181e18f8 BLE interop fixes: pools, keepalive tracking, data buffering
- Replace std::map and std::vector with fixed-size pools in
  BLEInterface (fragmenters, pending handshakes, pending data)
- Track keepalive failures and disconnect after 3 consecutive
- Force-disconnect zombie peers detected by BLEPeerManager
- Add periodic advertising refresh (every 60s) to combat silent stops
- Buffer incoming data when identity not yet mapped instead of dropping
- Subtract ATT_OVERHEAD from MTU in NimBLEPlatform connection setup

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 00:18:02 -05:00
torlando-tech ac6ceca9f8 Initial commit: standalone Pyxis T-Deck firmware
Split T-Deck firmware from microReticulum examples/lxmf_tdeck/ into its
own repo. microReticulum is consumed as a git submodule dependency pinned
to feat/t-deck. All include paths updated from relative symlinks to bare
includes resolved via library build flags.

Both tdeck (NimBLE) and tdeck-bluedroid environments compile successfully.
Licensed under AGPLv3.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 19:48:33 -05:00