update specs

2026-07-03 11:12:22 +00:00 · 2026-03-13 12:43:02 +00:00
parent bde90500ea
commit 546ee1a0e1
14 changed files with 357 additions and 17 deletions
@@ -33,7 +33,7 @@ The key decision point: when `subAction_` is `Nothing` (set by `workerErrors` af

 `retrySubActions` holds the list of subs-to-retry in a `TVar`. Each iteration, the action function returns only the subs that got temporary errors (via `splitResults`). The `TVar` is overwritten with this shrinking list. On success or permanent error, subs drop out. This means retry batches get smaller over time.

-`splitResults` implements a three-way partition: temporary errors → retry, permanent errors → null the action + notify, successes → continue pipeline.
+`splitResults` implements a three-way partition: temporary or host errors → retry, permanent errors → null the action + notify, successes → continue pipeline.

 ### 3. rescheduleWork deferred wake-up

@@ -48,7 +48,7 @@ This is the mechanism for time-scheduled subscription health checks.

 When the notification router returns `AUTH` for a subscription check, the subscription is not simply marked as failed — it is fully recreated from scratch by resetting to `NSASMP NSASmpKey` state. This handles the case where the notification router has lost its subscription state (restart, data loss). The SMP worker is kicked to re-establish notifier credentials.

-Non-AUTH failure statuses that are not in `subscribeNtfStatuses` also trigger recreation.
+Successful check responses with statuses not in `subscribeNtfStatuses` also trigger recreation via `recreateNtfSub`.

 ### 5. deleteToken two-phase with restart survival

@@ -73,3 +73,23 @@ These NTF worker actions are no longer generated by current code but are kept fo
 ### 8. Stats counting groups by userId

 `incStatByUserId` groups batch subscriptions by `userId` before incrementing stats counters, ensuring per-user counts are accurate even when a single batch contains subscriptions from multiple users.
+
+### 9. sendNtfSubCommand — gated on instant mode
+
+`sendNtfSubCommand` only enqueues work if instant notifications are active (`hasInstantNotifications` checks `NTActive` status + `NMInstant` mode). In periodic mode, the entire subscription creation pipeline is dormant — no commands reach the supervisor.
+
+### 10. deleteNotifierKeys — credential reset before disable
+
+`resetCredsGetQueue` clears the queue's notification credentials in the store *before* sending the disable command to the SMP router. This "clean first" ordering means local state is already consistent even if the network call fails.
+
+### 11. runNtfTknDelWorker — permanent error discards record
+
+When token deletion gets a permanent (non-temporary, non-host) error, the deletion record is removed from the queue rather than retried. This prevents stuck deletion records from blocking the worker. The error is reported to the client.
+
+### 12. getNtfServer — random selection from multiple
+
+When multiple notification routers are configured, one is selected randomly using `randomR` with a session-stable `TVar` generator. Single-server configurations skip the randomness.
+
+### 13. closeNtfSupervisor — atomic swap then cancel
+
+`swapTVar` atomically replaces the workers map with empty, then cancels all extracted workers. This ensures all existing workers at the point of shutdown are captured for cancellation. Prevention of new work is handled by the supervisor loop termination and operation bracket lifecycle, not by the swap itself.
@@ -42,3 +42,31 @@ See comment on `InvShortLink`. Stored separately from the connection because 1-t
 ## RcvQueueSub — subscription-optimized projection

 `RcvQueueSub` strips cryptographic fields from `RcvQueue`, keeping only what's needed for subscription tracking in [TSessionSubs](./TSessionSubs.md). This reduces memory pressure when tracking thousands of subscriptions in STM.
+
+## rcvSMPQueueAddress exposes sender-facing ID
+
+`rcvSMPQueueAddress` constructs the `SMPQueueAddress` from a receive queue using `sndId` (not `rcvId`). The address shared with senders in connection requests contains the sender ID, the public key derived from `e2ePrivKey`, and `queueMode`. The `rcvId` is never exposed externally.
+
+## enableNtfs is duplicated between queue and connection
+
+`enableNtfs` exists on both `StoredRcvQueue` and `ConnData`. The comment marks it as "duplicated from ConnData." The queue-level copy enables subscription operations (which work at the queue level) to check notification status without loading the full connection.
+
+## deleteErrors — queue deletion retry counter
+
+`StoredRcvQueue` has a `deleteErrors :: Int` field that counts failed deletion attempts. This allows the agent to give up on queue deletion after repeated failures rather than retrying indefinitely.
+
+## Two-level message preparation
+
+`SndMsgData` optionally carries `SndMsgPrepData` with a `sndMsgBodyId` reference to a separately stored message body. `PendingMsgData` optionally carries `PendingMsgPrepData` with the actual `AMessage` body. This split allows large message bodies to be stored once and referenced by ID during the send pipeline, avoiding redundant serialization.
+
+## Per-message retry backoff
+
+`PendingMsgData` includes `msgRetryState :: Maybe RI2State` — each pending message independently tracks its retry backoff state. This means messages that fail to send don't reset the retry timers of other pending messages in the same connection.
+
+## Soft deletion and optional contact connection
+
+`ConnData` has `deleted :: Bool` for soft deletion — connections are marked deleted before queue cleanup completes. `Invitation` has `contactConnId_ :: Maybe ConnId` (note the trailing underscore) — invitations can outlive their originating contact connection.
+
+## SEBadQueueStatus is vestigial
+
+`SEBadQueueStatus` is documented in the source as "Currently not used." It was intended for queue status transition validation but was never implemented.
@@ -8,7 +8,7 @@

 At ~3700 lines, this is the largest module in the codebase. It implements all database operations for the agent, compiled with CPP for both SQLite and PostgreSQL backends. Most functions are straightforward SQL CRUD, but several patterns are non-obvious.

-The module re-exports `withConnection`, `withTransaction`, `withTransactionPriority`, `firstRow`, `firstRow'`, `maybeFirstRow`, and `fromOnlyBI` from the backend-specific Common module.
+The module re-exports `withConnection`, `withTransaction`, `withTransactionPriority`, `firstRow`, `firstRow'`, and `maybeFirstRow` from the backend-specific Common module. It also exports `fromOnlyBI` (a local helper) and `getWorkItem`/`getWorkItems`.

 ## Dual-backend compilation

@@ -19,11 +19,11 @@ The module uses `#if defined(dbPostgres)` throughout. Key behavioral differences

 ## getWorkItem / getWorkItems — worker store pattern

-`getWorkItem` implements the store-side pattern for the [worker framework](../Client.md): `getId → getItem → markFailed`. If `getId` or `getItem` throws an IO exception, `handleWrkErr` wraps it as `SEWorkItemError` (via `mkWorkItemError`), which signals the worker to suspend rather than retry. This prevents crash loops on corrupt data.
+`getWorkItem` implements the store-side pattern for the [worker framework](../Client.md): `getId → getItem → markFailed`. If `getId` throws an IO exception, `handleWrkErr` wraps it as `SEWorkItemError` (via `mkWorkItemError`), which signals the worker to suspend rather than retry. If `getItem` fails (returning Left or throwing), `tryGetItem` calls `markFailed` (also wrapped by `handleWrkErr`) and rethrows the original error. This prevents crash loops on corrupt data.

 `getWorkItems` extends this to batch work items, where each item failure is independent.

-**Consumed by**: `getPendingQueueMsg`, `getPendingServerCommand`, `getNextNtfSubNTFActions`, `getNextNtfSubSMPActions`, `getNextDeletedSndChunkReplica`, `getNextNtfTokenToDelete`.
+**Consumed by**: `getPendingQueueMsg`, `getPendingServerCommand`, `getNextNtfSubNTFActions`, `getNextNtfSubSMPActions`, `getNextDeletedSndChunkReplica`, `getNextNtfTokenToDelete`, `getNextRcvChunkToDownload`, `getNextRcvFileToDecrypt`, `getNextSndChunkToUpload`, `getNextSndFileToPrepare`.

 ## Notification subscription — supervisor/worker coordination

@@ -43,11 +43,11 @@ Both functions include `AND last_internal_*_msg_id = ?` in their UPDATE WHERE cl

 ## deleteConn — conditional delivery wait

-Three deletion paths:
+Four paths:
 1. No timeout: immediate delete.
 2. Timeout + no pending deliveries: immediate delete.
 3. Timeout + pending deliveries + `deleted_at_wait_delivery` expired: delete.
-4. Timeout + pending deliveries + not expired: return `Nothing` (skip).
+4. Timeout + pending deliveries + not expired: return `Nothing` (skip deletion).

 This allows graceful delivery completion before connection cleanup.

@@ -74,3 +74,55 @@ Generates random 12-byte IDs (base64url encoded) and retries up to 3 times on co
 ## setRcvQueuePrimary / setSndQueuePrimary — two-step primary swap

 First clears primary flag on all queues in the connection, then sets it on the target queue. Also clears `replace_*_queue_id` on the new primary — this completes the queue rotation by removing the "replacing" marker.
+
+## checkConfirmedSndQueueExists_ — dpPostgres typo
+
+The CPP guard reads `#if defined(dpPostgres)` (note `dp` instead of `db`). This means the `FOR UPDATE` clause is never included for any backend. The check still works correctly for SQLite (single-writer model) but on PostgreSQL the query runs without row locking, which could allow a TOCTOU race between checking and inserting.
+
+## createCommand — silent drop for deleted connections
+
+When `createCommand` encounters a constraint violation (the referenced connection was already deleted), it logs the error and returns successfully rather than throwing. This means commands targeting deleted connections are silently dropped. The rationale: the connection is already gone, so there's nothing useful to do with the error.
+
+## updateNewConnRcv — retry tolerance
+
+`updateNewConnRcv` accepts both `NewConnection` and `RcvConnection` connection states. The `RcvConnection` case is explicitly commented as "to allow retries" — if the initial queue insertion succeeded but the caller didn't get the response, a retry would find the connection already upgraded. `updateNewConnSnd` does not have this tolerance.
+
+## setLastBrokerTs — monotonic advance
+
+The WHERE clause includes `AND (last_broker_ts IS NULL OR last_broker_ts < ?)`, which ensures the timestamp only moves forward. Out-of-order message processing (e.g., from different queues) cannot regress the broker timestamp.
+
+## deleteDeliveredSndMsg — FOR UPDATE + count zero check
+
+On PostgreSQL, acquires a `FOR UPDATE` lock on the message row before counting pending deliveries. This prevents a race where two concurrent delivery completions both see count > 0 before either deletes, then both try to delete. Only deletes the message when the count reaches exactly 0.
+
+## createWithRandomId' — savepoint-based retry
+
+Uses `withSavepoint` around each insertion attempt rather than bare execute. This is critical for PostgreSQL: a failed statement within a transaction aborts the entire transaction, but savepoints allow rolling back just the failed INSERT and retrying with a new ID.
+
+## Explicit row-lock functions
+
+`lockConnForUpdate`, `lockRcvFileForUpdate`, and `lockSndFileForUpdate` are PostgreSQL-only explicit lock acquisition that compile to no-ops on SQLite. They acquire `FOR UPDATE` locks on rows that need serialized access without modifying them.
+
+## XFTP work item retry ordering
+
+`getNextRcvChunkToDownload` and `getNextSndChunkToUpload` order by `retries ASC, created_at ASC`. This prioritizes chunks with fewer retries, ensuring a repeatedly-failing chunk doesn't starve others. Same pattern for `getNextDeletedSndChunkReplica`.
+
+## getRcvFileRedirects — error resilience
+
+When loading redirect chains, errors loading individual redirect files are silently swallowed (`either (const $ pure Nothing) (pure . Just)`). This prevents a corrupt redirect from blocking access to the main file.
+
+## enableNtfs defaults to True when NULL
+
+Both `toRcvQueue` and `rowToConnData` default `enableNtfs` to `True` when the database value is NULL (`maybe True unBI enableNtfs_`). This is a backward-compatibility default for connections created before the field existed.
+
+## primaryFirst — queue ordering
+
+The `primaryFirst` comparator sorts queues with the primary queue first (`Down` on primary flag), then by `dbReplaceQId` to place the "replacing" queue second. This ensures all queue lists are consistently ordered for connection reconstruction.
+
+## getAnyConn_ — connection GADT reconstruction
+
+Reconstructs the type-level `Connection'` GADT by combining connection mode with the presence/absence of receive and send queues. The `CMContact` mode only maps to `ContactConnection` (receive-only); all other combinations use `CMInvitation` mode. When neither rcv nor snd queues exist, the result is always `NewConnection` regardless of mode.
+
+## deleteNtfSubscription — soft delete when supervisor active
+
+When `updated_by_supervisor` is true, `deleteNtfSubscription` doesn't actually delete the row. Instead, it nulls out the IDs and sets status to `NASDeleted`, preserving the row for the supervisor to observe. Only when the supervisor has not intervened does it perform a real DELETE.
@@ -15,7 +15,7 @@ These are set per-connection, not per-database — every new connection (includi

 ## simplex_xor_md5_combine — custom SQLite function

-A C-exported SQLite function registered at connection time. Takes an existing `IdsHash` and a `RecipientId`, XORs the hash with the MD5 of the ID. This is the SQLite implementation of the accumulative IdsHash used by service subscriptions (see [TSessionSubs.md](../TSessionSubs.md#updateActiveService--accumulative-xor-merge)). PostgreSQL uses its native `md5()` and `decode()` functions instead.
+A C-exported SQLite function registered at connection time. Takes an existing `IdsHash` and a `RecipientId`, XORs the hash with the MD5 of the ID. This is the SQLite implementation of the accumulative IdsHash used by service subscriptions (see [TSessionSubs.md](../TSessionSubs.md#updateActiveService--accumulative-xor-merge)). PostgreSQL uses `pgcrypto`'s `digest()` function for MD5 and a custom `xor_combine` PL/pgSQL function for the XOR.

 ## openSQLiteStore_ — connection swap under MVar

@@ -23,4 +23,8 @@ Uses `bracketOnError` with `takeMVar`/`tryPutMVar`: takes the connection MVar, c

 ## storeKey — conditional key retention

-`storeKey key keepKey` stores the encryption key in the `dbKey` TVar only if `keepKey` is true. This allows `reopenDBStore` to re-open without the caller re-supplying the key. If `keepKey` is false and the store is closed, `reopenDBStore` fails with "no key".
+`storeKey key keepKey` stores the encryption key in the `dbKey` TVar if `keepKey` is true or if the key is empty (no encryption). This means unencrypted stores can always be reopened. If `keepKey` is false and the key is non-empty, `reopenDBStore` fails with "no key".
+
+## dbBusyLoop — initial connection retry
+
+`connectSQLiteStore` wraps `connectDB` in `dbBusyLoop` to handle database locking during initial connection. All transactions (`withTransactionPriority`) are also wrapped in `dbBusyLoop` as a retry layer on top of the `busy_timeout` PRAGMA.
@@ -13,3 +13,11 @@
 ### 2. Batch operations return per-item errors

 `ntfCreateSubscriptions` and `ntfCheckSubscriptions` return `NonEmpty (Either NtfClientError result)` — individual items in a batch can fail independently. Callers must handle partial success (some created, some failed). The singular variants throw on any error.
+
+### 3. Default port is 443
+
+`defaultNTFClientConfig` sets the default transport to `("443", transport @TLS)`. Unlike the SMP protocol which typically uses port 5223, the NTF protocol defaults to the standard HTTPS port.
+
+### 4. okNtfCommand parameter ordering
+
+`okNtfCommand` has an unusual parameter order — the command comes first, then client, mode, key, entityId. This enables partial application in the `ntfDeleteToken`, `ntfVerifyToken` etc. definitions, where the command is fixed and the remaining parameters flow through.
@@ -41,3 +41,15 @@ Token status `NTInvalid` allows subscription commands (SNEW, SCHK, SDEL), which
 ### 7. DeviceToken hex validation

 `DeviceToken` string parsing has two paths: a hardcoded literal match for `"apns_null test_ntf_token"` (test tokens), and hex string validation for real tokens (must be even-length hex). The wire encoding (`smpP`) does not perform this validation — it accepts any `ByteString`.
+
+### 8. SMPQueueNtf parsing applies updateSMPServerHosts
+
+Both `smpP` and `strP` for `SMPQueueNtf` apply `updateSMPServerHosts` to the parsed SMP server. This normalizes server host addresses on deserialization, ensuring consistent comparison even if the on-wire format uses different host representations.
+
+### 9. NRTknId response tag comment
+
+The `NRTknId_` tag encodes as `"IDTKN"` with a source comment: "it should be 'TID', 'SID'". This indicates a naming inconsistency that was preserved for backward compatibility — the tag names don't follow the pattern of other NTF protocol tags.
+
+### 10. useServiceAuth is False
+
+The `Protocol` instance explicitly returns `False` for `useServiceAuth`, meaning the NTF protocol never uses service-level authentication. All authentication is entity-level (per token/subscription).
@@ -27,12 +27,12 @@ When `verifyNtfTransmission` encounters an AUTH error (entity not found), it cal
 ### 2. TNEW idempotent re-registration

 When TNEW is received for an already-registered token, the server:
-1. Looks up the existing token via `findNtfTokenRegistration`
+1. Looks up the existing token via `findNtfTokenRegistration` (matches on push provider, device token, AND verify key)
 2. Verifies the DH secret matches (recomputed from the new `dhPubKey` and stored `tknDhPrivKey`)
 3. If DH secrets differ → AUTH error (prevents token hijacking)
 4. If they match → re-sends verification push notification

-This makes TNEW safe for client retransmission after connection drops.
+If the verify key doesn't match in step 1, the lookup returns `Nothing` and a new token is created instead — the DH secret check never runs. This makes TNEW safe for client retransmission after connection drops.

 ### 3. SNEW idempotent subscription

@@ -77,3 +77,55 @@ Cron notification interval has a hard minimum of 20 minutes. `TCRN 0` disables c
 ### 11. receive separates error responses from commands

 The `receive` function processes incoming transmissions and partitions results: malformed/unauthorized requests are written directly to `sndQ` as error responses, while valid commands go to `rcvQ` for processing. This ensures protocol errors get immediate responses without competing for the command processing queue.
+
+### 12. Maintenance mode saves state then exits immediately
+
+When `maintenance` is set in `startOptions`, the server restores stats, calls `stopServer` (closes DB, saves stats), and exits with `exitSuccess`. It never starts transport listeners, subscriber threads, or resubscription. This provides a way to run database migrations without the server serving traffic.
+
+### 13. Resubscription runs as a detached fork
+
+`resubscribe` is launched via `forkIO` before `raceAny_` starts — it is **not part of the `raceAny_` group**. Most exceptions are silently lost per `forkIO` semantics. However, `ExitCode` exceptions (like `exitFailure` from pattern 20) are special-cased by GHC's runtime and propagate to the main thread, terminating the process.
+
+### 14. TNEW re-registration resets status for non-verifiable tokens
+
+When a re-registration TNEW matches on DH secret but `allowTokenVerification tknStatus` is `False` (token is `NTNew`, `NTInvalid`, or `NTExpired`), the server resets status to `NTRegistered` before sending the verification push. This makes TNEW a "status repair" mechanism — clients with stuck tokens can restart the verification flow by re-registering with the same DH key.
+
+### 15. DELD unconditionally updates status (no session validation)
+
+Unlike `SMP.END` which checks `activeClientSession'` to prevent stale session messages from changing state, `SMP.DELD` updates subscription status to `NSDeleted` unconditionally. This is correct because DELD means the queue was permanently deleted on the SMP router — the information is valid regardless of which session reports it.
+
+### 16. TRPL generates new code but reuses the DH key
+
+`TRPL` (token replace) creates a new registration code and resets status to `NTRegistered`, but does NOT generate a new server DH key pair. The existing `tknDhPrivKey` and `tknDhSecret` are preserved — only the push provider token and registration code change. The encrypted channel between client and NTF router persists across device token replacements.
+
+### 17. PNMessage delivery requires NTActive, verification and cron do not
+
+`ntfPush` applies `checkActiveTkn` only to `PNMessage` notifications. Verification pushes (`PNVerification`) and cron check-messages pushes (`PNCheckMessages`) are delivered regardless of token status. This is necessary because verification pushes must be sent before NTActive, and cron pushes are already filtered at the database level.
+
+### 18. CAServiceSubscribed validates count and hash with warning-only behavior
+
+When a service subscription is confirmed, the NTF router compares expected and confirmed subscription count and IDs hash. Mismatches in either are logged as warnings but no corrective action is taken. Only when both match is an informational message logged.
+
+### 19. subscribeLoop uses 100x database batch multiplier
+
+`dbBatchSize = batchSize * 100` reads subscriptions from the database in chunks 100 times larger than the SMP subscription batches. This reduces database round-trips during resubscription while keeping individual SMP batches small enough to avoid overwhelming SMP routers.
+
+### 20. subscribeLoop calls exitFailure on database error
+
+If `getServerNtfSubscriptions` returns `Left _` during startup resubscription, the server terminates via `exitFailure`. Since `resubscribe` runs in a forked thread (pattern 13), this `exitFailure` terminates the entire process — a transient database error during startup resubscription kills the server.
+
+### 21. Stats log aligns to wall-clock time of day
+
+The stats logging thread calculates an `initialDelay` to synchronize the first flush to `logStatsStartTime`. If the target time already passed today, it adds 86400 seconds to schedule for the next day. Subsequent flushes occur at exact `logInterval` cadence from that aligned start point.
+
+### 22. NMSG AUTH errors silently counted, not logged
+
+When `addTokenLastNtf` returns `Left AUTH` (notification for a queue whose subscription/token association is invalid), the server increments `ntfReceivedAuth` but takes no corrective action. Other error types are silently ignored. This is expected — subscriptions may be deleted while messages are in-flight.
+
+### 23. PNVerification delivery transitions token to NTConfirmed
+
+When a verification push is successfully delivered to the push provider, `setTknStatusConfirmed` transitions the token to `NTConfirmed`, but only if not already `NTConfirmed` or `NTActive`. This creates a two-phase confirmation: push delivery confirms the channel works (`NTConfirmed`), then TVFY confirms the client received it (`NTActive`).
+
+### 24. disconnectTransport always passes noSubscriptions = True
+
+Unlike the SMP router which checks active subscriptions before disconnecting idle clients, the NTF router always returns `True` for the "no subscriptions" check. NTF clients are disconnected purely on inactivity timeout — the NTF protocol has no long-lived client subscriptions.
@@ -8,7 +8,7 @@

 ### 1. Service credentials are lazily generated

-`mkDbService` in `newNtfServerEnv` generates service credentials on demand: when `getCredentials` is called for an SMP server, it first checks the database. If credentials exist, they are used. If not (`Nothing`), new credentials are generated via `genCredentials`, stored in the database, and returned. This happens per SMP server on first connection.
+`mkDbService` in `newNtfServerEnv` generates service credentials on demand: when `getCredentials` is called for an SMP server, it checks the database. If the server is known and already has credentials, they are reused. If the server is known but has no credentials yet (first connection), new credentials are generated via `genCredentials`, stored in the database, and returned. If the server is not in the database at all, `PCEServiceUnavailable` is thrown (this case should not occur in practice, as clients only connect to servers already tracked in the database).

 Service credentials are only used when `useServiceCreds` is enabled in the config.

@@ -19,3 +19,27 @@ Service credentials are only used when `useServiceCreds` is enabled in the confi
 ### 3. getPushClient lazy initialization

 `getPushClient` looks up the push client by provider in `pushClients` TMap. If not found, it calls `newPushClient` to create and register one. Push provider connections are established on first use, not at server startup.
+
+### 4. Service credential validity: 25h backdating, ~2700yr forward
+
+`genCredentials` creates self-signed Ed25519 certificates valid from 25 hours in the past to `24 * 999999` hours (~2,739 years) in the future. The 25-hour backdating protects against clock skew between NTF and SMP routers. The near-permanent forward validity avoids the need for credential rotation infrastructure.
+
+### 5. newPushClient race creates duplicate clients
+
+`newPushClient` atomically inserts into `pushClients` after creating the client. A concurrent `getPushClient` call between creation start and TMap insert will see `Nothing`, create a second client, and overwrite the first. This race is tolerable — APNS connections are cheap and the overwritten client is garbage collected.
+
+### 6. Bidirectional activity timestamps
+
+`NtfServerClient` has separate `rcvActiveAt` and `sndActiveAt` TVars, both initialized to connection time and updated independently. `disconnectTransport` considers both — a client that only receives (or only sends) is still considered active.
+
+### 7. pushQ bounded TBQueue creates backpressure
+
+`pushQ` in `NtfPushServer` is a `TBQueue` sized by `pushQSize`. When full, any thread writing to it (NMSG processing, periodic cron, verification) blocks in STM until space is available. This prevents the push delivery pipeline from being overwhelmed.
+
+### 8. subscriberSeq provides monotonic session variable ordering
+
+The `subscriberSeq` TVar is used by `getSessVar` to assign monotonically increasing IDs to subscriber session variables. `removeSessVar` uses compare-and-swap with this ID — only the variable with the matching ID can be removed, preventing stale removal when a new subscriber has already replaced the old one.
+
+### 9. SMPSubscriber holds Weak ThreadId for GC-based cleanup
+
+`subThreadId` is `Weak ThreadId`, not `ThreadId`. Using `Weak ThreadId` allows the GC to collect thread resources when no strong references remain. `stopSubscriber` uses `deRefWeak` to obtain the `ThreadId` (if the thread hasn't been GC'd) before calling `killThread`. The `Nothing` case (thread already collected) is simply skipped.
@@ -33,3 +33,43 @@ When APNS returns 503 (Service Unavailable), the client actively closes the HTTP
 ### 7. EC key type assumption

 `readECPrivateKey` uses a specific pattern match for EC keys (`PrivKeyEC_Named`). It will crash at runtime if the APNS key file contains a different key type. The comment acknowledges this limitation.
+
+### 8. JWT signature uses DER-encoded ASN.1, not raw r||s
+
+`signedJWTToken` serializes the ECDSA signature as a DER-encoded ASN.1 SEQUENCE of two INTEGERs, then base64url-encodes it. RFC 7518 Section 3.4 requires raw concatenation of fixed-length r and s values instead. This deviation works because Apple's APNS server accepts DER-encoded signatures, but it would break if Apple enforced strict JWS compliance.
+
+### 9. Two different base64url encodings
+
+The encryption path uses `U.encode` (base64url **with** padding `=`), while the JWT path uses `U.encodeUnpadded` (base64url **without** padding). JWT requires unpadded base64url per RFC 7515, but the encrypted notification ciphertext is padded before being embedded as a JSON text value.
+
+### 10. Error response defaults to empty string on parse failure
+
+If the APNS error response body is empty, malformed, or not JSON, `decodeStrict'` returns `Nothing` and the reason defaults to `""`. This empty string never matches named error patterns, so unparseable error bodies fall through to the catch-all of whichever status code branch matches. For 410, this means a malformed body is treated as `PPRetryLater` rather than a token invalidation.
+
+### 11. 410 unknown reasons are retryable, unlike 400/403 unknowns
+
+Unknown 410 (Gone) reasons fall through to `PPRetryLater`, while unknown 400 and 403 reasons fall through to `PPResponseError`. This means an unexpected APNS 410 reason string triggers retry behavior rather than permanent failure.
+
+### 12. 429 TooManyRequests is not explicitly handled
+
+There is a commented-out note but no actual 429 handler. A rate-limiting response falls through to the `otherwise` branch and becomes `PPResponseError`, surfacing as a generic error rather than a retryable condition.
+
+### 13. Nonce generation is STM-atomic, separate from encryption
+
+The per-notification nonce is generated inside `atomically` using the `ChaChaDRG` TVar, guaranteeing uniqueness under concurrent delivery. The nonce is then used by `cbEncrypt` outside STM. This separation means the nonce is committed to the DRG state even if encryption or send subsequently fails — correct behavior since nonce reuse would be catastrophic.
+
+### 14. Background notifications use priority 5, alerts use default 10
+
+`apnsRequest` conditionally appends `apns-priority: 5` only for `APNSBackground` notifications. Alert and mutable-content notifications omit the header, relying on APNS's default priority of 10. Apple requires background pushes to use priority 5 — using 10 can cause APNS to reject them.
+
+### 15. APNSErrorResponse is data, not newtype
+
+The comment explicitly states `APNSErrorResponse` is `data` rather than `newtype` "to have a correct JSON encoding as a record." With `deriveFromJSON`, a newtype around `Text` would serialize as a bare string, not `{"reason": "..."}`. The `data` wrapper forces record encoding matching APNS's JSON error format.
+
+### 16. HTTP/2 requests go through a serializing queue
+
+`sendRequest` routes through the HTTP2Client's `reqQ` (a `TBQueue`), serializing all requests through a single sender thread. Concurrent push deliveries are implicitly serialized at the HTTP/2 layer, meaning high-throughput scenarios bottleneck on this queue rather than utilizing HTTP/2's multiplexing.
+
+### 17. Connection initialization is fire-and-forget
+
+`createAPNSPushClient` calls `connectHTTPS2` and discards the result with `void`. If the initial connection fails, the error is only logged — the client is still created. The first push delivery triggers `getApnsHTTP2Client` which reconnects. This means the server can start even if APNS is unreachable.
@@ -8,7 +8,7 @@

 ### 1. incServerStat double lookup

-`incServerStat` performs a non-STM IO lookup first, then only enters an STM transaction on cache miss. The STM block re-checks the map to handle races (another thread may have inserted between the IO lookup and STM entry). This avoids contention on the shared TMap in the common case where the server's counter TVar already exists.
+`incServerStat` performs a non-STM IO lookup first. On cache hit, the STM transaction only touches the per-server `TVar Int` without reading the shared TMap, avoiding contention. On cache miss, the STM block re-checks the map to handle races (another thread may have inserted between the IO lookup and STM entry).

 ### 2. setNtfServerStats is not thread safe

@@ -17,3 +17,23 @@
 ### 3. Backward-compatible parsing

 The `strP` parser uses `opt` which defaults missing fields to 0. This allows reading stats files from older server versions that don't include newer fields (`ntfReceivedAuth`, `ntfFailed`, `ntfVrf*`, etc.).
+
+### 4. getNtfServerStatsData is a non-atomic snapshot
+
+`getNtfServerStatsData` reads each `IORef` and `TMap` field sequentially in plain `IO`, not inside a single STM transaction. The returned `NtfServerStatsData` is not a consistent point-in-time snapshot — invariants like "received >= delivered" may not hold. The same applies to `getStatsByServer`, which does one `readTVarIO` for the map root TVar, then a separate `readTVarIO` for each per-server TVar. This is acceptable for periodic reporting where approximate consistency suffices.
+
+### 5. Mixed IORef/TVar concurrency primitives
+
+Aggregate counters (`ntfReceived`, `ntfDelivered`, etc.) use `IORef Int` incremented via `atomicModifyIORef'_`, while per-server breakdowns use `TMap Text (TVar Int)` incremented atomically via STM in `incServerStat`. Although both individual operations are atomic, the aggregate and per-server increments are separate operations, so their values can drift: a thread could increment the aggregate `IORef` before `incServerStat` runs, or vice versa.
+
+### 6. setStatsByServer replaces TMap atomically but orphans old TVars
+
+`setStatsByServer` builds a fresh `Map Text (TVar Int)` in IO via `newTVarIO`, then atomically replaces the TMap's root TVar. Old per-server TVars are not reused — any other thread holding a reference from a prior `TM.lookupIO` would modify an orphaned counter. Safe only because it's called at startup (like `setNtfServerStats`), but lacks the explicit "not thread safe" comment.
+
+### 7. Positional parser format despite key=value appearance
+
+The parser is strictly positional: fields must appear in exactly the serialization order. The `opt` alternatives only handle entirely absent fields (defaulting to 0), not reordered fields. Despite the `key=value` on-disk appearance, this is a sequential format — the named prefixes are for human readability, not key-lookup parsing.
+
+### 8. B.unlines trailing newline asymmetry
+
+`strEncode` uses `B.unlines`, which appends `\n` after every element including the last. The parser compensates with `optional A.endOfLine` on the last field. The file always ends with `\n`, but the parser tolerates its absence.
@@ -21,3 +21,35 @@ When a token is activated, `stmRemoveInactiveTokenRegistrations` removes ALL oth
 ### 4. tokenLastNtfs accumulates via prepend

 New notifications are prepended to the `NonEmpty PNMessageData` list via `(<|)`. The list is unbounded in the STM store — bounding is handled at the push delivery layer (the Postgres store limits to 6).
+
+### 5. stmDeleteNtfToken prunes empty registration maps
+
+When `stmDeleteNtfToken` removes a token, it deletes the entry from the inner `TMap` of `tokenRegistrations`, then checks whether that inner map is now empty via `TM.null`. If empty, it removes the outer `DeviceToken` key entirely, preventing unbounded growth of empty inner maps. In contrast, `stmRemoveInactiveTokenRegistrations` does **not** perform this cleanup — the surviving active token's registration always remains.
+
+### 6. stmRemoveTokenRegistration is identity-guarded
+
+`stmRemoveTokenRegistration` looks up the registration entry for the token's own verify key and only deletes it if the stored `NtfTokenId` matches the token's own ID. This guard prevents a token from accidentally removing a **different** token's registration that was inserted under the same `(DeviceToken, verifyKey)` pair due to a re-registration race.
+
+### 7. stmDeleteNtfToken silently succeeds on missing tokens
+
+`stmDeleteNtfToken` uses `lookupDelete` chained with monadic bind over `Maybe`. If the token ID does not exist in the `tokens` map, the registration-cleanup branch is silently skipped, and the function still proceeds to delete from `tokenLastNtfs` and `deleteTokenSubs`. It returns an empty list rather than signaling an error — the caller cannot distinguish "deleted a token with no subscriptions" from "token never existed."
+
+### 8. deleteTokenSubs returns SMP queues for upstream unsubscription
+
+`deleteTokenSubs` atomically collects all `SMPQueueNtf` values from the deleted subscriptions and returns them. This is how the server layer knows which SMP notifier subscriptions to tear down. `stmRemoveInactiveTokenRegistrations` discards this list (`void $`), meaning rival-token cleanup does **not** trigger SMP unsubscription — only explicit token deletion does.
+
+### 9. stmAddNtfSubscription always returns Just (vestigial Maybe)
+
+`stmAddNtfSubscription` has return type `STM (Maybe ())` with a comment "return Nothing if subscription existed before," but **unconditionally returns `Just ()`**. `TM.insert` overwrites any existing subscription silently. The `Maybe` return type is vestigial — the function never detects duplicates.
+
+### 10. stmDeleteNtfSubscription leaves empty tokenSubscriptions entries
+
+When `stmDeleteNtfSubscription` removes a subscription, it deletes the `subId` from the token's `Set NtfSubscriptionId` in `tokenSubscriptions` but never checks whether the set became empty. Tokens with all subscriptions individually deleted accumulate empty set entries — these are only cleaned up when the token itself is deleted via `deleteTokenSubs`.
+
+### 11. stmSetNtfService — asymmetric cleanup with Postgres store
+
+`stmSetNtfService` uses `maybe TM.delete TM.insert` to either remove or set the service association for an SMP server. This is purely a key-value update with no cascading effects on subscriptions. The Postgres store's `removeServiceAndAssociations` handles subscription cleanup separately, meaning the STM and Postgres stores have **different cleanup semantics** for service removal.
+
+### 12. Subscription index triple-write invariant
+
+`stmAddNtfSubscription` writes to three maps atomically: `subscriptions` (subId → data), `subscriptionLookup` (smpQueue → subId), and `tokenSubscriptions` (tokenId → Set subId). Single-subscription deletion (`stmDeleteNtfSubscription`) cleans the first two but only removes from the Set in the third. Bulk-token deletion (`deleteTokenSubs`) deletes the outer `tokenSubscriptions` entry entirely. Different deletion paths have different completeness guarantees.
@@ -52,3 +52,43 @@ Only non-service-associated subscriptions (`NOT ntf_service_assoc`) are returned
 ### 10. Service association tracking

 `batchUpdateSrvSubStatus` atomically updates both subscription status and `ntf_service_assoc` flag. When notifications arrive via a service subscription (`newServiceId` is `Just`), all affected subscriptions are marked as service-associated. `removeServiceAndAssociations` resets all subscriptions for a server to `NSInactive` with `ntf_service_assoc = FALSE`.
+
+### 11. uninterruptibleMask_ wraps most store operations
+
+`withDB_` and `withClientDB` wrap the database transaction in `E.uninterruptibleMask_`. This prevents async exceptions from interrupting a PostgreSQL transaction mid-flight, which could leave a connection in a half-committed state and corrupt the pool. Functions that take a raw `DB.Connection` parameter (`getNtfServiceCredentials`, `setNtfServiceCredentials`, `updateNtfServiceId`) operate within a caller-managed transaction and are not independently wrapped. `getUsedSMPServers` uses `withTransaction` directly (intentionally: it is expected to crash on error at startup).
+
+### 12. Silent error swallowing with sentinel returns
+
+`withDB_` catches all `SomeException`, logs the error, and returns `Left (STORE msg)` — callers never see database failures as exceptions. Additionally, `batchUpdateSrvSubStatus` and `batchUpdateSrvSubErrors` use `fromRight (-1)` to convert database errors into a `-1` count, and `withPeriodicNtfTokens` uses `fromRight 0`, making database failures indistinguishable from "zero results" at the call site.
+
+### 13. getUsedSMPServers uncorrelated EXISTS
+
+The `EXISTS` subquery in `getUsedSMPServers` has no join condition to the outer `smp_servers` table — it returns ALL servers if ANY subscription anywhere has a subscribable status. This is intentional for server startup: the server needs all SMP server records (including `ServiceSub` data) to rebuild in-memory state, and the EXISTS clause is a cheap guard against an empty subscription table.
+
+### 14. Trigger-maintained XOR hash aggregates
+
+Subscription insert, update, and delete trigger functions incrementally maintain `smp_notifier_count` and `smp_notifier_ids_hash` on `smp_servers` using XOR-based hash aggregation of MD5 digests. Every `batchUpdateSrvSubStatus` or cascade-delete from token deletion implicitly fires these triggers. The XOR hash is self-inverting: adding and removing the same notifier ID restores the previous hash. `updateNtfServiceId` resets these counters to zero when the service ID changes, invalidating the previous aggregate.
+
+### 15. updateNtfServiceId asymmetric credential cleanup
+
+Setting a new service ID preserves existing TLS credentials (`ntf_service_cert`, etc.) while only resetting aggregate counters. Setting service ID to `NULL` clears both credentials AND counters. In both cases, if a previous service ID existed, all subscription associations are reset first via `removeServiceAssociation_`, and a `logError` is emitted — treating a service ID change as anomalous.
+
+### 16. Server upsert no-op DO UPDATE for RETURNING
+
+The `insertServer` fallback uses `ON CONFLICT ... DO UPDATE SET smp_host = EXCLUDED.smp_host` — a no-op update solely to make `RETURNING smp_server_id` work. PostgreSQL's `ON CONFLICT DO NOTHING` does not support `RETURNING` for conflicting rows, so this pattern forces a row to always be "affected" and thus returnable. This handles races where two concurrent `addNtfSubscription` calls both miss the initial SELECT.
+
+### 17. getNtfServiceCredentials FOR UPDATE serializes provisioning
+
+`getNtfServiceCredentials` acquires `FOR UPDATE` on the server row even though it is a read operation. The caller needs to atomically check whether credentials exist and then set them in the same transaction. Without `FOR UPDATE`, two concurrent provisioning attempts could both see `Nothing` and both provision, resulting in credential mismatch.
+
+### 18. deleteNtfToken string_agg with hex parsing
+
+`deleteNtfToken` uses `string_agg(s.smp_notifier_id :: TEXT, ',')` to aggregate `BYTEA` notifier IDs into comma-separated text, then parses with `parseByteaString` which drops the `\x` prefix and hex-decodes. `mapMaybe` silently drops any IDs that fail hex decoding, which could mask data corruption.
+
+### 19. withPeriodicNtfTokens streams with DB.fold
+
+`withPeriodicNtfTokens` uses `DB.fold` to stream token rows one at a time through a callback that performs IO (sending push notifications), meaning the database transaction and connection are held open for the entire duration of all notifications. This is deliberately routed through the non-priority pool to avoid blocking client-facing operations.
+
+### 20. Cursor-based pagination with byte-ordering
+
+`getServerNtfSubscriptions` uses `subscription_id > ?` with `ORDER BY subscription_id LIMIT ?`. Since `subscription_id` is `BYTEA`, ordering is by raw byte comparison. The batch status update uses `FROM (VALUES ...)` pattern instead of `WHERE IN (...)`, and the `s.status != upd.status` guard prevents no-op writes from firing XOR hash triggers.
@@ -8,7 +8,7 @@

 ### 1. ALPN-dependent version range

-`ntfServerHandshake` advertises `legacyServerNTFVRange` (v1 only) when ALPN is not available (`getSessionALPN` returns `Nothing`). When ALPN is present, it advertises the full `supportedServerNTFVRange`. This is the backward-compatibility mechanism for pre-ALPN clients that cannot negotiate newer protocol features.
+`ntfServerHandshake` advertises `legacyServerNTFVRange` (v1 only) when ALPN is not available (`getSessionALPN` returns `Nothing`). When ALPN is present, it advertises the caller-provided `ntfVRange`. This is the backward-compatibility mechanism for pre-ALPN clients that cannot negotiate newer protocol features.

 ### 2. Version-gated features

@@ -23,8 +23,16 @@ Pre-v2 connections have no command encryption or batching — commands are sent

 ### 3. Unused Protocol typeclass parameters

-`ntfClientHandshake` accepts `_proxyServer` and `_serviceKeys` parameters that are ignored. These exist because the `Protocol` typeclass (shared with SMP) requires `protocolClientHandshake` to accept them. The NTF protocol does not support proxy routing or service authentication.
+`ntfClientHandshake` accepts `_proxyServer` and `_serviceKeys` parameters that are ignored. These are passed through from the `Protocol` typeclass's `protocolClientHandshake` method for consistency with SMP. A third parameter (`Maybe C.KeyPairX25519` for key agreement) is discarded at the Protocol instance wrapper level. The NTF protocol does not support proxy routing or service authentication.

 ### 4. Block size

-NTF uses a 512-byte block size (`ntfBlockSize`), significantly smaller than SMP. Notification commands and responses are short — the main payload is the `PNMessageData` which contains encrypted message metadata.
+NTF uses a 512-byte block size (`ntfBlockSize`), significantly smaller than SMP. This is sufficient because NTF protocol commands (TNEW, SNEW, TCHK, etc.) and their responses are short. `PNMessageData` (which contains encrypted message metadata) is not sent over the NTF transport — it is delivered via APNS push notifications.
+
+### 5. Initial THandle has version 0
+
+`ntfTHandle` creates a THandle with `thVersion = VersionNTF 0` — a version that no real protocol supports. This is a placeholder value that gets overwritten during version negotiation. All feature gates check `v >= authBatchCmdsNTFVersion` (v2), so the v0 placeholder disables all optional features.
+
+### 6. Server handshake always sends authPubKey
+
+`ntfServerHandshake` always includes `authPubKey = Just sk` in the server handshake, regardless of the advertised version range. The encoding functions (`encodeAuthEncryptCmds`) then decide whether to actually serialize it based on the max version. This means the key is computed even when it won't be sent.
@@ -16,4 +16,4 @@

 ### 3. NSADelete and NSARotate are deprecated

-These `NtfSubNTFAction` values are no longer generated by current code but are retained in the type for processing legacy database records. `NSARotate` is logically "delete + recreate" while `NSADelete` is "delete notifier on NTF router + delete credentials on SMP router".
+These `NtfSubNTFAction` values are no longer generated by current code but are retained in the type for processing legacy database records. `NSARotate` is logically "delete + recreate" while `NSADelete` is "delete subscription on NTF server + delete notifier credentials on SMP server".