Files
simplexmq/plans/20260328_01_server_batched_sub_processing.md
T
Evgeny f0b7a4be73 messaging services (#1667)
* smp server: messaging services (#1565)

* smp server: refactor message delivery to always respond SOK to subscriptions

* refactor ntf subscribe

* cancel subscription thread and reduce service subscription count when queue is deleted

* subscribe rcv service, deliver sent messages to subscribed service

* subscribe rcv service to messages (TODO delivery on subscription)

* WIP

* efficient initial delivery of messages to subscribed service

* test: delivery to client with service certificate

* test: upgrade/downgrade to/from service subscriptions

* remove service association from agent API, add per-user flag to use the service

* agent client (WIP)

* service certificates in the client

* rfc about drift detection, and SALL to mark end of message delivery

* fix test

* fix test

* add function for postgresql message storage

* update migration

* servers: maintain xor-hash of all associated queue IDs in PostgreSQL (#1668)

* servers: maintain xor-hash of all associated queue IDs in PostgreSQL (#1615)

* ntf server: maintain xor-hash of all associated queue IDs via PostgreSQL triggers

* smp server: xor hash with triggers

* fix sql and using pgcrypto extension in tests

* track counts and hashes in smp/ntf servers via triggers, smp server stats for service subscription, update SMP protocol to pass expected count and hash in SSUB/NSSUB commands

* agent migrations with functions/triggers

* remove agent triggers

* try tracking service subs in the agent (WIP, does not compile)

* Revert "try tracking service subs in the agent (WIP, does not compile)"

This reverts commit 59e908100d.

* comment

* agent database triggers

* service subscriptions in the client

* test / fix client services

* update schema

* fix postgres migration

* update schema

* move schema test to the end

* use static function with SQLite to avoid dynamic wrapper

* agent: fail when per-connection transport isolation is used with services (#1670)

* agent: service subscription events (#1671)

* agent: use server keyhash when loading service record

* agent: process queue/service associations with delayed subscription results

* agent: service subscription events

* agent: finalize initial service subscriptions, remove associations on service ID changes (#1672)

* agent: remove service/queue associations when service ID changes

* agent: check that service ID in NEW response matches session ID in transport session

* agent subscription WIP

* test

* comment

* enable tests

* update queries

* agent: option to add SQLite aggregates to DB connection  (#1673)

* agent: add build_relations_vector function to sqlite

* update aggregate

* use static aggregate

* remove relations

---------

Co-authored-by: Evgeny Poberezkin <evgeny@poberezkin.com>

* add test, treat BAD_SERVICE as temp error, only remove queue associations on service errors

* add packZipWith for backward compatibility with GHC 8.10.7

---------

Co-authored-by: spaced4ndy <8711996+spaced4ndy@users.noreply.github.com>

* servers: service stats and logging, allow services without option (removed), report errors during service message delivery, remove threads when service subscription ended (#1676)

* smp server: always allow services without option

* smp server: maintain IDs hash in session subscription states

* smp server: service message delivery error handling

* ntf server: log subscription count and hash differences

* smp server: remove delivery threads when service subscription ended/client disconnected

* agent: remove service queue association when service ID changed, process ENDS event, test migrating to/from service (#1677)

* agent: remove service queue association when service ID changed

* agent: process ENDS event

* agent: send service subscription error event

* agent: test migrating to/from service subscriptions, fixes

* agent: always remove service when disabled, fix service subscriptions

* ntf server: use different client certs for each SMP server, remove support for store log (#1681)

* ntf server: remove support for store log

* ntf server: use different client certificates for each SMP server

* smp protocol: fix encoding for SOKS/ENDS responses (#1683)

* agent: create user with option to enable client service (#1684)

* agent: create user with option to enable client service

* handle HTTP2 errors

* do not catch async exceptions

* agent: minor fixes

* docs: update protocol (#1705)

* docs: agent threat model

* update protocol docs

* update RFCs (#1730)

* update RFCs

* update

* update overview

* update terminology

* original language in threat model

---------

Co-authored-by: Evgeny @ SimpleX Chat <259188159+evgeny-simplex@users.noreply.github.com>

* docs: fix minor issues in protocols

* docs: add e2e encrypted message wire encoding to PQDR spec

* docs: add missing encodings and other protocol corrections

* docs: move implemented rfcs

* smp: service fixes (#1737)

* smp: deliver service subscription to correct client

* tests: more resilient to concurrency

* optimize PostgreSQL query

* fix service re-association after server "downgrade"

* correctly handle service removed from server (and ID changed)

* remove unused

---------

Co-authored-by: Evgeny @ SimpleX Chat <259188159+evgeny-simplex@users.noreply.github.com>

* prometheus: fix metrics names (#1747)

* test: rcv service re-association on restart (#1746)

* agent: correct log message

* docs: update whitepaper

* smp: fix messaging client service issues (#1751)

* services: fix minor issues

* fix accounting for subscribed service queues, add prometheus stats

* fix uncorrelated subquery

* fix potential race condition when inserting service defensively, as it is also prevented by how client is created

---------

Co-authored-by: Evgeny @ SimpleX Chat <259188159+evgeny-simplex@users.noreply.github.com>

* agent: refactor cleanup if no pending subs (#1757)

* smp server: batch processing of subscription messages (#1753)

* smp server: batch processing of subscription messages

* refactor

* empty line

* fix

---------

Co-authored-by: Evgeny @ SimpleX Chat <259188159+evgeny-simplex@users.noreply.github.com>

* smp: batch queue association updates on subscriptions (#1760)

* smp: batch queue association updates on subscriptions

* refactor to fused batching

* simpler

* batch assoc functions

* clean up

* fix

---------

Co-authored-by: Evgeny @ SimpleX Chat <259188159+evgeny-simplex@users.noreply.github.com>

* agent: use primary key index in setRcvServiceAssocs (#1783)

* agent: use primary key index in setRcvServiceAssocs

Previous WHERE rcv_id = ? did not match the (host, port, rcv_id)
primary key prefix and fell back to a table scan via
idx_rcv_queues_client_notice_id. With ~390k rows per queue, each
update in a 1350-row batch scanned the whole table, yielding ~290s
per batch and a multi-hour rcv-services migration.

* agent: pass SMPServer explicitly to setRcvServiceAssocs

Avoid extracting host/port from the first queue inside setRcvServiceAssocs.
The caller already has SMPServer in scope (from tSess) and the call chain
is short, so threading it through is simpler than inspecting the list.
Removes the empty-list guard from setRcvServiceAssocs (it remains in
processRcvServiceAssocs).

---------

Co-authored-by: spaced4ndy <8711996+spaced4ndy@users.noreply.github.com>
Co-authored-by: Evgeny @ SimpleX Chat <259188159+evgeny-simplex@users.noreply.github.com>
Co-authored-by: sh <37271604+shumvgolove@users.noreply.github.com>
2026-05-21 14:14:03 +01:00

5.6 KiB

Server: batched SUB command processing

Implementation plan for Part 1 of RFC 2026-03-28-subscription-performance.

Current state

When a batch of ~135 SUB commands arrives, the server already batches:

  • Queue record lookups (getQueueRecs in receive, Server.hs:1151)
  • Command verification (verifyLoadedQueue, Server.hs:1152)

But command processing is per-command (foldrM process in client, Server.hs:1372-1375). Each SUB calls subscribeQueueAndDeliver which calls tryPeekMsg - one DB query per queue. For Postgres, that's ~135 individual SELECT ... FROM messages WHERE recipient_id = ? ORDER BY message_id ASC LIMIT 1 queries per batch.

Goal

Replace ~135 individual message peek queries with 1 batched query per batch. No protocol changes.

Implementation

Step 1: Add tryPeekMsgs to MsgStoreClass

File: src/Simplex/Messaging/Server/MsgStore/Types.hs

Add to MsgStoreClass:

tryPeekMsgs :: s -> [StoreQueue s] -> ExceptT ErrorType IO (Map RecipientId Message)

Returns a map from recipient ID to earliest pending message for each queue that has one. Queues with no messages are absent from the map.

Step 2: Parameterize deliver to accept pre-fetched message

File: src/Simplex/Messaging/Server.hs

Currently deliver (inside subscribeQueueAndDeliver, line 1641) calls tryPeekMsg ms q. Add a parameter for an optional pre-fetched message:

deliver :: Maybe Message -> (Bool, Maybe Sub) -> M s ResponseAndMessage
deliver prefetchedMsg (hasSub, sub_) = do
  stats <- asks serverStats
  fmap (either ((,Nothing) . err) id) $ liftIO $ runExceptT $ do
    msg_ <- maybe (tryPeekMsg ms q) (pure . Just) prefetchedMsg
    ...

When Nothing is passed, falls back to individual tryPeekMsg (existing behavior). When Just msg is passed, uses it directly (batched path).

Step 3: Pre-fetch messages before the processing loop

File: src/Simplex/Messaging/Server.hs

Currently (lines 1372-1375):

forever $
  atomically (readTBQueue rcvQ)
    >>= foldrM process ([], [])
    >>= \(rs_, msgs) -> ...

Add a pre-fetch step before the existing loop:

forever $ do
  batch <- atomically (readTBQueue rcvQ)
  msgMap <- prefetchMsgs batch
  foldrM (process msgMap) ([], []) batch
    >>= \(rs_, msgs) -> ...

prefetchMsgs scans the batch, collects queues from SUB commands that have a verified queue (q_ = Just (q, _)), calls tryPeekMsgs once, returns the map. For batches with no SUBs it returns an empty map (no DB call).

process passes the looked-up message (or Nothing) through to processCommand and down to deliver.

The foldrM process loop, processCommand, subscribeQueueAndDeliver, and all other command handlers stay structurally the same. Only deliver gains one parameter, and the client loop gains one pre-fetch call.

Step 4: Review

Review the typeclass signature and server usage. Confirm the interface has the right shape before implementing store backends.

Step 5: Implement for each store backend

Postgres

File: src/Simplex/Messaging/Server/MsgStore/Postgres.hs

Single query using DISTINCT ON:

SELECT DISTINCT ON (recipient_id)
  recipient_id, msg_id, msg_ts, msg_quota, msg_ntf_flag, msg_body
FROM messages
WHERE recipient_id IN ?
ORDER BY recipient_id, message_id ASC

Build Map RecipientId Message from results.

STM

File: src/Simplex/Messaging/Server/MsgStore/STM.hs

Loop over queues, call tryPeekMsg for each, collect into map.

Journal

File: src/Simplex/Messaging/Server/MsgStore/Journal.hs

Loop over queues, call tryPeekMsg for each, collect into map.

Step 6: Handle edge cases

  1. Mixed batches: prefetchMsgs collects only SUB queues. Non-SUB commands get Nothing for the pre-fetched message and process unchanged.

  2. Already-subscribed queues: Include in pre-fetch - deliver is called for re-SUBs too (delivers pending message).

  3. Service subscriptions: The pre-fetch doesn't care about service state. sharedSubscribeQueue handles service association in STM; message peek is the same.

  4. Error queues: Verification errors from receive are Left values in the batch. prefetchMsgs only looks at Right values with SUB commands.

  5. Empty pre-fetch: If batch has no SUBs (e.g., all ACKs), prefetchMsgs returns empty map, no DB call made.

Step 7: Batch other commands (future, not in scope)

The same pattern (pre-fetch before loop, parameterize handler) can extend to:

  • ACK with tryDelPeekMsg - batch delete+peek
  • GET with tryPeekMsg - same map lookup

Lower priority since these don't have the N-at-once pattern of subscriptions.

File changes summary

File Change
src/Simplex/Messaging/Server/MsgStore/Types.hs Add tryPeekMsgs to typeclass
src/Simplex/Messaging/Server/MsgStore/Postgres.hs Implement tryPeekMsgs with batch SQL
src/Simplex/Messaging/Server/MsgStore/STM.hs Implement tryPeekMsgs as loop
src/Simplex/Messaging/Server/MsgStore/Journal.hs Implement tryPeekMsgs as loop
src/Simplex/Messaging/Server.hs Add prefetchMsgs, parameterize deliver

Testing

  1. Existing server tests must pass unchanged (correctness preserved).
  2. Add a test that subscribes a batch of queues (some with pending messages, some without) and verifies all get correct SOK + MSG responses.
  3. Prometheus metrics: existing qSub stat should still increment correctly.

Performance expectation

For 300K queues across ~2200 batches:

  • Before: ~300K individual DB queries
  • After: ~2200 batched DB queries (one per batch of ~135)
  • ~136x reduction in DB round-trips