Files
simplexmq/spec/encoding.md
Evgeny @ SimpleX Chat 66d7efa61e some modules documented
2026-03-11 08:53:57 +00:00

333 lines
13 KiB
Markdown

# Encoding
> Binary and string encoding used across all SimpleX protocols.
**Source files**: [`Encoding.hs`](../src/Simplex/Messaging/Encoding.hs), [`Encoding/String.hs`](../src/Simplex/Messaging/Encoding/String.hs), [`Parsers.hs`](../src/Simplex/Messaging/Parsers.hs)
## Overview
Two encoding layers serve different purposes:
- **`Encoding`** — Binary wire format for SMP protocol transmissions. Compact, no delimiters between fields. Used in all on-the-wire protocol messages.
- **`StrEncoding`** — Human-readable string format for configuration, URIs, logs, and JSON serialization. Uses base64url for binary data, decimal for numbers, comma-separated lists, space-separated tuples.
Both are typeclasses with `MINIMAL` pragmas requiring `encode` + (`decode` | `parser`), with the missing one derived from the other.
## Binary Encoding (`Encoding` class)
```haskell
class Encoding a where
smpEncode :: a -> ByteString
smpDecode :: ByteString -> Either String a -- default: parseAll smpP
smpP :: Parser a -- default: smpDecode <$?> smpP
```
### Length-prefix conventions
| Type | Prefix | Max size |
|------|--------|----------|
| `ByteString` | 1-byte length (Word8 as Char) | 255 bytes |
| `Large` (newtype) | 2-byte length (Word16 big-endian) | 65535 bytes |
| `Tail` (newtype) | None — consumes rest of input | Unlimited |
| Lists (`smpEncodeList`) | 1-byte count prefix, then concatenated items | 255 items |
| `NonEmpty` | Same as list (fails on count=0) | 255 items |
### Scalar types
| Type | Encoding | Bytes |
|------|----------|-------|
| `Char` | Raw byte | 1 |
| `Bool` | `'T'` / `'F'` (0x54 / 0x46) | 1 |
| `Word16` | Big-endian | 2 |
| `Word32` | Big-endian | 4 |
| `Int64` | Two big-endian Word32s (high then low) | 8 |
| `SystemTime` | `systemSeconds` as Int64 (nanoseconds dropped) | 8 |
| `Text` | UTF-8 then ByteString encoding (1-byte length prefix) | 1 + len |
| `String` | `B.pack` then ByteString encoding | 1 + len |
### `Maybe a`
```
Nothing → '0' (0x30)
Just x → '1' (0x31) ++ smpEncode x
```
Tags are ASCII characters `'0'`/`'1'`, not binary 0x00/0x01.
### Tuples
Tuples (2 through 8) encode as simple concatenation — no length prefix, no separator. Fields are parsed sequentially using each component's `smpP`. This works because each component's parser knows how many bytes to consume (via its own length prefix or fixed size).
### Combinators
| Function | Signature | Purpose |
|----------|-----------|---------|
| `_smpP` | `Parser a` | Space-prefixed parser (`A.space *> smpP`) |
| `smpEncodeList` | `[a] -> ByteString` | 1-byte count + concatenated items |
| `smpListP` | `Parser [a]` | Parse count then that many items |
| `lenEncode` | `Int -> Char` | Int to single-byte length char |
## String Encoding (`StrEncoding` class)
```haskell
class StrEncoding a where
strEncode :: a -> ByteString
strDecode :: ByteString -> Either String a -- default: parseAll strP
strP :: Parser a -- default: strDecode <$?> base64urlP
```
Key difference from `Encoding`: the default `strP` parses base64url input first, then applies `strDecode`. This means types that only implement `strDecode` will automatically accept base64url-encoded input.
### Instance conventions
| Type | Encoding |
|------|----------|
| `ByteString` | base64url (non-empty required) |
| `Word16`, `Word32` | Decimal string |
| `Int`, `Int64` | Signed decimal |
| `Char`, `Bool` | Delegates to `Encoding` (`smpEncode`/`smpP`) |
| `Maybe a` | Empty string = `Nothing`, otherwise `strEncode a` |
| `Text` | UTF-8 bytes, parsed until space/newline |
| `SystemTime` | `systemSeconds` as Int64 (decimal) |
| `UTCTime` | ISO 8601 string |
| `CertificateChain` | Comma-separated base64url blobs |
| `Fingerprint` | base64url of fingerprint bytes |
### Collection encoding
| Type | Separator |
|------|-----------|
| Lists (`strEncodeList`) | Comma `,` |
| `NonEmpty` | Comma (fails on empty) |
| `Set a` | Comma |
| `IntSet` | Comma |
| Tuples (2-6) | Space (` `) |
### `Str` newtype
Raw string (not base64url-encoded). Parses until space, consumes trailing space. Used for string-valued protocol fields that should not be base64-encoded.
### `TextEncoding` class
```haskell
class TextEncoding a where
textEncode :: a -> Text
textDecode :: Text -> Maybe a
```
Separate from `StrEncoding` — operates on `Text` rather than `ByteString`. Used for types that need Text representation (e.g., enum display names).
### JSON bridge functions
| Function | Purpose |
|----------|---------|
| `strToJSON` | `StrEncoding a => a -> J.Value` via `decodeLatin1 . strEncode` |
| `strToJEncoding` | Same, for Aeson encoding |
| `strParseJSON` | `StrEncoding a => String -> J.Value -> JT.Parser a` — parse JSON string via `strP` |
| `textToJSON` | `TextEncoding a => a -> J.Value` |
| `textToEncoding` | Same, for Aeson encoding |
| `textParseJSON` | `TextEncoding a => String -> J.Value -> JT.Parser a` |
## Parsers
**Source**: [`Parsers.hs`](../src/Simplex/Messaging/Parsers.hs)
### Core parsing functions
| Function | Signature | Purpose |
|----------|-----------|---------|
| `parseAll` | `Parser a -> ByteString -> Either String a` | Parse consuming all input (fails if bytes remain) |
| `parse` | `Parser a -> e -> ByteString -> Either e a` | `parseAll` with custom error type (discards error string) |
| `parseE` | `(String -> e) -> Parser a -> ByteString -> ExceptT e IO a` | `parseAll` lifted into `ExceptT` |
| `parseE'` | `(String -> e) -> Parser a -> ByteString -> ExceptT e IO a` | Like `parseE` but allows trailing input |
| `parseRead1` | `Read a => Parser a` | Parse a word then `readMaybe` it |
| `parseString` | `(ByteString -> Either String a) -> String -> a` | Parse from `String` (errors with `error`) |
### `base64P`
Standard base64 parser (not base64url — uses `+`/`/` alphabet). Takes alphanumeric + `+`/`/` characters, optional `=` padding, then decodes. Contrast with `base64urlP` in `Encoding/String.hs` which uses `-`/`_` alphabet.
### JSON options helpers
Platform-conditional JSON encoding for cross-platform compatibility (Haskell ↔ Swift).
| Function | Purpose |
|----------|---------|
| `enumJSON` | All-nullary constructors as strings, with tag modifier |
| `sumTypeJSON` | Platform-conditional: `taggedObjectJSON` on non-Darwin, `singleFieldJSON` on Darwin |
| `taggedObjectJSON` | `{"type": "Tag", "data": {...}}` format |
| `singleFieldJSON` | `{"Tag": value}` format |
| `defaultJSON` | Default options with `omitNothingFields = True` |
Pattern synonyms for JSON field names:
- `TaggedObjectJSONTag = "type"`
- `TaggedObjectJSONData = "data"`
- `SingleFieldJSONTag = "_owsf"`
### String helpers
| Function | Purpose |
|----------|---------|
| `fstToLower` | Lowercase first character |
| `dropPrefix` | Remove prefix string, lowercase remainder |
| `textP` | Parse rest of input as UTF-8 `String` |
## Auxiliary Types and Utilities
### TMap
**Source**: [`TMap.hs`](../src/Simplex/Messaging/TMap.hs)
```haskell
type TMap k a = TVar (Map k a)
```
STM-based concurrent map. Wraps `Data.Map.Strict` in a `TVar`. All mutations use `modifyTVar'` (strict) to prevent thunk accumulation.
| Function | Notes |
|----------|-------|
| `emptyIO` | IO allocation (`newTVarIO`) |
| `singleton` | STM allocation |
| `clear` | Reset to empty |
| `lookup` / `lookupIO` | STM / non-transactional IO read |
| `member` / `memberIO` | STM / non-transactional IO membership |
| `insert` / `insertM` | Insert value / insert from STM action |
| `delete` | Remove key |
| `lookupInsert` | Atomic lookup-then-insert (returns old value) |
| `lookupDelete` | Atomic lookup-then-delete |
| `adjust` / `update` / `alter` / `alterF` | Standard Map operations lifted to STM |
| `union` | Merge `Map` into `TMap` |
`lookupIO`/`memberIO` use `readTVarIO` — single-read outside STM transaction, useful when you need a snapshot without composing with other STM operations.
### SessionVar
**Source**: [`Session.hs`](../src/Simplex/Messaging/Session.hs)
Race-safe session management using TMVar + monotonic ID.
```haskell
data SessionVar a = SessionVar
{ sessionVar :: TMVar a -- result slot
, sessionVarId :: Int -- monotonic ID from TVar counter
, sessionVarTs :: UTCTime -- creation timestamp
}
```
| Function | Purpose |
|----------|---------|
| `getSessVar` | Lookup or create session. Returns `Left new` or `Right existing` |
| `removeSessVar` | Delete session only if ID matches (prevents removing a replacement) |
| `tryReadSessVar` | Non-blocking read of session result |
The ID-match check in `removeSessVar` prevents a race where:
1. Thread A creates session #5, starts work
2. Thread B creates session #6 (replacing #5 in TMap)
3. Thread A finishes, tries to remove — ID mismatch, removal blocked
### ServiceScheme
**Source**: [`ServiceScheme.hs`](../src/Simplex/Messaging/ServiceScheme.hs)
```haskell
data ServiceScheme = SSSimplex | SSAppServer SrvLoc
data SrvLoc = SrvLoc HostName ServiceName
```
URI scheme for SimpleX service addresses. `SSSimplex` encodes as `"simplex:"`, `SSAppServer` as `"https://host:port"`.
`simplexChat` is the constant `SSAppServer (SrvLoc "simplex.chat" "")`.
### SystemTime
**Source**: [`SystemTime.hs`](../src/Simplex/Messaging/SystemTime.hs)
```haskell
newtype RoundedSystemTime (t :: Nat) = RoundedSystemTime { roundedSeconds :: Int64 }
type SystemDate = RoundedSystemTime 86400 -- day precision
type SystemSeconds = RoundedSystemTime 1 -- second precision
```
Phantom-typed time rounding. The `Nat` type parameter specifies rounding granularity in seconds.
| Function | Purpose |
|----------|---------|
| `getRoundedSystemTime` | Get current time rounded to `t` seconds |
| `getSystemDate` | Alias for day-rounded time |
| `getSystemSeconds` | Second-precision (no rounding needed, just drops nanoseconds) |
| `roundedToUTCTime` | Convert back to `UTCTime` |
`RoundedSystemTime` derives `FromField`/`ToField` for SQLite storage and `FromJSON`/`ToJSON` for API serialization.
### Util
**Source**: [`Util.hs`](../src/Simplex/Messaging/Util.hs)
Selected utilities used across the codebase:
**Monadic combinators**:
| Function | Signature | Purpose |
|----------|-----------|---------|
| `<$?>` | `MonadFail m => (a -> Either String b) -> m a -> m b` | Lift fallible function into parser |
| `$>>=` | `(Monad m, Monad f, Traversable f) => m (f a) -> (a -> m (f b)) -> m (f b)` | Monadic bind through nested monad |
| `ifM` / `whenM` / `unlessM` | Monadic conditionals | |
| `anyM` | Short-circuit `any` for monadic predicates (strict) | |
**Error handling**:
| Function | Purpose |
|----------|---------|
| `tryAllErrors` | Catch all exceptions (including async) into `ExceptT` |
| `catchAllErrors` | Same with handler |
| `tryAllOwnErrors` | Catch only "own" exceptions (re-throws async cancellation) |
| `catchAllOwnErrors` | Same with handler |
| `isOwnException` | `StackOverflow`, `HeapOverflow`, `AllocationLimitExceeded` |
| `isAsyncCancellation` | Any `SomeAsyncException` except own exceptions |
| `catchThrow` | Catch exceptions, wrap in Left |
| `allFinally` | `tryAllErrors` + `final` + `except` (like `finally` for ExceptT) |
The own-vs-async distinction is critical: `catchOwn`/`tryAllOwnErrors` never swallow async cancellation (`ThreadKilled`, `UserInterrupt`, etc.), only synchronous exceptions and resource exhaustion (`StackOverflow`, `HeapOverflow`, `AllocationLimitExceeded`).
**STM**:
| Function | Purpose |
|----------|---------|
| `tryWriteTBQueue` | Non-blocking bounded queue write, returns success |
**Database result helpers**:
| Function | Purpose |
|----------|---------|
| `firstRow` | Extract first row with transform, or Left error |
| `maybeFirstRow` | Extract first row as Maybe |
| `firstRow'` | Like `firstRow` but transform can also fail |
**Collection utilities**:
| Function | Purpose |
|----------|---------|
| `groupOn` | `groupBy` using equality on projected key |
| `groupAllOn` | `groupOn` after `sortOn` (groups non-adjacent elements) |
| `toChunks` | Split list into `NonEmpty` chunks of size n |
| `packZipWith` | Optimized ByteString zipWith (direct memory access) |
**Miscellaneous**:
| Function | Purpose |
|----------|---------|
| `safeDecodeUtf8` | Decode UTF-8 replacing errors with `'?'` |
| `bshow` / `tshow` | `show` to `ByteString` / `Text` |
| `threadDelay'` | `Int64` delay (handles overflow by looping) |
| `diffToMicroseconds` / `diffToMilliseconds` | `NominalDiffTime` conversion |
| `labelMyThread` | Label current thread for debugging |
| `encodeJSON` / `decodeJSON` | `ToJSON a => a -> Text` / `FromJSON a => Text -> Maybe a` |
| `traverseWithKey_` | `Map` traversal discarding results |
## Security notes
- **Length prefix overflow**: `ByteString` encoding uses 1-byte length — silently truncates strings > 255 bytes. Callers must ensure size bounds before encoding. `Large` extends to 65535 bytes via Word16 prefix.
- **`Tail` unbounded**: `Tail` consumes all remaining input with no size check. Only safe when total message size is already bounded (e.g., within a padded SMP block).
- **base64 vs base64url**: `Parsers.base64P` uses standard alphabet (`+`/`/`), while `String.base64urlP` uses URL-safe alphabet (`-`/`_`). Mixing them causes silent decode failures.
- **`safeDecodeUtf8`**: Replaces invalid UTF-8 with `'?'` rather than failing. Suitable for logging/display, not for security-critical string comparison.