From 7ece87f1b63e28f042c5331f48f174709ccd26a5 Mon Sep 17 00:00:00 2001 From: "Evgeny @ SimpleX Chat" <259188159+evgeny-simplex@users.noreply.github.com> Date: Wed, 11 Mar 2026 09:47:18 +0000 Subject: [PATCH] encoding notes --- spec/modules/Simplex/Messaging/Encoding.md | 4 ++++ spec/modules/Simplex/Messaging/Encoding/String.md | 7 +++++-- 2 files changed, 9 insertions(+), 2 deletions(-) diff --git a/spec/modules/Simplex/Messaging/Encoding.md b/spec/modules/Simplex/Messaging/Encoding.md index f485aeaa4..8db63d0cc 100644 --- a/spec/modules/Simplex/Messaging/Encoding.md +++ b/spec/modules/Simplex/Messaging/Encoding.md @@ -36,6 +36,10 @@ Sequential concatenation with no separators. Works because each element's encodi Only seconds are encoded (as Int64); nanoseconds are discarded on encode and set to 0 on decode. +## String instance + +`smpEncode` goes through `B.pack`, which silently truncates any Unicode character above codepoint 255 to its lowest byte. A String containing non-Latin-1 characters is silently corrupted on encode with no error. Same issue exists in the `StrEncoding String` instance — see [Simplex.Messaging.Encoding.String](./Encoding/String.md#string-instance). + ## smpEncodeList / smpListP 1-byte length prefix for lists — same 255-item limit as ByteString's 255-byte limit. diff --git a/spec/modules/Simplex/Messaging/Encoding/String.md b/spec/modules/Simplex/Messaging/Encoding/String.md index 60ac9e496..1e60295b8 100644 --- a/spec/modules/Simplex/Messaging/Encoding/String.md +++ b/spec/modules/Simplex/Messaging/Encoding/String.md @@ -27,9 +27,12 @@ Inherits from ByteString via `B.pack` / `B.unpack`. Only Char8 (Latin-1) charact `strToJSON` uses `decodeLatin1`, not `decodeUtf8'`. This preserves arbitrary byte sequences (e.g., base64url-encoded binary data) as JSON strings without UTF-8 validation errors, but means the JSON representation is Latin-1, not UTF-8. -## Default strP fallback +## Class default: strP assumes base64url for all types -If only `strDecode` is defined (no custom `strP`), the default parser runs `base64urlP` first, then passes the decoded bytes to `strDecode`. This means the type's own `strDecode` receives raw bytes, not the base64url text. Easy to confuse when implementing a new instance. +The `MINIMAL` pragma allows defining only `strDecode` without `strP`. But the default `strP = strDecode <$?> base64urlP` then assumes input is base64url-encoded — for *any* type, not just ByteString. Two consequences: + +1. The type's `strDecode` receives raw decoded bytes, not the base64url text. Easy to confuse when implementing a new instance. +2. `base64urlP` requires non-empty input (`takeWhile1`), so the default `strP` cannot parse empty values — even if `strDecode ""` would succeed. Types that can encode to empty output must define `strP` explicitly. ## listItem