From 7ece87f1b63e28f042c5331f48f174709ccd26a5 Mon Sep 17 00:00:00 2001
From: "Evgeny @ SimpleX Chat"
 <259188159+evgeny-simplex@users.noreply.github.com>
Date: Wed, 11 Mar 2026 09:47:18 +0000
Subject: [PATCH] encoding notes

---
 spec/modules/Simplex/Messaging/Encoding.md        | 4 ++++
 spec/modules/Simplex/Messaging/Encoding/String.md | 7 +++++--
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/spec/modules/Simplex/Messaging/Encoding.md b/spec/modules/Simplex/Messaging/Encoding.md
index f485aeaa4..8db63d0cc 100644
--- a/spec/modules/Simplex/Messaging/Encoding.md
+++ b/spec/modules/Simplex/Messaging/Encoding.md
@@ -36,6 +36,10 @@ Sequential concatenation with no separators. Works because each element's encodi
 
 Only seconds are encoded (as Int64); nanoseconds are discarded on encode and set to 0 on decode.
 
+## String instance
+
+`smpEncode` goes through `B.pack`, which silently truncates any Unicode character above codepoint 255 to its lowest byte. A String containing non-Latin-1 characters is silently corrupted on encode with no error. Same issue exists in the `StrEncoding String` instance — see [Simplex.Messaging.Encoding.String](./Encoding/String.md#string-instance).
+
 ## smpEncodeList / smpListP
 
 1-byte length prefix for lists — same 255-item limit as ByteString's 255-byte limit.
diff --git a/spec/modules/Simplex/Messaging/Encoding/String.md b/spec/modules/Simplex/Messaging/Encoding/String.md
index 60ac9e496..1e60295b8 100644
--- a/spec/modules/Simplex/Messaging/Encoding/String.md
+++ b/spec/modules/Simplex/Messaging/Encoding/String.md
@@ -27,9 +27,12 @@ Inherits from ByteString via `B.pack` / `B.unpack`. Only Char8 (Latin-1) charact
 
 `strToJSON` uses `decodeLatin1`, not `decodeUtf8'`. This preserves arbitrary byte sequences (e.g., base64url-encoded binary data) as JSON strings without UTF-8 validation errors, but means the JSON representation is Latin-1, not UTF-8.
 
-## Default strP fallback
+## Class default: strP assumes base64url for all types
 
-If only `strDecode` is defined (no custom `strP`), the default parser runs `base64urlP` first, then passes the decoded bytes to `strDecode`. This means the type's own `strDecode` receives raw bytes, not the base64url text. Easy to confuse when implementing a new instance.
+The `MINIMAL` pragma allows defining only `strDecode` without `strP`. But the default `strP = strDecode <$?> base64urlP` then assumes input is base64url-encoded — for *any* type, not just ByteString. Two consequences:
+
+1. The type's `strDecode` receives raw decoded bytes, not the base64url text. Easy to confuse when implementing a new instance.
+2. `base64urlP` requires non-empty input (`takeWhile1`), so the default `strP` cannot parse empty values — even if `strDecode ""` would succeed. Types that can encode to empty output must define `strP` explicitly.
 
 ## listItem