From 3b4bf92015e3787edc56edabf3dee66693fd5a00 Mon Sep 17 00:00:00 2001
From: Narasimha-sc <166327228+Narasimha-sc@users.noreply.github.com>
Date: Wed, 13 May 2026 15:45:39 +0000
Subject: [PATCH] core: include trailing "_" and "!" characters in links
 (#6973)

* core: include trailing "_" and "!" characters in links

* docs: plan for keeping trailing "_" and "!" in links
---
 ...12-link-trailing-underscore-exclamation.md | 151 ++++++++++++++++++
 src/Simplex/Chat/Markdown.hs                  |   2 +
 tests/MarkdownTests.hs                        |   4 +
 3 files changed, 157 insertions(+)
 create mode 100644 plans/2026-05-12-link-trailing-underscore-exclamation.md

diff --git a/plans/2026-05-12-link-trailing-underscore-exclamation.md b/plans/2026-05-12-link-trailing-underscore-exclamation.md
new file mode 100644
index 0000000000..c92da7e8df
--- /dev/null
+++ b/plans/2026-05-12-link-trailing-underscore-exclamation.md
@@ -0,0 +1,151 @@
+# Links: trailing `_` and `!` dropped from the highlighted link
+
+Design doc for the fix shipped in PR #6973.
+
+## Problem
+
+A bare URL or domain ending in `_` (or `!`) was highlighted as a link only up
+to the last non-`_` character — the trailing `_` rendered as plain,
+non-clickable text. For example `https://en.wikipedia.org/wiki/The_Lord_of_the_Rings_`
+showed `…The_Lord_of_the_Rings` as a blue link followed by a separate, plain
+`_`. Reported for `_`; the same defect applies to `!`.
+
+## Background — how bare links are parsed
+
+`parseMarkdown` (`src/Simplex/Chat/Markdown.hs`) splits a message into
+fragments; a fragment that isn't a recognized markdown construct falls through
+`wordP` → `wordMD`, which decides whether the "word" (the run up to the next
+space) is a URI / SimpleX link / domain / email.
+
+To handle the very common case of a link immediately followed by sentence
+punctuation — `check out https://simplex.chat.` or `(https://simplex.chat)` —
+`wordMD` peels a trailing run of "punctuation" off the word and re-emits it as
+`unmarked` text:
+
+```haskell
+where
+  punct   = T.takeWhileEnd isPunctuation' s
+  s'      = T.dropWhileEnd isPunctuation' s
+  res md' = if T.null punct then md' else md' :|: unmarked punct
+```
+
+`isPunctuation'` is `Data.Char.isPunctuation` with exemptions for characters
+that legitimately *end* a URL: `/` (trailing path separator, e.g.
+`https://github.com/simplex-chat/`) and `)` (Wikipedia disambiguation, e.g.
+`…/wiki/Servo_(software)`).
+
+All link-highlighting surfaces derive from the result of this parser: the
+desktop/Android UI (`TextItemView.kt`) and iOS UI (`MsgContentView.swift`)
+both call `chatParseMarkdown` (FFI into the bundled Haskell core) and style a
+`Uri` / `HyperLink` fragment by its whole `text`; the compose-preview path uses
+the same function; `Styled.hs` does the terminal rendering. So whatever the
+parser puts inside the `Uri` fragment is exactly what gets highlighted, on
+every platform.
+
+## Root cause
+
+`Data.Char.isPunctuation '_' == True` — `_` is Unicode `ConnectorPunctuation`
+(`Pc`). `isPunctuation '!' == True` — `!` is `OtherPunctuation` (`Po`). Neither
+was in the `isPunctuation'` exemption list, so a trailing `_` or `!` was always
+stripped from the URI text.
+
+For `https://simplex.chat/page_name_`:
+
+- `punct = "_"`, `s' = "https://simplex.chat/page_name"`
+- output: `Uri "https://simplex.chat/page_name" :|: unmarked "_"`
+
+The UI highlights the `Uri` fragment by its text, so the `_` lands outside the
+blue/clickable span — exactly the reported behaviour.
+
+## Fix
+
+Add `_` and `!` to the `isPunctuation'` exemptions, alongside `/` and `)`:
+
+```haskell
+isPunctuation' = \case
+  '/' -> False
+  ')' -> False
+  '_' -> False
+  '!' -> False
+  c -> isPunctuation c
+```
+
+`T.takeWhileEnd isPunctuation'` now stops at a trailing `_`/`!`, so the full
+token is kept in `s'` and emitted as a single `Uri` fragment. Anything still
+trailing it (`.`, `,`, ` …`) is peeled off as before:
+
+- `https://simplex.chat/page_name_` → `Uri "https://simplex.chat/page_name_"`
+- `https://simplex.chat/page_name_, hello` → `Uri "…/page_name_" :|: unmarked ", hello"`
+- `https://simplex.chat/page!` → `Uri "https://simplex.chat/page!"`
+
+## Why this is the right place
+
+- `wordMD`/`isPunctuation'` is the single point where bare-link text is
+  trimmed, and it already encodes "these characters legitimately end a link."
+  `_` and `!` belong in that list next to `/` and `)`.
+- `_` and `!` are RFC 3986–valid URL characters (`_` is in `unreserved`, `!` is
+  a `sub-delim`); `_` is never sentence-ending punctuation.
+- Fixing it in the parser fixes every surface at once (desktop, Android, iOS,
+  terminal, compose preview), because they all consume the same `FormattedText`.
+  A UI-layer patch would have to be repeated per platform and would leave
+  `Styled.hs` wrong.
+
+## Why a wider change is not in scope
+
+- The reported bug is fully resolved by the two-line addition to the exemption
+  `case`. Nothing more is required.
+- `isPunctuation'` is shared by the URI, domain and email branches of `wordMD`.
+  Exempting `_`/`!` for all three is the intended behaviour, with one minor
+  knock-on: `user@example.com!` now renders as plain text rather than
+  `Email "user@example.com" :|: unmarked "!"`, because `user@example.com!`
+  isn't a valid email so the whole token isn't recognized. This is acceptable —
+  `!` is now treated consistently as part of the token everywhere — and is
+  preferable to splitting `isPunctuation'` into a URI predicate and an email
+  predicate, which adds structure for a marginal case. Phones are unaffected
+  (`phoneP` is a separate parser that doesn't use `isPunctuation'`).
+- `good-code-v5.md` — *"Find the minimal change … the smallest structural
+  modification that achieves the goal."* The smallest modification that
+  resolves the report is two lines in the exemption `case`.
+
+## Backward compatibility
+
+Pure parsing change, no wire-format impact. `FormattedText` keeps the same
+shape; only which characters fall inside a `Uri`/`Email` fragment changes.
+Messages already stored keep their previously-parsed formatting — re-parsing
+happens on compose / receive, not on display of stored items. An old client
+receiving a message authored by a fixed client parses the raw text itself and
+behaves per its own (older) rule — no incompatibility either way.
+
+## Verification
+
+`tests/MarkdownTests.hs`, `describe "text with Uri"` — four cases added:
+
+- `"https://simplex.chat/page_name_" <==> uri "https://simplex.chat/page_name_"`
+  — the trailing `_` is part of the link.
+- `"https://simplex.chat/page_name_, hello" <==> uri "https://simplex.chat/page_name_" <> ", hello"`
+  — `_` kept, the `, hello` after it still peeled off.
+- `"https://simplex.chat/page!" <==> uri "https://simplex.chat/page!"`
+- `"https://simplex.chat/page!, hello" <==> uri "https://simplex.chat/page!" <> ", hello"`
+
+`MarkdownTests` suite: 38 examples, 0 failures. The existing exemption / peel
+coverage is unchanged — `…/simplex-chat/`, `…/wiki/Servo_(software)`,
+`https://simplex.chat.` → link + `.`, `https://simplex.chat, hello` → link +
+`, hello`, etc.
+
+Manual sanity (desktop, Linux AppImage build): a message containing
+`https://en.wikipedia.org/wiki/The_Lord_of_the_Rings_` highlights the whole URL
+including the trailing `_`.
+
+## Alternatives considered and rejected
+
+- **Split `isPunctuation'` into a URI predicate and an email predicate** so `!`
+  is kept only inside URLs. Adds a second predicate and a branch solely to
+  preserve `Email "x@y.z" :|: unmarked "!"` on `x@y.z!` — a marginal case. The
+  shared predicate is simpler; rejected.
+- **Strip `_`/`!` only when followed by more URL-looking text.** Requires
+  look-ahead the trailing-trim model doesn't have, for no real benefit — `_`
+  and `!` aren't sentence punctuation in the first place.
+- **Extend the link span over a trailing `_` in the UI layer.** Wrong layer:
+  the parser is the single source of truth for `FormattedText`, consumed by
+  three platforms plus the terminal renderer; a UI-only patch would diverge per
+  platform.
diff --git a/src/Simplex/Chat/Markdown.hs b/src/Simplex/Chat/Markdown.hs
index 39000fde84..9325de41eb 100644
--- a/src/Simplex/Chat/Markdown.hs
+++ b/src/Simplex/Chat/Markdown.hs
@@ -302,6 +302,8 @@ markdownP = mconcat <$> A.many' fragmentP
     isPunctuation' = \case
       '/' -> False
       ')' -> False
+      '_' -> False
+      '!' -> False
       c -> isPunctuation c
     isUri s = T.length s >= 10 && any (`T.isPrefixOf` s) ["http://", "https://", "simplex:/"]
     -- matches what is likely to be a domain, not all valid domain names
diff --git a/tests/MarkdownTests.hs b/tests/MarkdownTests.hs
index 54483babec..a82e18f988 100644
--- a/tests/MarkdownTests.hs
+++ b/tests/MarkdownTests.hs
@@ -227,6 +227,10 @@ textWithUri = describe "text with Uri" do
     "https://github.com/simplex-chat/ - SimpleX on GitHub" <==> uri "https://github.com/simplex-chat/" <> " - SimpleX on GitHub"
     -- "SimpleX on GitHub (https://github.com/simplex-chat/)" <==> "SimpleX on GitHub (" <> uri "https://github.com/simplex-chat/" <> ")"
     "https://en.m.wikipedia.org/wiki/Servo_(software)" <==> uri "https://en.m.wikipedia.org/wiki/Servo_(software)"
+    "https://simplex.chat/page_name_" <==> uri "https://simplex.chat/page_name_"
+    "https://simplex.chat/page_name_, hello" <==> uri "https://simplex.chat/page_name_" <> ", hello"
+    "https://simplex.chat/page!" <==> uri "https://simplex.chat/page!"
+    "https://simplex.chat/page!, hello" <==> uri "https://simplex.chat/page!" <> ", hello"
     "example.com" <==> uri "example.com"
     "example.com." <==> uri "example.com" <> "."
     "example.com..." <==> uri "example.com" <> "..."