Reverts element-hq/synapse#18416
Unfortunately, this causes failures on `/sendToDevice` endpoint in
normal circumstances. If a single user has, say, a hundred devices then
we easily go over the limit. This blocks message sending entirely in
encrypted rooms.
cc @MadLittleMods @MatMaul
Fixes: #8088
Previously we would perform OIDC discovery on startup,
which involves making HTTP requests to the identity provider(s).
If that took a long time, we would block startup.
If that failed, we would crash startup.
This commit:
- makes the loading happen in the background on startup
- makes an error in the 'preload' non-fatal (though it logs at CRITICAL
for visibility)
- adds a templated error page to show on failed redirects (for
unavailable providers), as otherwise you get a JSON response in your
navigator.
- This involves introducing 2 new exception types to mark other
exceptions and keep the error handling fine-grained.
The machinery was already there to load-on-demand the discovery config,
so when the identity provider
comes back up, the discovery is reattempted and login can succeed.
Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
Fixes https://github.com/element-hq/synapse/issues/19494
MSC4284 policy servers
This:
* removes the old `/check` (recommendation) support because it's from an
older design. Policy servers should have updated to `/sign` by now. We
also remove optionality around the policy server's public key because it
was only optional to support `/check`.
* supports the stable `m.room.policy` state event and `/sign` endpoints,
falling back to unstable if required. Note the changes between unstable
and stable:
* Stable `/sign` uses errors instead of an empty signatures block to
indicate refusal.
* Stable `m.room.policy` nests the public key in an object with explicit
key algorithm (always ed25519 for now)
* does *not* introduce tests that the above fallback to unstable works.
If it breaks, we're not going to be sad about an early transition. Tests
can be added upon request, though.
* fixes a bug where the policy server was asked to sign policy server
state events (the events were correctly skipped in `is_event_allowed`,
but `ask_policy_server_to_sign_event` didn't do the same).
* fixes a bug where the original event sender's signature can be deleted
if the sending server is the same as the policy server.
* proxies Matrix-shaped errors from the policy server to the
Client-Server API as `SynapseError`s (a new capability of the stable
API).
Membership event handling (from the issue) is expected to be a different
PR due to the size of changes involved (tracked by
https://github.com/element-hq/synapse/issues/19587).
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
- Use markdown where necessary, mostly for `code blocks`.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
---------
Co-authored-by: turt2live <1190097+turt2live@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Eric Eastwood <madlittlemods@gmail.com>
Fixes#19347
This deprecates MSC2697 which has been closed since May 2024. As per
#19347 this seems to be a thing we can just rip out. The crypto team
have moved onto MSC3814 and are suggesting that developers who rely on
MSC2697 should use MSC3814 instead.
MSC2697 implementation originally introduced by https://github.com/matrix-org/synapse/pull/8380
Fixes https://github.com/element-hq/synapse/issues/19175
This PR moves tracking of what lazy loaded membership we've sent to each
room out of the required state table. This avoids that table from
continuously growing, which massively helps performance as we pull out
all matching rows for the connection when we receive a request.
The new table is only read when we have data in a room to send, so we
end up reading a lot fewer rows from the DB. Though we now read from
that table for every room we have events to return in, rather than once
at the start of the request.
For an explanation of how the new table works, see the
[comment](https://github.com/element-hq/synapse/blob/erikj/sss_better_membership_storage2/synapse/storage/schema/main/delta/93/02_sliding_sync_members.sql#L15-L38)
on the table schema.
The table is designed so that we can later prune old entries if we wish,
but that is not implemented in this PR.
Reviewable commit-by-commit.
---------
Co-authored-by: Eric Eastwood <erice@element.io>
This changes the arguments in clock functions to be `Duration` and
converts call sites and constants into `Duration`. There are still some
more functions around that should be converted (e.g.
`timeout_deferred`), but we leave that to another PR.
We also changes `.as_secs()` to return a float, as the rounding broke
things subtly. The only reason to keep it (its the same as
`timedelta.total_seconds()`) is for symmetry with `as_millis()`.
Follows on from https://github.com/element-hq/synapse/pull/19223
MSC4380 aims to be a simplified implementation of MSC4155; the hope is
that we can get it specced and rolled out rapidly, so that we can
resolve the fact that `matrix.org` has enabled MSC4155.
The implementation leans heavily on what's already there for MSC4155.
It has its own `experimental_features` flag. If both MSC4155 and MSC4380
are enabled, and a user has both configurations set, then we prioritise
the MSC4380 one.
Contributed wearing my 🎩 Spec Core Team hat.
I noticed this in some profiling. Basically, we prune the ratelimiters
by copying and iterating over every entry every 60 seconds. Instead,
let's use a wheel timer to track when we should potentially prune a
given key, and then we a) check fewer keys, and b) can run more
frequently. Hopefully this should mean we don't have a large pause
everytime we prune a ratelimiter with lots of keys.
Also fixes a bug where we didn't prune entries that were added via
`record_action` and never subsequently updated. This affected the media
and joins-per-room ratelimiter.
Spawning from adding some logcontext debug logs in
https://github.com/element-hq/synapse/pull/18966 and since we're not
logging at the `set_current_context(...)` level (see reasoning there),
this removes some usage of `set_current_context(...)`.
Specifically, `MockClock.call_later(...)` doesn't handle logcontexts
correctly. It uses the calling logcontext as the callback context
(wrong, as the logcontext could finish before the callback finishes) and
it didn't reset back to the sentinel context before handing back to the
reactor. It was like this since it was [introduced 10+ years
ago](38da9884e7).
Instead of fixing the implementation which would just be a copy of our
normal `Clock`, we can just remove `MockClock`
### Background
As part of Element's plan to support a light form of vhosting (virtual
host) (multiple instances of Synapse in the same Python process), we're
currently diving into the details and implications of running multiple
instances of Synapse in the same Python process.
"Per-tenant logging" tracked internally by
https://github.com/element-hq/synapse-small-hosts/issues/48
### Prior art
Previously, we exposed `server_name` by providing a static logging
`MetadataFilter` that injected the values:
205d9e4fc4/synapse/config/logger.py (L216)
While this can work fine for the normal case of one Synapse instance per
Python process, this configures things globally and isn't compatible
when we try to start multiple Synapse instances because each subsequent
tenant will overwrite the previous tenant.
### What does this PR do?
We remove the `MetadataFilter` and replace it by tracking the
`server_name` in the `LoggingContext` and expose it with our existing
[`LoggingContextFilter`](205d9e4fc4/synapse/logging/context.py (L584-L622))
that we already use to expose information about the `request`.
This means that the `server_name` value follows wherever we log as
expected even when we have multiple Synapse instances running in the
same process.
### A note on logcontext
Anywhere, Synapse mistakenly uses the `sentinel` logcontext to log
something, we won't know which server sent the log. We've been fixing up
`sentinel` logcontext usage as tracked by
https://github.com/element-hq/synapse/issues/18905
Any further `sentinel` logcontext usage we find in the future can be
fixed piecemeal as normal.
d2a966f922/docs/log_contexts.md (L71-L81)
### Testing strategy
1. Adjust your logging config to include `%(server_name)s` in the format
```yaml
formatters:
precise:
format: '%(asctime)s - %(server_name)s - %(name)s - %(lineno)d -
%(levelname)s - %(request)s - %(message)s'
```
1. Start Synapse: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Make some requests (`curl
http://localhost:8008/_matrix/client/versions`, etc)
1. Open the homeserver logs and notice the `server_name` in the logs as
expected. `unknown_server_from_sentinel_context` is expected for the
`sentinel` logcontext (things outside of Synapse).
Introduce `Clock.call_when_running(...)` to wrap startup code in a
logcontext, ensuring we can identify which server generated the logs.
Background:
> Ideally, nothing from the Synapse homeserver would be logged against the `sentinel`
> logcontext as we want to know which server the logs came from. In practice, this is not
> always the case yet especially outside of request handling.
>
> Global things outside of Synapse (e.g. Twisted reactor code) should run in the
> `sentinel` logcontext. It's only when it calls into application code that a logcontext
> gets activated. This means the reactor should be started in the `sentinel` logcontext,
> and any time an awaitable yields control back to the reactor, it should reset the
> logcontext to be the `sentinel` logcontext. This is important to avoid leaking the
> current logcontext to the reactor (which would then get picked up and associated with
> the next thing the reactor does).
>
> *-- `docs/log_contexts.md`
Also adds a lint to prefer `Clock.call_when_running(...)` over
`reactor.callWhenRunning(...)`
Part of https://github.com/element-hq/synapse/issues/18905
This can be reviewed commit by commit
There are a few improvements over the experimental support:
- authorisation of Synapse <-> MAS requests is simplified, with a single
shared secret, removing the need for provisioning a client on the MAS
side
- the tests actually spawn a real server, allowing us to test the rust
introspection layer
- we now check that the device advertised in introspection actually
exist, making it so that when a user logs out, the tokens are
immediately invalidated, even if the cache doesn't expire
- it doesn't rely on discovery anymore, rather on a static endpoint
base. This means users don't have to override the introspection endpoint
to avoid internet roundtrips
- it doesn't depend on `authlib` anymore, as we simplified a lot the
calls done from Synapse to MAS
We still have to update the MAS documentation about the Synapse setup,
but that can be done later.
---------
Co-authored-by: reivilibre <oliverw@element.io>
Bulk refactor `Counter` metrics to be homeserver-scoped. We also add
lints to make sure that new `Counter` metrics don't sneak in without
using the `server_name` label (`SERVER_NAME_LABEL`).
All of the "Fill in" commits are just bulk refactor.
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
# This is just showing how to configure metrics either way
#
# `http` `metrics` resource
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
# `metrics` listener
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` and/or
`http://localhost:9323/metrics`
1. Observe response includes the `synapse_user_registrations_total`,
`synapse_http_server_response_count_total`, etc metrics with the
`server_name` label
Default values will be 1 room per minute, with a burst count of 10.
It's hard to imagine most users will be affected by this default rate,
but it's intentionally non-invasive in case of bots or other users that
need to create rooms at a large rate.
Server admins might want to down-tune this on their deployments.
---------
Signed-off-by: Olivier 'reivilibre <oliverw@matrix.org>
The main goal of this PR is to handle device list changes onto multiple
writers, off the main process, so that we can have logins happening
whilst Synapse is rolling-restarting.
This is quite an intrusive change, so I would advise to review this
commit by commit; I tried to keep the history as clean as possible.
There are a few things to consider:
- the `device_list_key` in stream tokens becomes a
`MultiWriterStreamToken`, which has a few implications in sync and on
the storage layer
- we had a split between `DeviceHandler` and `DeviceWorkerHandler` for
master vs. worker process. I've kept this split, but making it rather
writer vs. non-writer worker, using method overrides for doing
replication calls when needed
- there are a few operations that need to happen on a single worker at a
time. Instead of using cross-worker locks, for now I made them run on
the first writer on the list
---------
Co-authored-by: Eric Eastwood <erice@element.io>
Refactor `Measure` block metrics to be homeserver-scoped (add
`server_name` label to block metrics).
Part of https://github.com/element-hq/synapse/issues/18592
### Testing strategy
#### See behavior of previous `metrics` listener
1. Add the `metrics` listener in your `homeserver.yaml`
```yaml
listeners:
- port: 9323
type: metrics
bind_addresses: ['127.0.0.1']
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9323/metrics`
1. Observe response includes the block metrics
(`synapse_util_metrics_block_count`,
`synapse_util_metrics_block_in_flight`, etc)
#### See behavior of the `http` `metrics` resource
1. Add the `metrics` resource to a new or existing `http` listeners in
your `homeserver.yaml`
```yaml
listeners:
- port: 9322
type: http
bind_addresses: ['127.0.0.1']
resources:
- names: [metrics]
compress: false
```
1. Start the homeserver: `poetry run synapse_homeserver --config-path
homeserver.yaml`
1. Fetch `http://localhost:9322/_synapse/metrics` (it's just a `GET`
request so you can even do in the browser)
1. Observe response includes the block metrics
(`synapse_util_metrics_block_count`,
`synapse_util_metrics_block_in_flight`, etc)
We do this by shoving it into Rust. We believe our python http client is
a bit slow.
Also bumps minimum rust version to 1.81.0, released last September (over
six months ago)
To allow for async Rust, includes some adapters between Tokio in Rust
and the Twisted reactor in Python.
This can be reviewed commit by commit.
This enables the `flake8-logging` and `flake8-logging-format` rules in
Ruff, as well as logging exception stack traces in a few places where it
makes sense
- https://docs.astral.sh/ruff/rules/#flake8-logging-log
- https://docs.astral.sh/ruff/rules/#flake8-logging-format-g
### Linting to avoid pre-formatting log messages
See [`adamchainz/flake8-logging` -> *LOG011 avoid pre-formatting log
messages*](152db2f167/README.rst (log011-avoid-pre-formatting-log-messages))
Practically, this means prefer placeholders (`%s`) over f-strings for
logging.
This is because placeholders are passed as args to loggers, so they can
do special handling of them.
For example, Sentry will record the args separately in their logging
integration:
c15b390dfe/sentry_sdk/integrations/logging.py (L280-L284)
One theoretical small perf benefit is that log levels that aren't
enabled won't get formatted, so it doesn't unnecessarily create
formatted strings
This PR addresses a test failure for
`tests.handlers.test_worker_lock.WorkerLockTestCase.test_lock_contention`
which consistently times out on the RISC-V (specifically `riscv64`)
architecture.
The test simulates high lock contention and has a default timeout of 5
seconds, which seems sufficient for architectures like x86_64 but proves
too short for current RISC-V hardware/environment performance
characteristics, leading to spurious `tests.utils.TestTimeout` failures.
This fix introduces architecture detection using `platform.machine()`.
If a RISC-V architecture is detected:
* The timeout for this specific test is increased (e.g., to 15 seconds
).
The original, stricter timeout (5 seconds) and lock count (500) are
maintained for all other architectures to avoid masking potential
performance regressions elsewhere.
This change has been tested locally on RISC-V, where the test now passes
reliably, and on x86_64, where it continues to pass with the original
constraints.
---
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [X] Pull request is based on the develop branch *(Assuming you based
it correctly)*
* [X] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
*(See below)*
* [X] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct (run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
*(Please run linters locally)*
Spawning from
https://github.com/element-hq/synapse/pull/18375#discussion_r2071768635,
This updates some sliding sync tests to use a higher level function in
order to move test coverage to cover both fallback & new tables.
Important when https://github.com/element-hq/synapse/pull/18375 is
merged.
In other words, adjust tests to target `compute_interested_room(...)`
(relevant to both new and fallback path) instead of the lower level
`get_room_membership_for_user_at_to_token(...)` that only applies to the
fallback path.
### Dev notes
```
SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.handlers.test_sliding_sync.ComputeInterestedRoomsTestCase_new
```
```
SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.rest.client.sliding_sync
```
```
SYNAPSE_POSTGRES=1 SYNAPSE_POSTGRES_USER=postgres SYNAPSE_TEST_LOG_LEVEL=INFO poetry run trial tests.handlers.test_sliding_sync.ComputeInterestedRoomsTestCase_new.test_display_name_changes_leave_after_token_range
```
### Pull Request Checklist
<!-- Please read
https://element-hq.github.io/synapse/latest/development/contributing_guide.html
before submitting your pull request -->
* [x] Pull request is based on the develop branch
* [x] Pull request includes a [changelog
file](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#changelog).
The entry should:
- Be a short description of your change which makes sense to users.
"Fixed a bug that prevented receiving messages from other servers."
instead of "Moved X method from `EventStore` to `EventWorkerStore`.".
- Use markdown where necessary, mostly for `code blocks`.
- End with either a period (.) or an exclamation mark (!).
- Start with a capital letter.
- Feel free to credit yourself, by adding a sentence "Contributed by
@github_username." or "Contributed by [Your Name]." to the end of the
entry.
* [x] [Code
style](https://element-hq.github.io/synapse/latest/code_style.html) is
correct
(run the
[linters](https://element-hq.github.io/synapse/latest/development/contributing_guide.html#run-the-linters))
---------
Co-authored-by: Eric Eastwood <erice@element.io>