Fix 'different_runtime_limit' race conditions

This commit fixes 3 related issues with the way runtime_limit was administered;
which could lead to race conditions (and hence: the wrong runtime_limit
applying at some point in time). Post-fix, the folllowing holds:

1. We use thread_locals to store this info, since there are at least 2 sources of
    threaded code that touch this (snappea's workers and the django debugserver)

2. We distinguish between the "from connection settings" timeout and the
    "temporarily overridden" ones, since we cannot assume
    connection-initialization happens first (as per the comment in base.py)

3. We store runtime-limits per alias ('using'). Needed for [2] (each connection
    may have a different moment-of-initialization, clobbering CM-set values from
    the other connection) and also needed once you realize there may be
    different defaults for the timeouts.

General context: I've recently started introducing the 'different runtime'
helper quite a bit more; and across connections (snappea!), which created more
and more doubts as to it actually working as advertised.

Thoughts on "using" being required. I used to think "you can reason about a
global timeout value, and the current transaction makes clear what you're
actually doing", but as per the notes above that doesn't really work.

Thoughts on reproducing:
A few thoughts/notes on reproducing problems with race conditions. Basic note:
that's always hairy. So in the end I settled on a solution that's hopefully
easy to reason about, even if it's verbose.

When I started work on this commit, I focussed on thread-safety; "proving the
problem" consisted of F5/^R on a web page with 2 context managers with different
timeouts, hoping to show that the stack unrolling didn't work properly.
However, during those "tests" I noticed quite a few resets-to-5s (from the
connection defaults), which prompted fix [2] from above.
This commit is contained in:
Klaas van Schelven
2025-04-22 22:08:53 +02:00
parent 5c0e45a16d
commit 53d4be8183
5 changed files with 65 additions and 18 deletions

View File

@@ -107,7 +107,7 @@ class Stats:
with immediate_atomic(using="snappea"): # explicit is better than impl.; and we combine read/write here
# having stats is great, but I don't want to hog task-processing too long (which would happen
# precisely when the backlog grows large)
with different_runtime_limit(0.1):
with different_runtime_limit(0.1, using="snappea"):
try:
task_counts = Task.objects.values("task_name").annotate(count=Count("task_name"))
except OperationalError as e: