When searching by tag, there is no need to join with Event; especially when
just counting results or determining first/last digest_order (for navigation).
(For the above "no need" to be actually true, digest_order was denormalized
into EventTag).
The above is implemented in `search_events_optimized`.
Further improvements:
* the bounds of `digest_order` are fetched only once; for first/last this info
is reused.
* explicitly pass `event_qs_count` to the templates
* non-event pages used to calculate a "last event" to generate a tab with a
correct event.id; since we simply have the "last" idiom, better use that.
this also makes clear the "none" idiom was never needed, we remove it again.
Results:
Locally (60K event DB, 30K events on largest issue) my testbatch now
runs in 25% of time (overall).
* The effect on the AND-ing are in fact very large (13% runtime remaining)
* The event details page is not noticably improved.
* denormalize IssueTag.key; this allows for key to be used in and index
(issue, key, count).
* rewrite to grouping-first, per-key-query-second. i.e. reverts part of
bbfee84c6a. Reasoning: I don't want to rely on "mostly unique" always
guessing correctly, and we don't dynamically determine that yet. Which
means that (in the single query version) if you'd have a per-event value for
some tag, you could end up iterating over as many values as there are events,
which won't work.
* in tags.py, do the tab-check first to avoid doing the tag-calculation twice.
* further denormalation (of key__key, of value__str) actually turns out to not
be required for both the grouping and indivdual queries to be fast.
Performance tests, as always, against sqlite3.
--
Roads not taken/background
* This commit removes a future TODO that "A point _could_ be made for
['issue', '?value?' 'count']", I tried both versions of that index
(against the group-then-query version, the only one which I trust)
but without denormalization of key, I could not get it to be fast.
* I thought about a hybrid approach (for those keys with low counts of values
do the single-query thing) but as it stands the extra complexity isn't worth
it.
---
on the 1.2M events, 3 (user defined) tags / event test env this
basically lowers the time from "seconds" to "miliseconds".
Done by denormalizing EventTag.issue, and adding that into an index. Targets:
* get-event-within-query (when it's 'last' or 'first')
* .count (of search query results)
* min/max (for the first/prev/next/last buttons)
(The min/max query's performance significantly improved by the addition of
the index, but was also rewritten into a simple SELECT rather than MIN/MAX).
When this code was written, I thought I had spectacularly improved performance.
I now believe this was based on an error in my measurements, but that this
still represents (mostly) an improvement, so I'll let it stand and will take
it from here in subsequent commits.
prompted by a user being confused about the number of events in their DB;
not 100% sure I'll keep this info here, but I'm introducing it for now
at least
In b76e474ef1, the event-navigation was changed into the next/prev idiom (I
think completely, i.e. also from the .html files, but did not check) but the
elif structure and error message did not fully reflect that (it still talked
about digest_order/id, but nav is now one of the primary methods)
I briefly considered removing the lookup-by-digest-order-only, but I figure it
may come in handy at some point (if only for users to directly edit the url)
and did not check whether this is actually unused.
Avoiding any (1406, "Data too long for column ...") on MySQL.
For the 'plainly provided' fields I followed the documented maximums which are
also our DB maximums. For calculated_* I harmonized with what Sentry &
GlitchTip both do (and which was already partially reflected in the code), i.e.
128 and 1024.
"possibly expensive" turned out to be "actually expensive". On 'emu', with 1.5M
events, the counts take 85 and 154 ms for Project and Issue respectively;
bottlenecking our digestion to ~3 events/s.
Note: this is single-issue, single-project (presumably, the cost would be lower
for more spread-out cases)
Note on indexes: Event already has indexes for both Project & Issue (though as
the first item in a multi-column index). Without checking further: that appears
to not "magically solve counting".
This commit also optimizes the .count() on the issue-detail event list (via
Paginator).
This commit also slightly changes the value passed as `stored_event_count` to
be used for `get_random_irrelevance` to be the post-evication value. That won't
matter much in practice, but is slightly more correct IMHO.
Triggered by issue_event_list being more than 5s on "emu" (my 1,500,000 event
test-machine). Reason: sorting those events on non-indexed field. Switching
to a field-with-index solved it.
I then analysed (grepped) for "ordering" and "order_by" and set indexes
accordingly and more or less indiscriminately (i.e. even on tables that are
assumed to have relatively few rows, such as Project & Team).
## Goal
Reduce the number of migrations for _fresh installs_ of Bugsink. This implies: squash as
broadly as possible.
## How?
"throw-away-and-rerun". In particular, for a given app:
* throw away the migrations from some starting point up until and including the last one.
* run "makemigrations" for that app. Django will see what's missing and just redo it
* rename to 000n_b_squashed or similar.
* manually set a `replaces` list on the migration to the just-removed migrations
* manually check dependencies; check that they are:
* as low as possible, e.g. an FK should only depend on existence. this reduces the
risk of circular dependencies.
* pointing to "original migrations", i.e. not to a just-created squashed migration.
because the squashed migrations "contain a lot" they increase the risk of circular
dependencies.
* restore (git checkout) the thrown-away migration
## Further tips:
* "Some starting point" is often not 0000, but some higher number (see e.g. the outcome
in the present commit). Leaving the migrations for creation of base models (Event,
Issue, Project) in place saves you from a lot of circular dependency problems.
* Move db.sqlite3 out of the way to avoid superfluous warnings.
## RunPython worries
I grepped for RunPython in the replaced migrations, with the following results:
* phonehome's create_installation_id was copied-over to the squashed migration.
* all others where ignored, because:
* they "do something with events", i.e. only when events are present will they have
an effect. This means they are no-ops for _new installs_.
* for existing installs, for any given app, they will only be missed (replaced) when
the first replaced migration is not yet executed.
I used the following command (reading from the bottom) to establish that this means only
people that did a fresh install after 8ad6059722 (June 14, 2024), but before
c01d332e18 (July 16) _and then never did any upgrades_ would be affected. There are no
such people.
git log --name-only \
events/migrations/0004_event_irrelevance_for_retention.py \
issues/migrations/0004_rename_event_count_issue_digested_event_count.py \
phonehome/migrations/0001_initial.py \
projects/migrations/0002_initial.py \
teams/migrations/0001_initial.py
Note that the above observation still be true for the next squashmigration (assuming
squashing starting at the same starting migrations).
## Cleanup of the replaced migrations
Django says:
> Once you’ve squashed your migration, you should then commit it alongside the
> migrations it replaces and distribute this change to all running instances of your
> application, making sure that they run migrate to store the change in their database.
Given that I'm not in control of all running instances of my application, this means the
cleanup must not happen "too soon", and only after announcing a migration path ("update
to version X before updating to version Y").
## Roads not taken
Q: Why not just do squashmigrations? A: It didn't work reliably (for me), presumably b/c
of the high number of strongly interdependant apps in combination with some RunPython.
Seen after I was mostly done, not explored seriously (yet):
* https://github.com/3YOURMIND/django-replace-migrations
* https://pypi.org/project/django-squash/
* https://django-extensions.readthedocs.io/en/latest/delete_squashed_migrations.html