This reverts course on 4201fbd778, and restores event.schema.json from that
commit. In that commit we said: 'this is not used'. Not true: it's used in a
test, though this test used the validity check to silently skip.
In this commit:
1. Do _not_ just silently skip invalid samples. Since we have a way of properly
validating, let's use that so that we know how useful the samples that we have
actually are.
2. Deal with "_meta", a field that we sometimes see in the "private samples" (data
that ultimately comes from running a somewhat recent python-sdk against my
actual codebase). The need for this was exposed by [1]
3. Add a test for the up-to-date-ness of event.json.schema
4. remove special-cased attribute-checks in `is_valid`; `send_json` was, at the
time, an opportunistic way to just get my hands on some sample data. the
approach at validation reflected that: I just did some tests on the existence
of certain attributes to determine which json files were even events. But in
the end I did a full validation using an API schema, which kinda made the
whole business useless. This commit cleans up the individual checks.
In the previous commit I put the code for a small performance-experiment.
The results are (very) obvious: don't do this. Response times go through
the roof, and more importantly, the server becomes unreliable. Reason:
time-outs caused by waiting for the write-lock.
* issue.event_count to digested_event_count
* event.ingest_order to event.digest_order
* issue.ingest_order to digest_order
This is generally more correct/explicit, and is also in preparation
of doing work on-digest (which may or may not happen)
The direct cause for this was the following observation: there was no mechanism
in place to safeguard counted events across evictions, i.e. the following order
of events was not accounted for:
* ingest/digest a bunch of events (PCs correctly updated)
* eviction (PC still correct)
* server/snappea restart (PC reloaded, but based on new events. not correct).
I though about various approaches to fix this (e.g. snapshotting) but in the end
such approaches added even more complexity to the PC mechanism. I decided to first
check how non-performant the SQL route would be, and this PoC seems to say: just
go SQL.
There's also a small semantic change (probably in the direction of what you'd
expect), namely: the periods are no longer 'calendar' periods.
I _think_ I meant that I had never actually seen those code-paths in action
(i.e. the note was not about automated tests but rather any kind of visual
confirmation that it worked) but I have seen that now
Using a pid-file that's implied by the ingestion directory.
We do this in `get_pc_registry`, i.e. on the first request. This means failure is
in the first request on the 2nd process.
Why not on startup? Because we don't have a configtest or generic on-startup location
(yet). Making _that_ could be another source of fragility, and getting e.g. the nr
of processes might be non-trivial / config-dependent.
Replacing it with passing the thresholds on each call to `inc`.
The event-based approach was broken in a multi-process setup (such as having a separate
gunicorn and snappea), because the unmute events would be registered GUI-side
(gunicorn), and the single process where the counting happened had a different PC
instance.
The solution is to get rid of the event-listener approach, and just make an inventory of
the threshold-checks that need to be done right before each call to `inc`. Because the
calls to `inc` happen in a single process (we [will] enforce this elsewhere) this fixes
the problem.
During refactoring it became clear that this is probably a good idea anyway: many
comments about corner-cases could be removed.
Other things I found:
* The now-removed `_digest_event_python_postprocessing` did more than Python alone (it
also touched the DB for unmutes) so that was probably a separate bug (now fixed).
* In the event-listener-based code, I foresaw the need for `on_become_false` (but did
not use it yet). The idea was probably that this could be useful in the quota setting
(a quota can become unmet after a while) but in fact it isn't useful, because when a
quota becomes unmet you'd still need to check all quota and OR them.
Tests have not been truly refactored (the new architecture probably points to a new
desired set of tests) but rather have been made to run in the simplest way possible.
The shaving-off of queries that's discussed in the comment is no longer
relevant because the associated branch is fully seen as ValidationError
these days.