Reverting this change entirely was too slow, so change the joins in the query
from inner joins to left joins, as this should mean that NULL values get
inserted if there are missing derivations or derivation outputs, which should
cause an error rather than silently skipping inserting the derivation inputs.
This reverts commit edeb89e0cf.
I'm concerned that this approach is more error prone and won't error if there
are issues with the data in the database.
This reverts commit 3081887b90.
These changes were motivated by switching to a mechanism of loading data that
isn't dependent on the big advisory lock that prevents more than one revision
from being processed at a time.
Since INSERT ... RETURNING id; is used, this can block if another transaction
inserts the same data, and then cause an error when that transaction
commits. The solution is to use ON CONFLICT DO NOTHING, but you have to handle
the case when the INSERT doesn't return an id since the other transaction has
inserted it.
This commit rewrites insert-missing-data-and-return-all-ids to do as described
above, as well as being more efficient in how existing data is detected and to
use more vectors. Other utilities for inserting data are added as well.
This should help with query performance, as the recursive queries using
derivation_inputs and derivation_outputs are particularly sensitive to the
n_distinct values for these tables.
This means you can query for derivations where builds exist or don't exist on
a given build server.
I think this will come in useful when submitting builds from a Guix Data
Service instance.
Switch from using a recursive query to doing a breath first search through the
graph of derivations, as I think PostgreSQL wasn't doing a great job of
planning the recursive queries (it would overestimate the rows involved, and
prefer sequential scans for the derivation_outputs table).
Previously it would compute a long list of strings, potentially more than
100,000 elements long, then split this string up and insert it in chunks. Only
then could memory be freed.
This new approach builds the strings in batches for the insertion query, then
moves on to the next batch. This should mean that more memory can be freed and
reused along the way.