Quarantining Duplicates
I’ve ran into the following problem multiple times now at different workplaces:
I have a database table with some column having rows with duplicate values, when
they should have been unique. Think like an external_id
, email
, or url
.
Maybe someone forgot to add a unique index originally, or perhaps uniqueness
being a requirement for the problem domain was discovered later in the table’s
life.
Either way, I have duplicate rows. Let’s say it’s not obvious how to clean things up, but in the meantime while figure that out, I at least want to “stop the bleeding”. Prevent any new duplicates from being inserted.
Since I’ve been working with Postgres for a while now, I’ve been meaning to try the following plan the next time I’m in this situation.
Step #1: Add New Flag Column
First, add a new column that will serve to mark these existing duplicates as “known”:
ALTER TABLE some_table
ADD COLUMN known_duplicate BOOLEAN
DEFAULT false;
The DEFAULT
is important for later, as we’ll never want new rows to be marked
as a duplicate, since their uniqueness will be enforced.
Step #2: Backfill the Flag
At this point every row will have known_duplicate
set to FALSE
. We need find
each set of duplicate rows and set all of their known_duplicate
columns to
TRUE
, except for one of them.
Having known_duplicate
set to FALSE
for one of the rows allows it to
participate in the index added in the next step, preventing a new rows being
added to that duplicate set. Deciding which row to pick will be specific to your
problem domain, but there will be an obvious “primary” row that your application
is already preferring in some way for that duplicate column.
Step #3: Add the Partial Index
Finally, add the UNIQUE
index, but instead of having it cover all rows like
you wish it originally did, it’ll need to be a
partial index
over the rows that aren’t a known duplicate:
CREATE UNIQUE INDEX partial_idx_on_some_table
ON some_table (wish_it_was_unique)
WHERE NOT known_duplicate;
Now your set of duplicates can’t grow any larger, allowing you to slow down and figure out your next steps. Like how there were duplicates in the first place (race condition, bad data import, etc.), and how you’ll de-duplicate them.
Or at least that’s the idea. Has anyone tried something like this? Is there a better way? If anyone is reading this, I’d love to know.