> mthadley_

Quarantining Duplicates

I’ve ran into the following problem multiple times now at different workplaces:

I have a database table with some column having rows with duplicate values, when they should have been unique. Think like an external_id, email, or url. Maybe someone forgot to add a unique index originally, or perhaps uniqueness being a requirement for the problem domain was discovered later in the table’s life.

Either way, I have duplicate rows. Let’s say it’s not obvious how to clean things up, but in the meantime while figure that out, I at least want to “stop the bleeding”. Prevent any new duplicates from being inserted.

Since I’ve been working with Postgres for a while now, I’ve been meaning to try the following plan the next time I’m in this situation.

Step #1: Add New Flag Column

First, add a new column that will serve to mark these existing duplicates as “known”:

ALTER TABLE some_table
ADD COLUMN known_duplicate BOOLEAN
DEFAULT false;

The DEFAULT is important for later, as we’ll never want new rows to be marked as a duplicate, since their uniqueness will be enforced.

Step #2: Backfill the Flag

At this point every row will have known_duplicate set to FALSE. We need find each set of duplicate rows and set all of their known_duplicate columns to TRUE, except for one of them.

Having known_duplicate set to FALSE for one of the rows allows it to participate in the index added in the next step, preventing a new rows being added to that duplicate set. Deciding which row to pick will be specific to your problem domain, but there will be an obvious “primary” row that your application is already preferring in some way for that duplicate column.

Step #3: Add the Partial Index

Finally, add the UNIQUE index, but instead of having it cover all rows like you wish it originally did, it’ll need to be a partial index over the rows that aren’t a known duplicate:

CREATE UNIQUE INDEX partial_idx_on_some_table
ON some_table (wish_it_was_unique)
WHERE NOT known_duplicate;

Now your set of duplicates can’t grow any larger, allowing you to slow down and figure out your next steps. Like how there were duplicates in the first place (race condition, bad data import, etc.), and how you’ll de-duplicate them.

Or at least that’s the idea. Has anyone tried something like this? Is there a better way? If anyone is reading this, I’d love to know.