All living databases has its duplicates. Depending on their nature or number, it may be interesting to dedupe them quickly. For this goal SQL is our friend.

Below a simple example in order to describe the main concepts of the method. Then you will be able to do this in more complex situations. Because the final formula is a bit long, we will proceed iteratively, to understand the different stages.

Import the attached SQL table into a database.

In this anonymized file we can easily view and count duplicates with a GROUP BY:

SELECT firstname, lastname, email, COUNT(id) AS total
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;
 firstname lastname email total 
 David ROSS Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 6
 Jay SHORR Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 6
Allan KOK Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 5
... ... ... ...


This total field counts the duplicates by each record using the unique identifiers (id field). So SQL is able to group the identifiers from duplicates.

Well, now imagine we would able to list these identifiers, to see them, and give them an order of importance. We would then be able to say which ones should be deleted, and which ones should be kept.

Let's start by listing these identifiers beside our duplicates, with a GROUP_CONCAT instruction. It will groups our ids textually, concatenated:

SELECT firstname, lastname, email, COUNT(id) AS total,
GROUP_CONCAT(id) AS all_id_related
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;
 firstname lastname email total  all_id_related
 David ROSS Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 6 440,441,442,443,444,445
 Jay SHORR Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 6 867,868,869,870,871,872
Allan KOK Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 5 71,72,73,74,75
... ... ... ... ...

 

OK, now assume that the last record in a series of duplicates is the one to keep (typically for a users table, the last record from a same user will be the user's reference record, with the most up-to-date information). We keep here the highest id from a duplicate series.

Let's start by just displaying it beside our duplicates, thanks to a MAX statement. With an ORDER BY in the GROUP_CONCAT.

SELECT firstname, lastname, email, COUNT(id) AS total,
GROUP_CONCAT(id ORDER BY id ASC) AS all_id_related,
MAX(id) AS id_to_keep
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;
 firstname lastname email total  all_id_related id_to_keep
 David ROSS Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 6 440,441,442,443,444,445 445
 Jay SHORR Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 6 867,868,869,870,871,872 872
Allan KOK Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 5 71,72,73,74,75 75
... ... ... ... ... ...

 

Well, now let's try to separate our id_to_keep from other id_related, with a simple SQL REPLACE above the GROUP_CONCAT:

SELECT firstname, lastname, email, COUNT(id) AS total,
REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), '') AS id_to_delete,
MAX(id) AS id_to_keep
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;
 firstname lastname email total  id_to_delete id_to_keep
 David ROSS Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 6 440,441,442,443,444, 445
 Jay SHORR Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 6 867,868,869,870,871, 872
Allan KOK Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 5 71,72,73,74, 75
... ... ... ... ... ...

 

Caution: in this simple case we keep the MAX id record. This protects us from errors if the sequence of number composing the id_to_keep was present in the id_to_delete (the REPLACE could cause errors).

OK we progress. Remove now the last comma, if not it will bother us for the following of the operations. This with a SUBSTRING and its CHAR_LENGTH:

SELECT firstname, lastname, email, COUNT(id) AS total,
SUBSTR(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), ''), 1,
CHAR_LENGTH(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id) , '')) - 1) AS id_to_delete,
MAX(id) AS id_to_keep
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;
 firstname lastname email total  id_to_delete id_to_keep
 David ROSS Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 6 440,441,442,443,444 445
 Jay SHORR Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 6 867,868,869,870,871 872
Allan KOK Cette adresse e-mail est protégée contre les robots spammeurs. Vous devez activer le JavaScript pour la visualiser. 5 71,72,73,74 75
... ... ... ... ... ...

 

Great! All that remains is to generate DELETE queries with an IN clause using our id_to_delete:

SELECT COUNT(id) AS total,
CONCAT('DELETE FROM my_users WHERE id IN (',
SUBSTR(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), ''), 1,
CHAR_LENGTH(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id) , '')) - 1),
') ;') AS delete_query
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC ;

Working with MySQL, we are forced to display a HAVING in a GROUP BY ... Rrrrrr!

No matter, use a subquery to make it disappear from our results:

SELECT delete_query FROM (
SELECT COUNT(id) AS total,
CONCAT('DELETE FROM my_users WHERE id IN (',
SUBSTR(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id), ''), 1,
CHAR_LENGTH(REPLACE(GROUP_CONCAT(id ORDER BY id ASC), MAX(id) , '')) - 1),
') ;') AS delete_query
FROM my_users
GROUP BY firstname, lastname, email
HAVING total > 1
ORDER BY total DESC) AS x ;
 delete_query
 DELETE FROM my_users WHERE id IN (440,441,442,443,444) ;
 DELETE FROM my_users WHERE id IN (867,868,869,870,871) ;
 DELETE FROM my_users WHERE id IN (71,72,73,74) ;
 ...

 

Recover your results. Separated by a semicolon, it forms a SQL script that just needs to be executed.

DELETE FROM my_users WHERE id IN (440,441,442,443,444) ;
DELETE FROM my_users WHERE id IN (867,868,869,870,871) ;
DELETE FROM my_users WHERE id IN (71,72,73,74) ;
...

Perfect! You can now focus you on more important things ...


Note that with a previous standardization on a temporary table (correction, harmonisation...), with the exact same ids of course, your final script will be still more efficient (finding still more duplicates).

But our example just eliminates duplicates. Indeed it was only perfect duplicates (exact same name, first name and email).

The next step will be to merge the duplicates. That is, to repatriate the richest information on a single record.

This in a relational database, with potentially phones or addresses located in joined tables. We will see that later ...

Liens ou pièces jointes
Télécharger ce fichier (DuplicatesDemo.sql)Test table[ ]104 Ko