Deduplication - Clean & Remove Duplicate Data Records

All too often I am asked the same question. How do I remove duplicate records from my file or database? This often comes from potential brokers looking to list their database on our site, and if we find duplication in the data, it is rejected automatically. Then the next question without fail is "How do I do that?" So for the benefit of sharing this information which seems to be in demand (as well as making my job easier by just pointing them to this post for now on) I decided to put this post together. So if you are in need of data DE duplication cleaning and filtering, read on. Based on your data format, I have broken things down by application and database type.

MySQL

If you are working with a MySQL database, then there are several different scenarios we can cover where 'removing duplicates' has a different meaning based on your needs. Are you looking to remove dupe records based on a single field, a number of select fields, or duplicates based on the entire record being unique? The great part about working with a query language, is the power you have, which we can deduplicate based on any scenario that you may need. Those three scenarios will typically cover the needs of most, so I'll cover each:

Unique Field DeDuplication

Remove records based upon a single field being unique. For example, you may have a table with multiple customer records, and you want to remove duplicates based on the email address field, so that the value becomes unique and only shows up once at most:

Assuming you have a primary key defined on the table, the following will work:

delete from table_name where primary_key_field in (

  select primary_key_field from (

    select primary_key_field from table_name a group by field_you_want_unique having count(*) > 1

  ) b

);

If there is no primary key defined on the table:

Were going to add one in so we can get things to work correctly, and remove it at the end. SO the SQL statements needed are exactly as above, but with the addition of two lines, one at the beginning to add a Primary Key field, and another at the end to remove it:

ALTER TABLE table_name ADD `temp_id` int primary key auto_increment;

delete from table_name where `temp_id` in (

  select `temp_id` from (

    select `temp_id` from table_name a group by field_you_want_unique having count(*) > 1

  ) b

);

ALTER TABLE table_name DROP `temp_id`;

Multiple Fields As Unique DeDuplication

In the above examples we based the removal on a single field that we wanted unique within the table. In some cases you may be looking to have two or more fields unique in combination. A common example is a database with addresses, lets say you want to make sure that the Address, City and State fields are unique combinations, and only appear once. Just follow the same outline as above for both scenarios (with or with a Primary Key defined) and change the GROUP BY Clause to accommodate the fields you need:

With a Primary Key Already, The altered line would be like:


select primary_key_field from table_name a group by field_you_want_unique1, field_you_want_unique2, field_you_want_unique3 having count(*) > 1

And without an existing Primary Index:


select `temp_id` from table_name a group by field_you_want_unique1, field_you_want_unique2, field_you_want_unique3 having count(*) > 1

Full Record De-Dupe based on an entire row as unique

In some cases you may need to remove duplicates based on the entire record, e.g. no matter the number of fields, the values in each field must be unique combination. e.g. a completely duplicated record. (this makes an assumption that there is no auto-increment / primary key defined on the table as that obviously makes every record unique in it's own right)


CREATE TEMPORARY TABLE good_temp AS SELECT DISTINCT * FROM table_name;

DELETE FROM table_name;

INSERT INTO table_name SELECT * FROM good_temp;

Microsoft Excel 2007

Excel 2007 offers up several options for filtering unique values and remove duplicates. The most common is Filtering and Deletion, basically it is a matter of if you want to simple hide the duplicated selections or actually remove them completely. For me, Filtering is a test case and once I have it the way I want, I will delete them completely and save the results as a copy.

Filter (Hide or Copy)

Select the needed range of cells

Data Tab -> Sort & Filter -> Advanced

Advanced Filter (select one):
- Clicking Filter the list, in-place to filter the range of cells or the entire table in place (hiding the rows)
  - OR -
- Copy to another location -> Copy to if you want to copy the results of the filter to another area or worksheet
Unique records only -> OK
- The unique values within the selected range will be copied to the new location

Remove (Delete the duplicate values)

Select the needed range of cells

Data Tab -> Data Tools (group) -> Remove Duplicates

Do what you need here:
- Columns (select one or more columns)
- Select All
- Unselect All
If you don't want to deduplicate by all columns in each row: Unselect All -> Columns (then just select the columns you need)

Click OK

A message will be displayed letting you know how many values were removed as well as the number of uniques that were left.

Click OK

Microsoft Excel 2010

To find unique records & remove duplicate values with Excel 2010 read below or check out a Video

Click the top left cell of the range, and then drag to the bottom right cell
- Data menu -> Filter -> Advanced Filter
- In the dialog, select "Filter the list in place"
- Select "Unique records only", click OK
- The filtered list is then displayed with the duplicated rows hidden
Edit menu -> Office Clipboard
- The clipboard task panel should then be displayed
Verify that the filtered list is selected still and click 'Copy'
- The filtered list should be highlighted and the selection should appears as at the top of the Clipboard as an option
Data menu -> Filter -> Show All
- The original list is displayed again
Press DELETE
- The original list is then deleted
Clipboard -> click on the filtered list item
- The filtered list should reappear in the same location like the original list did

Microsoft Access Multiple Versions

I'm not going to reinvent any wheels here, that and to be honest I hate Access and don't use it very often unless. But, Microsoft actually has some decent knowledge base articles on the duplicate removal process with Access Tables if you hunt around for them. I'll just save you the googling time:

Oddity Software - Databases, Development and Design

Download Website & Marketing Database Lists

Deduplication - Clean & Remove Duplicate Data Records

MySQL

Unique Field DeDuplication

Assuming you have a primary key defined on the table, the following will work:

If there is no primary key defined on the table:

Multiple Fields As Unique DeDuplication

With a Primary Key Already, The altered line would be like:

And without an existing Primary Index:

Full Record De-Dupe based on an entire row as unique

Microsoft Excel 2007

Filter (Hide or Copy)

Remove (Delete the duplicate values)

Microsoft Excel 2010

Microsoft Access Multiple Versions

Resolve and help prevent duplicate data in MS Access 2010

Remove Duplucates In Microsoft Access 2000 using primary keys

Remove Duplucates In Microsoft Access 7.0 and 97 using primary keys

Mathew says:

carrie says:

Scotty G says:

Leave a Comment

SEARCH DATABASES

Executive Contacts Marketing Database

RECENT DATA UPDATES