Deduplication - Clean & Remove Duplicate Data Records
All too often I am asked the same question. How do I remove duplicate records from my file or database? This often comes from potential brokers looking to list their database on our site, and if we find duplication in the data, it is rejected automatically. Then the next question without fail is "How do I do that?" So for the benefit of sharing this information which seems to be in demand (as well as making my job easier by just pointing them to this post for now on) I decided to put this post together. So if you are in need of data DE duplication cleaning and filtering, read on. Based on your data format, I have broken things down by application and database type.
MySQL
If you are working with a MySQL database, then there are several different scenarios we can cover where 'removing duplicates' has a different meaning based on your needs. Are you looking to remove dupe records based on a single field, a number of select fields, or duplicates based on the entire record being unique? The great part about working with a query language, is the power you have, which we can deduplicate based on any scenario that you may need. Those three scenarios will typically cover the needs of most, so I'll cover each:Unique Field DeDuplication
Remove records based upon a single field being unique. For example, you may have a table with multiple customer records, and you want to remove duplicates based on the email address field, so that the value becomes unique and only shows up once at most:Assuming you have a primary key defined on the table, the following will work:
delete from table_name where primary_key_field in (
select primary_key_field from (
select primary_key_field from table_name a group by field_you_want_unique having count(*) > 1
) b
);
If there is no primary key defined on the table:
Were going to add one in so we can get things to work correctly, and remove it at the end. SO the SQL statements needed are exactly as above, but with the addition of two lines, one at the beginning to add a Primary Key field, and another at the end to remove it:
ALTER TABLE table_name ADD `temp_id` int primary key auto_increment;
delete from table_name where `temp_id` in (
select `temp_id` from (
select `temp_id` from table_name a group by field_you_want_unique having count(*) > 1
) b
);
ALTER TABLE table_name DROP `temp_id`;
Multiple Fields As Unique DeDuplication
In the above examples we based the removal on a single field that we wanted unique within the table. In some cases you may be looking to have two or more fields unique in combination. A common example is a database with addresses, lets say you want to make sure that the Address, City and State fields are unique combinations, and only appear once. Just follow the same outline as above for both scenarios (with or with a Primary Key defined) and change the GROUP BY Clause to accommodate the fields you need:With a Primary Key Already, The altered line would be like:
select primary_key_field from table_name a group by field_you_want_unique1, field_you_want_unique2, field_you_want_unique3 having count(*) > 1
And without an existing Primary Index:
select `temp_id` from table_name a group by field_you_want_unique1, field_you_want_unique2, field_you_want_unique3 having count(*) > 1
Full Record De-Dupe based on an entire row as unique
In some cases you may need to remove duplicates based on the entire record, e.g. no matter the number of fields, the values in each field must be unique combination. e.g. a completely duplicated record. (this makes an assumption that there is no auto-increment / primary key defined on the table as that obviously makes every record unique in it's own right)
CREATE TEMPORARY TABLE good_temp AS SELECT DISTINCT * FROM table_name;
DELETE FROM table_name;
INSERT INTO table_name SELECT * FROM good_temp;
Microsoft Excel 2007
Excel 2007 offers up several options for filtering unique values and remove duplicates. The most common is Filtering and Deletion, basically it is a matter of if you want to simple hide the duplicated selections or actually remove them completely. For me, Filtering is a test case and once I have it the way I want, I will delete them completely and save the results as a copy.Filter (Hide or Copy)
- Select the needed range of cells
- Data Tab -> Sort & Filter -> Advanced
- Advanced Filter (select one):
- Clicking Filter the list, in-place to filter the range of cells or the entire table in place (hiding the rows)
- OR - - Copy to another location -> Copy to if you want to copy the results of the filter to another area or worksheet
- Clicking Filter the list, in-place to filter the range of cells or the entire table in place (hiding the rows)
- Unique records only -> OK
- The unique values within the selected range will be copied to the new location
Remove (Delete the duplicate values)
- Select the needed range of cells
- Data Tab -> Data Tools (group) -> Remove Duplicates
- Do what you need here:
- Columns (select one or more columns)
- Select All
- Unselect All
- Click OK
A message will be displayed letting you know how many values were removed as well as the number of uniques that were left. - Click OK
Microsoft Excel 2010
To find unique records & remove duplicate values with Excel 2010 read below or check out a Video- Click the top left cell of the range, and then drag to the bottom right cell
- Data menu -> Filter -> Advanced Filter
- In the dialog, select "Filter the list in place"
- Select "Unique records only", click OK
- The filtered list is then displayed with the duplicated rows hidden
- Edit menu -> Office Clipboard
- The clipboard task panel should then be displayed
- Verify that the filtered list is selected still and click 'Copy'
- The filtered list should be highlighted and the selection should appears as at the top of the Clipboard as an option
- Data menu -> Filter -> Show All
- The original list is displayed again
- Press DELETE
- The original list is then deleted
- Clipboard -> click on the filtered list item
- The filtered list should reappear in the same location like the original list did
Microsoft Access Multiple Versions
I'm not going to reinvent any wheels here, that and to be honest I hate Access and don't use it very often unless. But, Microsoft actually has some decent knowledge base articles on the duplicate removal process with Access Tables if you hunt around for them. I'll just save you the googling time:Resolve and help prevent duplicate data in MS Access 2010
http://office.microsoft.com/en-us/access-help/resolve-and-help-prevent-duplicate-data-HA010341696.aspx
Remove Duplucates In Microsoft Access 2000 using primary keys
http://support.microsoft.com/kb/209183
Remove Duplucates In Microsoft Access 7.0 and 97 using primary keys
http://support.microsoft.com/kb/109329
carrie says:
Jan 31stI get the following error when I try to run the top SQL statement
MySQL said: Documentation
#1093 - You can't specify target table 'users_temp' for update in FROM clause
Scotty G says:
Jul 14thAwesome, thanks for the post, just what I needed. All the other examples I have seen for the full duplicate row removal before were like 50+ lines of statements.
Worked like a charm, ty again
Mathew says:
Jan 21stDeduplication is become the best and most effective way this days for clean or remove duplicate data record. Now, deduplication software is also used for removing duplicate data.