Microsoft Query Remove Duplicates

0 views

Skip to first unread message

Madison Rapelyea

unread,

Aug 3, 2024, 3:47:07 PM8/3/24

to gizwhistnvasav

Power Query is case-sensitive. When working with duplicate values, Power Query considers the case of the text, which might lead to undesired results. As a workaround, users can apply an uppercase or lowercase transform prior to removing duplicates.

You have four rows that are duplicates. Your goal is to remove those duplicate rows so there are only unique rows in your table. Select all columns from your table, and then select Remove duplicates.

You have four rows that are duplicates. Your goal in this example is to keep only the rows that are duplicated in your table. Select all the columns in your table, and then select Keep duplicates.

In this example, you have multiple duplicates and you want to keep only those duplicates from your table. To keep duplicates from the id column, select the id column, and then select Keep duplicates.

I'm running the following Script to view USB file activity, however it produces filename duplicates, which I am having issues with removing, I have tried playing with the summarize arg_max() but had no luck, as it doesn't work with the Filename field. MountTime and Timestamp are fields I want to avoid.

I believe you may need to use MountTime in a Arg_max for this. Arg_max removes duplicates and you need to use a number time or date for this unfortunately.

The Arg_max would work below, so it will show the last MountTime for that FileName going to a specific USB. You may see duplicate FileNames but thats because its being saved to multiple USB devices.

So I'm trying to get some creation events for App Services, though there seems to be multiple entries for the same App. Therefore I'm trying to find a way to remove duplicates on a column but retain the rest of the columns in the output / or a defined set of columns. Though after dodging distinct on a specific column only this is retained in the output.

Hi There is some basics that you need to understand about Log Analytics. Data in Log Analytics is stored with different time stamp (TimeGenerated column). So basically when a solution sends new data it does not delete the old one it just writes a new record with new date. That way you can view historically the data. With that said for example if you want to get the latest record for something there is function for that arg_max(). You can read about it here:

So my original query to get information about deleted App Services returns multiple events for the deletion of the same App. Though these have the same CorrelationId. Therefore I was trying with my limited knowledge of the query language to only return 1 record that has the same CorrelationId. Though still get all the columns the event contains.

If one just ignores time at the moment this almost gets me there but it will only retain CorrelationId on the result output, but I need to get all the columns (or use project to choose the ones I need)

Hmm the query you've posted last is different from the one I've posted. It does not contain arg_max(). arg_max() retains all columns. If you do not want to use arg_max() when you get two results what is the difference between them? Which one of the two you want to be displayed only?

I have a table that has ID and date columns. I want to remove duplicates and only keep the most recent record for each ID. I thought I'd by able to simply by sorting either ascending or descending then removing duplicates, but it doesn't keep the first or last record in your sort. It only keeps the first record in the source dataset regardless of sort. Is there a way around this?

Do check this video from curbal.com. It beautifully explains the reason why changing Sort Order in Power query doesn't work for removing duplicates and using Table.Buffer to keep new Sorting Order in memory.

The underlying idea in both cases is to prevent the query from using the original cached table in memory (since it's not the order we want). It's conceivable that in some cases the internal engine might fold in the index step without reloading the table, which would sort of defeat the purpose of adding that step. Table.Buffer explicitly reloads the table to memory so I'd guess you have fewer edge cases to worry about.

I have run into this issue many times and wanted to see if anyone has any feedback. In this specific case I have 2 lists of user names. I append the user names to a new query to have a singular list of user names. I think TRIM the text to remove any spacing. I then select remove duplicates in the Query Editor. I close and apply my changes. When I try to create a relationship between my list of user names and other tables I get a many to many notification. I then pull all user names and do a count and find a single user name is listed twice even after removing duplicates and trimming the text. Has anyone else had this and am I missing an obvious reason for my issue i.e. user error?

Just to elaborate on this as its an issue I have run in to before as well, as previously said PowerQuery is value case sensitive whilst DAX is not. So when you use Remove Duplicates in PowerQuery, it will not consider 2 values in different cases to be duplicates, but then when you create the relationship in DAX, it will consider them to be duplicates - hence the many to many.

If you have to use the text field as a key for a relationship, trim as you have already, but then also force the case before de-duping so you'll get a result more like you are expecting and give you the relationship you want in DAX. And as per az38's post, do the same thing prior to doing your merges in PowerQuery as well.

@az38 thanks for your information! out of curiousity is it the same logic when using the Merge Queries? For instance if I have user names "username" & "Username" will they merge or will they not merge as they are considered unique?

When running Remove Duplicates in Power Query, does it leave the 1st instance alone and then delete any following duplicates? E.G. If there were duplicates on rows 10,11 and 12, it would delete rows 10 & 11? Is there documentation on this somewhere?

As far as I am aware, remove duplicates will remove items based on the order the data was initially loaded into Power Query. Any sorting or other operations you have performed after the data is loaded will not be factored into this. So duplicate items on rows 11 and 12 would be removed in your example, even if you sorted the data so the items on rows 11 and 12 were now above the item on row 10.

It is possible to make remove duplicates follow the current sort order if you use the function Table.Buffer() on the data before using the remove duplicates function in PQ (the actual function it runs is Table.Distinct(). This is because Table.Buffer() loads the table at the current state it is called into memory, and this resets the "load" order that is used to remove duplicates by Table.Distinct.

I have several spreadsheets with information I need to consolidate, each with different (and sometimes overlapping) portions of the final data set I'm hoping to create. Typically, combining spreadsheets with similar data is easy to do by dumping all the data into one sheet and using Remove Duplicates. However, I want to make sure that the line items which are left represent the latest version of the duplicates.

This array formula simply performs the equivalent of an SQL WHERE clause with two conditions: for all rows that match both the UID (A column) and DATE (B column), return the row's value in the C column (STATUS).

The first formula should have been good enough, but since we don't have a way to pull out only the non-null (or non-FALSE) value, and Excel does not have a COALESCE formula, we need to resort to a little indirection.
The MATCH formula searches the array returned by the IF (same conditions as above, but simply returns TRUE if it is a match), for the first TRUE value. The 3 parameter, 0, demands an exact match.
This formula simply returns the index of the first - and only - row that is a match for the previous conditions (matching UID and DATE (which was the maximum date that matches the UID)).

Now it is simple enough, to take the index of the matching row from the MATCH, and pull out the corresponding STATUS value from the IF array. This returns a single value, your new STATUS, which is guaranteed (if you've done all these steps correctly) to be from the latest date for each UID.

@AviD is correct, in that Excel isn't a database, but you can import your data into another spreadsheet via a Microsoft Query data source. It's a bit ugly, but will give you access to a SQL statement, which should enable you to get what you want.

If you're willing to consider a third-party tool, I would recommend BeyondCompare. It's easy to use, it has a very generous trial before buying, fairly low price, and is nice for both comparing and merging many different file types and directories, including Excel. You can copy individual lines from one file to the other.