Mastering Data Manipulation with awesome Py Library - dfply

26 views

Skip to first unread message

Neeraj Kaushik

unread,

Apr 16, 2026, 8:07:53 PM (6 days ago) Apr 16

to dataanalysistraining

Dear Friends

If you are looking for a way to bring the power of R's dplyr "pipe" syntax to Python, the dfply library is a game-changer. It allows for cleaner, more readable code when performing complex data transformations. By using the pipe operator >>, you can chain together essential verbs like mutate, filter_by, and summarize to handle data pipelines with ease. I have summarized some prominent verbs used in the library:

select(): Returns a subset of columns based on their names or indices, discarding the rest.

drop(): Returns the dataframe with only the specifically named columns removed.
filter_by(): Returns only the rows that evaluate to True for the provided logical expressions.
rename(): Updates the names of specific columns using key-value pairs while leaving all other columns untouched.
arrange(): Sorts the rows of the dataframe in ascending or descending order based on the values within one or more columns.
mutate(): Adds new columns or modifies existing ones by calculating values row-by-row, keeping the overall row count the same.
summarize(): Computes summary statistics (like mean or count), returning one consolidated row of results per group.
group_by(): Segments the dataframe into distinct categories so that subsequent operations are applied group-wise rather than to the whole dataset.
ungroup(): Removes the internal grouping structure, returning the dataframe to a standard, unsegmented state.
distinct(): Returns only the unique rows, dropping any duplicates based on all columns or a specified subset.
separate(): Splits a single string-based column into multiple new columns using a specified delimiter or regular expression.
unite(): Combines multiple columns into a single new string column, joining their values together with a specified separator.
sample(): Returns a random subset of rows, determined either by a fixed number or a percentage fraction of the data.

Here is a quick reference for some of the database functions:

left_join(): Returns all rows from the left data frame, appending matched columns from the right and filling missing matches with NA.
right_join(): Returns all rows from the right data frame, appending matched columns from the left and filling missing matches with NA.
inner_join(): Returns only the rows that share matching keys in both the left and right data frames, dropping all unmatched rows.
outer_join(): (Referred to as full_join() in R's dplyr) Returns all rows from both data frames, retaining all data and inserting NA wherever there are missing matches.
anti_join(): Filters the left data frame to return only the rows that do not have a corresponding match in the right data frame.
semi_join(): Filters the left data frame to return only the rows that do have a match in the right data frame, without actually adding any new columns from the right.

I've explained these concepts in the following videos:

Python Data Handling using dfply 1 (10 basic verbs): https://youtu.be/EFtf8nGERTM
Python Data Handling using dfply 2 (Various join verbs): https://youtu.be/NXUKxoxYb2I
Python Data Handling using dfply 3 (Mutate advance options): https://youtu.be/AHgo9M66aJI
Python Data Handling using dfply 4 (Data Cleaning using mutate): https://youtu.be/9jpgltxLqug
Python Data Handling using dfply 5 (Working on practical assignments): https://youtu.be/q3p1Wa-Yi6M

Best wishes

Neeraj

Reply all

Reply to author

Forward

0 new messages