Mastering Data Manipulation with awesome Py Library - dfply

26 views
Skip to first unread message

Neeraj Kaushik

unread,
Apr 16, 2026, 8:07:53 PM (6 days ago) Apr 16
to dataanalysistraining
Dear Friends

If you are looking for a way to bring the power of R's dplyr "pipe" syntax to Python, the dfply library is a game-changer. It allows for cleaner, more readable code when performing complex data transformations. By using the pipe operator >>, you can chain together essential verbs like mutate, filter_by, and summarize to handle data pipelines with ease. I have summarized some prominent verbs used in the library:

  • select(): Returns a subset of columns based on their names or indices, discarding the rest.
  • drop(): Returns the dataframe with only the specifically named columns removed.

  • filter_by(): Returns only the rows that evaluate to True for the provided logical expressions.

  • rename(): Updates the names of specific columns using key-value pairs while leaving all other columns untouched.

  • arrange(): Sorts the rows of the dataframe in ascending or descending order based on the values within one or more columns.

  • mutate(): Adds new columns or modifies existing ones by calculating values row-by-row, keeping the overall row count the same.

  • summarize(): Computes summary statistics (like mean or count), returning one consolidated row of results per group.

  • group_by(): Segments the dataframe into distinct categories so that subsequent operations are applied group-wise rather than to the whole dataset.

  • ungroup(): Removes the internal grouping structure, returning the dataframe to a standard, unsegmented state.

  • distinct(): Returns only the unique rows, dropping any duplicates based on all columns or a specified subset.

  • separate(): Splits a single string-based column into multiple new columns using a specified delimiter or regular expression.

  • unite(): Combines multiple columns into a single new string column, joining their values together with a specified separator.

  • sample(): Returns a random subset of rows, determined either by a fixed number or a percentage fraction of the data.

Here is a quick reference for some of the database functions:

  • left_join(): Returns all rows from the left data frame, appending matched columns from the right and filling missing matches with NA.

  • right_join(): Returns all rows from the right data frame, appending matched columns from the left and filling missing matches with NA.

  • inner_join(): Returns only the rows that share matching keys in both the left and right data frames, dropping all unmatched rows.

  • outer_join(): (Referred to as full_join() in R's dplyr) Returns all rows from both data frames, retaining all data and inserting NA wherever there are missing matches.

  • anti_join(): Filters the left data frame to return only the rows that do not have a corresponding match in the right data frame.

  • semi_join(): Filters the left data frame to return only the rows that do have a match in the right data frame, without actually adding any new columns from the right.

I've explained these concepts in the following videos:

Python Data Handling using dfply 1 (10 basic verbs): https://youtu.be/EFtf8nGERTM
Python Data Handling using dfply 2 (Various join verbs): https://youtu.be/NXUKxoxYb2I
Python Data Handling using dfply 3 (Mutate advance options): https://youtu.be/AHgo9M66aJI
Python Data Handling using dfply 4 (Data Cleaning using mutate): https://youtu.be/9jpgltxLqug
Python Data Handling using dfply 5 (Working on practical assignments): https://youtu.be/q3p1Wa-Yi6M

Best wishes
Neeraj 

Reply all
Reply to author
Forward
0 new messages