Using data frames with inclusion mining

Ryan Wisnesky

unread,

May 4, 2026, 8:41:52 PMMay 4

to Desbordante Q&A

Hi all, I've successfully run the inclusion mining demo on the associated example CSV files, and it is great. But I can't figure out how to do inclusion mining on a set of data frames directly, that don't come from CSV files (e.g., that might come from SQL). Any help would be appreciated, thanks! -Ryan

zr9ihi

unread,

May 5, 2026, 8:42:43 AMMay 5

to Desbordante Q&A

Hi Ryan,

Thanks for your question!

At the moment, inclusion mining (and, more generally, the Desbordante project) supports data provided as CSV files or pandas DataFrames (you can find an example in `basic/mining_aind.py`). Direct integration with other data sources is not yet available.

As a workaround, you can prepare your data by converting it into one of the supported formats. For example, you can export your SQL results to CSV or load them into a pandas DataFrame within your workflow.

If you have any trouble, feel free to ask!

Alexey Shlyonskikh

unread,

May 5, 2026, 10:37:12 AMMay 5

to Desbordante Q&A

Hello

Dataframes can be passed as elements of the list for the `tables=` option. E.g., in examples/basic/mining_ind.py, if you `import pandas`, you can replace `TABLES = [(f'examples/datasets/ind_datasets/{table_name}.csv', ',', True) for table_name in` with `TABLES = [pandas.read_csv(f'examples/datasets/ind_datasets/{table_name}.csv', sep=',', header=0) for table_name in` and get exactly the same results. In general, wherever a CSV file's parameters are passed in an example, a corresponding DataFrame can be used in its place.

Ryan Wisnesky

unread,

May 5, 2026, 11:25:16 AMMay 5

to Alexey Shlyonskikh, Desbordante Q&A

This is exactly what I was looking for, thanks! And thanks to everyone who responded, all suggestions were helpful. -Ryan

--
You received this message because you are subscribed to the Google Groups "Desbordante Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email to desbordante...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/desbordante/58f1bcd7-ebe2-4741-bf64-ea1021482944n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Wisnesky

unread,

May 11, 2026, 7:47:01 PM (8 days ago) May 11

to Alexey Shlyonskikh, Desbordante Q&A

Quick follow-up: I can pass the data frame in instead of a csv file, but the data frame is missing the table/query name that would identify it. Here’s a full example using trino and Trino’s built-in test database. As you can see, the inclusions say “pandas data frame” instead of the csv file name, making the inclusions ambiguous when column names overlap. Is there a way to set the data frame name? Thanks again! -Ryan.

from sqlalchemy import create_engine

import desbordante

import pandas as pd

engine = create_engine('trino://ry...@nuc.local:8080/tpch/tiny')

df_table1 = pd.read_sql("lineitem", engine)

df_table2 = pd.read_sql("orders", engine)

TABLES = [df_table1, df_table2]

algo = desbordante.ind.algorithms.Default()

algo.load_data(tables=TABLES)

algo.execute()

inds = algo.get_inds()

for ind in inds: print(ind)

Prints:

(Pandas dataframe, [orderkey]) -> (Pandas dataframe, [orderkey])

(Pandas dataframe, [suppkey]) -> (Pandas dataframe, [partkey])

(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [orderkey])

(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [partkey])

(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [suppkey])

(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [orderkey])

(Pandas dataframe, [tax]) -> (Pandas dataframe, [discount])

(Pandas dataframe, [linestatus]) -> (Pandas dataframe, [orderstatus])

(Pandas dataframe, [commitdate]) -> (Pandas dataframe, [shipdate])

(Pandas dataframe, [commitdate]) -> (Pandas dataframe, [receiptdate])

(Pandas dataframe, [orderkey]) -> (Pandas dataframe, [orderkey])

(Pandas dataframe, [custkey]) -> (Pandas dataframe, [partkey])

But it’s hard to tell which data frame is which (the names line item and order have been lost) in the printout. Thanks again- Ryan

Anton Chizhov

unread,

May 12, 2026, 4:04:46 AM (8 days ago) May 12

to Desbordante Q&A

At the moment, Desbordante doesn't allow to store user-defined name for pandas DataFrames, so you see the default string representation in the output. But there are several ways to solve your problem.

One option is to use `ind.to_short_string()`, which prints table indices and column indices:

for ind in inds:
print(ind.to_short_string())

Example output:
(0, [2]) -> (1, [0])
(2, [2]) -> (1, [0])
(3, [2]) -> (1, [0])

This a least makes it possible to identify which table is being referenced.

You can also define a custom print method for INDs. IND objects are not just strings - they expose their internal structure.
Each IND has two sides (lhs and rhs), and each side is a column combination, containing table index and column indices.

So you can map those indices back to your table names and column names yourself.

TABLE_NAMES = ["lineitem", "orders"]
TABLES = [pd.read_sql(table_name, engine) for table_name in TABLE_NAMES]

# run algorithm
# ...

def format_cc(cc):
table = TABLE_NAMES[cc.table_index]
columns = [
TABLES[cc.table_index].columns[column_index]
for column_index in cc.column_indices
]
return f"({table}, {columns})"

for ind in inds:
print(f"{format_cc(ind.get_lhs())} -> {format_cc(ind.get_rhs())}")

Reply all

Reply to author

Forward