Using data frames with inclusion mining

Ryan Wisnesky

unread,

May 4, 2026, 8:41:52 PMMay 4

to Desbordante Q&A

Hi all, I've successfully run the inclusion mining demo on the associated example CSV files, and it is great. But I can't figure out how to do inclusion mining on a set of data frames directly, that don't come from CSV files (e.g., that might come from SQL). Any help would be appreciated, thanks! -Ryan

zr9ihi

unread,

May 5, 2026, 8:42:43 AMMay 5

to Desbordante Q&A

Hi Ryan,

Thanks for your question!

At the moment, inclusion mining (and, more generally, the Desbordante project) supports data provided as CSV files or pandas DataFrames (you can find an example in `basic/mining_aind.py`). Direct integration with other data sources is not yet available.

As a workaround, you can prepare your data by converting it into one of the supported formats. For example, you can export your SQL results to CSV or load them into a pandas DataFrame within your workflow.

If you have any trouble, feel free to ask!

Alexey Shlyonskikh

unread,

May 5, 2026, 10:37:12 AMMay 5

to Desbordante Q&A

Hello

Dataframes can be passed as elements of the list for the `tables=` option. E.g., in examples/basic/mining_ind.py, if you `import pandas`, you can replace `TABLES = [(f'examples/datasets/ind_datasets/{table_name}.csv', ',', True) for table_name in` with `TABLES = [pandas.read_csv(f'examples/datasets/ind_datasets/{table_name}.csv', sep=',', header=0) for table_name in` and get exactly the same results. In general, wherever a CSV file's parameters are passed in an example, a corresponding DataFrame can be used in its place.

Ryan Wisnesky

unread,

May 5, 2026, 11:25:16 AMMay 5

to Alexey Shlyonskikh, Desbordante Q&A

This is exactly what I was looking for, thanks! And thanks to everyone who responded, all suggestions were helpful. -Ryan

--
You received this message because you are subscribed to the Google Groups "Desbordante Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email to desbordante...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/desbordante/58f1bcd7-ebe2-4741-bf64-ea1021482944n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Wisnesky

unread,

May 11, 2026, 7:47:01 PMMay 11

to Alexey Shlyonskikh, Desbordante Q&A

Quick follow-up: I can pass the data frame in instead of a csv file, but the data frame is missing the table/query name that would identify it. Here’s a full example using trino and Trino’s built-in test database. As you can see, the inclusions say “pandas data frame” instead of the csv file name, making the inclusions ambiguous when column names overlap. Is there a way to set the data frame name? Thanks again! -Ryan.

from sqlalchemy import create_engine

import desbordante

import pandas as pd

engine = create_engine('trino://ry...@nuc.local:8080/tpch/tiny')

df_table1 = pd.read_sql("lineitem", engine)

df_table2 = pd.read_sql("orders", engine)

TABLES = [df_table1, df_table2]

algo = desbordante.ind.algorithms.Default()

algo.load_data(tables=TABLES)

algo.execute()

inds = algo.get_inds()

for ind in inds: print(ind)

Prints:

(Pandas dataframe, [orderkey]) -> (Pandas dataframe, [orderkey])

(Pandas dataframe, [suppkey]) -> (Pandas dataframe, [partkey])

(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [orderkey])

(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [partkey])

(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [suppkey])

(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [orderkey])

(Pandas dataframe, [tax]) -> (Pandas dataframe, [discount])

(Pandas dataframe, [linestatus]) -> (Pandas dataframe, [orderstatus])

(Pandas dataframe, [commitdate]) -> (Pandas dataframe, [shipdate])

(Pandas dataframe, [commitdate]) -> (Pandas dataframe, [receiptdate])

(Pandas dataframe, [orderkey]) -> (Pandas dataframe, [orderkey])

(Pandas dataframe, [custkey]) -> (Pandas dataframe, [partkey])

But it’s hard to tell which data frame is which (the names line item and order have been lost) in the printout. Thanks again- Ryan

Anton Chizhov

unread,

May 12, 2026, 4:04:46 AMMay 12

to Desbordante Q&A

At the moment, Desbordante doesn't allow to store user-defined name for pandas DataFrames, so you see the default string representation in the output. But there are several ways to solve your problem.

One option is to use `ind.to_short_string()`, which prints table indices and column indices:

for ind in inds:
print(ind.to_short_string())

Example output:
(0, [2]) -> (1, [0])
(2, [2]) -> (1, [0])
(3, [2]) -> (1, [0])

This a least makes it possible to identify which table is being referenced.

You can also define a custom print method for INDs. IND objects are not just strings - they expose their internal structure.
Each IND has two sides (lhs and rhs), and each side is a column combination, containing table index and column indices.

So you can map those indices back to your table names and column names yourself.

TABLE_NAMES = ["lineitem", "orders"]
TABLES = [pd.read_sql(table_name, engine) for table_name in TABLE_NAMES]

# run algorithm
# ...

def format_cc(cc):
table = TABLE_NAMES[cc.table_index]
columns = [
TABLES[cc.table_index].columns[column_index]
for column_index in cc.column_indices
]
return f"({table}, {columns})"

for ind in inds:
print(f"{format_cc(ind.get_lhs())} -> {format_cc(ind.get_rhs())}")

Ryan Wisnesky

unread,

Jun 25, 2026, 6:35:25 PMJun 25

to desbo...@googlegroups.com

This is all working great, thank you. One question though: is there a way to indicate progress? I’ve been testing on small examples but am about to run on large examples that might take hours and am wondering if there’s a way for desbordante to report a completion time estimate or progress report while it runs. Thanks again!

To view this discussion visit https://groups.google.com/d/msgid/desbordante/fee8cdc2-29b9-4e96-a7ca-7c48a20763b9n%40googlegroups.com.

Alexey Shlyonskikh

unread,

Jun 27, 2026, 10:58:22 AMJun 27

to Desbordante Q&A

No, sorry, we don't have anything like that. Thanks for asking, we'll try to figure something out. Generally, these algorithms can be difficult to predict, though I'm not familiar with Spider in particular (which is what I'm assuming you're using)

Ryan Wisnesky

unread,

Jun 29, 2026, 4:18:39 PMJun 29

to desbo...@googlegroups.com

Thanks! Something else arising from using this in production: we’re seeing many dependencies arise because columns are degenerate, e.g., every value in a column is “None”. Is there a way to exclude columns from analysis, or should we create a new data frame without the degenerate columns? Perhaps there is some kind of ‘virtual’ data frame defined by a query we could use to avoid data replication if a new frame is needed? Thanks again- Ryan

To view this discussion visit https://groups.google.com/d/msgid/desbordante/bffe9703-e5ee-4088-81f7-a19b0ec4f38fn%40googlegroups.com.

George Chernishev

unread,

Jun 30, 2026, 5:37:08 AMJun 30

to Desbordante Q&A

Hi Ryan,

Thanks a lot for your interest in our tool!

> Is there a way to exclude columns from analysis, or should we create a new data frame without the degenerate columns?

Our general philosophy is to keep the core as simple as possible, so it stays easy to maintain. The algorithms are already quite complex, and since we are a small team, we try to avoid adding extra complexity/functionality where we can. So the idea is that users handle any necessary preprocessng on the python side before running algorithms.

As for your second question, I’m not competent enough in Python to answer it, so let’s wait for my colleagues on that one.

Best regards,
George

Alexey Shlyonskikh

unread,

Jul 14, 2026, 5:37:35 AM (13 days ago) Jul 14

to Desbordante Q&A

> Perhaps there is some kind of ‘virtual’ data frame defined by a query we could use to avoid data replication if a new frame is needed?

That looks like it'd quite complex to implement on our side. Pandas 3.0 enabled copy-on-write by default, so creating a DataFrame with excluded columns should work without extra overhead , since the library does not write to the DataFrames.

Reply all

Reply to author

Forward