Using data frames with inclusion mining

113 views
Skip to first unread message

Ryan Wisnesky

unread,
May 4, 2026, 8:41:52 PMMay 4
to Desbordante Q&A
Hi all, I've successfully run the inclusion mining demo on the associated example CSV files, and it is great.  But I can't figure out how to do inclusion mining on a set of data frames directly, that don't come from CSV files (e.g., that might come from SQL).  Any help would be appreciated, thanks!  -Ryan

zr9ihi

unread,
May 5, 2026, 8:42:43 AMMay 5
to Desbordante Q&A

Hi Ryan,

Thanks for your question!

At the moment, inclusion mining (and, more generally, the Desbordante project) supports data provided as CSV files or pandas DataFrames (you can find an example in `basic/mining_aind.py`). Direct integration with other data sources is not yet available.

As a workaround, you can prepare your data by converting it into one of the supported formats. For example, you can export your SQL results to CSV or load them into a pandas DataFrame within your workflow.

If you have any trouble, feel free to ask!

Alexey Shlyonskikh

unread,
May 5, 2026, 10:37:12 AMMay 5
to Desbordante Q&A
Hello

Dataframes can be passed as elements of the list for the `tables=` option. E.g., in examples/basic/mining_ind.py, if you `import pandas`, you can replace `TABLES = [(f'examples/datasets/ind_datasets/{table_name}.csv', ',', True) for table_name in` with `TABLES = [pandas.read_csv(f'examples/datasets/ind_datasets/{table_name}.csv', sep=',', header=0) for table_name in` and get exactly the same results. In general, wherever a CSV file's parameters are passed in an example, a corresponding DataFrame can be used in its place.

Ryan Wisnesky

unread,
May 5, 2026, 11:25:16 AMMay 5
to Alexey Shlyonskikh, Desbordante Q&A
This is exactly what I was looking for, thanks!  And thanks to everyone who responded, all suggestions were helpful.  -Ryan

-- 
You received this message because you are subscribed to the Google Groups "Desbordante Q&A" group.
To unsubscribe from this group and stop receiving emails from it, send an email to desbordante...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/desbordante/58f1bcd7-ebe2-4741-bf64-ea1021482944n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Ryan Wisnesky

unread,
May 11, 2026, 7:47:01 PM (8 days ago) May 11
to Alexey Shlyonskikh, Desbordante Q&A
Quick follow-up: I can pass the data frame in instead of a csv file, but the data frame is missing the table/query name that would identify it.  Here’s a full example using trino and Trino’s built-in test database.  As you can see, the inclusions say “pandas data frame” instead of the csv file name, making the inclusions ambiguous when column names overlap.  Is there a way to set the data frame name?  Thanks again!  -Ryan. 

from sqlalchemy import create_engine
import desbordante
import pandas as pd

engine = create_engine('trino://ry...@nuc.local:8080/tpch/tiny')

df_table1 = pd.read_sql("lineitem", engine)
df_table2 = pd.read_sql("orders", engine)

TABLES = [df_table1, df_table2]

algo = desbordante.ind.algorithms.Default()
algo.load_data(tables=TABLES)
algo.execute()
inds = algo.get_inds()

for ind in inds: print(ind)

Prints: 

(Pandas dataframe, [orderkey]) -> (Pandas dataframe, [orderkey])
(Pandas dataframe, [suppkey]) -> (Pandas dataframe, [partkey])
(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [orderkey])
(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [partkey])
(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [suppkey])
(Pandas dataframe, [linenumber]) -> (Pandas dataframe, [orderkey])
(Pandas dataframe, [tax]) -> (Pandas dataframe, [discount])
(Pandas dataframe, [linestatus]) -> (Pandas dataframe, [orderstatus])
(Pandas dataframe, [commitdate]) -> (Pandas dataframe, [shipdate])
(Pandas dataframe, [commitdate]) -> (Pandas dataframe, [receiptdate])
(Pandas dataframe, [orderkey]) -> (Pandas dataframe, [orderkey])
(Pandas dataframe, [custkey]) -> (Pandas dataframe, [partkey])

But it’s hard to tell which data frame is which (the names line item and order have been lost) in the printout.  Thanks again- Ryan

Anton Chizhov

unread,
May 12, 2026, 4:04:46 AM (8 days ago) May 12
to Desbordante Q&A
At the moment, Desbordante doesn't allow to store user-defined name for pandas DataFrames, so you see the default string representation in the output. But there are several ways to solve your problem.

One option is to use `ind.to_short_string()`, which prints table indices and column indices:

for ind in inds:
  print(ind.to_short_string())

Example output:
(0, [2]) -> (1, [0])
(2, [2]) -> (1, [0])
(3, [2]) -> (1, [0])

This a least makes it possible to identify which table is being referenced.

You can also define a custom print method for INDs. IND objects are not just strings - they expose their internal structure.
Each IND has two sides (lhs and rhs), and each side is a column combination, containing table index and column indices.

So you can map those indices back to your table names and column names yourself.

TABLE_NAMES = ["lineitem", "orders"]
TABLES = [pd.read_sql(table_name, engine) for table_name in TABLE_NAMES]

# run algorithm
# ...

def format_cc(cc):
    table = TABLE_NAMES[cc.table_index]
    columns = [
        TABLES[cc.table_index].columns[column_index]
        for column_index in cc.column_indices
    ]
    return f"({table}, {columns})"

for ind in inds:
    print(f"{format_cc(ind.get_lhs())} -> {format_cc(ind.get_rhs())}")
Reply all
Reply to author
Forward
0 new messages