Correctness of parallel read_sql

79 views
Skip to first unread message

Steffen Kläbe

unread,
Sep 24, 2020, 6:46:57 AM9/24/20
to modin-dev
Hi all,
For the parallel read_sql implementation, Modin uses LIMIT and OFFSET for pagination of the data and unions the partial results. Consider a partitioned table in the database, how is the correctness of the result ensured here? If there is no ORDER BY clause specified, the LIMIT-OFFSET combination might result in different result orders (and therefore different result sets) when running the same query multiple times. (Even if an ORDER BY is specified, the order of tuples sharing the same sort keys is not ensured. This might be a problem if the sort key is not unique.)

As a consequence, I get a different result when comparing the parallel read_sql and the ordinary pandas read_sql.

Best regards,
Steffen

Devin Petersohn

unread,
Sep 25, 2020, 9:56:00 AM9/25/20
to Steffen Kläbe, modin-dev
Hi Steffen,

It would be great if you could open an issue for this on the GitHub. We should definitely get this fixed.

Devin

--
You received this message because you are subscribed to the Google Groups "modin-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modin-dev+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/modin-dev/1d0b5c8c-fdc7-4aaa-a864-d0814c0e7ffan%40googlegroups.com.
Reply all
Reply to author
Forward
0 new messages