Query History In Databricks

29 views

Skip to first unread message

Arnaude Kubiak

unread,

Jul 25, 2024, 1:21:18 AM7/25/24

to sinnatorot

If your workspace is enabled for the serverless compute public preview, your query history will also contain all SQL and Python queries run on serverless compute for notebooks and jobs. See Connect to serverless compute.

query history in databricks

DOWNLOAD - https://urlgoal.com/2zMQiX

The time recorded in query history for a SQL query is only the time the SQL warehouse spends actually executing the query. It does not record any additional overhead associated with getting ready to execute the query, such as internal queuing, or additional time related to the data upload and download process.

Queries shared by a user with Run as Owner permissions to another user with CAN RUN permissions appear in the query history of the user executing the query and not the user that shared the query.

Databricks SQL Query History is a feature of Databricks that keeps a record of the SQL queries executed in a Databricks workspace. It provides details about each query such as the query text, execution time, user who ran the query, and more. This information can be useful for auditing, performance tuning, and understanding the usage patterns of the Databricks workspace.

The databricks_sql_query_history table provides insights into the SQL queries executed within a Databricks workspace. As a data analyst or data engineer, you can explore the details of past queries through this table, including the query text, execution time, user who ran the query, and more. Utilize it to audit the usage of the Databricks workspace, tune the performance of your SQL queries, and understand the usage patterns of your team.

Explore which data modifications have been made in your Databricks warehouse. This is useful for tracking changes and understanding the impact of various insertions, updates, and deletions within your data.

Explore the list of queries that have completed their updates to get insights into any potential errors or issues. This is useful for identifying and troubleshooting problematic queries within your Databricks account.

Determine the performance of each executed query by assessing factors such as compilation time, execution time, and total time taken. This can help in identifying inefficient queries and optimizing them for better performance.

I would like to know if there is direct access to the Databricks query history tables. For compliance issues, I would like to be able to create reports for something like: who has accessed a particular column in a table in the past 6 months. The query history web interface is quite limited. I would ideally like wo use SQL to query the history table. Is this possible?

I would like to be able to query the query history tables by running my own queries. I do not want to use the Query History interface supplied by Databricks; I want to be able to create python scripts that access the underling tables/views for TAC and Query history. From your response, it seems like this is not possible. Can you confirm that?

For posterity, there is a query history system table which contains all of this information which is in preview at the time of me writing this. If you're reading this later than May 2024, please check the documentation for the query metrics table.

A quick question on this... (First of all thanks so much for the sample code!). I'm playing around with this and I would like to get the statement_type and status. I see that duration, query_start_time_ms and query_end_time_ms are int data type and defines as LongType(); executed_as_user_name and query_text are str datatype and defined as StringType. statement_type and status are listed as data types QueryStatementType and QueryStatus respectively. How would I define the StructType for these fields?

Thanks @josh_melton ! I was wondering right now about this (one day after your post!) since I only found the UI and API in the documentation and was really puzzled that there is no equivalent in unity to the Snowflake query_history table.

Prior to setting up the SpotApp a data pipeline must create the tables that will be referenced via Embrace. This pipeline could be established within the Databricks platform via a workspace and scheduled job as part of the Data Science & Engineering platform. Alternatively, 3rd party tools can be leveraged to complete this. The key activities are to:

An example Databricks workspace archive is available as a reference (SpotAppDatabricksPythonAPIFetch). This includes example python code to query the identified endpoint APIs and create the delta tables. Within the Data Science and Engineering platform this workspace can be scheduled as a reoccurring job.

If you want to run this notebook yourself, you need to create a Databricks personal access token Store the access token using our secrets API, and pass it in through the Spark config, such as this: spark.pattoken secrets/queryhistory_etl/user, or Azure Keyvault.

This is the fifth article in our series on parsing different SQL dialects. We explored SQL parsing on Snowflake, MS SQL Server, Oracle, and Redshift. in our earlier blog posts. We cover SQL parsing on DataBricks in this blog post. We take table and column audit logging as a use case for parsing SQL on Databricks.

We provide practical examples of interpreting SQL from the DataBricks query history. Additionally, we will present some code that utilises the FlowHigh SQL parser SDK to programmatically parse SQL from DataBricks. The parsing of SQL Server SQL can be automated using the SDK.

In another post on the Sonra blog, I go into great depth on the benefits of using an SQL parser. In this post we cover the use cases of an SQL parser for both data engineering and data governance use cases.

One example for a use case of an SQL parser is table and column audit logging. Audit logging refers to the detailed recording of access and operations performed on specific tables and columns in a database including execution of SQL queries. Such logging can be essential for ensuring security, compliance with regulatory standards, and understanding data access patterns.

Databricks records every SQL query executed and retains this information in the query execution log for a period of 30 days. To retrieve this data, users can utilise a dedicated API endpoint: /api/2.0/sql/history/queries. This endpoint includes details such as the timestamp of execution and the user, among other metrics. You can extract the content of each query via the query_text attribute and also get the query_id as a unique identifier.

System tables are controlled by Unity Catalog, we must have at least one workspace in our account that is enabled by Unity Catalog in order to enable and access system tables. System tables contain information from every workspace in the account, but only workspaces with the Unity Catalog feature can access them.

An SQL query is ingested by the FlowHigh SQL parser for Databricks, which then returns the processed result either as a JSON or XML message. For instance, the parser produces a full JSON message of the SQL query from the query history we collected using the API. This output includes information on the filter conditions, fields fetched, aliases used, join conditions, tables and other components of the SQL statement.

The analysis of the JSON representation reveals two inner join conditions. The initial join condition establishes a connection based on the attributes C1 and C5. Meanwhile, the subsequent join condition associates the attributes C6 and C7.

You can also access the FlowHigh SQL parser through the web based user interface. The below figure shows how FlowHigh provides the information about tables in a SQL query by grouping them into types of tables.

We have an intermittant issue where occasionally a partition in our Power BI PPU Import Dataset times out at 5 hours. This model is importing data from Databricks via the Databricks SQL Warehouse. When I look at Query History in Databricks SQL, I see a query that failed with the following error message: "Query has been timed out due to inactivity". That query info screen, in one example, shows that the total Wall-clock duration was 15.587seconds. But the Result fetching by client row shows 1.04hours. Power BI doesn't seem to know that the query failed as it continues to wait for results until it times out at 5 hours (it is a Power BI Premium Per User dataset). The query failed in this example 2 hours before the refresh timed out at 5 hours. In other examples, the query wall-clock sometimes runs for a few minutes, but the Result fetching by client row always shows a very long time (over an hour to a few hours) relative to the wall-clock.

This error is intermittent as the model does not usually time out, but when it does, I always find a query with this error message in query history during the time of the load. It is also not always on the same partition, but it is always on a partition on one of the larger tables. The full refresh on this particular data model usually takes around 2 hours. When this error occurs, it times out at 5 hours. Then the retry will complete in the normal 2 hours.

It seems there is a disconnect between Power BI and Databricks that is causing Power BI not to receive or grab the rows from Databricks. The query fails in Databricks, but Power BI doesn't know that it failed and continues to wait for the results until it hits the 5 hour timeout.