Presto transfers the data through the network. Due to connection timeoutor some problem in worker nodes, this network data transfer may failoccasionally (PAGE_TRANSPORT_TIMEOUT, WORKER_RESTARTED). Presto isdesigned for faster query processing when compared to Hive, so itsacrifices fault-tolerance, to some extent. Typically, more than 99.5%of Presto queries finish without any error on the first run. Inaddition, Treasure Data provides a query retry mechanism on queryfailures, so nearly 100% of queries finish successfully after beingretried.
Because of the nature of network-based distributed query processing, ifyour query tries to process billions of records, the chance of hittingnetwork failures increases. If you see PAGE_TRANSPORT_TIMEOUT errorfrequently, try to reduce the input data size by narrowing down theTD_TIME_RANGE or reducing the number of columns in SELECT statements.
For Treasure Data customers, if problems persist, see Support Channels or send an e-mail to sup...@treasuredata.com with the job IDs of your queries. If you can, include information about the expected results and the meaning of your data set.
If you have a huge number of rows in a 1-hour partition, processing thispartition can be the performance bottleneck. To check the number rowscontained in each partition, run a query similar to the following:
Presto tracks the memory usage of each query. While the available memoryvaries according to your price plan, in most cases it is possible torewrite your query to resolve this issue. Here is a list ofmemory-intensive operations:
This stores the entire set of columns c1, c2 and c3 into a memory of asingle worker node to check the uniqueness of the tuples. The amount ofthe required memory increases with the number of columns and their size.Remove distinct from your query or use it after reducing the number ofinput rows by using a subquery.
appro_distinct(x)returns an approximate result of the true value. It ensures thatreturning a distant value happens only with low probability. If you needto summarize the characteristics of your data set, use this approximateversion.
The distributed join algorithms partitions both left and rightside tables by using hash values of the join keys as a partitioning key.It works even if the right side table is large. However, it can increasethe number of network data transfers and is usually slower than thebroadcast join.
You can parallelize the query result output process by using the CREATETABLE AS SELECT statement. If you DROP the table before running thequery, your performance is significantly better. The result outputperformance will be 5x faster than running SELECT *. Treasure DataPresto skips the JSON output process and directly produces a 1-hourpartitioned table.
Without using DROP TABLE, Presto uses JSON text to materialize queryresults. And if the result table contains 100GB of data, the coordinatortransfers more than 100GB of JSON text to save the query result. So,even if the query computation is almost finished, the output of the JSONresults takes a long time.
Treasure Data uses a column-oriented storage format, so accessing asmall set of columns is really fast, but as the number of columnsincreases in your query, it deteriorates the query performance. Beselective in choosing columns.
This impacts performance even though the query results are ready; TDPresto is waiting for the worker node to complete its sequentialoperations. To mitigate this, Treasure Data now usesresult_output_direct, which redirects the query result to S3 inparallel, thereby improving the performance of the queries.
The building blocks of geospatial functions are the Geometry data type.Geometry data types are expressed using the WKT format. Well-known text(WKT) is a text markup language for representing vector geometry objectson a map. Examples include:
All Presto geospatial functions require the Geometry data type. The datais stored in TD as a WKT formatted string. As a result, the WKT stringdata has to be converted into a Geometry type before it can be used ingeospatial functions. The conversion from WKT string data to Geometrytype is done using geospatial constructor functions. These functionsaccept WKT formatted text as arguments and return the Geometry data typerequired by geospatial functions. Examples of constructor functionsinclude:
For Presto, ST_DISTANCE returns a spatial-ref that at times can beimpractical in use cases where you need the distance between two pointsin kilometers. If you need to calculate distance in kilometers, you haveto use an SQL query. Assuming you had two locations: A(lat1, long1),B(lat1, long2) the SQL query to find the distance between the two pointsin kilometers is:
Presto is designed for faster query processing when compared to Hive, soit sacrifices fault-tolerance, to some extent. Typically, more than99.5% of Presto queries finish without any error on the first run.
Because of the nature of network-based distributed query processing, ifyour query tries to process billions of records, the chance of hittingnetwork failures increases. If you start seeing thePAGE_TRANSPORT_TIMEOUT error frequently, try to reduce the input datasize by narrowing down the TD_TIME_RANGE or reducing the number ofcolumns in SELECT statements.
The UDF TD_SESSIONIZE_WINDOW() was introduced in 2016 to replaceTD_SESSIONIZE. It is a Presto window function with equivalentfunctionality, more consistent results and faster, more reliableperformance.
Sessionization of a table of event data groups a series of event rowsassociated with users into individual sessions for analysis. As long asthe series of events is associated with the same user identifier(typically IP address, email, cookie or similar identifier) and eventsare separated by no more than a chosen timeout interval, those eventscan be grouped into a session. Each session is then assigned a uniquesession ID.
However, TD_SESSIONIZE_WINDOW is implemented as a Presto windowfunction, so in calling it you use an OVER clause to specify the windowover which events are aggregated, rather than passing that into thefunction directly as with TD_SESSIONIZE.
Like TD_SESSIONIZE, TD_SESSIONIZE_WINDOW generates a unique session ID(UUID) for the first row in a session, and returns that value for allrows in the session, until it finds a separation in time between rowsgreater than the timeout value.
Once you have created a Presto connection, you can select data from the available databases and tables and load that data into your app. If you select multiple databases, data will be loaded from all of them at the same time.
If enabled, the driver uses the MIT Kerberos library for Kerberos authentication. The MIT Kerberos library must be installed on the same machine where the Presto driver and ODBC connector are installed.
There are two types of credentials that can be used when making a connection in Qlik Sense SaaS. If you leave the User defined credentials check box deselected, then only one set of credentials will be used for the connection. These credentials belong to the connection and will be used by anyone who can access it. For example, if the connection is in a shared space, every user in the space will be able to use these credentials. This one-to-one mapping is the default setting.
If you select User defined credentials, then every user who wants to access this connection will need to input their own credentials before selecting tables or loading data. These credentials belong to a user, not a connection. User defined credentials can be saved and used in multiple connections of the same connector type.
In the Data load editor, you can click the underneath the connection to edit your credentials. In the hub or Data manager, you can edit credentials by right-clicking on the connection and selecting Edit Credentials.
Maximum length of string fields. This can be set from 256 to 16384 characters. The default value is 4096. Setting this value close to the maximum length may improve load times, as it limits the need to allocate unnecessary resources. If a string is longer than the set value, it will be truncated, and the exceeding characters will not be loaded.
Follow the instructions below to install and configure this check for an Agent running on a host. For containerized environments, see the Autodiscovery Integration Templates for guidance on applying these instructions.
The Presto check is included in the Datadog Agent package.No additional installation is needed on your server. Install the Agent on each Coordinator and Worker node from which you wish to collect usage and performance metrics.
This check has a limit of 350 metrics per instance. The number of returned metrics is indicated in the status page. You can specify the metrics you are interested in by editing the configuration below. To learn how to customize the metrics to collect, see the JMX Checks documentation for more detailed instructions. If you need to monitor more metrics, contact Datadog support.
presto.can_connect
Returns CRITICAL if the Agent is unable to connect to and collect metrics from the monitored Presto instance, WARNING if no metrics are collected, and OK otherwise.
Statuses: ok, critical, warning
For this tutorial, we will be analyzing the a dataset of Peripheral Blood Mononuclear Cells (PBMC) freely available from 10X Genomics. There are 2,700 single cells that were sequenced on the Illumina NextSeq 500. The raw data can be found here.
We start by reading in the data. The Read10X() function reads in the output of the cellranger pipeline from 10X, returning a unique molecular identified (UMI) count matrix. The values in this matrix represent the number of molecules for each feature (i.e. gene; row) that are detected in each cell (column). Note that more recent versions of cellranger now also output using the h5 file format, which can be read in using the Read10X_h5() function in Seurat.
We next use the count matrix to create a Seurat object. The object serves as a container that contains both data (like the count matrix) and analysis (like PCA, or clustering results) for a single-cell dataset. For more information, check out our [Seurat object interaction vignette], or our GitHub Wiki. For example, in Seurat v5, the count matrix is stored in pbmc[["RNA"]]$counts.