Exploring Two Columns Code.org

0 views

Skip to first unread message

Jarrell Campbell

unread,

Aug 4, 2024, 9:03:07 PM8/4/24

to mantireva

Iam facing a challenge in implementing field-level encryption for the "rating" column in my MySQL InnoDB database. The codebase is distributed across various repositories, and I am unable to make changes at the application code level. Therefore, I am exploring options to achieve this encryption at the database level.

Given these constraints, I am seeking advice on alternative approaches or strategies to implement field-level encryption for the "rating" column in MySQL InnoDB without modifying the application code as there are more than 70 rating columns. Are there any database-level solutions or configurations that could help achieve this goal seamlessly, ensuring that the rating data remains encrypted at rest?

Triggers: I attempted to use triggers to handle encryption and decryption during insert and update operations. However, triggers do not seem to work seamlessly with select statements, making this approach unsuitable for my use case.

AST (Abstract Syntax Tree) for Query Modification: I experimented with AST to convert MySQL queries and perform encryption/decryption. I am using aes_encrypt() and aes_decrypt() for the same. While this approach worked for some scenarios, it is not viable for my PHP application, as it heavily relies on raw queries. Modifying these raw queries would be a substantial undertaking.

Tidyverse packages are designed to simplify and streamline the data science process of load, prep, train and visualize by providing a more consistent development experience across the various libraries. A good analogy would be how jQuery simplified Web development by creating a more consistent programming surface across the DOM, event handling and more. While not a language per se, jQuery made JavaScript more productive by lessening the friction of their most common tasks.

The readr package provides a fast and easy way to read rectangular data files, such as .csv files. It can flexibly parse many types of data files, while handling errors robustly. To get started, create a new R language Jupyter Notebook. For details on Jupyter Notebooks, refer to my February 2018 article on the topic at msdn.com/magazine/mt829269. In the first blank cell, enter the following code to load the .csv file data and display it:

Note that above the tabular output with the contents of the .CSV file is text that highlights how each record was parsed and that the output is a tibble with 183 rows and four columns. Base R uses data frames to store tabular data. In the tidyverse, a tibble is the equivalent structure. In fact, tibbles are data frames, but they modify some default data frame behaviors to meet the needs of modern data analytics. More information about tibbles can be found at tibble.tidyverse.org and in-depth documentation resides at r4ds.had.co.nz/tibbles.html.

With the dplyr package loaded, I will use it to view only the months with 100 or more posts by using the pipe operator %>% to pass the tibble to the filter method. Enter the following code into a new cell and execute it:

I could further analyze the data by adding an additional pipe to a summarize function. Summarize functions create one value summarizing the values in a table. Enter the following code to view the number of rows, the mean post count and the mean PPD values, like so:

Fortunately, the ggplot2 package makes creating graphs from data straightforward, as it allows for creating graphics declaratively. Simply provide the data and instructions on mapping data columns to graphic elements, as well as which graph type to employ, and ggplot2 handles the rendering. For instance, to create a scatter plot of PostCount by Year, enter the following code to generate the graph as seen in Figure 1.

To further explore the data, I can generate a histogram to explore the distribution of the data. For example, I want to get an idea of the distribution of how many posts there have been across all 16 years. Enter the following code to use data from the fwposts tibble to build out a histogram:

As the graph shows, most months have 50 posts or less, with one very noticeable outlier. In statistical terms, the number of posts is skewed right. To get some finer granularity, I will set the binwidth to 10. Enter the following code and run it to create the graph as shown in Figure 2:

Another useful visualization for understanding distribution of numeric values is the box plot. A box plot is a standardized way of displaying the distribution of data based on a five-number summary: the minimum value, first quartile, median, third quartile and maximum value. Fortunately, generating a box plot is simple in ggplot2. Enter the following code and execute it to see the box plot for Posts:

The generated plot shows that the first and third quartile are between around 13 and 50, with a number of outliers at or above 100. For more information about box plots, read this excellent

article on the topic: bit.ly/2IbqkmX.

While base R is perfectly acceptable for most data science-related tasks, many R developers prefer to use the tidyverse suite of libraries for increased productivity. In this article, I walked through the most common steps in a typical data science pipeline: loading, exploring, manipulating and visualizing data.

This topic uses the Sample - Superstore data source to walk through how to create basic views and explore your data. It shows how your view of data in Tableau evolves through your process of exploration.

Open Tableau. On the start page, under Connect, click Microsoft Excel. In the Open dialog box, navigate to the Sample - Superstore Excel file on your computer. Go to /Documents/My Tableau Repository/Datasources/version number/[language]. Select Sample - Superstore, and then click Open.

In the worksheet, the columns from your data source are shown as fields on the left side in the Data pane. The Data pane contains a variety of fields organized by table. For each table or folder in a data source, dimension fields appear above the gray line and measure fields appear below the gray line. Dimension fields typically hold categorical data such as product types and dates, while measure fields hold numeric data such as sales and profit. Sometimes a table or folder will contain only dimensions, or only measures to start with. For more information, see Dimensions and Measures, Blue and Green.

If you have related dimension fields, sometimes you might want to group them in a folder, or as a hierarchy. For example, in this data source, Country, State, City, and Postal Code are grouped into a hierarchy named Location. You can drill down into a hierarchy by clicking the + sign in a field, or drill back up by clicking the - sign in a field.

Every time you drag a field into the view or onto a shelf, you're asking a question about the data. The question varies depending on the field you choose, where you place it, and the order in which you add it to the view.

Select one or more fields in the Data pane and then choose a chart type from Show Me, which identifies the chart types that are appropriate for the fields you selected. For details, see Use Show Me to Start a View .

As you start exploring data in Tableau, you'll find there are many ways to build a view. Tableau is extremely flexible, and also very forgiving. As you build a view, if you ever take a path that isn't answering your question, you can always undo to a previous point in your exploration.

The default datelevel is determined by the highest level that contains more than onedistinct value (for example, multiple years and multiple months). Thatmeans that if [Order Date] contained data for only one year but hadmultiple months, the default level would be month. You can changethe date level using the field menu.

If you're wondering why there are two sets of date levels (from Year down to Day), the first set of options uses date parts and the second set of options uses date values. For more information, see Change Date Levels.

The new dimension divides the view into separate panesfor each year. Each pane, in turn, has columns for quarters. This view is called a nested table because it displays multipleheaders, with quarters nested within years. The word "headers" is a little misleading in this example because the year headers are displayed at the top of the chart, while the quarter headers are displayed at the bottom.

Optionally, you can achieve the same result by dropping Segment just to the left of the Profit axis in the view (shown in the image below). Tableau often supports multiple ways to add fields to the view.

The new dimension divides the view into 12 panes, onefor each combination of year and segment. This view isa more complex example of a nested table. Any view that contains this sort of grid of individual charts is referred toas a small multiples view.

Placing a dimension on Color separates the marks according to the members in the dimension, and assigns a unique color to each member. The color legend displays each member name and its associated color.

For more information on the Marks card and level of detail, see Shelves and Cards Reference, Marks, and How dimensions affect the level of detail in the view. Also see Understanding the grain in your data(Link opens in a new window) from Tableau Tim.

Watch a video: You can see many Tableau concepts and product features discussed and demonstrated on the Tableau Tim website(Link opens in a new window) and YouTube channel(Link opens in a new window).

Datasette offers flexible tools for exploring data tables. It's always worth spending time familiarizing yourself with data in its raw, tabula form before thinking about ways to apply more sophisticated analysis or visualization.

We'll be using an example database of Members of the United States Congress, 1789 to present. I built this example using data from the unitedstates/congress-legislators project on GitHub, maintained by Joshua Tauberer, Eric Mill and over 100 other contributors.