The actual data was difficult to understand because it was aggregated earlier.
This time, let's look at the "raw" data before it is aggregated.
From here, we will check the data while explaining the query.
Executing a SELECT statement (getting all columns): Query No.2
SELECT
*
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210118`
Let's run the above query,
we should get the following results:
The query we ran this time was written in a syntax for extracting data called a SELECT statement.
The basic syntax is as follows:
Executing a SELECT statement (retrieving columns): Query No.3
In this example, "*" was used for [Specify columns to retrieve], but this is the symbol used to retrieve all columns. You
can also specify the column name, in which case the result will be as follows:
In a database, rows are called columns and data is called records.
SELECT
event_date
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210118`
[Side note] What does the "*" in the FROM phone number list malaysia clause mean?
In the query No. 1, the data source after the FROM clause
bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_*
is
In Query No. 2 and Query No. 3,
bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210118
it is stated as follows.
These differences are due to the fact that GA4 data has a special data structure called "sharding."
This is a complicated topic, so you can skip it, but I'll explain it for those who are interested.
By splitting and storing data using sharding, the amount of calculations required for data extraction can be reduced.
In the example of GA4 data, the data is split by date, and there are independent tables with the date added to the end for each day. By accessing the tables for each date, data can be extracted by searching for data for a single day.
On the other hand, there may be cases where you want to run a query for a set number of days, in which case you can use "*" to treat multiple tables as if they were a single table.
However, because all tables will be searched, the amount of data processed will increase, as shown below.
In fact, there is another method of dividing data to reduce the amount of data to be processed called partitioning, which is generally easier to use, so I don't think there is much need to create a sharding structure.
"Clause" to be combined with SELECT statement
A "clause" is a block of code that can be added to SQL; other clauses include the WHERE clause and the EXISTS clause.
Specifying records with the WHERE clause: Query No.4
The WHERE clause is used when you want to narrow down the data to be retrieved.
It returns only records for which the judgment expression written after the FROM clause is true.
In this example, it specifies only data where the event_name is "page_view".
SELECT
event_date,event_timestamp,event_name
FROM
`bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210118`
WHERE
event_name = "page_view"
Specifying record order using the ORDER clause: Query No.5
The ORDER clause is used when you want to sort the data you are getting.
You sort by specifying the column name you want to sort by after the FROM clause.
Enter ASC for ascending order and DESC for descending order.
In this example, we will sort event_timestamp in ascending order.