One of the unique characteristics that make Snowflake stand out is its native support for semi-
structured data. Because of their versatility, these semi-structured data have become the go-to
method for sending and storing data. When it comes to analyzing semi-structured data, however, the
flexibility and expressiveness of the data added levels of complication. As a result, many businesses
have struggled to combine structured and semi-structured data in order to provide timely business
insights.
Snowflake is a cutting-edge technology that can help you with your semi-structured data.
Semi-structured data could be a little perplexing if you're just starting out on your snowflake
adventure. That's why I decided to publish a little blog to assist you with it.
What is semi-structured data?
Semi-structured data comprises tags or other forms of mark-up that identify specific, distinct entities
within the data, while not conforming to the requirements of typical structured data.
Nested data structures and the lack of a predefined schema are two major characteristics that
separate semi-structured data from structured data:
-
Semi-structured data does not need the creation of a schema in advance and may change
overtime, with new properties being added at any moment. -
Semi-structured data can include n-level hierarchies of nested information, as opposed to
structured data, which depicts data as a flat table.
Email is a well-known example of semi-structured data. Although more complex analytical tools are
required for thread tracking, near-deduplication, and concept searching, emails inbuilt metadata
allows categorization and keyword searches without the need for any extra tools.
Snowflake currently supports semi-structured data formats in the form of XML, JSON, ORC, Avro,
and Parquet. Snowflake's document has an up-to-date list of supported file types.
Semi-structured data in Snowflake
Snowflake is unlike any other data warehousing technology on the market today since it was built
from the ground up to load and query semi-structured data (e.g. JSON and XML) without requiring
any transformation. Because an increasing proportion of mission-critical data is semi-structured
today, Snowflake makes it simple to expose key insights to enable strategic decision-making.
Snowflake allows semi-structured data to be directly saved in the database in its natural format. This
makes use of a VARIANT
column, which is a special data type. The columnated metadata for the
document is recorded and saved as semi-structured data is inserted into a Snowflake VARIANT
column. To load semi-structured data, simply construct a table with a single VARIANT
column and
then use Snowflake's COPY
command to load data from one or more files containing the semi-
structured data. You can also use commands like INSERT
and CREATE TABLE AS SELECT
to populate tables with VARIANT
columns, as well as create tables with numerous columns that are a mix of standard and VARIANT
data types. Here’s a CREATE TABLE AS SELECT
illustration:
CREATE OR REPLACE TABLE <table_name>
( <column_name> VARIANT)
AS SELECT <column_name>
FROM <table_name>
DESC TABLE <table_name>
It's then quite simple to run queries on the semi-structured VARIANT
column. SQL extensions make it
possible to access characteristics inside semi-structured data once it has been put into Snowflake. A
single colon (‘:') separates the standard dot notation path from the relational path when describing
characteristics within a semi-structured data item. Subscript notation can even be used to refer to
items in recurring arrays. Insert a colon : between the VARIANT
column name and any first-level
element:
<column_name>:<level1_element>
Unfortunately, it is not possible to dynamically retrieve all objects in an array with : and [ ] notation
alone. This is where flattening comes in. FLATTEN
is a table function that turns semi-structured data
into a relational structure and makes it act like a table by using a VARIANT
column. Here’s a LATERAL FLATTEN
example:
SELECT <column_name>, <column_name>
FROM <table_name>
, LATERAL FLATTEN (input = > <column_name>:<level1_element> );
To illustrate, imagine that you have a data set of orders stored in JSON format. The following could
be an example of a typical query:
CREATE OR REPLACE TABLE variant_example (
json_string variant
)
;
INSERT INTO variant_example
SELECT parse_json(
'{"order_id": "1",
"first_name": "John",
"last_name": "Smith",
"products": [
{"product_name": "laptop", "quantity": 1},
{"product_name": "usb drive", "quantity": 2}
]}'
)
;
SELECT
json_string:order_id::integer as order_id,
json_string:first_name::string as first_name,
json_string:last_name::string as last_name,
prod.value:product_name::string as product,
prod.value:quantity::integer as quantity
FROM variant_example,
LATERAL FLATTEN (input => json_string:products) AS prod
;
Because of Snowflake's design, SQL can be used to query both semi-structured and structured data.
In a single query, you may join, window, compare, and compute structured and semi-structured data.
This allows for the elimination of redundant systems and stages, as well as the simplification of data
pipelines, improved performance, and a shorter period between when data is created and when it
can be evaluated and accessed.
Hopefully, this little introduction blog has helped you realize that you can use semi-structured data in
Snowflake and why you should care about it.
- Saba