One of the unique characteristics that make Snowflake stand out is its native support for semi-
structured data. Because of their versatility, these semi-structured data have become the go-to
method for sending and storing data. When it comes to analyzing semi-structured data, however, the
flexibility and expressiveness of the data added levels of complication. As a result, many businesses
have struggled to combine structured and semi-structured data in order to provide timely business
insights.
Snowflake is a cutting-edge technology that can help you with your semi-structured data.

Semi-structured data could be a little perplexing if you're just starting out on your snowflake
adventure. That's why I decided to publish a little blog to assist you with it.

What is semi-structured data?

Semi-structured data comprises tags or other forms of mark-up that identify specific, distinct entities
within the data, while not conforming to the requirements of typical structured data.

Nested data structures and the lack of a predefined schema are two major characteristics that
separate semi-structured data from structured data:

  • Semi-structured data does not need the creation of a schema in advance and may change
    overtime, with new properties being added at any moment.

  • Semi-structured data can include n-level hierarchies of nested information, as opposed to
    structured data, which depicts data as a flat table.

Email is a well-known example of semi-structured data. Although more complex analytical tools are
required for thread tracking, near-deduplication, and concept searching, emails inbuilt metadata
allows categorization and keyword searches without the need for any extra tools.

Snowflake currently supports semi-structured data formats in the form of XML, JSON, ORC, Avro,
and Parquet. Snowflake's document has an up-to-date list of supported file types.

Semi-structured data in Snowflake

Snowflake is unlike any other data warehousing technology on the market today since it was built
from the ground up to load and query semi-structured data (e.g. JSON and XML) without requiring
any transformation. Because an increasing proportion of mission-critical data is semi-structured
today, Snowflake makes it simple to expose key insights to enable strategic decision-making.

Snowflake allows semi-structured data to be directly saved in the database in its natural format. This
makes use of a VARIANT column, which is a special data type. The columnated metadata for the
document is recorded and saved as semi-structured data is inserted into a Snowflake VARIANT
column. To load semi-structured data, simply construct a table with a single VARIANT column and
then use Snowflake's COPY command to load data from one or more files containing the semi-
structured data. You can also use commands like INSERT and CREATE TABLE AS SELECT to populate tables with VARIANT columns, as well as create tables with numerous columns that are a mix of standard and VARIANT data types. Here’s a CREATE TABLE AS SELECT illustration:


CREATE OR REPLACE TABLE <table_name>
( <column_name> VARIANT)
AS SELECT <column_name>
FROM <table_name>
DESC TABLE <table_name>


It's then quite simple to run queries on the semi-structured VARIANT column. SQL extensions make it
possible to access characteristics inside semi-structured data once it has been put into Snowflake. A
single colon (‘:') separates the standard dot notation path from the relational path when describing
characteristics within a semi-structured data item. Subscript notation can even be used to refer to
items in recurring arrays. Insert a colon : between the VARIANT column name and any first-level
element:

<column_name>:<level1_element>


Unfortunately, it is not possible to dynamically retrieve all objects in an array with : and [ ] notation
alone. This is where flattening comes in. FLATTEN is a table function that turns semi-structured data
into a relational structure and makes it act like a table by using a VARIANT column. Here’s a LATERAL FLATTEN example:

SELECT <column_name>, <column_name>
FROM <table_name>
, LATERAL FLATTEN (input = > <column_name>:<level1_element> );

 


To illustrate, imagine that you have a data set of orders stored in JSON format. The following could
be an example of a typical query:


CREATE OR REPLACE TABLE variant_example (
json_string variant
)
;
INSERT INTO variant_example
SELECT parse_json(
'{"order_id": "1",
"first_name": "John",
"last_name": "Smith",
"products": [
{"product_name": "laptop", "quantity": 1},
{"product_name": "usb drive", "quantity": 2}
]}'
)
;

SELECT
json_string:order_id::integer as order_id,
json_string:first_name::string as first_name,
json_string:last_name::string as last_name,
prod.value:product_name::string as product,
prod.value:quantity::integer as quantity
FROM variant_example,
LATERAL FLATTEN (input => json_string:products) AS prod
;

Because of Snowflake's design, SQL can be used to query both semi-structured and structured data.
In a single query, you may join, window, compare, and compute structured and semi-structured data.
This allows for the elimination of redundant systems and stages, as well as the simplification of data
pipelines, improved performance, and a shorter period between when data is created and when it
can be evaluated and accessed.

Hopefully, this little introduction blog has helped you realize that you can use semi-structured data in
Snowflake and why you should care about it.


- Saba

Author
Saba Gebreyohannes

Saba Gebreyohannes

I help you get insights from your data. Easier. Faster.

Read more articles of this author
Let's discuss your data challenges

Join our community of data enthusiasts

Get industry insights, expert tips and Biztory news sent straight to your inbox with our monthly newsletter.