Auto ingestion into Snowflake alternative to using Snowpipe - External Tables with Snowflake Streams

Snowflake’s auto ingestion service Snowpipe has been around since Dec 2018 and, in my experience, has proven to be an excellent method for automatically ingesting incoming data landing in cloud storage into Snowflake.

Snowpipe with auto ingestion requires addition cloud services over and above the simple cloud storage containers your data lands in.

For Snowpipe auto ingestion on Azure (for example) you need.

  1. Event Grid
  2. Storage Queue

I recently had a situation where provisioning these additional Azure components was problematic and we needed to come up with a work-a-round.

At the time Snowflake had just made External Tables generally available (GA) (Jan 2021) and in addition to that, External Tables could also have Streams on them which got me thinking about the how I could leverage this new functionality for my needs.

Below is the pattern I ended up with.

This pattern involves creating the following 5 steps in Snowflake.

  1. Create an External Table, set auto_fresh=false
  2. Create Stream on External Table
  3. Create Snowflake Task#1 (Manual refresh of External Table metadata)
  4. Create Snowflake Task#2 (Run ‘myAwesomeDataMunchingStoredProcedure’ from step 5)
  5. Create SQL Code or Stored Procedure to process the stream data

 

So, what’s going on here?  To understand this pattern main things to focus on are the 2 Snowflake Tasks and what they’re doing.
Task #1 is set to run at x minute intervals and performs an External Table refresh.

Figure 1. Task #1


When the External Table is refreshed by Task #1 any new files/data are bookmarked in the Snowflake Stream object from step 2.
Task #2 runs after Task #1 (see AFTER clause in Figure 2 below) and only runs the Stored Procedure (see CALL  statement in Figure 2)  when there is data the Stream object.  i.e. when data files have newly arrived into the Data Lake / Storage folder(s).
 
Figure 2.  Task #2
 Running Task #1 doesn’t require a running Warehouse

So, the $64,000 question here is how does this pattern compare to using Snowpipe from a cost and performance perspective?  The answer is the consultant’s standard, “it depends”.
Things to consider when evaluating this pattern.

  • Running Task #1 is a metadata operation and doesn’t require a running Warehouse so incurs no direct cost (you may incur some minimal Cloud Services cost)
  • Running Task #2 depends on there being data to process in the Stream (new data).
    •  No data = no run = no cost.
  • When Task #2 does runs you will incur a cost of at least 1 minute’s worth of Snowflake Credits.
  • Snowpipe has somewhat limited transformation capability and often you end up with a Stream/Task pattern in addition to Snowpipe anyway to process the data downstream, so you end up paying a basically an identical cost to this pattern in addition to Snowpipe cost.
  • You have control over Warehouse size that runs the Stored Procedure.

This pattern could be considered when you have bursts of data (every 15 mins for example) that are either reasonably large and/or have some complexity for processing the raw data into other layers in your solution like ‘delete detection’, ‘SCD 2’, ‘Late/Early arriving fact’ etc.
And of course, when you have some constraints like what drove the creation of this pattern in the first place.
 Have you thought about the data collaboration possibilities?

Interestingly, you could easily implement this pattern with 3rd party data.  Get a SAS key from the data provider and away you go.  Again, not suggesting this as a replacement for Snowflake’s awesome Data Sharing capabilities, but I can see plenty of opportunities to share data using this pattern especially with data providers/partners who haven’t got Snowflake on their roadmap

In Conclusion

Hopefully this blog gives you some food for thought and gives you another tool to add to your Snowflake toolbox, while I’m not necessarily advocating this pattern over others, however, I can think of many situations where this pattern could be utilized, can you?

I’d like love to hear your comments.

Share this Post

2024 © Data Engineers Ltd - All Rights Reserved