What's in it for you?
If you read until the end you will be capable of building your own custom connector for Fivetran, hosting it on the google cloud platform and using it to constantly feed data from any application with an API into a database like Snowflake.
In my last blog post about Fivetran, I introduced the custom connector as a solution when Fivetran does not offer a native connector for the application you want to connect to. I mentioned Slack, being one of those very popular business SaaS applications for which Fivetran has not yet created a native connector. That being the case, it shouldn’t hinder you anymore from still using all of Fivetran’s benefits, after you have read part 1 and this step-by-step guide of “Building your own Custom Connector for Fivetran”. If you want to recap on Fivetran’s benefits and the concept of custom connectors in general, jump back to part 1. If you’re good to go for the step-by-step guide on how to build one, keep on reading.
Here at Biztory always wanted to do some analytics on our Twitter account. What have we posted so far? And who likes our posts? Those were some of the questions we wanted to know the answers to and since we’re constantly posting new things we wanted to get those answers on an ongoing basis. We realised that we would need a data pipeline that feeds the relevant information from Twitter into a database and lets us visualise the information in a fancy Tableau dashboard. Back in those days there existed no native Fivetran connector for Twitter. That led us to decide to develop a custom connector and use Fivetran to schedule the data pipeline and feed the data into our Snowflake database. Let me show you how we did it.
The Twitter API
It always starts with the API of the application you are retrieving the data from. Entering the Twitter API V2 which lets you programmatically interact with all sorts of data on the platform, from retrieving user information to posting content. If you need to fresh up your mind about APIs in general, read our blog post about Application Programming Interfaces (APIs).
To perform analytics on our Twitter account and to answer those questions mentioned in the previous chapter, we need to retrieve all of Biztory’s Twitter posts and the liking users per post from Twitter’s API. We want the posts to be stored in one table and the liking users in another. A third table will be used to store our account details. It looks like the following in Snowflake.
Image 2: Screenshot of the final tables in Snowflake that fill be fed by Fivetran using the custom Twitter connector
After we have identified the data we need and the endpoints we need to connect to, we can start writing the code.
Let’s Get Into it - Four Steps to Your Own Custom Connector!
A custom connector can be written in either Node.js, Python or Java. We decided to go for Node.js and to host the function on the Google Cloud platform. A sample function for any of the three major cloud platforms can be downloaded from Fivetran’s website here. We use the four steps depicted in the image below to set up a custom function with node.js. It is important that you follow it precisely as otherwise, Fivetran will not feed the data properly into your warehouse.
Image 3: The four steps for setting up a custom connector in Fivetran
Step 0 - Checking Fivetran’s Request
First we initialise the user_id (row 17) and the path_tweets (row 18). Then we need to check the request coming from Fivetran (row 21-24). The purpose of that is to provide a since_id that gets attached to the path which we are going to call later to retrieve the data. This since_id will tell the Twitter API to only return the tweets posted after a certain tweet with the value of since_id. This parameter will only have a value after the initial sync. The second check that we are doing is on the next_token which is another parameter that gets added to the path to paginate through the response pages from the Twitter API (row 27-30). The next_token will only have a value when the function gets called back from Fivetran because the hasMore flag is true. No worries, this will be explained in more detail later in step 4.
Image 4: Screenshot of the code for step 0 “checking Fivetran’s request”
Step 1 - Calling the API Endpoint
Logically, to get the data from Twitter we need to call the relevant endpoints of the Twitter API to retrieve the data that we need. For that to work out, we first needed to register a developer account with Twitter to receive our API access keys. You can do that here.
The API endpoints that we need for our case are:
- /2/users/<user_id>/tweets for the tweets of any account with user_id
- /2/tweets/<t.id>/liking_users for the liking users of any tweet with t.id
The code to “step 1 - calling the API endpoints” looks like the following:
Image 5: Screenshot of the code for step 1 “calling the API endpoints”
We set up the connection to the endpoint (row 33-38) and get the tweets as a response by ‘getRes’ in row 39. The async function starting in row 46 takes the tweets (timeline) from before and loops over each one to retrieve the liking users to that tweet. The tweets get stored as JSON objects in the timeline variable and the likes as JSON objects in likes. Both are handed over to the withTweetsAndLikes function to be flattened into tables.
Step 2 - Creating Flat Tables
After the data is retrieved from the API we need to flatten it into tables for Fivetran to handle the data. The code for that looks like the following:
Image 6: Screenshot of the code for step 2 “creating flat tables”
Within the withTweetsAndLikes function, we first initialise a few important variables that you partially know from Step 0 - Checking Fivetran’s Request. The next_token and the since_id are initialised along with empty arrays for tweets, users and likes. The arrays will be filled with the respective data from the JSON objects coming from Step 1 - Calling the API Endpoints.
First, we go through the tweets (row 97-109) with a for loop to retrieve all tweets and store the id (tweet id), user_id, created_at and text inside the tweets array using the push-function. After that we loop over the user information which is also included in the timeline variable and store the retrieved information in the users array (row 112-119). Last but not least, and this was the trickiest for loop, we go inside the nested tweet_likes only when it contains data and push the retrieved information to the likes array.
After all that is done, we look for the new value for since_id in the meta-object within timeline, but we only set the since_id if the previous_token is not existent in the meta object. This corresponds to the first page in the paginated response and will make sure that we let Fivetran only save the newest_id of the latest 10 posts.
Note: Fivetran’s ‘timeline’ response provides a maximum of 10 tweets per page and a previous_token to retrieve the next 10 tweets through an API callback.
Also we check if a next_token is existent in the meta-object and if so then we assign it to the next_token variable.
Step 3 - Preparing Response to Fivetran
In this final step we prepare the JSON object that gets sent back to Fivetran. The code for this step looks like this:
Image 6: Screenshot of the code for step 3 “preparing response to Fivetran”
This section has to have the same structure, always. It is really important that you stick to the structure as otherwise Fivetran will fail when running the initial sync run.
Fivetran expects four nested objects - state, schema, insert and hasMore. The state object can take any variables that you want to use in your function to describe the state of the next request. In our case we use this to save the since_id and the next_token written about in Step 2.
Note: In Step 0 - Checking Fivetran’s Request we check if the state variable contains a since_id and a next_token. If it contains a next_token, we do not attach the since_id to the path. Otherwise, instead of getting the next page of Twitter’s paginated response, we would get the tweets after the tweet where id equals the since_id.
The schema object is used by Fivetran to create the tables inside your destination. In our case, Fivetran will create three tables in Snowflake, the tweets table, users table and the likes table. We also set primary keys for every table which Fivetran uses to deduplicate the data automatically.
In the insert object, we simply provide the flat tables created in Step 2 - Creating Flat Tables to Fivetran which Fivetran will use to insert data into Snowflake.
Last but not least, we set the hasMore flag to true if we have set a next_token in Step 2 - Creating Flat Tables so that Fivetran knows to call back the function to retrieve more data.
Setting up the Cloud Function
Given you have followed the instructions above precisely, the next step is to simply publish the function to a Google Cloud project. You can follow the instructions on Fivetran’s website here. It is important to note that you have to set the cloud function trigger as ‘http’ and to uncheck allow unauthenticated invocations. This will make sure your can only be requested by authorised parties while the invocation is done via http protocol.
The last step is to set up the connector in Fivetran. Two things must be taken care of here. The first is that you have to allow the Fivetran user to invocate your cloud function. That can be done in the settings of your cloud function in the Google Cloud platform. The second is to provide the http-url of your Google Cloud function to the Fivetran connector. These steps can also be reviewed in more detail here.
If you follow these steps and Fivetran does not present you with any error messages after testing the connection, you’re ready to go ahead with the initial sync run. Given your code is correctly written (and Google Cloud normally already tests the code on syntax while publishing the function), you will be able to see your data popping up in the destination that you selected for the connector. You don’t have to take care of building any tables or schemas in your database. As long as Fivetran has access to it, it will create schemas and tables for you as they are defined in your cloud function. And then you’ll be able to come up with a fancy dashboard like ours about tweets and likes on Biztory’s Twitter account.
Image 7: Screenshot of our Twitter Analytics dashboard for demonstration of Fivetran Custom Connector
We'll gladly show you how Fivetran will save you time and resources, make your data engineer's life easier and finally provide a stable and scalable solution to data pipelining. You can always start a free trial, just get in touch with us and we'll hook you up!