Aditya Rachman Putra

Scraping the Surface of Airtable (Part 2)

Sky Photo by jimmy teoh from Pexels

Before we start, let’s appreciate the fact that the first part is 2 yrs old! Anyway, after 2 years of working on another things, I came back around to the same problem today. Still the same idea as part one, that is how can i connect multiple bases without the need of logging into airtable. This part will be built on top of the first one, or to summarize: turns out I’m on the right track and requestId doesn’t really matter too much, maybe I made mistake in copying it in the last post? Anyway in this post I will talk about several things:

If you want to follow along you can find the notebook and source code on my github or you can run it on google colab. If you just want to use the package, you could install it from pip. And use the code on your own risk, as this behavior is not guaranteed to be consistent over time (I don’t know if airtable team decide to change it anytime soon). Okay, first let’s address the elephant in the room.

Why Not Airtable API

2 years ago when I was writing the first part, the API itself is still a work in progress by the airtable team, so basically I have no choice but to scrap my data. But today, after airtable has official API support, why do I decide not to use the API anyway? I have three main reasons:

So, because of those three reasons I decided to create the solution which could work by only using the shared view link. But for any other purposes I really recommend using their official API, as it’s the way it’s intended to be :D.

Why not Headless Browser

Using headless browser I could simply head over to the link, and click on the “Download CSV” button and be done with it. And to be honest it seems to be a better option, but I want the solution to be as “light” as possible, and headless browser is not the smallest program out there. Thus I decided to work by onnly using the minimal amount of python packages. In this case, the solution practically only need beautifulsoup4 to works.

Current Solution

If you want to try this out without creating a new base, you could use my experimentation base to follow along.

I’m still using the same method as the first part, and the first thing i need to do is create the secret shared view link. To do so, just head over the table of interest, in this case we will use the Stories table, and All stories view. Then click on the Share view button and copy the generated grid view link. Oh, don’t forget to keep the Allow viewers to copy data out of this view activated.

Shared Table

For the rest of this article we’ll use the shared view link generated from that base:

Get the necessary Information From the Response

What Necessary Info??

I’ll explain a bit how I decided what is necessary information here. The first thing I did was to fire up the devtools and look at network tab, then click that “Download CSV” button. I found this to be the request fired by my browser to download the CSV.

What does that even means? To find out, I head over to Postman and let it parse the parameter for me. The result, the request hits the endpoint of{viewId}/downloadCsv?{query_params} with viewId as the id of our shared view, and the query params as shown below.

Query Parameter

Okay, the next task that we need to do is finding out which parameter is necessary and how to generate them. I did it by eliminating one parameter at a time. After an hour of trial and error, I found out that practically no header need to be explicitly set. And for the request to succeed we need the following parameters:

With those in mind let’s move on to retrieve those data from the shared view page.

Retrieve those information

After trying another GET requests to the shared view url, and search for the parameters name, I found that the script tag right after the title tag contains all the necessary information that’s needed.

Source File

After finding out where the wanted information are, we could Use some regex magic to extract those information (if you want to know how the regex works try will help a lot in understanding what the patterns do).

view_id ="(viw[a-zA-Z0-9]+)",str(script)).group(1)
access_policy ="accessPolicy=([a-zA-Z0-9%*]+)",str(script)).group(1)
app_id ="\"x-airtable-application-id\":\"(app[a-zA-Z0-9]+)",str(script)).group(1)

Build URL to Download CSV and Fire!

With those parameters the last step is building the URL needed to download the CSV, and use it to send the GET requests.

self.csv_url = f"{self.view_id}/downloadCsv?"
for (k,v) in self.params.items():
	self.csv_url += k+"="+v+"&"
self.csv_url = self.csv_url[:-1]
r = requests.get(self.csv_url)
r.encoding = "utf-8"

It will return the CSV data that we want as a string. You could save it into file, or use StingIO to feed it into pandas, and so many more. By the way, I think airtable use UTF-8 to encode the CSV data, so make sure to change the response encoding after receiving the requests.

Finished Script

Again, if you just want the finished product, You can clone the finished package from my github repository, or install it using pip and use it as a CLI.

# install the package
pip install airscraper

# pull data and put it into csv
airscraper {url}

Where to go from Here?

Well, there are several things that we can expand upon this idea

For the next part I think I’ll go over how to use this as part of ETL pipeline in Google Cloud Platform using Functions and BigQuery, but seeing it took two whole years for me to continue the first one let’s see how long it’ll take me for the next part :)  For now, thanks for reading and if you’re interested in further discussion, feel free to hit me up in the comment or on my twitter @BanditelolRP

#Learning #Scraping #Airtable