How to bulk download Google Search Console data automatically
You can download up to 16 months worth of Google Search Console data using an API and some free software called RStudio. If you’ve used R before the whole set up process should only take a few minutes, and if not – it’s fairly easy to pick up once you know what to look for.
Having access to more than five times the amount of Search Query data gives you the ability to make year on year comparisons and spot trends, which is a vital piece of context for any Organic Search performance review.
The fact that any webmaster now has access to 16 months worth of data for free is a huge leap forwards compared to previous iterations of the Google Webmaster Tools API. The expansion to a 16 month window brings the opportunity to retroactively add to existing data you already have – or a second chance to grab the data that should have been recorded already. But bigger data sets mean you have to get smarter with how you go about downloading your data.
No one wants to download 480+ individual days of data, or try dealing with the many levels of human error that happen when combining all that data back together again. The simplest solution I have found is to use the Google Search Console API with RStudio to make batch calls and automatically download everything you want into nice neat CSV files.
If you’re looking for a way to get Google Search Console data off the internet and into your database, then this is for you.
What is RStudio?
You might have come across R or RStudio before if you’ve ever tried to work out the internal PageRank for pages of a site, or tried to tackle anything partciularly ambitious with maths. Essentially, it’s a free code studio that lets you easily install additional packages to do pretty much anything you like when it comes to data and dashboarding. If you’ve ever hit a wall with spreadsheets, R usually has a solution.
How do I run this from scratch?
- Download R and then RStudio, installing them both
- Install some additional packages needed to run Search Console API (see the subsection at the end)
- Create a new script using the code below (or just download and run from this script)
- Create a Google API service token, if you don’t have one already
- Update the script with your own parameters
What script do I need download Search Query data?
The exact code you need is open to tweaks. The Gist panel below has an efficient 37 lines of code I’ve written which you can use to download data between two points and is flexible enough that you can take this for one-off bulk downloads or stitch multiple versions of this script to download all combinations of metrics and dimensions for a single site in one sitting.
This is what the code looks like:
Explanation of what this code does:
The language used in R is pretty clear once you wrap your head around some of the basics. Below is a line by line explanation of what my script is doing, and why.
- Lines 1-3: Open up the three packages needed to run this script successfully.
- Line 5: Sets the location of your service token – make sure this is somewhere memorable that won’t accidentally get deleted and add in your own filepath. Remember that R needs folder structures to have / rather than the \ windows uses, so flip them round.
- Line 7: Uses the token to authorise your session with Google Search Console.
- Line 8: The List Websites command is an optional extra here, but useful as it’s one of the basic commands that auto refreshes your Oauth token. (You’ll know if this happens because you’ll get the “Error: Invalid Credentials” response).
- Line 10: Establishes a value called “brand”, which can be anything you like as long as you use it to indicate what this site is. Since you can’t put a URL into a filename this “brand” will be written in instead.
- Line 11: Establishes a value called “website”, which should be the exact URL of the site you have access to in Search Console
- Line 13: Sets your download dimensions. Date must always be included by default, but the second dimension can be one of the four, such as ‘query’. Beware that combining more than one dimension will trigger further sampling and you will have less granular data.
- Lines 14 + 15: These are the settings for start and end dates. I prefer to use times based on the current date rather than type in new dates each time. You should not set the end date any lower than -3 days, as this is usually the earliest time frame that GSC releases data. -487 days ago is approximately 16 months and is the maximum (or thereabouts) that you can go back to at this time.
- Line 17: This sets the search type. Options are ‘web’, ‘video’ and ‘image’. We typically think of web searches by default, but the option is there if you want to explore how well your site does in video and image search results
- Line 19 – 25: This is the core part of the query, which basically takes the different values you have established previously and applies them in a format expected by GSC. Some points to note are that your row limit specified can now be greater than 5000, and to make sure that your data returned is greater than 5,000 you should also use the walk_data parameter, which splits out your requests by date to make sure it’s as granular as possible.
- Line 27: This last section is optional. Basically in order to distinguish which site your data has come from (which can get confusing if you do this for multiple sites) we need to add in an extra column to the table of results. This line appends a new column to data1 called ‘site’
- Line 28: This column changes the value of every row in the newly built ‘site’ column to be populated with the ‘website’ value you set up in line 11.
- Line 29: Creates another new column, this time called ‘searchType’. We will use this to identify whether the data returned is from web, image or video Search Query data – as this is not clearly returned in the results.
- Line 30: This sets the default value of ‘searchType’ to be the ‘type’ value set in line 17.
- Line 31: This line re-arranges the columns of the table created to make sure that they are in a logical order. This line is optional, but beneficial for data planning.
Writing the Data to CSV
- Line 34: Creates a value called “filepath”. You should make sure that this is a logical directory you have write access to.
- Line 35: Creates a value called “filename” and makes sure that the defined brand, start and end dates are included within the name (separated by ‘-‘). Again this is optional, but for ease when creating multiple exports over many sites this is what is needed to accurately distinguish files from one another.
- Line 36: Combines the filepath and filename values into a single string that ends in “.csv”. You don’t have to do it this way, but keeping filepath and filename makes it a little more obvious when you come to make tweaks to where a file is saved or how it might be named.
- Line 37: This is the final command to write the stored table data1 as a CSV. The additional ‘row.names = FALSE’ command tells R not to output row numbers as an additional column.
What packages will I need to install?
This is a Google API Client Library for R with OAuth2 to help make authentication easier than other API solutions. This is also Shiny compatible which is really useful for making decent dashboards.
To install just run this line in RStudio:
Also made by Mark Edmondson, this is an R interface with Google Search Console API v3 compatibility, including Search Analytics. Using this in conjunction with AuthR above will let you authorise and stay logged in while running queries Search Analytics queries. This can be installed by running:
or the github dev version using:
(This runs installation via github rather than via CRAN, but you’ll need the devtools package installed first. This version usually has more bug fixes in.)
This package allows you to re-arrange data that’s returned from your queries, and can be installed by running:
Depending on how you prefer to authorise your account you’ll either need to continuously log in to your Google account via a web page that will pop up, or you’ll have to set some credentials to use permanently. In this example I have a service account .json file stored (the location of which is set in line 5), you can create your own service account key by visiting the Google Cloud platform credentials page and setting up a new Service Account key. This is pretty straightforward to do, just save it somewhere memorable.
Also, when running queries that take an extensive amount of time you may find that the credentials expire. If this happens run a non-search analytic command such as “list_websites()” and this will auto refresh. You may find it useful to only run queries over a 60 day periods and then bind all the data together at the end. E.g. if you had 5 tables of Search Query data you could merge them into a single table called “compiledqueries” as an extra step before exporting to CSV:
compiledqueries <- as.data.frame(rbind(query1,query2,query3,query4,query5))
Written by:Alex Darwin
Category:What we think
You may also like
/ 12 May 2019
Diversity in Digital with #GIRLTECHLCR
Funded in part by Leeds City Council, #GIRLTECHLCR is an amazing initiative to get more female students inspired to work in digital and tech. I was able to take part in the scheme as a part of Leeds Digital Festival 2019, delivering four 30-minute woRead more
/ 10 Apr 2019
Social Media. Not so Lush anymore?
Social media is an integral part of most of our daily lives – more than 3 billion people use some form of social (that’s over 40% of the current global population!), and brands are constantly on the lookout for new social strategies to help conneRead more