How to bulk download Google Search Console data automatically
Did you know that thanks to an update back in June, you can now download up to 16-months’ worth of Google Search Console data using an API and some free software called RStudio?
Pretty handy, right! Having access to more than five times the amount of Search Query data gives you the ability to make year-on-year comparisons and spot trends, which is a vital piece of context for any Organic Search performance review.
It’s also a huge leap forwards compared to previous iterations of the Google Webmaster Tools API. The expansion to a 16-month window brings the opportunity to retroactively add to any existing data you already have – or a second chance to grab the data that should have been recorded already.
However, bigger data sets mean you have to get smarter with how you go about downloading them. No one wants to download 480+ individual days of data, or deal with human error when combining all that data back together again.
So, what’s a webmaster to do? The simplest solution I have found is to use the Google Search Console API with RStudio to make batch calls and automatically download everything you want into nice neat CSV files. If you’ve used R before, then the whole set-up process should only take a few minutes, and if not – it’s fairly easy to pick up once you know what to look for.
If you’re an analyst looking for a way to get Google Search Console data off the internet and into your database, then this is for you.
What is RStudio?
You might have come across R or RStudio before if you’ve ever tried to work out the internal PageRank for pages of a site or tried to tackle any particularly ambitious maths. Essentially, it’s a free code studio for pretty much everything to do with data and dashboarding. When the time comes, and you hit a wall with spreadsheets, R usually has a solution.
How do I run this from scratch?
1. Download R and then RStudio, installing them both.
2. Install some additional packages needed to run Search Console API (see the subsection at the end).
3. Create a new script using the code below (or just download and run from this script).
4. Create a Google API service token, if you don’t have one already.
5. Update the script with your own parameters.
What script do I need to download Search Query data?
The exact code you need is open to tweaks. Below are 37 efficient lines of code I’ve written which you can use to download data between two points. It’s flexible enough that you can take this for one-off bulk downloads, or stitch multiple versions of this script to download all combinations of metrics and dimensions for a single site in one sitting.
So what does this code actually do?
The language used in R is pretty clear once you wrap your head around some of the basics. Below is a line-by-line explanation of what my script is doing, and why. At the very bottom of this post, you can find the script in full.
• Lines 1 – 3: Open up the three packages needed to run this script successfully.
library(googleAuthR) library(searchConsoleR) library(dplyr)
• Line 5: Sets the location of your service token – make sure this is somewhere memorable that won’t accidentally get deleted and add in your own filepath. Remember that R needs folder structures to have / rather than the \ windows uses, so flip them round.
service_token <- gar_auth_service(json_file="C:/Users/YOU/Documents/GSC-Downloads/JSON Service account/GSC-Darwin-605272efbc0b.json")
• Line 7: Uses the token to authorise your session with Google Search Console.
• Line 8: The List Websites command is an optional extra here, but it’s useful as it’s one of the basic commands that auto refreshes your OAuth token (you’ll know if this happens because you’ll get the “Error: Invalid Credentials” response).
• Line 10: Establishes a value called “brand”, which can be anything you like as long as you use it to indicate what this site is. Since you can’t put a URL into a filename this “brand” will be written in instead.
brand <- "YOUR-BRAND"
• Line 11: Establishes a value called “website”, which should be the exact URL of the site you have access to in Search Console.
website <- "https://YOUR-SITE.com/"
• Line 13: Sets your download dimensions. Date must always be included by default, but the second dimension can be one of the four, such as ‘query’. Beware that combining more than one dimension will trigger further sampling and you will have less granular data.
download_dimensions <- c('date','query')
• Lines 14 – 15: These are the settings for start and end dates. I prefer to use times based on the current date rather than type in new dates each time. You should not set the end date any lower than -3 days, as this is usually the earliest time frame that GSC releases data. -487 days ago is approximately 16-months, and is the maximum (or thereabouts) that you can go back to at the time of writing this.
start <- Sys.Date() - 487 end <- Sys.Date() - 3
• Line 17: This sets the search type. Options are ‘web’, ‘video’ and ‘image’. We typically think of web searches by default, but the option is there if you want to explore how well your site does in video and image search results.
type <- c('web')
• Line 19 – 25: This is the core part of the query, which basically takes the different values you have established previously and applies them in a format expected by GSC. Some points to note are that your row limit specified can now be greater than 5000, and to make sure that your data returned is greater than 5,000 you should also use the walk data parameter, which splits out your requests by date to make sure it’s as granular as possible.
data1 <- search_analytics(siteURL = website, startDate = start, endDate = end, dimensions = download_dimensions, searchType = type, rowLimit = 1000000, walk_data = c('byDate'))
• Line 27: This last section is optional. To distinguish which site your data has come from (which can get confusing if you’re doing this for multiple sites), we need to add in an extra column to the table of results. This line appends a new column to data1 called ‘site’.
data1$site <- NA
• Line 28: This column changes the value of every row in the newly built ‘site’ column to be populated with the ‘website’ value you set up in line 11.
data1$site <- website
• Line 29: Creates another new column, this time called ‘searchType’. We will use this to identify whether the data returned is from web, image or video Search Query data – as this is not clearly returned in the results.
data1$searchType <- NA
• Line 30: This sets the default value of ‘searchType’ to be the ‘type’ value set in line 17.
data1$searchType <- type
• Line 31: This line re-arranges the columns of the table created to make sure that they are in a logical order. This line is optional, but beneficial for data planning.
data1 <- data1 %>% select("site", download_dimensions,"clicks","impressions","ctr","position","searchType" )
Writing the Data to CSV
• Line 34: Creates a value called “filepath”. You should make sure that this is a logical directory you have write access to.
• Line 35: Creates a value called “filename” and makes sure that the defined brand, start and end dates are included within the name (separated by ‘-‘). Again this is optional, but for ease when creating multiple exports over many sites, this is what is needed to accurately distinguish files from one another.
filename <- paste("GSC",brand,start, end, sep = "_")
• Line 36: Combines the filepath and filename values into a single string that ends in “.csv”. You don’t have to do it this way, but keeping filepath and filename makes it a little more obvious when you come to make tweaks to where a file is saved or how it might be named.
output <- paste(filepath, filename, ".csv", sep = "")
• Line 37: This is the final command to write the stored table data1 as a CSV. The additional ‘row.names = FALSE’ command tells R not to output row numbers as an additional column.
write.csv(data1, output, row.names = FALSE)
What packages will I need to install?
This is a Google API Client Library for R with OAuth2 to help make authentication easier than other API solutions. This is also Shiny compatible which is really useful for making decent dashboards.
To install just run this line in RStudio:
Also made by Mark Edmondson, this is an R interface with Google Search Console API v3 compatibility, including Search Analytics. Using this in conjunction with AuthR above will let you authorise and stay logged in while running Search Analytics queries. This can be installed by running:
or the github dev version using:
(This runs installation via github rather than via CRAN, but you’ll need the devtools package installed first. This version usually has more bug fixes in.)
This package allows you to re-arrange data that’s returned from your queries, and can be installed by running:
Depending on how you prefer to authorise your account you’ll either need to continuously log in to your Google account via a web page that will pop up, or you’ll have to set some credentials to use permanently. In this example I have a service account .json file stored (the location of which is set in line 5), you can create your own service account key by visiting the Google Cloud platform credentials page and setting up a new Service Account key. This is pretty straightforward to do, just save it somewhere memorable.
Also, when running queries that take an extensive amount of time you may find that the credentials expire. If this happens run a non-search analytic command such as “list_websites()” and this will auto refresh. You may find it useful to only run queries over a 60 day periods and then bind all the data together at the end. E.g. if you had 5 tables of Search Query data you could merge them into a single table called “compiledqueries” as an extra step before exporting to CSV:
compiledqueries <- as.data.frame(rbind(query1,query2,query3,query4,query5))
The full script: