Google Dataprep: a rookie technician & the elegance of simplicity

It sometimes seems that every day heralds another announcement by Google of another feature or tool that gives those of us who work in data a sense of both excitement and dread for the security of our profession. The release of Dataprep in Beta may have slipped somewhat under the radar, but its associated promise of useable data more quickly should have us all sitting up in our seats.

In the video announcement for Dataprep, Google barely stopped short of forecasting mass redundancies amongst data engineers. Whilst I naturally believe that it is going a little far, I can see that certain features could come in handy, especially for those of us who are still spending a significant proportion of time on constructing elaborate SQL scripts to clean up messy data. That being said, here are a few features of Dataprep that may somewhat adjust the structure of a typical day.

The Flow

In the world of Google, you wrangle your data in a flow and then run a job on it. To the rest of us, what that means is you access your dataset, create a process in which you perform operations to change the data and then transfer that data into a new or existing table.

The process of creating a Flow is extremely easy, so long as you do your storage on a Google platform. Look at this handy “Import Data” function. You can either drag and drop a file to upload, or just simply select a table from Google Cloud Storage or BigQuery.

Here the file I’ll be working on is the free BigQuery table Internet Archive Books. This is basically an online record of historical books that have been scanned in and made free to the public. It contains a bunch of missing entries, duplications, and columns filled with giant lists that make little sense the perfect dataset for a demonstration of Dataprep’s capabilities.

The Recipe

I don’t know if you’ve noticed, but almost every tool you use to work with data has its own language (or at least its own version of a common language). This is also true of Dataprep, but thankfully you don’t really need to learn it to perform most basic tasks. You also don’t need to go in there blind, as you might do when constructing queries in BigQuery. What Dataprep does is take a sample of the first 9.77 Mb of your data and shows this to you, usually after already performing some basic operations, such as splitting your file into table format, taking the first row as headers, and so on.

Above each column is a histogram with the number of unique values. Any entries that don’t match the typical format will be shown as a bar in a different colour above the histogram (see BookMeta_Subjects above, which has a grey section due to the missing values). When you click on a column, Dataprep suggests functions that may be appropriate for the style of data within that column. To apply these, you merely select one and click “add to recipe”. The recipe appears on the right hand side of the window (you expand it by clicking on the scroll-shaped icon), and operations may be easily added, edited, and removed.

For more complex operations, the bottom left hand corner of the window contains transformations. Examples of these are deduplication, aggregations, joins, splitting, extractions of lists and key variables into arrays, and replacements. When creating transformations, you can choose to use either Dataprep language, Regex or text strings, meaning that there is high versatility in what may be done. Here I show the “quotations” column. Originally, on the left there was a variety of quotes and random numbers separated by the pipe symbol, with several rows of missing values. With a few short commands I was able to remove numbers, replace pipe symbols with semi-colons, turn the column into an array, and flatten the array so that each quotation is on a new row, giving you what you see on the right.

Highlighting a word within a specific cell gives you yet more available functions to perform on your data. This could be a count of occurrences, extraction, splitting, converting to an array, or deleting rows where the word is present.

You can see that by now my recipe is pretty long. Can you imagine how long it would take to do all that with blind scripting? And if you decide that you’ve changed your mind about what needs to be done, all you need to do is click a step in the recipe and the data will revert to the state it was at on that step. Of course, what I’ve done is only the start, we need to apply this to our whole dataset, and we do this by clicking the “Run Job” button in the top right hand corner.

The Job Run

There’s a reason that the set of instructions we give the data is called a recipe. It’s meant to be used again and again. In the job run, Dataprep takes the operations performed on your sample and applies it to your whole dataset. You have a few different options here – you can create a new csv file in Google Cloud Storage, create a new table in BigQuery, or select an existing table for more advanced options such as appending or replacing the data.

After you’ve run your job, you see a report which tells you whether there are any new errors or missing values encountered when applying the recipe to your larger table. This lets you know whether you need to make any changes to the recipe, and you can make several iterations of the job run to get everything right.

Of course, nothing stops you from creating a new flow from your resulting table, so you may end up with a series of datasets with different operations performed on them. For example, you could delete some columns in a later flow to protect confidential information.

The Service

We have been privileged enough to be given access to Dataprep whilst it is still in beta, and although it can do a lot more than I’ve covered here, you can bet there’s still more to come. I’m personally hoping for more job run automation in the future, which would make a simple job even simpler. Pricing is also still not completely determined, so currently you only get charged for resource usage (that might change in the future). I do believe, however, that once established, any costs will be compensated for by the lower demand on man-hours.

So does Dataprep mean I need to start planning for a career change? Not quite, but it does mean that I can spend less time doing the boring data preparation and more time exploring what my data is telling me. And really, isn’t that what it’s all about?