Theresa Welchy: Introduction to Data Transformations

BigML’s upcoming release on Thursday, October 25, 2018, will be presenting our latest resource to the platform: Data Transformations. In this post, we’ll do a quick introduction to Data Transformations before we move on to the remainder of our series of 6 blog posts (including this one) to give you a detailed perspective of what’s behind the new capabilities. Today’s post explains the basic concepts that will be followed by an example use case. Then, there will be three more blog posts focused on how to use Data Transformations through the BigML Dashboard, API, and WhizzML for automation. Finally, we will complete this series of posts with a technical view of how Data Transformations work behind the scenes.

Understanding Data Transformations

Transforming your data is one of the most important, yet time-consuming and difficult tasks in any Machine Learning workflow. Of course, “data transformations” is a loaded phrase and entire books are authored on the topic. When mentioned in the Machine Learning context, what we mean is a collection of actions that can be performed compositionally on your input data to make it more responsive to various modeling tasks — if you will, these are the methods to optimally prepare or pre-process your data.

To remind, BigML already offers several automatic data preparation options (missing values treatment, categorical fields encoding, date-time fields expansion, NLP capabilities, and even a full domain-specific language for feature generation in Flatline) as well as useful dataset operations such as sampling, filtering, and the addition of new fields. Despite those, we’ve been looking to add more capabilities for full-fledged feature engineering within the platform.

Well, the time has come! This means the powerful set of supervised and unsupervised learning techniques we’ve built from scratch over the last 7 years all stand to benefit from data better prepared to make the most of them. Without further adieu, let’s see what goodies made it to this release:

Aggregating instances: at times aggregating highly granular data at higher levels can be necessary. When that happens, you can group your instances by a given field and perform various operations on the other fields. For example, you may want to aggregate sales figures by product and perform further operations on the resulting dataset before applying Machine Learning techniques such as Time Series.
Joining datasets: if your data comes from different sources, in multiple datasets you need to join said datasets by defining a join field. For instance, imagine you have a dataset containing user profile information such as account creation date, age, sex, country, and another dataset that contains transactions that belong to those users with critical fields like transaction date, payment type, amount and more. If you’d rather have all those fields in a single dataset, you can join those datasets based on a common field such as customer_id.
Merging datasets: if you have multiple datasets to process to create with the same fields, then you may want to concatenate those before you continue your workflow. Take for example a situation where daily files of sensor data need to be collated into a single monthly file before you can proceed. This would be a breeze with the new merge capability built into the BigML Dashboard.
SQL support: This is big! BigML now supports all the operations from PostgreSQL, which means you have the full power of SQL at your disposal through the BigML REST API. You can choose between writing a free-form SQL query or use the JSON-like formulas that the BigML API supports. You can also easily see the associated SQL queries that created a given dataset and even apply them to other datasets — more on those in the subsequent blog posts.
Feature Library and Flatline Editor: this addition lets you easily create and reuse new features thanks to Feature Library and Flatline Editor improvements. Simply save the formulas of the new features you create and reuse them for other datasets, view the most recent and frequently used formulas that are important for any use case, visualize the fields that involve a Flatline formula in the Flatline Editor preview mode and save time with Formula Autocompletion based on your past formulas and otherwise frequently used examples.

Flatline Editor

NOTE: Keep this under the wraps for now, but before you know it, the Dashboard will be supporting other capabilities such as ordering instances, removing duplicate instances, and more!

Want to know more about Data Transformations?

To learn more about BigML’s upcoming release, please join our free, live webinar on Thursday, October 25, 2018, at 10:00 AM PDT. Register today as space is limited! Stay tuned for our next 6 blog posts that will present step by step how to transform your data with the BigML platform.

The Official Blog of BigML.com published first on The Official Blog of BigML.com

Theresa Welchy

Tuesday, October 16, 2018

Introduction to Data Transformations

Understanding Data Transformations

Want to know more about Data Transformations?

No comments:

Post a Comment