Preparing Your Dataset for Machine Learning: 8 Basic Techniques That Make Your Data Better

There’s a good story about bad data told by Martin Goodson, a data science consultant. A healthcare project was aimed to cut costs in the treatment of patients with pneumonia. It employed machine learning (ML) to automatically sort through patient records to decide who has the lowest death risk and should take antibiotics at home and who’s at a high risk of death from pneumonia and should be in the hospital. The team used historic data from clinics, and the algorithm was accurate.

But there was with an important exception. One of the most dangerous conditions that may accompany pneumonia is asthma, and doctors always send asthmatics to intensive care resulting in minimal death rates for these patients. So, the absence of asthmatic death cases in the data made the algorithm assume that asthma isn’t that dangerous during pneumonia, and in all cases the machine recommended sending asthmatics home, while they had the highest risk of pneumonia complications.

ML depends heavily on data. It’s the most crucial aspect that makes algorithm training possible and explains why machine learning became so popular in recent years. But regardless of your actual terabytes of information and data science expertise, if you can’t make sense of data records, a machine will be nearly useless or perhaps even harmful.

The thing is, all datasets are flawed. That’s why data preparation is such an important step in the machine learning process. In a nutshell, data preparation is a set of procedures that helps make your dataset more suitable for machine learning. In broader terms, the preparation also includes establishing the right data collection mechanism. And these procedures consume most of the time spent on machine learning. Sometimes it takes months before the first algorithm is built!

Dataset preparation is sometimes a DIY project

If you were to consider a spherical machine-learning cow, all data preparation should be done by a dedicated data scientist. And that’s about right. If you don’t have a data scientist on board to do all the cleaning, well… you don’t have machine learning. But as we discussed in our story on data science team structures, life is hard for companies that can’t afford data science talent and try to transition existing IT engineers into the field. Besides, dataset preparation isn’t narrowed down to a data scientist’s competencies only. Problems with datasets can stem from the way an organization is built, workflows that are established, and whether instructions are adhered to or not among those charged with recordkeeping.

Yes, you can rely completely on a data scientist in dataset preparation, but by knowing some techniques in advance there’s a way to meaningfully lighten the load of the person who’s going to face this Herculean task.

So, let’s have a look at the most common dataset problems and the ways to solve them.

0. You don’t have data

The line dividing those who can play with ML and those who can’t is drawn by years of collecting information. Some organizations have been hoarding records for decades with such great success that now they need trucks to move it to the cloud as conventional broadband is just not broad enough.

For those who’ve just come on the scene, lack of data is expected, but fortunately, there are ways to turn that minus into a plus.

First, rely on open source data to initiate ML execution. There are mountains of data around and some companies (like Google) are ready to give it away. We’ll talk about public dataset opportunities a bit later. While those opportunities exist, usually the real value comes from internally collected golden data nuggets mined from the business decisions and activities of your own company.

Second — and not surprisingly — now you have a chance to collect data the right way. The companies that started data collection with paper ledgers and ended with .xlsx and .csv files will likely have a harder time with data preparation than those who have a small but proud ML-friendly dataset. If you know the tasks that machine learning should solve, you can tailor a data-gathering mechanism in advance.

What about big data? It’s so buzzed, it seems like the thing everyone should be doing. Aiming at big data from the start is a good mindset, but big data isn’t about petabytes. It’s all about the ability to process them the right way. The larger your dataset, the harder it gets to make the right use of it and yield insights. Having tons of lumber doesn’t necessarily mean you can convert it to a warehouse full of chairs and tables. So, the general recommendation for beginners is to start small and reduce the complexity of their data.

1. Articulate the problem early

Knowing what you want to predict will help you decide which data may be more valuable to collect. When formulating the problem, try to think in the categories of classification, clustering, regression, and ranking that we talked about in our whitepaper on business application of machine learning. In layman’s terms, these tasks are differentiated in the following way:

Classification. You want an algorithm to answer binary yes-or-no questions (cats or dogs, good or bad, sheep or goats, you get the idea) or you want to make a multiclass classification (grass, trees, or bushes; cats, dogs, or birds etc.) You also need the right answers labeled, so an algorithm can learn from them.

Clustering. You want an algorithm to find the rules of classification and the number of classes. The main difference from classification tasks is that you don’t actually know what the groups and the principles of their division are. For instance, this usually happens when you need to segment your customers and tailor a specific approach to each segment depending on its qualities.

Regression. You want an algorithm to yield some numeric value. For example, if you spend too much time coming up with the right price for your product since it depends on many factors, regression algorithms can aid in estimating this value.

Ranking. Some machine learning algorithms just rank objects by a number of features. Ranking is actively used to recommend movies in video streaming services or show the products that a customer might purchase with a high probability based on his or her previous search and purchase activities.

It’s likely, that your business problem can be solved within this simple segmentation and you may start adapting a dataset accordingly. The rule of thumb on this stage is to avoid over-complicated problems.

2. Establish data collection mechanisms

Creating a data-driven culture in an organization is perhaps the hardest part of the entire initiative. We briefly covered this point in our story on machine learning strategy. If you aim to use ML for predictive analytics, the first thing to do is combat data fragmentation.

For instance, if you look at travel tech — one of AltexSoft’s key areas of expertise — data fragmentation is one of the top analytics problems here. In hotel businesses, the departments that are in charge of physical property get into pretty intimate details about their guests. Hotels know guests’ credit card numbers, types of amenities they choose, sometimes home addresses, room service use, and even drinks and meals ordered during a stay. The website where people book these rooms, however, may treat them as complete strangers.

This data gets siloed in different departments and even different tracking points within a department. Marketers may have access to a CRM but the customers there aren’t associated with web analytics. It’s not always possible to converge all data streams if you have many channels of engagement, acquisition, and retention, but in most cases it’s manageable.

Another point here is the human factor. Data collection may be a tedious task that burdens your employees and overwhelms them with instructions. If people must constantly and manually make records, the chances are they will consider these tasks as yet another bureaucratic whim and let the job slide. For instance, Salesforce provides a decent toolset to track and analyze salespeople activities but manual data entry and activity logging alienates salespeople.

3. Format data to make it consistent

Data formatting is sometimes referred to as the file format you’re using. And this isn’t much of a problem to convert a dataset into a file format that fits your machine learning system best.

We’re talking about format consistency of records themselves. If you’re aggregating data from different sources or your dataset has been manually updated by different people, it’s worth making sure that all variables within a given attribute are consistently written. These may be date formats, sums of money (4.03 or $4.03, or even 4 dollars 3 cents), addresses, etc. The input format should be the same across the entire dataset.

And there are other aspects of data consistency. For instance, if you have a set numeric range in an attribute from 0.0 to 5.0, ensure that there are no 5.5s in your set.

4. Reduce data

It’s tempting to include as much data as possible, because of… well, big data! That’s wrong-headed. Since you know what the target attribute (what value you want to predict) is, common sense will guide you further. You can assume which values are critical and which are going to add more dimensions and complexity to your dataset without any predictive contribution. This approach is called attribute sampling.

For example, you want to predict which customers are prone to make large purchases in your online store. The age of your customers, their location, and gender can be better predictors than their credit card numbers. But this also works another way. Consider which other values you may need to collect to uncover more dependencies. For instance, adding bounce rates may increase accuracy in predicting conversion.

That’s the point where domain expertise plays a big role. Returning to our beginning story, not all data scientists know that asthma can cause pneumonia complications. The same works with reducing datasets. If you haven’t employed a unicorn who has one foot in healthcare basics and the other in data science, it’s likely that a data scientist might have a hard time understanding which values are of real significance to a dataset.

Another approach is called record sampling. This implies that you simply remove records (objects) with missing, erroneous, or less representative values to make prediction more accurate. The technique can also be used in the later stages when you need a model prototype to understand whether a chosen machine learning method yields expected results.

You can also reduce data by aggregating it into broader records by dividing the entire attribute data into multiple groups and drawing the number for each group. Instead of exploring the most purchased products of a given day through five years of online store existence, aggregate them to weekly or monthly scores. This will help reduce data size and computing time without tangible prediction losses.

5. Clean data

Since missing values can tangibly reduce prediction accuracy, make this issue a priority. In terms of machine learning, assumed or approximated values are “more right” for an algorithm than just missing ones. Even if you don’t know the exact value, methods exist to better “assume” which value is missing or bypass the issue. Choosing the right approach also heavily depends on data and the domain you have:

If you use some ML as a service platform, data cleaning can be automated. For instance, Azure Machine Learning allows you to choose among available techniques, while Amazon ML will do it without your involvement at all. Have a look at our MLaaS systems comparison to get a better idea about systems available on the market.

6. Decompose data

Some values in your data set can be complex and decomposing them into multiple parts will help in capturing more specific relationships. This process is actually the opposite to reducing data as you have to add new attributes based on the existing ones.

For example, if your sales performance varies depending on the day of a week, segregating the day as a separate categorical value from the date (Mon; 06.19.2017) may provide the algorithm with more relevant information.

7. Rescale data

Data rescaling belongs to a group of data normalization procedures that aim at improving the quality of a dataset by reducing dimensions and avoiding the situation when some of the values overweight others. What does this mean?

Imagine that you run a chain of car dealerships and most of the attributes in your dataset are either categorical to depict models and body styles (sedan, hatchback, van, etc.) or have 1–2 digit numbers, for instance, for years of use. But the prices are 4–5 digit numbers ($10000 or $8000) and you want to predict the average time for the car to be sold based on its characteristics (model, years of previous use, body style, price, condition, etc.) While the price is an important criterion, you don’t want it to overweight the other ones with a larger number.

In this case, min-max normalization can be used. It entails transforming numerical values to ranges, e.g. from 0.0 to 5.0 where 0.0 represents the minimal and 5.0 the maximum values to even out the weight of the price attribute with other attributes in a dataset.

A bit simpler approach is decimal scaling. It consists of scaling data by moving a decimal point in either direction for the same purposes.

8. Discretize data

Sometimes you can be more effective in your predictions if you turn numerical values into categorical values. This can be achieved, for example, by dividing the entire range of values into a number of groups.

If you track customer age figures, there isn’t a big difference between the age of 13 and 14 or 26 and 27. So these can be converted into relevant age groups. Making the values categorical, you simplify the work for an algorithm and essentially make prediction more relevant.

Public datasets

Your private datasets capture the specifics of your unique business and potentially have all relevant attributes that you might need for predictions. But when can you use public datasets?

Public datasets come from organizations and businesses that are open enough to share. The sets usually contain information about general processes in a wide range of life areas like healthcare records, historical weather records, transportation measurements, text and translation collections, records of hardware use, etc. Though these won’t help capture data dependencies in your own business, they can yield great insight into your industry and its niche, and, sometimes, your customer segments. You can find a great public datasets compilation on GitHub. Some of the public datasets are commercial and will cost you money.

Another use case for public datasets comes from startups and businesses that use machine learning techniques to ship ML-based products to their customers. If you recommend city attractions and restaurants based on user-generated content, you don’t have to label thousands of pictures to train an image recognition algorithm that will sort through photos sent by users. There’s an Open Images dataset from Google. Similar datasets exist for speech and text recognition.

So, even if you haven’t been collecting data for years, go ahead and search. There may be sets that you can use right away.

Final word: you still need a data scientist

The dataset preparation measures described here are basic and straightforward. So, you still must find data scientists and data engineers if you need to automate data collection mechanisms, set the infrastructure, and scale for complex machine learning tasks.

But the point is, deep domain and problem understanding will aid in relevant structuring values in your data. If you are only at the data collection stage, it may be reasonable to reconsider existing approaches to sourcing and formatting your records.

Being a Technology & Solution Consulting company, AltexSoft co-builds technology products to help companies accelerate growth.