Bad data, bad results

The vast majority of Artificial Intelligence projects fail for the same reason: data quality. Many projects are abandoned prematurely without having time to demonstrate the return on investment, giving rise to phrases such as “this AI thing does not work” or “too much hype about AI”. But the truth is that the reason for failure rarely has to do with the AI itself.

Knowing why a project fails is a great way to learn.

As the saying goes “garbage in, garbage out”. If you have bad data at the beginning of the project you are not going to be successful with your AI projects. The heart of AI is data quality. The vast majority of AI projects are actually data engineering projects and although selecting algorithms and building models is a very important step, you must first ensure that you have good quality data. If this data is not of good quality, we have to make it good, and it is a very tedious job: cleaning the data, preparing it, augmenting it and labeling it.

Everything that we have told you so far seems incredibly obvious, but in the vast majority of projects we do not realize that we do not have good quality data until we see that our models do not obtain the expected results.

The main challenge is therefore found in the acquisition and treatment of data, whether public or private. Once we have them, we have to modify them in a way that makes sense, since they come from very different sources. And it is that the data, unfortunately, most of the time does not come to us as we would like it to be presented. First we have to clean the database and remove all the ones we don’t need and are not useful for our project. We also have to make sure that we have enough data to be able to carry it out.

In all projects there are a series of steps to follow to ensure that we have quality data that will guarantee that our models are a success. All of this is data engineering.

In general, customer payment transaction data tends to have the most up-to-date data and, on the other hand, the data collected in web forms tend to be of the worst quality. We have to get this data from different sources to merge when it comes to the same client, and the vast majority of times, this challenge becomes a daunting task.

That is why it is so important to have a work methodology where data quality is methodically addressed as soon as possible. If some data sources do not work for us, it may be necessary to evaluate changing this source or eliminating it so that our models give results as close as possible to those expected. The sooner we find out, the more solutions we can provide.

Labeling the data is very important for supervised learning projects since the models need to be fed with good, clean and well-labeled data so that they can learn from the example, since supervised learning is basically just that: learning by example.

It is necessary to address how the data must be transformed to meet the specific requirements that each project requires.

There are always a series of questions that we will have to ask ourselves:

What do we do to implement data cleansing, data transformation, data manipulation?
What is going to be our data engineering?
What are the means by which data quality can be continuously monitored and evaluated?
How will this data be monitored and evaluated to ensure that the quality remains at the required level at all times?
Will all internal data be used or will the data we have be expanded with third-party data?
How can we obtain this data that we do not yet have? Will we carry out that process internally or externally?
Is it necessary for the project to use a third-party data tagger?
How is this whole labeling process going to be controlled?

None of these questions should be underestimated because low-quality data can sink a project, even for the most experienced professionals.

Nennisiwok AI Lab Blog

ne·nni·si·wok