Skip to main content

If you had data of the entire world at your feet, would you still build a deep learning model on it?

 


Suppose you solved a crucial business problem by building a complex deep neural network model to generate predictions on a new set of business data.

Deploying and maintaining this model costs your business a fortune, but the decision-makers consider it a small price to pay for a greater good.

It is all well and good. But could you have optimized the cost by opting for a simple regression model instead of a deep neural network?

Of course, you tried regression as a baseline model, and you went for a deep neural network only after you noticed that it outperformed the regression model by miles.

But if you had to suddenly build this model again with an enormous volume of training data, would you notice any difference in the performance metrics of the baseline and the complex models?

Or, for that matter, would an increase in the volume of training data improve the performance of a baseline model and make it on par with a complex model?

Two Microsoft researchers, Michele Banko and Eric Brill tried to find an answer to a slightly similar problem as yours.

They were trying to develop an improved grammar checker, especially the one that could help a person correctly choose a word for a sentence from a list of similar ones.

As an example, imagine the task where one needs to choose the correct word out of the following list: to, too, and two, for completing the below sentence.

“He has __ hands.”

This problem is usually known as confusion set disambiguation as we are trying to solve the ambiguity of a confusing set of words.

Banko and Brill asked themselves how to approach this task to attain the highest performance improvement.

The possibilities that came to their mind were tweaking the existing algorithms, exploring new learning techniques, and using sophisticated features.

But since all of these were expensive options, they decided to try and see what happens when they tried existing methods with a large amount of training data.

Hence, they chose four existing learners — winnow, perceptron, naive-Bayes, and memory-based learner to perform this task.

They collected a training corpus of one billion words from numerous news articles, English literature, scientific texts, etc.

To keep test set data different from the training set, they collected one million words separately from the wall street journal texts.

They later trained each of the four learners at different cutoff points in the corpus, i.e., the first one million words, the first five million words, and so on, until they used all billion words for training.

They published the results of their experimentation in a paper titled “Scaling to Very Very Large Corpora for Natural Language Disambiguation” in 2001.

It consisted of the graph depicted below, where three learners out of four produce almost the same accuracy when trained upon a corpus of billion words.

Figure 1. Learning Curves for Confusion Set Disambiguation from the paper “Scaling to Very Very Large Corpora for Natural Language Disambiguation” by Michele Banko and Eric Brill.

Without ignoring the caveats and practicalities of this research, sometimes it might be helpful to ask ourselves if adding more data could improve the baseline model’s performance.

The authors wrote in the conclusion of their paper as below:

“We propose that a logical next step for the research community would be to direct efforts towards increasing the size of annotated training collections while deemphasizing the focus on comparing different learning techniques trained only on small training corpora. “

Of course, we cannot ignore that increasing the volume of training data has its own cost and memory constraints. But when feasible, it might be an effective way to approach the model building.

Click here to view the original research paper.


Comments

Popular posts from this blog

Solving Customer Churn with a hammer!

Learning when data should take a back seat and give way to domain knowledge is a valuable skill. Suppose you built a machine learning model on the data of your customers to predict churn risk. Now that you have a risk score for each customer, what do you do next? Do you filter the top n% based on the risk and send them a coupon with a discount in the hopes that it will prevent churn? But what if price is not the factor driving churn in many of these customers? Customers might have been treated poorly by customer service, which drove them away from your company's product.  Or there might have been an indirect competitor's product or service that removes the need for your company's product altogether (this happened to companies like Blockbuster and Kodak in the past!) There could be a myriad of factors, but you get the point! Dashboards and models cannot guide any company's strategic actions directly. If companies try to use them without additional context, more often tha...

Curing writer's block with sunk cost fallacy

I paid $20 to renew this blog's domain in July. But the truth is, I had been suffering from writer's block ever since the start of this year and hadn’t posted a single thing. At one point, I was ready to give up on the blog altogether, but a voice in my head kept reminding me of all the time and money I’d already invested in this blog. So, this week, I sat down to write this imperfect, patchy article—about none other than that voice itself.  Let me start with a classic scenario where you might have also encountered this voice. Suppose you’re at an Italian restaurant and ordered some pasta and tiramisu. After finishing the pasta, you realize you’re full, and there’s no way your stomach can handle that delicious tiramisu sitting right in front of you. But then, that beautiful brain of yours reminds you that you’ll be paying for the tiramisu whether you eat it or not. In a desperate attempt to avoid wasting money, you reluctantly eat two quick bites. And just like that, my frien...

What is SUTVA for A/B testing?

Imagine if person B’s blood pressure reading depends on whether person A receives the blood pressure medicine in a randomized controlled trial. This will be violating Stable Unit Treatment Value Assumption (SUTVA) SUTVA states that the treatment received by an individual should not influence the outcome we see for another individual during the experiment. I know the initial example sounded absurd, so let me try again. Consider LinkedIn A/B testing a new ‘dislike’ reaction for its users, and the gods of fate chose you to be part of the initial treatment group that received this update. Excited after seeing this new update, you use this dislike reaction on my post and send a screenshot to a few of your connections to do the same, who are coincidentally in the control group that did not receive the update. Your connections log in and engage with my posts to use this dislike reaction, but later get disappointed as this new update is not yet available to them. The offices of LinkedIn are tr...