Advanced Textual Analytics: Understanding Key Concepts and Algorithms

When it comes to advanced textual analytics such as machine learning, everyone should be familiar with a few key ideas and phrases. We should all know where machine learning is used, and the different types of machine learning that exist.

Continue reading to learn more about how machine learning affects search and how search engines are doing, and how to recognize machine learning at work. Let’s start with a few definitions. Then we’ll get into machine learning algorithms and models.

Components of Advanced Textual Analytics

Machine learning terms

Here are several key machine learning definitions, the majority of which will be covered later in the article.

This vocabulary of machine learning terms is not meant to be exhaustive.

If you want that, Google provides a good one here.

Algorithm: A mathematical process run on data to produce an output. There are different types of algorithms for different machine-learning problems.
Artificial Intelligence (AI): A field of computer science focused on equipping computers with skills or abilities that replicate or are inspired by human intelligence.
Corpus: A collection of written text. Usually organized in some way.
Entity: A thing or concept that is unique, singular, well-defined, and distinguishable. You can loosely think of it as a noun, though it’s a bit broader than that. A specific hue of red would be an entity. Is it unique and singular in that nothing else is exactly like it, it is recognizable from all other colors since it is clearly defined (see hex code)
Machine Learning: a branch of artificial intelligence that is mostly concerned with developing algorithms, models, and systems to perform tasks and generally to improve upon themselves in performing that task without being explicitly programmed.
Model: Models and algorithms are frequently used interchangeably. The distinction can get blurry (unless you’re a machine learning engineer). Essentially, the difference is that where an algorithm is simply a formula that produces an output value, a model is a representation of what that algorithm has produced after being trained for a specific task. So, when we say “BERT model” we are referring to the BERT that has been trained for a specific NLP task (which task and model size will dictate which specific BERT model).
Natural Language Processing (NLP): A general term to describe the field of work in processing language-based information to complete a task.
Neural Network: A model architecture that, taking inspiration from the brain, include an input layer (where the signals enter – in a human you might think of it as the signal sent to the brain when an object is touched)), a number of hidden layers (providing a number of different paths the input can be adjusted to produce an output), and the output layer. As the signals enter, various "paths" are tested to create the output layer and are programmed to gravitate towards ever-better output conditions. Visually it can be represented by:

Artificial intelligence vs. machine learning: What’s the difference?

Often we hear the words artificial intelligence and machine learning used interchangeably. They are not exactly the same.

Artificial intelligence is the field of making machines mimic intelligence, whereas machine learning is the pursuit of systems that can learn without being explicitly programmed for a task.

Visually, you can think of it like this:

Advanced Textual Analytics - AI, ML and DL — Advanced Textual Analytics

Where else machine learning is used

Few significant areas where ML is being used:

In Ads, what drives the systems behind automated bidding strategies and ad automation?
In News, how does the system know how to group stories?
In Images, how does the system identify specific objects and types of objects?
In Email, how does the system filter spam?
In Translation, how does the system deal learn new words and phrases?
In Video, how those the system learn which videos to recommend next?

All of these questions and hundreds if not many thousands more all have the same answer: Machine learning.

Types of machine learning algorithms and models

Now let’s walk through two supervision levels of machine learning algorithms and models – supervised and unsupervised learning.

Understanding the type of algorithm we’re looking at, and where to look for them, is important.

Supervised learning

Simply said, fully labeled training and test data are provided to the algorithm during supervised learning.

This is to say, someone has gone through the effort of labeling thousands (or millions) of examples to train a model on reliable data.

For example, labeling red shirts in x number of photos of people wearing red shirts.

Supervised learning is useful in classification and regression problems.

1)Classification problems are fairly straightforward. Determining if something is or is not a part of a group.

An easy example is Google Photos. Google has classified me, as well as stages. They have not manually labeled each of these pictures. But the model will have been trained on manually labeled data for stages. And anyone who has used Google Photos knows that they ask you to confirm photos and the people in them periodically. We are manual labelers.

Ever used ReCAPTCHA? Guess what you’re doing? That’s right. You regularly help researchers to train machine learning models.

2)Regression problems, on the other hand, deal with problems where there is a set of inputs that need to be mapped to an output value. A simple example is to think of a system for estimating the sale price of a house with the input of square feet, number of bedrooms, number of bathrooms, distance from the ocean, etc.

Can you think of any other systems that might carry in a wide array of features/signals and then need to assign a value to the entity (/site) in question?

Even though it is undoubtedly more complicated and includes a vast number of unique algorithms for different purposes, regression is likely one of the algorithm types that drives the core functions of search.

Unsupervised learning

In unsupervised learning, a system is given a set of unlabeled data and left to determine for itself what to do with it. No end goal is specified. The system may cluster similar items together, look for outliers, find co-relation, etc.

Unsupervised learning is used when you have a lot of data, and you can’t or don’t know in advance how it should be used.

A good example might be Google News. Google clusters similar news stories and also surfaces news stories that didn’t previously exist (thus, they are news).

These tasks would best be performed by mainly (though not exclusively) unsupervised models. Models that have “seen” how successful or unsuccessful previous clustering or surfacing has gone but not be able to fully apply that to the current data, which is unlabeled (as was the previous news), and make decisions.

It’s an incredibly important area of machine learning as it relates to search, especially as things expand.

Google Translate is another good example. Not the one-to-one translation that used to exist, where the system was trained to understand that word x in English is equal to word y in Spanish, but rather newer techniques that seek out patterns in usage of both, improving translation through semi-supervised learning (some labeled data and much not) and unsupervised learning, translating from one language into a completely unknown (to the system) language. It is just the beginning.

Thanks for reading..

A basic guide to key terms, concepts, and algorithms in Advanced Textual Analytics