What is semi-supervised learning?

Semi-supervised learning is an advancement in Machine Learning that uses a combination of labeled and unlabeled data for training models.

There used to be only two types of machine learning algorithms other than reinforcement learning, supervised and unsupervised. For supervised learning, a large amount of labeled data is required for training the model. On the other hand, unsupervised learning does not require labeled data – it learns patterns and trends from unlabeled data.

Disadvantages

Both of these methods have a set of disadvantages. For supervised learning, it is difficult to find labeled data, so it mostly needs to be hand-engineered, which is a very tedious, time-expensive job. On the other hand, gathering data for unsupervised learning may be easier as it only has a limited number of applications.

Solution

To solve these problems, semi-supervised learning was introduced. It is the middle ground between supervised and unsupervised learning. The amount of labeled data is generally much less as compared to unlabeled data, but the presence of even a small amount of labeled data makes the model perform much better.

How it works

  • First of all, the labeled data is used to train the model (just as in the case of supervised learning). This training isn’t done until the model gives acceptable results.
  • Next, the unlabeled data is used and the trained model is used to predict the outputs. These may not be a 100% accurate, but they will still give good results because the model will have been trained on the labeled data provided.
  • The labeled data, along with the predicted outputs of unlabeled data, is used to train the model in order to make it more accurate.

Assumptions

Semi-supervised learning is used in those cases where the labeled and unlabeled data have some sort of connection with each other, i.e., a relationship based on patterns and trends must exist. For this, some assumptions may be made. These are:

Continuity Assumption
It is assumed that those points that lie closer to each other have the same outputs. This assumption is also used in supervised learning so that the decision boundary is simplified.

Cluster Assumption
It may be assumed that the data can be divided into distinct clusters and that points that are present in the same cluster have the same output. This is similar to unsupervised learning, where data is separated based on patterns such as cluster centroid distance.

Manifold Assumption
Another assumption may be that the data can be modeled on a space that is much smaller than the entire input space. This assumption helps to avoid the problems risen when dealing with high dimensions of data. An example is that during speech recognition, where the input space that consists of all possible waves is much larger than the actual requirement.

Applications

  1. Speech Recognition: Labeling audio files for speech recognition is a very time consuming task, therefore, semi supervised learning is used to improve the efficiency.
  2. Text Document Analyzer: It is very difficult to find a large number of labeled text documents, but semi supervised learning can help with document classificationuseful for classifying and ranking web pages in response to a query.
  3. Protein Sequence Classification: Protein sequences in DNA are very large, so the combination of human intervention and inference is ideal.

Free Resources