Semi-supervised Learning is a combination of supervised and unsupervised learning in Machine Learning. In this technique, an algorithm learns from labelled data and unlabelled data (maximum datasets is unlabelled data and a small amount of labelled one) it falls in-between supervised and unsupervised learning approach.
As we know Supervised Learning needs datasets to perform the task, the more the data the more the accuracy and speed (casting under-fitting and over-fitting problem aside) but, this is a very costly process due to dealing with that large number of datasets. On the other hand, the basic disadvantage of Unsupervised Learning is its application spectrum is limited to real-world problems.
To encounter this scientists and engineers introduced Semi-supervised learning. As mentioned in the above definition Semi-supervised learning is a combinational algorithmic approach of Supervised and Unsupervised Learning. Basically, Semi-supervised learning combines a small amount of labelled data and a large amount of unlabeled data (maximum data is Unsupervised).
There are three types of semi-supervised learning algorithmic assumptions (In order to make any use of unlabeled data and make a combination approach in-between labelled and unlabeled data) are as follows:
1)- Continuity Assumption – In continuity assumption, a simple approach kept in mind. The point which is closed to each other is more likely to share labels. As we have already seen in Supervised Learning approach simple geometric decision boundaries are given preferences. But in case of Semi-supervised learning smoothness is also matters with continuity. So it might be possible that some differently labelled data may lie in the same zone instead of a different one.
2)- Cluster Assumption – The cluster stands for a group of similar things positioned or occurring closely together. So, in this assumption, the data form different clusters of the same points and points in the same cluster are likely to share label too (output label). This gives the idea of feature learning with clustering algorithms.
3)- Manifold Assumption – The Manifold is stood for many and various things or implications. So, in this type of assumption, the data lie approximately on a manifold of much-lowered dimension than the input space. This assumption also defining the definition of Semi-supervised learning. Here, the labelled data and unlabelled are taken into account and avoid the curse of dimensionality easily.
- Features: The number of features or distinct traits that can be used to describe each item in quantitative manner.
- Feature Vector: N-Dimensional vector of numerical features that represent some objects.
- Instance Space X: Set of all possible objects describable by features.
- Concept C: Subset of objects from X
- Training Data S: Collection of examples on which algorithm is learning. Or pre-learned datasets.
- Target Function f: Maps each instance x belongs to X to target label y belongs to Y
- Examples: Instance M with label Y = f(x)
Mainly there are four basic methods are used in semi-supervised learning which are as follows:
- Generative Models
- Low-density Separation
- Graph based Methods
- Heuristic Approaches
Practical Applications of Semi-supervised Learning –
- Speech Analysis
- Internet Content Classification
- Protein Sequence Classification