SVM: Support Vector Machines

Introduction
Many data mining techniques often suffer from under fitting or over fitting the data. Under fitting occurs when the algorithm used has not the capacity to express the variability in the data. Over fitting is the opposite case, the algorithm has too much capacity and therefore it also fits noise present in the data. The cause of under and over fitting is the complexity which the model represents which determines how much variability in the data the model will express. If too much complexity is allowed, the variability due to noise in the data is modelled as well which causes a model which over fits patterns in the data. If the complexity is too low the models will fail to account for the true variability in the data and therefore under fit the data. One wants to use models which represent the optimal complexity and one wants to be able to control it to obtain good generalisation properties. Good generalisation means the model's ability to predict unseen data based on the known learning data sets. This is the focus of machine learning algorithms, being able to control the accuracy and robustness of the models which are generated by the machine learning algorithms, which represents patterns in the data.

Where classical statistics deal with large sample size problems, statistical learning theory is the first theory which is able to address also small sample learning problems. The complexity of models generated with machine learning algorithms using empirical data is dependent on the sample size. By taking the sampling size into account one can obtain better results than by applying asymptotic results from classical statistics.

Vladimir Vapnik developed in the mid 1980's the statistical learning theory (SLT), from which a new kind of learning machine, called Support Vector Machine (SVM) was introduced.

The basic idea of support vector machines is to determine a classifier or regression machine which minimizes the empirical risk (that is, the training set error) and the confidence interval (which corresponds to the generalization or test set error).

In SVMs, the idea is to fix the empirical risk associated with an architecture and then to use a method to minimize the generalization error. The primary advantage of SVM as adaptive models for binary classification and regression is that they provide a classifier with minimal VC dimension which implies low expected probability of generalization errors. SVM can be used to classify linearly separable data and nonlinearly separable data. They can be used as nonlinear classifiers and regression machines by mapping the input space to a high dimensional feature space. In this high dimensional feature space, linear classification can be performed.

Text Classification Techniques
Text classification techniques can be divided into rule based oriented approaches, machine learning techniques and techniques based on natural language processing. The first group will only work on simple well defined cases but in general there is the risk of not being able to generalize well for real world data. The last approach requires very detailed knowledge of natural languages which is still a research area on its own and will require detailed input knowledge for each text classification case. Machine learning techniques have proven to be able to produce very good results and in many cases at least equally good or much better then other techniques. It is also know that machine learning techniques generalize very well for many different cases with different requirements. These features are even truer for support vector machines (svm), which are a subset of machine learning techniques. The support vector machine algorithms are discovered by Vladimir Vapnik in 1975 and from 1990-1995 till now there it has become a fast growing research field. Support vector machines algorithms basically focus on descriptive problems such classification, clustering and segmentation and predictive problems such as regression. The algorithms work for multivariate data, such as structured data, table data in databases, and also for text data which is in most cases considered unstructured.

The machine learning task for text classification has a set of requirements which can be characterized by the following properties:

Large input space The input space consists of the documents and the words within those documents whereby the text data is representation using the vector space model. For each document a vector representation is used which contains the frequency distribution of all words within the document. The dimensionality of the data is then easy larger than 30.000 words even with strong simplifications.
Little training data For most learning algorithms, the required number of training examples to produce accurate classifications scales with the dimensionality of the input space. For accurate text classifiers one normally has fewer training examples then dimensions in the feature spaces and therefore it is important to experiment to determine the optimal text classifier given certain classification requirements.
Noise Most documents contain spelling mistakes, words which do not contribute to the information contained in the document. This is considered noise to the machine learning algorithms. It is therefore important to improve the signal to noise ratio of the feature vectors in the pre-processing phase.
Complex learning tasks Classification of text is generally based the semantic understanding of natural language by humans whereby the interpretation of the text and the context of the words is very important. Text classifiers must be able to approximate such complex tasks whereby they have to be accurate and robust.
Computational efficiency Training text classifiers in a high dimensional input space can be computational difficult and for practical approaches it is important to be able to handle large number of features efficiently.

Support vector machines are currently the only approach which is computational efficient and for which there is a well-defined theory that describes the mechanisms with respect to accuracy and robustness.

Text Classification using Support Vector Machines 
Support Vector Machines are a method for creating functions from a set of labelled training data. The function can be a classification function or a general regression function.

For classification, SVMs operate by finding a hyper surface in the space of possible inputs. This hyper surface will attempt to split the positive examples from the negative examples. The split will be chosen to have the largest distance from the hyper surface to the nearest of the positive and negative examples. Intuitively, this makes the classification correct for testing data that is near, but not identical to the training data.

 

Home Contact Sitemap Disclaimer