CLASSIFICATION OF CANCER OF THE LUNGS USING SVM AND ANN

Accurate diagnosis of cancer plays an important role in order to save human life. The results of the diagnosis indicate by the medical experts are mostly differentiated based on the experience of different medical experts. This problem could risk the life of the cancer patients. A fast and effective method to detect the lung nodules and separate the cancer images from other lung diseases like tuberculosis is becoming increasingly needed due to the fact that the incidence of lung cancer has risen dramatically in recent years and an early detection can save thousands of lives each year. The focus of this paper is to compare the performance of the ANN and SVM classifiers on acquired online cancer datasets. The performance of both classifiers is evaluated using different measuring parameters namely; accuracy, sensitivity, specificity, true positive, true negative, false positive and false negative.


INTRODUCTION
Challenge facing medical practitioners makes this study of a much greater significance. The challenge of detecting cancer in its early stages since symptoms appear only in the advanced stages thereby causing the mortality rate of lung cancer to be the highest among all other types of cancer. Accurate diagnosis for different types of cancer plays an important role to the doctors to assist them in determining and choosing the proper treatment. Undeniably, the decisions made by the doctors are the most important factors in diagnosis but lately, application of different AI classification techniques have been proven in helping doctors to facilitate their decision making process. Possible errors that might occur due to unskilled doctors can be minimized by using classification techniques. This technique can also examine medical data in a shorter time and more precisely.
The critical task is to define and specify a good feature space that means the type of features which will discriminate between nodules and non-nodules, malignant and benign etc. Interpretation of a chest radiograph is extremely challenging. Superimposed anatomical structures make the image complicated. Even experienced radiologists have trouble distinguishing infiltrates from the normal pattern of branching blood vessels in the lung fields, or detecting subtle nodules that indicate lung cancer. When radiologists rate the severity of abnormal findings, large inter-observer and even intra-observer differences occur. The clinical importance of chest radiographs, combined with their complicated nature, explains the interest to develop computer algorithms to assist radiologists in reading chest images. These are problems that cannot be corrected with current methods of training and high levels of clinical skill and experience. These problems include the miss rate of detection of small pulmonary nodules, the detection of minimal interstitial lung disease and the detection of changes in pre-existing intestinal lung disease. Although traditional classifiers have been used over times to classify images, but back propagation algorithm of ANN is a good choice for classification of cancer and tuberculosis images. This supervised training algorithm produce results faster than the other traditional classifier [1], but the need to have efficient results arise, the support vector machine maps images into hyper plane thereby separating them into two linear different phases, which enables classification.
Hence, the problem to solve for early diagnosis of lung cancer is associated with the reduction of the number of FP classifications while maintaining a high degree of true-positive (TP) diagnoses, i.e., sensitivity. Several methods have been proposed to reduce the number of FP"s while maintaining a high sensitivity. The nodule definition for thoracic CT of the Fleischer"s Society is "a round opacity, at least moderately well margined and no greater than 3 cm in maximum diameter". Approximately 40% of lung nodules are malignant, that is, are cancerous: the rest is usually associated with infections. Because malignancy depends on many factors, such as patient age, nodule shape, doubling time, presence of calcification, after the initial nodule detection further exams are necessary to obtain a diagnosis. In computer vision, segmentation refers to the process of partitioning a digital image into multiple regions or sets of pixels. Each of the pixels in a region is similar with respect to some characteristic or computed property, such as color, intensity, or texture. Adjacent regions are significantly different with respect to the same characteristics. Early diagnosis has an important prognostic values and has a huge impact on treatment planning. As nodules are the most common sign of lung cancer, nodule detection in CT scan images is a main diagnostic problem.
Conventional projection radiography is a simple, cheap, and widely used clinical test. Unfortunately, its capability to detect lung cancer in its early stages is limited by several factors, both technical and observer-dependent. Lesions are relatively small and usually contrast poorly with respect to anatomical structure. This partially explains why radiologists are commonly credited with low sensitivity in nodule detection, ranging from 60 to 70% [2]. Lung cancer is the primary cause of tumor deaths for both sexes in most countries. There are four stages of lung cancer from I to IV with rising gravity. If the cancer is detected at stage I and it has no more 30 mm in diameter, then there is about 67% survival rate, and only less than 1% chance left for stage IV. Thus it is concluded that early detection and treatment at stage 1 have high survival rate. But unfortunately, lung cancer is usually detected late due to the lack of symptoms in its early stages. This is the reason why lung screening programs have been investigated to detect pulmonary nodules: they are small lesions which can be calcified or not, almost spherical in shape or with irregular borders. The nodule definition for thoracic CT of the Fleischer"s Society is "a round opacity, at least moderately well margined and no greater than 3 cm in maximum diameter" [3].
Approximately 40% of lung nodules are malignant, that is, are cancerous: the rest is usually associated with infections. Because malignancy depends on many factors, such as patient age, nodule shape, doubling time, presence of calcification. After the initial nodule detection further exams are necessary to obtain a diagnosis. In computer vision, segmentation refers to the process of partitioning a digital image into multiple regions or sets of pixels. Each of the pixels in a region is similar with respect to some characteristic or computed property, such as color, intensity, or texture. Adjacent regions are significantly different with respect to the same characteristics [4], [5], [6].
Early diagnosis has an important prognostic values and has a huge impact on treatment planning (Cancer Facts and Figures 2001). As nodules are the most common sign of lung cancer, nodule detection in CT scan images is a main diagnostic problem. Conventional projection radiography is a simple, cheap, and widely used clinical test. Unfortunately, its capability to detect lung cancer in its early stages is limited by several factors, both technical and observer-dependent.
Lesions are relatively small and usually contrast poorly with respect to anatomical structure. This partially explains why radiologists are commonly credited with low sensitivity in nodule detection, ranging from 60 to 70%. A thorough review of the drawbacks affecting conventional chest radiography is given, for example, by [7]. However, several long-term studies carried out in the 1980s using large clinical data sets have shown that up to 90% of nodules may be correctly perceived retrospectively [8]. In addition, detection sensitivity can be increased to more than 80% in the case of a double radiograph reading by two radiologists. Furthermore, sensitivity is expected to increase with the widespread use of digital radiography systems which are characterized by an extended dynamic range and have a better contrast resolution than conventional O c t o b e r 1 7 , 2 0 1 5 film radiography. In view of this, the availability of efficient and effective computer-aided diagnosis (CAD) systems is highly desirable [9], as such systems are usually conceived to provide the physician with a second opinion [10] so as to focus his/her attention on suspicious image zones, playing the role of a "second reader".
The aim of this paper is to compare the performance of the ANN and SVM classifiers on acquired online cancer datasets. The performance of both classifiers is evaluated using different measuring parameters which are accuracy, sensitivity, specificity and true positive, true negative, false positive and false negative. [11] proposed a fully automated method for multiple sclerosis (MS) lesion segmentation in T1-weighted MR imaging. [12] proposed abnormality detection from CT images of different disease. The model was built based on decision tree classifier that is able to predict general abnormality in human brain. The model was evaluated using hold out method and N-fold. SVM classification accuracy is 96%. [13] opined that the K-Nearest Neighbor (K-NN) classification technique is the simplest technique conceptually and computationally that provides good classification accuracy. The k-NN algorithm was based on a distance function and a voting function in k-Nearest Neighbors, the metric employed is the Euclidean distance. This paper evaluated and compared the performance of SVM and ANN by analyzing Idiopathic Pulmonary Fibrosis(IPF) and Chronic Obstructive Pulmonary Disease (COPD)-two major diseases of the liver.

METHODOLOGY
The stages of the system development are as shown in figure 3.1 and these include image acquisition, image preprocessing, image segmentation, feature extraction and classification. The implementation is carried out on Matlab 7.10a.

Image (Data) Acquisition
The image used for this paper was obtained from an online database of brain MRI images. The database provides a repository of these images which can be downloaded and regenerated in the Matlab environment. Some of these images are stored for research purposes, and for other image processing analysis. After the images were gotten from an online source, a database that contains both images was created in the Matlab environment. The image was called from the database using Matlab algorithm.

Image Pre-processing
Pre-processing of image is necessary before any image analysis can be carried out. It involves conversion to gray-scale and removal of objects that could affect the proper processing of the images. The main aim of image pre-processing is to suppress unwanted noise and to enhance image features important from further analysis point of view, and is most of the time specific in nature depending upon the type of noise present in the image. (For example, in case of image with poor "brightness and contrast," histogram equalization can be used to improve the brightness and contrast of an image). In analysis of medical images, we try to avoid image pre-processing unless and until it is very much necessary as image preprocessing typically decreases image information content. O c t o b e r 1 7 , 2 0 1 5

Pre-processing to Grey Scale
A major pre-processing is conversion to grayscale. Most images obtained are always in colored form, and the only way to process such image is by conversion to gray scale. An RGB image is a 3 by 3 image matrix consisting of rows, columns and index type.

3.3.1Texture Features
Texture is a very useful characterization for a wide range of image. It is generally believed that human visual systems use texture for recognition and interpretation. In general, color is usually a pixel property while texture can only be measured from a group of pixels. A large number of techniques have been proposed to extract texture features. Based on the domain from which the texture feature is extracted, they can be broadly classified into spatial texture feature extraction methods and spectral texture feature extraction methods. For the former approach, texture features are extracted by computing the pixel statistics or finding the local pixel structures in original image domain, whereas the latter transforms an image into frequency domain and then calculates feature from the transformed image.

Feature Selection
Feature selection (also known as subset selection) is a process commonly used in machine learning, wherein a subset of the features available from the data is selected for application of a learning algorithm. The best subset contains the least number of dimensions that contributes to high accuracy; we discard the remaining, unimportant dimensions.

3.4.1Forward Selection
This selection process starts with no variables and adds them one by one, at each step adding the one that decreases the error the most, until any further addition does not significantly decrease the error. We use a simple ranking based feature selection criterion, a two tailed t-test, which measures the significance of a difference of means between two distributions, and therefore evaluates the discriminative power of each individual feature in separating two classes.
The features are assumed to come from normal distributions with unknown, but equal variances. Since the correlation among features has been completely ignored in this feature ranking method, redundant features can be inevitably selected, which ultimately affects the classification results. Therefore, we use this feature ranking method to select the more discriminative feature, e.g.by applying a cut-off ratio (p value<0.1), and then apply a feature subset selection method on the reduced feature space.

3.5Classification with Artificial Neural Networks (ANN)
Different types of Neural Networks (NN) have been proposed but all of them have three things in common: the individual neuron, the connection between them (architecture), and the learning algorithm. Each type restricts the kind of connections that are possible. Figure 3.4 shows an artificial neuron. The input to the neuron can be from the actual environment or from the other neurons. Its output can be fed into other neurons or directly into the environment. The output of the neuron is constructed by taking the weighted sum of the inputs called net input to a neuron or combination function (vector -to-scalar function) transformed by transfer function F [also called activation function (scalar-to-scalar function)]. This transfer function introduces nonlinearity into the system. This makes the system so powerful. One of the most important methods to train neural networks is Back Propagation Algorithm. It is a systematic method of training multilayer artificial neural networks. It is built on sound mathematical base. The back propagation is a gradient descent method in which gradient of the error is calculated with respect to the weights for a given input by propagating the error backwards from output layer to hidden layer and further to input layer.

Figure 3.4: Artificial Neuron
In this paper, a total number of 80 images were acquired, 48 images were passed into the neural network for training and the remaining 32 images were used for testing and validation. These images were first treated, pre-processed and segmented before being passed into the neural network. The neural network parameters used is as shown in table 3.1 Network structure 5-2-1

3.6Classification with Support Vector Machine (SVM)
SVM uses an optimum linear separating hyper plane to separate two set of data in feature space as shown in figure 9. This optimum hyper plane is produced by maximizing minimum margin between the two sets. Therefore the resulting hyper plane will only be dependent on border training patterns called support vectors.The standard SVM is a linear classifier which is composed of a set of given support vectors z and a set of weights w.
The computation for the output of a given SVM with N support vectors z1, z2, ....,zN and weights w1, w2, ....,wN is then given by: (2) SVM maps input vectors to a higher dimensional vector space where an optimal hyper plane is constructed. The data with linear severability may be analyzed with a hyper plane, and the linearly non separable data are analyzed with kernel functions such as Gaussian RBF. The output of an SVM is a linear combination of the training examples projected onto a high dimensional feature space through the use of kernel function. For this work SVM with kernel function linear and RBF (Radial Basis Function) is used for classification of images into two classes namely "Idiopathic Pulmonary Diseases" and "Chronic Obstructive Pulmonary Disease". The labels for these classes are using "1" and "2" for "Normal" and "Abnormal" respectively. Classification performance results are discussed in result section in detail. On the basis of classification accuracy rate KNN algorithm is chosen for the classification purpose.

4.RESULTS AND DISCUSSION
The system user interface was designed and developed on Matlab 7.10a and computed on a 4-GHZ memory. The user interface is flexible as it loads the image onto it and also gives out results. The results are displayed for visibility and further studies. The interface works hand in hand with the artificial neural network tool. The artificial neural network tool is generated, simulated and loads results.

Training Data
Chronic Obstructive Pulmonary Disease (COPD) is a preventable and treatable disease that makes it difficult to empty air out of the lungs. This difficulty in emptying air out of the lungs (airflow obstruction) can lead to shortness of breath or feeling tired because you are working harder to breathe. COPD is a term that is used to include chronic bronchitis, emphysema, or a combination of both conditions while Idiopathic Pulmonary Fibrosis (IPF) is a chronic progressive pulmonary disease of unknown etiology. It is primarily diagnosed on the basis of clinical, physiologic, and radiologic criteria.
In this paper, a total number of 80 images containing both COPD and IPF were acquired, 48 images were passed into the neural network for training and the remaining 32 images were used for testing and validation. These images were first treated, pre-processed and segmented before being passed into the neural network. The neural network parameters used is as shown in table 3.1

Training with Neural Network
The artificial neural network extracts features such as texture and roughness from the images and performs accurate classification sequence, together with the train sets.

Results with Artificial Neural Network
The artificial neural network can classify with the given number of neurons it is trained with. The results obtained with it are shown in the table 4.1 below The neural network is able to give an accurate classification of the results and a little incorrectness. The results above differ from the support vector machine.

Results with Support Vector Machine
A total of 80 images containing COPD and IPF were also acquired, 48 images were passed into the database for training and 32 images were used for testing. The classification results shows the support vector machine result generate a confusion matrix as shown in

Performance Evaluation Metrics
The performance evaluation metrics used are: Accuracy, Specificity, Sensitivity, True Positive, True Negative, False Positive and False Negative.
Accuracy is defined as the accurate value of classification which equals the total sum of correct classification over the total sum of correct and incorrect classification multiplied by 100.
Specificity is defined as the total division of true positive against the total of true positive and false negative, while sensitivity is defined as true negative against true negative and false positive.
True positive (Tp) occurs when the correct classification for the right image is done i.e. when a normal brain is classified as normal, while false negative (Fn) occurs when a normal brain is incorrectly classified as diseased.
True negative (Tn) occurs when the correct classification for the right image is done i.e. when a diseased image is classified as being diseased, while false positive (Fp) occurs when a diseased image is incorrectly classified to be normal.

Total Recognition time
Recognition of images occurs when the classifier is able to correctly identify and acquaint itself with what it has been trained with. Recognition time plays an important role in classification of medical images because higher recognition time could lead to more memory consumption which could affect corresponding. The Support Vector Machine poses a shorter training time of 0.284seconds as opposed to that of the Artificial Neural Network which is 0.468. The result is shown in table 4.3 above.

5.CONCLUSION AND RECOMMENDATION
In this paper, performance evaluation of the Artificial Neural Network and Support Vector Machine was carried out on both COPD and Pulmonary fibrosis, the results obtained showed that ANN outperforms the SVM with an accurate classification of 98.76% as opposed the SVM with 90.00% classification accuracy. This is as a result of the image cells tissue of the COPD and pulmonary fibrosis which is a good criterion for the neural network. The following were recommended:  Classification of COPD and Pulmonary fibrosis could be achieved with other classifiers such as KNN, Bayes classifiers etc.


Other diseased images of the lungs such as asthma, could also be evaluated and classified  Second and Higher Order features could also be extracted from the COPD and Pulmonary fibrosis  Future work should also ensure working on the training time of the neural network in the classification of COPD and Pulmonary fibrosis