Fundamentals of Artificial Intelligence @ UVT: octombrie 2022

joi, 20 octombrie 2022

Common issues with ML

With the use of machine learning (ML), which is a form of artificial intelligence (AI), software programs may anticipate outcomes more accurately without having to be explicitly instructed to do so. To forecast new output values, machine learning algorithms use past data as input.

Despite being applied in multiple industries, machine learning still has a lot of issues that cannot be overlooked. Here are some of them:

1. Inadequate data

The most common issue is the lack of quantity or the quality of the data used to train the model. While simple tasks may require only a few thousand sample data, for a more advanced task like image recognition you may need millions of samples. Regarding the data quality it can have the following problems:

noisy data (any data that cannot be understood and interpreted correctly by machines)
incorrect data (caused by human error for example)
generalizing of output data (becomes way too complex to generalize)

2. Overfitting

A machine learning model begins collecting noise and erroneous data into the training data set once it is trained with a large amount of data. As a result, the model's performance suffers. Let’s take an example of a training data set where we have 1000 apples, 1000 bananas and 8000 papayas. The chances that the model will identify a lot of apples as papayas are pretty high because there is a massive amount of biased data in the training data set. Overfitting usually happens due to the usage of non-linear methods. This problem can be solved by using linear and parametric algorithms. Although there are several other methods to reduce overfitting:

increasing the training data
early stopping during the training phase
reduce the noise in the data set

3. Underfitting

Underfitting is the exact opposite of overfitting. It occurs when we have limited data and try to build a linear model with non-linear data. Some methods to reduce underfitting:

increase the number of features
increase the number of epochs
reduce the noise in the data set

4. Lack of explainability

Machine learning models suffer from a lack of explainability. This implies that the results get increasingly difficult to comprehend as time goes on. It becomes very hard to reverse-engineer a machine-learning model after some time, decreasing its validity. Unfortunately, sophisticated machine learning techniques don't offer the required transparency or clarity.

5. Slow implementation

Although machine learning models are quite effective at producing accurate predictions, sometimes it takes a long time. The most common cases are due to slow programs, data overload, and excessive requirements. To get the best results, it also needs ongoing maintenance and monitoring.

Bibliography:

https://www.geeksforgeeks.org/7-major-challenges-faced-by-machine-learning-professionals/

https://www.javatpoint.com/issues-in-machine-learning

https://www.hyperon.io/blog/common-problems-with-machine-learning-that-companies-face

Computer Aided Diagnosis for Diabetes Diagnostic using Machine Learning algorithms

In medical imaging, Computer Aided Diagnosis (CAD) is a rapidly growing dynamic area of research. In recent years, significant attempts are made for the enhancement of computer aided diagnosis applications because errors in medical diagnostic systems can result in seriously misleading medical treatments. Machine learning is important in Computer Aided Diagnosis. After using an easy equation, objects such as organs may not be indicated accurately. So, pattern recognition fundamentally involves learning from examples. In the field of bio-medical, pattern recognition and machine learning promise the improved accuracy of perception and diagnosis of disease. They also promote the objectivity of decision-making process. For the analysis of high-dimensional and multimodal bio-medical data, machine learning offers a worthy approach for making classy and automatic algorithms.

Many researchers have worked on different machine learning algorithms for disease diagnosis. Researchers have been accepted that machine-learning algorithms work well in diagnosis of different diseases. In this survey paper diseases diagnosed by MLT are heart, diabetes, liver, dengue and hepatitis.

Iyer et al. [11] has performed a work to predict diabetes disease by using decision tree and Naive Bayes. Diseases occur when production of insulin is insufficient or there is improper use of insulin. Data set used in this work is Pima Indian diabetes data set. Various tests were performed using WEKA data mining tool. In this data-set percentage split (70:30) predict better than cross validation. J48 shows 74.8698% and 76.9565% accuracy by using Cross Validation and Percentage Split Respectively. Naive Bayes presents 79.5652% correctness by using PS. Algorithms shows highest accuracy by utilizing percentage split test.

Meta learning algorithms for diabetes disease diagnosis has been discussed by Sen and Dash [12] . The employed data set is Pima Indians diabetes that is received from UCI Machine Learning laboratory. WEKA is used for analysis. CART, Adaboost, Logiboost and grading learning algorithms are used to predict that patient has diabetes or not. Experimental results are compared on the behalf of correct or incorrect classification. CART offers 78.646% accuracy. The Adaboost obtains 77.864% exactness. Logiboost offers the correctness of 77.479%. Grading has correct classification rate of 66.406%. CART offers highest accuracy of 78.646% and misclassification Rate of 21.354%, which is smaller as compared to other techniques.

An experimental work to predict diabetes disease is done by the Kumari and Chitra [13] . Machine learning technique that is used by the scientist in this experiment is SVM. RBF kernel is used in SVM for the purpose of classification. Pima Indian diabetes data set is provided by machine learning laboratory at University of California, Irvine. MATLAB 2010a are used to conduct experiment. SVM offers 78% accuracy.

Sarwar and Sharma [14] have suggested the work on Naive Bayes to predict diabetes Type-2. Diabetes disease has 3 types. First type is Type-1 diabetes, Type-2 diabetes is the second type and third type is gestational diabetes. Type-2 diabetes comes from the growth of Insulin resistance. Data set consists of 415 cases and for purpose of variety; data are gathered from dissimilar sectors of society in India. MATLAB with SQL server is used for development of model. 95% correct prediction is achieved by Naive Bayes.

Ephzibah [15] has constructed a model for diabetes diagnosis. Proposed model joins the GA and fuzzy logic. It is used for the selection of best subset of features and also for the enhancement of classification accuracy. For experiment, dataset is picked up from UCI Machine learning laboratory that has 8 attributes and 769 cases. MATLAB is used for implementation. By using genetic algorithm only three best features/attributes are selected. These three attributes are used by fuzzy logic classifier and provide 87% accuracy. Around 50% cost is less than the original cost. Table 2provides the Comprehensive view of Machine learning Techniques for diabetes disease diagnosis.

Analysis:

Naive Bayes based system is helpful for diagnosis of Diabetes disease. Naive Bayes offers highest accuracy of 95% in 2012. The results show that this system can do good prediction with minimum error and also this technique is important to diagnose diabetes disease. But in 2015, accuracy offered by Naive Bayes is low. It presents 79.5652% or 79.57% accuracy. This proposed model for detection of Diabetes disease would require more training data for creation and testing. Figure 4shows the Accuracy graph of Algorithms for the diagnosis of Diabetes disease according to time.

Advantages and Disadvantages of Naive Bayes:

Advantages: It enhances the classification performance by eliminating the unrelated features. Its performance is good. It takes less computational time.

Machine Learning Techniques	Author	Year	Disease	Resource of Data Set	Tool	Accuracy
Naive Bayes	Iyer et al.	2015	Diabetes Disease	Pima Indian Diabetes dataset	WEKA	79.5652%
J48	Iyer et al.	2015	Diabetes Disease	Pima Indian Diabetes dataset	WEKA	76.9565%
CART	Sen and Dash	2014	Diabetes Disease	Pima Indian Diabetes dataset from UCI	WEKA	78.646%
Adaboost						77.864%
Logiboost						77.479%
Grading						66.406%
SVM	Kumari and Chitra	2013	Diabetes Disease	UCI	MATLAB 2010a	78%
Naive Bayes	Sarwar and Sharma	2012	Diabetes type-2	Different Sectors of Society in India	MATLAB with SQL Server	95%
GA + Fuzzy Logic	Ephzibah	2011	Diabetes disease	UCI	MATLAB	87%

https://html.scirp.org/file/1-9601348x5.png

Disadvantages: This algorithm needs large amount of data to attain good outcomes. It is lazy as they store entire the training examples [16] .

miercuri, 19 octombrie 2022

The most suitable multi-label classification method(s) for predicting antimicrobial resistance

The research question. The aim of this work is to identify which of the multi-label

classification methods is appropriate in the context of predicting antimicrobial resistance. In order to

address this problem we analyzed the multi-drug resistance in the case of E.Coli. The used dataset is

a public one, accessible at NCBI [5]. It consists of 363 records and each record contains a set of 140

genes which suffered different types of mutations, therefore they induce resistance to different

antibiotics classes. There are 34 antibiotics and for each antibiotic is specified if the bacteria is

susceptible or resistant to it. There are taken into consideration five types of changes which might

influence the role of a gene: point-type mutation, partial mutation, partial end of contig, mistranslation

and complete gene (no mutation).

The problem. Predicting the susceptibility or resistance of a microbe to several drugs is a

multi-label classification problem. Unlike in the traditional classification, where an input data can

belong to only one class, in the case of multi-label classification one data can belong to several classes

[1]. This makes the classification problem more challenging and the traditional classification models

cannot be directly applied.

Methodology and results. The methodology we followed consists of several steps:

Step 1. Analysis of the data characteristics with respect to the degree of multilabelness. The metrics

used to characterize the distribution of labels over instances are: Cardinality (the mean number of

labels per instance), Density (how well are the labels represented in each instance) and PMin

(percentage of instances with only one active label) [1,6,7]. The values obtained for the investigated

dataset are 7.80 for Cardinality, 0.229 for Density and 1.49% for Pmin showing that the dataset is

indeed a multi-labeled one and it cannot be treated as an one-label dataset.

Step 2. Comparison of several multi-label classification methods. The selected methods are: Binary

Relevance (BR) method with Random Forest as classifier algorithm [8], Binary Relevance method

using kNN algorithm [8], Multi-label kNN [9] and Classifier Chain (CC) using Random Forest

algorithm [10]. As accuracy metric it was used Hamming loss (smaller values denote a high accuracy)

and the following results were obtained: 0.107 for BR with Random Forest, 0.117 for BR with kNN,

0.117 for MLkNN, 0.106 for CC with Random Forest. This suggests that the CC approach is better in

this context than the BR strategy.

Step 3. Analysis of the degree of data imbalance. One of the main challenges in constructing a

multi-label classifier is the presence of data imbalance. Unlike for traditional classification tasks

where it is straightforward to analyze the distribution of labels and estimate the degree of imbalance,

in the case of multilabel classification there have been proposed several specific metrics [11], e.g.

IRLBl (Imbalanced Ratio) and CVIR (Coefficient variation of imbalanced ratio). For the analyze

dataset the IRLBl value is 324 and the mean value is 7.055, while the value of CVIR is 1.61. Based on

these metrics and comparing with the corresponding values for other datasets (genbase and yeast) [1],

it follows that the E.Coli dataset is imbalanced, therefore a specific technique, e.g. Synthetic Minority

Oversampling Technique (SMOTE)[12], should be applied.

Step 4. Application of SMOTE. After applying the Synthetic Minority Oversampling Technique the

obtained results are: 0.103 for BR with Random Forest, 0.113 for BR with KNN, 0.116 for MLkNN,

0.106 for CC with Random Forest. This suggests that applying SMOTE improved only slightly the

classifiers accuracy.

Conclusions and further work. The comparative analysis conducted on the E.Coli dataset

illustrated that the Classifier Chains are promising approaches for multi-drug resistance prediction. On

the other hand, further work is required to address the problem of imbalanced data.

References:

1. F. Herrera, F. Charte, A. J. Rivera, M. J. del Jesus. Multilabel Classification Problem Analysis,

Metrics and Techniques. In: Springer Verlag, 2016

2. D. Heider, R. Senge, W. Cheng, E. Hullermeier. Multilabel classification for exploiting

cross-resistance information in HIV-1 drug resistance and prediction. In: Bioinformatics, 2013,

pp. 29(16):1946-52

3. P. Eikafrawy, A. Mausad, H. Esmail. Experimental comparison of methods for multi-label

classification in different application domains. In: International Journal of Computer, 2015, pp.

Applications 114(19):1-9,

4. P. Boerlin, R. Travis, et al. Antimicrobial Resistance and Virulence Genes of Escherichia coli

Isolates from Swine in Ontario. In: Applied Environmental Microbiology, 2005, pp.

71(11):6753-61

5. NCBI Antimicrobial Resistance Resources https://www.ncbi.nlm.nih.gov/pathogens/

antimicrobial-resistance/resources/

6. G. Tsoumakas, I. Katakis. Multi-label classification: An Overview. In: International Journal of

Data Warehousing and Mining, 2007, pp. 3(3):1-13

7. M.D. Turner, C. Chakrabarti, et al. Automated annotation of functional imaging experiments via

multi-label classication. In: Front. Neuroscience, 2013, pp. 7:240,

8. Scikit-multilearn - http://scikit.ml

9. M.L. Zhang, Z.H. Zho. ML-kNN: A Lazy Learning Approach to Multi-Label Learning. In:

Pattern Recognition, 2007, pp: 40(7), 2038-2048

10. J. Read, B. Pfahringer, et al. Classier Chains for Multi-label Classication. In: W. Buntine, M.

Grobelnik, D. Mladenic, J. Shawne-Taylor (eds) Machine Learning and Knowledge Discovery in

Databases. ECML PKDD, LNCS 5782, 2009.

11. F. Charte, A.J. Rivera, M.J. del Jesus, F. Herrera. Addressing imbalance in multilabel

classication: measures and random resampling algorithms. In: Neurocomputing, 2015, pp. 163,

3–16

12. A.F. Giraldo-Forero, J.A. Jaramillo-Garzon, J.F. Ruiz-Muniz, C.G. Castellanos-Dominguez.

Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE

Algorithm. In: J. Ruiz-Shulcloper, J. Sanniti di Baja (eds.) Progress in Pattern Recognition, Image

Analysis, Computer Vision, and Applications. CIARP 2013, LNCS 8258, 2013

duminică, 16 octombrie 2022

Biomimetics in ML and comparing BNN to ANN

In order to discuss this topic we need to know all of the terms:

Biomimetics applied: Kingfisher bird (left) and Shinkansen 500 Series (right)

Biomimetics or biomimicry is the emulation of the models, systems, and elements of nature for the purpose of solving complex human problems.

Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks.

Biological neural network (left) and artificial neural network (right)

Biological neural network (BNN) is a structure that consists of synapse, dendrites, cell body, and axon.

Artificial neural network (ANN) is based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a BNN.

The BNN is composed of several synaptically coupled processing units called neurons. These neurons either take in input or the outcome of other neurons. The last layer, where the findings may be displayed to the outside world, is where the produced output from the individual neurons propagates its impact on the whole network.

When the network is being trained, each synapses is assigned a processing value and weight. The amount of neurons in the network, their connections between them (i.e., topology), and the weights assigned to each synapse all have a significant impact on the network's performance and potency.

Similarly, the ANN is also made up of a variety of processing components, called (artificial) neurons, that are connected by weighted paths to create networks. Each element's output is calculated by applying a non-linear function to its weighted inputs. Networks that combine these processing components may perform arbitrary complicated non-linear operations, such as tasks involving classification, prediction, or optimization.

These artificial neural networks can recover important data from a noisy environment and learn from experiences and examples, just like the human brain.

The main differences between BNN and ANN:

Basis for comparison	BNN	ANN
Processing	Parallel and distributed	Sequential and centralised
Speed (in processing information)	Slow	Fast
Size	Large	Small
Allocation for storage to a new process	Easy as it is added just by adjusting the interconnection strengths	Strictly irreplaceable as the old location is saved for the previous process
Control mechanism	Activites are centrally controlled (not monitored by a control unit)	Activites are monitored by a control unit
Fault tolerance	Implicitly fault tolerant	Intolerant to the failure

In conclusion the ANN is the outcome of the implementation of the BNN approach. Although there are some similarities between the two, the differences are apparent.

Bibliography:

https://www.youtube.com/watch?v=iMtXqTmfta0

https://en.wikipedia.org/wiki/Biomimetics

https://en.wikipedia.org/wiki/Artificial_neural_network

https://techdifferences.com/difference-between-artificial-neural-network-and-biological-neural-network.html

https://www.geeksforgeeks.org/difference-between-ann-and-bnn/

Fundamentals of Artificial Intelligence @ UVT