miercuri, 19 octombrie 2022

The most suitable multi-label classification method(s) for predicting antimicrobial resistance

 The research question. The aim of this work is to identify which of the multi-label

classification methods is appropriate in the context of predicting antimicrobial resistance. In order to

address this problem we analyzed the multi-drug resistance in the case of E.Coli. The used dataset is

a public one, accessible at NCBI [5]. It consists of 363 records and each record contains a set of 140

genes which suffered different types of mutations, therefore they induce resistance to different

antibiotics classes. There are 34 antibiotics and for each antibiotic is specified if the bacteria is

susceptible or resistant to it. There are taken into consideration five types of changes which might

influence the role of a gene: point-type mutation, partial mutation, partial end of contig, mistranslation

and complete gene (no mutation).

The problem. Predicting the susceptibility or resistance of a microbe to several drugs is a

multi-label classification problem. Unlike in the traditional classification, where an input data can

belong to only one class, in the case of multi-label classification one data can belong to several classes

[1]. This makes the classification problem more challenging and the traditional classification models

cannot be directly applied.

Methodology and results. The methodology we followed consists of several steps:

Step 1. Analysis of the data characteristics with respect to the degree of multilabelness. The metrics

used to characterize the distribution of labels over instances are: Cardinality (the mean number of

labels per instance), Density (how well are the labels represented in each instance) and PMin

(percentage of instances with only one active label) [1,6,7]. The values obtained for the investigated

dataset are 7.80 for Cardinality, 0.229 for Density and 1.49% for Pmin showing that the dataset is

indeed a multi-labeled one and it cannot be treated as an one-label dataset.

Step 2. Comparison of several multi-label classification methods. The selected methods are: Binary

Relevance (BR) method with Random Forest as classifier algorithm [8], Binary Relevance method

using kNN algorithm [8], Multi-label kNN [9] and Classifier Chain (CC) using Random Forest


algorithm [10]. As accuracy metric it was used Hamming loss (smaller values denote a high accuracy)

and the following results were obtained: 0.107 for BR with Random Forest, 0.117 for BR with kNN,

0.117 for MLkNN, 0.106 for CC with Random Forest. This suggests that the CC approach is better in

this context than the BR strategy.

Step 3. Analysis of the degree of data imbalance. One of the main challenges in constructing a

multi-label classifier is the presence of data imbalance. Unlike for traditional classification tasks

where it is straightforward to analyze the distribution of labels and estimate the degree of imbalance,

in the case of multilabel classification there have been proposed several specific metrics [11], e.g.

IRLBl (Imbalanced Ratio) and CVIR (Coefficient variation of imbalanced ratio). For the analyze

dataset the IRLBl value is 324 and the mean value is 7.055, while the value of CVIR is 1.61. Based on

these metrics and comparing with the corresponding values for other datasets (genbase and yeast) [1],

it follows that the E.Coli dataset is imbalanced, therefore a specific technique, e.g. Synthetic Minority

Oversampling Technique (SMOTE)[12], should be applied.

Step 4. Application of SMOTE. After applying the Synthetic Minority Oversampling Technique the

obtained results are: 0.103 for BR with Random Forest, 0.113 for BR with KNN, 0.116 for MLkNN,

0.106 for CC with Random Forest. This suggests that applying SMOTE improved only slightly the

classifiers accuracy.


Conclusions and further work. The comparative analysis conducted on the E.Coli dataset

illustrated that the Classifier Chains are promising approaches for multi-drug resistance prediction. On

the other hand, further work is required to address the problem of imbalanced data.


References:

1. F. Herrera, F. Charte, A. J. Rivera, M. J. del Jesus. Multilabel Classification Problem Analysis,

Metrics and Techniques. In: Springer Verlag, 2016

2. D. Heider, R. Senge, W. Cheng, E. Hullermeier. Multilabel classification for exploiting

cross-resistance information in HIV-1 drug resistance and prediction. In: Bioinformatics, 2013,

pp. 29(16):1946-52

3. P. Eikafrawy, A. Mausad, H. Esmail. Experimental comparison of methods for multi-label

classification in different application domains. In: International Journal of Computer, 2015, pp.

Applications 114(19):1-9,

4. P. Boerlin, R. Travis, et al. Antimicrobial Resistance and Virulence Genes of Escherichia coli

Isolates from Swine in Ontario. In: Applied Environmental Microbiology, 2005, pp.

71(11):6753-61

5. NCBI Antimicrobial Resistance Resources https://www.ncbi.nlm.nih.gov/pathogens/

antimicrobial-resistance/resources/

6. G. Tsoumakas, I. Katakis. Multi-label classification: An Overview. In: International Journal of

Data Warehousing and Mining, 2007, pp. 3(3):1-13

7. M.D. Turner, C. Chakrabarti, et al. Automated annotation of functional imaging experiments via

multi-label classication. In: Front. Neuroscience, 2013, pp. 7:240,

8. Scikit-multilearn - http://scikit.ml

9. M.L. Zhang, Z.H. Zho. ML-kNN: A Lazy Learning Approach to Multi-Label Learning. In:

Pattern Recognition, 2007, pp: 40(7), 2038-2048

10. J. Read, B. Pfahringer, et al. Classier Chains for Multi-label Classication. In: W. Buntine, M.

Grobelnik, D. Mladenic, J. Shawne-Taylor (eds) Machine Learning and Knowledge Discovery in

Databases. ECML PKDD, LNCS 5782, 2009.

11. F. Charte, A.J. Rivera, M.J. del Jesus, F. Herrera. Addressing imbalance in multilabel

classication: measures and random resampling algorithms. In: Neurocomputing, 2015, pp. 163,

3–16

12. A.F. Giraldo-Forero, J.A. Jaramillo-Garzon, J.F. Ruiz-Muniz, C.G. Castellanos-Dominguez.

Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE

Algorithm. In: J. Ruiz-Shulcloper, J. Sanniti di Baja (eds.) Progress in Pattern Recognition, Image

Analysis, Computer Vision, and Applications. CIARP 2013, LNCS 8258, 2013

Niciun comentariu:

Trimiteți un comentariu

Medical Image Segmentation using AI

 Medical Image Segmentation using AI Malina Diaconescu, Adrian Pal  Introduction Medical image segmentation has been very challenging due to...