The research question. The aim of this work is to identify which of the multi-label
classification methods is appropriate in the context of predicting antimicrobial resistance. In order to
address this problem we analyzed the multi-drug resistance in the case of E.Coli. The used dataset is
a public one, accessible at NCBI [5]. It consists of 363 records and each record contains a set of 140
genes which suffered different types of mutations, therefore they induce resistance to different
antibiotics classes. There are 34 antibiotics and for each antibiotic is specified if the bacteria is
susceptible or resistant to it. There are taken into consideration five types of changes which might
influence the role of a gene: point-type mutation, partial mutation, partial end of contig, mistranslation
and complete gene (no mutation).
The problem. Predicting the susceptibility or resistance of a microbe to several drugs is a
multi-label classification problem. Unlike in the traditional classification, where an input data can
belong to only one class, in the case of multi-label classification one data can belong to several classes
[1]. This makes the classification problem more challenging and the traditional classification models
cannot be directly applied.
Methodology and results. The methodology we followed consists of several steps:
Step 1. Analysis of the data characteristics with respect to the degree of multilabelness. The metrics
used to characterize the distribution of labels over instances are: Cardinality (the mean number of
labels per instance), Density (how well are the labels represented in each instance) and PMin
(percentage of instances with only one active label) [1,6,7]. The values obtained for the investigated
dataset are 7.80 for Cardinality, 0.229 for Density and 1.49% for Pmin showing that the dataset is
indeed a multi-labeled one and it cannot be treated as an one-label dataset.
Step 2. Comparison of several multi-label classification methods. The selected methods are: Binary
Relevance (BR) method with Random Forest as classifier algorithm [8], Binary Relevance method
using kNN algorithm [8], Multi-label kNN [9] and Classifier Chain (CC) using Random Forest
algorithm [10]. As accuracy metric it was used Hamming loss (smaller values denote a high accuracy)
and the following results were obtained: 0.107 for BR with Random Forest, 0.117 for BR with kNN,
0.117 for MLkNN, 0.106 for CC with Random Forest. This suggests that the CC approach is better in
this context than the BR strategy.
Step 3. Analysis of the degree of data imbalance. One of the main challenges in constructing a
multi-label classifier is the presence of data imbalance. Unlike for traditional classification tasks
where it is straightforward to analyze the distribution of labels and estimate the degree of imbalance,
in the case of multilabel classification there have been proposed several specific metrics [11], e.g.
IRLBl (Imbalanced Ratio) and CVIR (Coefficient variation of imbalanced ratio). For the analyze
dataset the IRLBl value is 324 and the mean value is 7.055, while the value of CVIR is 1.61. Based on
these metrics and comparing with the corresponding values for other datasets (genbase and yeast) [1],
it follows that the E.Coli dataset is imbalanced, therefore a specific technique, e.g. Synthetic Minority
Oversampling Technique (SMOTE)[12], should be applied.
Step 4. Application of SMOTE. After applying the Synthetic Minority Oversampling Technique the
obtained results are: 0.103 for BR with Random Forest, 0.113 for BR with KNN, 0.116 for MLkNN,
0.106 for CC with Random Forest. This suggests that applying SMOTE improved only slightly the
classifiers accuracy.
Conclusions and further work. The comparative analysis conducted on the E.Coli dataset
illustrated that the Classifier Chains are promising approaches for multi-drug resistance prediction. On
the other hand, further work is required to address the problem of imbalanced data.
References:
1. F. Herrera, F. Charte, A. J. Rivera, M. J. del Jesus. Multilabel Classification Problem Analysis,
Metrics and Techniques. In: Springer Verlag, 2016
2. D. Heider, R. Senge, W. Cheng, E. Hullermeier. Multilabel classification for exploiting
cross-resistance information in HIV-1 drug resistance and prediction. In: Bioinformatics, 2013,
pp. 29(16):1946-52
3. P. Eikafrawy, A. Mausad, H. Esmail. Experimental comparison of methods for multi-label
classification in different application domains. In: International Journal of Computer, 2015, pp.
Applications 114(19):1-9,
4. P. Boerlin, R. Travis, et al. Antimicrobial Resistance and Virulence Genes of Escherichia coli
Isolates from Swine in Ontario. In: Applied Environmental Microbiology, 2005, pp.
71(11):6753-61
5. NCBI Antimicrobial Resistance Resources https://www.ncbi.nlm.nih.gov/pathogens/
antimicrobial-resistance/resources/
6. G. Tsoumakas, I. Katakis. Multi-label classification: An Overview. In: International Journal of
Data Warehousing and Mining, 2007, pp. 3(3):1-13
7. M.D. Turner, C. Chakrabarti, et al. Automated annotation of functional imaging experiments via
multi-label classication. In: Front. Neuroscience, 2013, pp. 7:240,
8. Scikit-multilearn - http://scikit.ml
9. M.L. Zhang, Z.H. Zho. ML-kNN: A Lazy Learning Approach to Multi-Label Learning. In:
Pattern Recognition, 2007, pp: 40(7), 2038-2048
10. J. Read, B. Pfahringer, et al. Classier Chains for Multi-label Classication. In: W. Buntine, M.
Grobelnik, D. Mladenic, J. Shawne-Taylor (eds) Machine Learning and Knowledge Discovery in
Databases. ECML PKDD, LNCS 5782, 2009.
11. F. Charte, A.J. Rivera, M.J. del Jesus, F. Herrera. Addressing imbalance in multilabel
classication: measures and random resampling algorithms. In: Neurocomputing, 2015, pp. 163,
3–16
12. A.F. Giraldo-Forero, J.A. Jaramillo-Garzon, J.F. Ruiz-Muniz, C.G. Castellanos-Dominguez.
Managing Imbalanced Data Sets in Multi-label Problems: A Case Study with the SMOTE
Algorithm. In: J. Ruiz-Shulcloper, J. Sanniti di Baja (eds.) Progress in Pattern Recognition, Image
Analysis, Computer Vision, and Applications. CIARP 2013, LNCS 8258, 2013