Resampling Logistic Regression untuk Penanganan Ketidakseimbangan Class Pada Prediksi Cacat Software

research
  • 05 Apr
  • 2023

Resampling Logistic Regression untuk Penanganan Ketidakseimbangan Class Pada Prediksi Cacat Software

Software yang berkualitas tinggi adalah software yang dapat membantu proses bisnis perusahaan dengan efektif, efesien dan tidak ditemukan cacat selama proses pengujian, pemeriksaan, dan implementasi. Perbaikan software setelah pengirimana dan implementasi, membutuhkan biaya jauh lebih mahal dari pada saat pengembangan. Biaya yang dibutuhkan untuk pengujian software menghabiskan lebih dari 50% dari biaya pengembangan. Dibutuhkan model pengujian cacat software untuk mengurangi biaya yang dikeluarkan. Saat ini belum ada model prediksi cacat software yang berlaku umum pada saat digunakan. Model Logistic Regression merupakan model paling efektif dan efesien dalam prediksi cacat software. Kelemahan dari Logistic Regression adalah rentan terhadap underfitting pada dataset yang kelasnya tidak seimbang, sehingga akan menghasilkan akurasi yang rendah. Dataset NASA MDP adalah dataset umum yang digunakan dalam prediksi cacat software. Salah satu karakter dari dataset prediksi cacat software, termasuk didalamnya dataset NASA MDP adalah memiliki ketidakseimbangan pada kelas. Untuk menangani masalah ketidakse- imbangan kelas pada dataset cacat software pada penelitian ini diusulkan metode resampling. Eksperimen dilakukan untuk membandingkan hasil kinerja Logistic Regression sebelum dan setelah diterapkan metode resampling. Demikian juga dilakukan eksperimen untuk membandingkan metode yang diusulkan hasil pengklasifikasi lain seperti Naïve Bayes, Linear Descriminant Analysis, C4.5, Random Forest, Neural Network, k-Nearest Network. Hasil eksperimen menunjukkan bahwa tingkat akurasi Logistic Regression dengan resampling lebih tinggi dibandingkan dengan metode Logistric Regression yang tidak menggunakan resampling, demikian juga bila dibandingkan dengan pengkalisifkasi yang lain. Dari hasil eksperimen di atas dapat disimpulkan bahwa metode resampling terbukti efektif dalam menyelesaikan ketidakseimbangan kelas pada prediksi cacat software dengan algoritma Logistic Regression.

Unduhan

 

REFERENSI

Afzal, W., & Torkar, R. (2008). Lessons from applying experimentation in software engineering prediction systems.

Canu, S., & Smola, A. (2006). Kernel methods and the exponential family. Neurocomputing.

Catal, C., & Diri, B. (2009). A systematic review of software fault prediction studies. Expert Systems with Applications.

Cateni, S., Colla, V., & Vannucci, M. (2014). A method for resampling imbalanced datasets in binary classification tasks for real-world problems. Neurocomputing.

Chang, C., & Chu, C. (2007). Defect prevention in software processes : An action- based approach.

Chang, R., Mu, X., & Zhang, L. (2011). Software Defect Prediction Using Non- Negative Matrix Factorization. Journal of Software.

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research.

Czibula, G., Marian, Z., & Czibula, I. G. (2014). Software defect prediction using relational association rule mining. Information Sciences.

Dawson, C. W. (2009). Projects in Computing and Information Systems A Student’s Guide Second Edition. Information Systems Journal (Vol. 2). Harlow, England: Addison-Wesley.

Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets.

The Journal of Machine Learning Research.

Dubey, R., Zhou, J., Wang, Y., Thompson, P. M., & Ye, J. (2014). Analysis of sampling techniques for imbalanced data: An n=648 ADNI study. NeuroImage.

Freund, R. J., J, W. W., & L, M. D. (2003). Statistical Methods (Vol. 2).

Academic Press.

Ganganwar, V. (2012). An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering.

Hall, T., Beecham, S., Bowes, D., Gray, D., & Counsell, S. (2012). A Systematic Literature Review on Fault Prediction Performance in Software Engineering. IEEE Transactions on Software Engineering.

Harrington, P. (2012). Machine Learning in Action. Manning Publications Co. Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic

Regression Third Edition. Hoboken, NJ, USA: John Wiley & Sons, Inc.

Karsmakers, P., Pelckmans, K., & Suykens, J. a. K. (2007). Multi-class kernel logistic regression: a fixed-size implementation. 2007 International Joint Conference on Neural Networks.

Khoshgoftaar, T. M., Gao, K., Napolitano, A., & Wald, R. (2013). A comparative study of iterative and non-iterative feature selection techniques for software defect prediction. Information Systems Frontiers.

King, G., & Zeng, L. (2001). Logistic Regression in Rare. Political Analysis.

Koh, K., Kim, S.-J., & Boyd, S. (2007). An Interior-Point Method for Large-Scale Logistic Regression. Journal of Machine Learning Research.

Komarek, P., & Moore, A. (2004). Logistic Regression for Data Mining and High-Dimensional Classification. School of Computer Science.

Komarek, P., & Moore, A. W. (2005). Making Logistic Regression A Core Data Mining Tool. School of Computer Science.

Laradji, I. H., Alshayeb, M., & Ghouti, L. (2015). Software defect prediction using ensemble learning on selected features. Information and Software Technology.

Larose, D. T. (2005). Discovering Knowladge In Data: An Introduction to Data Mining. Discovering Knowledge in Data: An Introduction to Data Mining.

Lessmann, S., Member, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking Classification Models for Software Defect Prediction : A Proposed Framework and Novel Findings.

Liebchen, G. a., & Shepperd, M. (2008). Data sets and data quality in software engineering. Proceedings of the 4th International Workshop on Predictor Models in Software Engineering.

Lin, C., Weng, R. C., & Keerthi, S. S. (2008). Trust Region Newton Method for Large-Scale Logistic Regression. Journal of Machine Learning Research.

Ma, Y., Luo, G., Zeng, X., & Chen, A. (2012). Transfer learning for cross- company software defect prediction. Information and Software Technology.

Maalouf, M., & Siddiqi, M. (2014). Weighted logistic regression for large-scale imbalanced and rare events data. Knowledge-Based Systems.

Maalouf, M., & Trafalis, T. B. (2011). Robust weighted kernel logistic regression in imbalanced and rare events data. Computational Statistics & Data Analysis.

MacDonald, M., Musson, R., & Smits, R. (2008). The Practical Guide to Defect Prevention. Redmond, Washington: H.B. Fenn and Company Ltd.

Pelayo, L., & Dick, S. (2007). Applying novel resampling strategies to software defect prediction. Annual Conference of the North American Fuzzy Information Processing Society - NAFIPS.

Pressman, R. S. (2010). Software Engineering A Practitioner’s Approach Sevent Edition. New York, NY: McGraw-Hill Companies, Inc.

Saifudin, A. (2014). Penerapan Pendekatan Level Data dan Algoritma Untuk Penanganan Prediksi Cacat Software Berbasis Naive Bayes.

Song, Q., Jia, Z., Shepperd, M., Ying, S., & Liu, J. (2011). A General Software Defect-Proneness Prediction Framework. IEEE Transactions on Software Engineering.

Thanathamathee, P., & Lursinsap, C. (2013). Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques. Pattern Recognition Letters.

Vercellis, C. (2011). Business Intelligence: Data Mining and Optimization for Decision Making. Methods. John Wiley & Sons.

Wahono, R. S., & Herman, N. S. (2014). Genetic Feature Selection for Software Defect Prediction. Advanced Science Letters.

Wahono, R. S., Herman, N. S., & Ahmad, S. (2011). A Comparison Framework of Classification Models for Software Defect Prediction. Advanced Science Letters.

Wahono, R. S., Suryana, N., & Ahmad, S. (2014). Metaheuristic Optimization based Feature Selection for Software Defect Prediction. Journal of Software.

Witten, I. H., Frank, E., & Hall, M. A. (2011). Data Mining Practical Mechine Learning Tools and Techniques Third Edition.

Wu, J., & Cai, Z. (2011). Attribute Weighting via Differential Evolution Algorithm for Attribute Weighted Naive Bayes ( WNB ).

Wu, X., & Kumar, V. (2010). The Top Ten Algorithms in Data Mining. Taylor & Francis Group.

Yen, S. J., & Lee, Y. S. (2009). Cluster-based under-sampling approaches for imbalanced data distributions. Expert Systems with Applications, 36.

Yu, C. H. (2010). Resampling methods : Concepts , Applications , and Justification What is resampling ? Types of resampling.

Zhang, H., & Wang, Z. (2011). A normal distribution-based over-sampling approach to imbalanced data classification. In Artificial Intelligence and Lecture Notes in Bioinformatics.