An Accurate Model for Text Document Classification Using Machine Learning Techniques
Abstract
Text document classification (TDC) is an approach used for the classification of any kind of document for the target category or out.Text classification algorithms have come across significant challenges recently as a result of the exponential expansion of digital text documents; the large volume of words in each document reduces the effectiveness of these existing text classifiers.A key method for improving classification accuracy and getting rid of redundant data is referred to as feature selection (FS).In this work, several phases have been conducted to test and equip the proposed model.Initially, the applied machine learning algorithms were tested and trained using the Reuters-21578 dataset.Second, data cleaning, label encoding, tokenizing, text cleaning, and last TF-IDF vectorization were done to prepare the dataset.Thirdly, four distinct machine learning algorithms, Extreme Gradient Boosting (XGBoost), K-Nearest Neighbor (KNN), Random Forest (RF), and Decision Tree (DT) were used to build a brand-new machine learning-based text document classification model (ML-TDCM) for document classification.Finally, several metrics, including F1 score, accuracy, precision, and recall, were used to assess the proposed model.With a 91% classification accuracy, XGBoost turned out to be the best-performing algorithm among the others.The obtained results were also matched with results obtained in past studies, verifying the performance of the suggested models and so defining them as possible methods to be applied in the next work concerning document categorization.
Related Papers
- → Physician-Friendly Machine Learning: A Case Study with Cardiovascular Disease Risk Prediction(2019)71 cited
- → Artificial Intelligence, Machine Learning, and Medicine: A Little Background Goes a Long Way Toward Understanding(2021)29 cited
- → Interpretable machine learning assessment(2023)24 cited
- → Breakdown of Machine Learning Algorithms(2022)1 cited
- → Machine Learning Techniques for the Management of Diseases: A Paper Review(2024)