Classifying Breast Cancer as Benign or Malignant Using RTextTools

[This article was first published on RTextTools: a machine learning library for text classification - Blog, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

RTextTools has largely been used for topic classification in the social sciences. However, recent discussions with researchers at various universities have demonstrated that the package can be applied to a host of problems in the natural sciences as well.

One such application is using text classification to identify breast cancer masses as benign or malignant. Using the Wisconsin Diagnostic Breast Cancer Dataset from UC Irvine, we wrote a script that trains eight classifiers on characteristics such as clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, and mitoses. When run on the data, the classifiers were able to achieve up to 96% recall accuracy on a randomly sampled training set of 200 patients and test set of 400 patients.

The source code is available below, and the dataset is automatically downloaded from UC Irvine’s servers. If you’ve found RTextTools useful in your research, we’d love to hear about it!

To leave a comment for the author, please follow the link and comment on their blog: RTextTools: a machine learning library for text classification - Blog.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)