Writing Homework Help
MIS 655 Grand Canyon University Neural Network Models Responses
Discussion 1: Arcelia
Imbalance data is not just a problem afflicting neural networks but is also a concern with other supervised machine learning algorithms that use labeled training data like decision trees (Kumar et al., 2021). When we have insufficient data to properly represent the minority class, our classification model is created with a bias for the majority class (Kumar et al., 2021). While the model produced when our imbalance reached below 5% may seem accurate, it will not represent the real world. In the case listed above, overfitting would be an issue. It would lead to an inaccurate model because the minority is only 2.5%, so our final model would be biased towards the majority set.
There are several ways to correct the imbalance issue. As noted by Elsinghorst (2017), two popular methods are to use a combination of over or under-sampling to balance the data. In under-sampling, subset data samples are randomly selected from the majority to match the maximum of the minority. With oversampling, we duplicate samples from the minority to match the samples in the majority. For example, if we had an imbalanced dataset with five minority samples and 300 majority samples, we would use oversampling to create 255 duplicate samples in our minority set. If under-sampling, we would randomly select 5 samples from our majority class to use. As you can see, both under and over-sampling can introduce overestimation and variance in our model. Additionally, Elsinghorst (2017) writes that other techniques like SMOTE or ROSE can be used as alternatives. These techniques “are hybrid methods that combine under-sampling with the generation of additional data” (Elsinghorst, 2017, Rose section). Lastly, Shmueli et al. (2017) write that within the nueralnet() function, the momentum parameter can be utilized to ensure that the model is willing to change and avoid “getting stuck in a local optimum” (p.283) which would assist in ensuring a more global fit.
References
Elsinghorst, S. (2017, April 2). Dealing with unbalanced data in machine learning. Github. https://shiring.github.io/machine_learning/2017/04/02/unbalanced
Kumar, P., Bhatnagar, R., Gaur, K., & Bhatnagar, A. (2021). Classification of imbalanced Data:Review of methods and applications. IOP Conference Series: Materials Science and Engineering, 1099(1), 1–8. https://doi.org/10.1088/1757-899x/1099/1/012077
Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Jr., K. L. C. (2017). Data mining for business analytics: Concepts, techniques, and applications in R (1st ed.). Wiley.
Discussion 2: Tyler
Neural networks greatly benefit from the utilization of large amounts of data and while this is true for most machine learning algorithms, it is especially true for neural networks. Since there are so few defaulters in the data that we are working with, it is likely that we may not have the data necessary to provide correct predictions. This negative affect is amplified due to the black box nature of a neural network (Donges, 2019). Simply stated, this means that we will know whether or not the algorithm is determining if someone will default, but we won’t be able to see the reason why which can be critical to know in order to explain why a borrower was rejected. This particular problem could be avoided by using methods such as decision trees which are very transparent about their solutions or even a method such as Naïve Bayes since you’re likely to do better with less data with this method and you don’t have to worry about the black box effect.
Ignoring the black box effect, there are some strategies that one could implement to overcome the minority class problem. A few of many of these strategies include penalized models, resampling your dataset, or changing your performance metric. A penalized model would “impose an additional cost for making classification mistakes on minority classes during training” (Brownlee, 2020). Providing these penalties or weights can cause an algorithm to pay more attention to these minority classes. Resampling of a dataset could mean that you remove overrepresented data from your dataset or add copies of under-represented classes (over-sampling). We can also change our performance metric away from the traditional accuracy measurement to other measures of accuracy from confusion matrices to precision, recall, the F1 Score to other methods such as Kappa or ROC curves (Brownlee, 2020).
There are many strategies to overcome these issues but one must always consider whether the downsides of a particular machine learning algorithm a directly opposed to what you are looking for in a solution. If this is the case, looking for a new machine learning algorithm may be the best solution since there is no one-size-fits-all in the world of machine learning.
References
Brownlee, J. (2020, August 15). 8 tactics to combat imbalanced classes in your machine learning dataset. Machine Learning Mastery. Retrieved November 19, 2021, from https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/.
Donges, N. (2019, July 24). 4 reasons why deep learning and neural networks aren’t always the right choice. Built In. Retrieved November 19, 2021, from https://builtin.com/data-science/disadvantages-neural-networks.
Discussion 3: Meredith
Having too few of any given variable will lead to data imbalance, which is an issue that cannot be addressed with standard classification algorithms. When the data is balanced, the majority and the minority classes will have an equal error. With imbalanced data, the majority class is a much larger contributor to the loss value and creates a biased result (Wang et al, 2016). This problem is not unique to neural networks; it can occur in most machine learning classification models. To overcome data imbalance, the performance metric can be changed, the algorithm can be changed, resampling techniques (oversample the minority class or undersample the majority class), or create synthetic samples (Boyle, 2019).
Neural network models are also susceptible to overfitting. This problem is not unique to neural networks, but it can be more common. Obtaining more data is one solution to prevent overfitting, but that is not always possible. Regularization is another technique that can be implemented to avoid overfitting. Skalski (2018) provides a few ways to regularize the data. Two ways are the Least Absolute Deviation and the Least Square Errors. The former is the more preferrable of the two. It will reduce the less important features’ weight values to zero and possibly completely eliminating them from the calculations.
Boyle, T. (February 3, 2019). Dealing with imbalanced data. Toward Data Science. Retrieved on November 16, 2021 from https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18
Skalkski, P. (September 7, 2018). Preventing deep neural network from overfitting. Toward Data Science. Retrieved on November 16, 2021 from https://towardsdatascience.com/preventing-deep-neural-network-from-overfitting-953458db800a
Wang, S., Liu, W., Wu, J., Cao, L., Meng, Q., Kennedy, P. (2016). Training deep neural networks on imbalanced data sets. International joint conference on neural networks. 4638-4374.