Computer Science Homework Help

San Jose State University K Means Discussion Responses

 

Answer 1

1.What is K-means from a basic standpoint?

K-means is an algorithm used in cluster analysis, which is a type of unsupervised machine learning (Tan et al., 2019). Since there are no labels applied to the data, clustering is described as unsupervised. When class labels must be applied to the data, that signifies a supervised process, and thus it becomes a classification method rather than a type of clustering (Tan et al., 2019). The difference between unsupervised clustering and supervised classification can be shown with an example. Suppose a fruit farm is growing one variety each of grapes, peaches and nectarines and using semi-automated, semi-intelligent picking equipment. A grapevine will grow its fruit in clusters on a stem (a bunch of grapes) in a manner that is unique among the three fruits. The machine has no trouble distinguishing grapes from either of the other two fruits. The grapes can be harvested in clusters with little chance of the grapes getting confused with either the peaches or nectarines, so it can use unsupervised clustering techniques when harvested by the machinery. However, the peach and nectarine trees must have their fruit picked and then manually classified by a human (assuming the trees are mixed in together) since the characteristics of the fruit is so similar. The machine that picks the peaches and nectarines cannot tell the difference between the two (and sometimes neither can I). So some sort of classification is necessary for the peaches and nectarines, which is considered supervised learning. K-means uses a prototype or representative data point for each of the clusters or groups (Tan et al., 2019). The representative data point is selected by an average, which is basically the means or centroid of the data points (Tan et al., 2019). There is one centroid for each group or cluster, and it is represented by K from the name, K-means (Garbade, 2018). K-means does calculations iteratively to select the best placement of the centroids within the clusters (Garbade, 2018). When there are no further changes in placement or the number of desired iterations is complete, the K-means algorithm is complete (Garbade, 2018). What are some examples of how K-means could be used in real life? According to Raghupathi (2018), the K-means algorithm can be used for document classification, identifying high-risk crime locations, customer segmentation, insurance fraud detection, profiling cyber criminals, and more. Can anybody think of other common examples of K-means used in clustering?

2. Is a binary variable the same as a dichotomous variable? Provide scholarly justification.

In general terms, yes, binary and dichotomous variables are the same (Glen, 2014). However, when digging into their definitions in more details, subtle differences can be seen. Glen (2014) described a binary variable as a subtype under the larger category of dichotomous variable. A dichotomous variable has two possible values, such as pass or fail, which represent nominal categories (Glen, 2014). A binary variable also has two values, but is typically represented either in Boolean (such as False or True), or integer (such as 0 or 1) values (Karabiber, 2021). In addition, dichotomous variables can either be discrete or continuous variables, (Glen, 2014). For example, the pass or fail dichotomous variable could have a value of 69.5, but if the professor is nice and rounds up to 70, it is passing (Glen, 2014). Binary variables on the other hand are a discrete variable with no option for a range (Tan et al., 2019). So what are some other examples for thought? Glen (2014) explained that a person can be dead or alive, so that is a discrete dichotomous variable. What happens if a person is on a life support system, however? Would that person then be considered in a continuous dichotomous state of living? And we know the binary descriptor of male or female describes the majority of the population. However, what about those individuals who describe themselves as non-binary? If they are non-binary are they then considered continuous dichotomous or would they be discrete dichotomous?

References

Garbade, M. (2018). Understanding K-means clustering in machine learning. Retrieved from https://towardsdatascience.com/understanding-k-mea…

Glen, S. (2014). Dichotomous variable: Definition. StatisticsHowTo.com: Elementary Statistics for the rest of us! Retrieved from https://www.statisticshowto.com/dichotomous-variab…

Karabiber, F. (2021). Binary variable. Retrieved from https://www.learndatasci.com/glossary/binary-varia…

Raghupathi, K. (2018). Ten interesting use cases for the K-Means algorithm. Retrieved from https://dzone.com/articles/10-interesting-use-case…

Tan, P., Steinbach, M., Karpatne, A., & Kumar, V. (2019). Introduction to Data Mining (2nd Edition). Pearson Education (US).

—————————————————————————————————————————————-

Answer 2

K-means is a learning algorithm that is unsupervised and designed to classify any unlabeled data into a given number of distinct clusters. It is simply a way of assessing any observations with similar characteristics and putting them together. The recommended cluster consists of observations that are more similar in a cluster than the clusters themselves. The algorithm’s aim is to minimize an objective function (Caruso et al., 2021).

2

The dichotomous variable is a natural choice for this analysis as it gives us more information on how people interact. If people want to communicate, there are two things that they can do. The first is to use a personal website, and the second is a public website to communicate with others. There are two types of public communication sites: a site for the whole group or small groups, or an entire organization. Both of these sites are important and are described in a follow-up (Caruso et al., 2021).

The binary variable is the same as a dichotomous variable because there is a chance of choosing the positive answer with the smallest sample size. The sample sizes available for binary variables include the following, A sample of n records A sample of n. n is a random sample of n. If there are no constraints on sample sizes, then samples of n will always be chosen by a priori probability. The sample sizes are not shown (Caruso et al., 2021).

Reference

Caruso, G., Gattone, S. A., Fortuna, F., & Di Battista, T. (2021). Cluster Analysis for mixed data: An application to credit risk evaluation. Socio-Economic Planning Sciences, 73, 100850.