Computer Science Homework Help

San Jose State University Handling Categorical Attributes Discussion Responses

 

Answer 1

Two (2) techniques in handling categorical attributes are as follows:

One-Hot Encoding: One-Hot Encoding is the most common and correct way to deal with non-ordinal categorical data as it consists of creating an additional feature for each group of the categorical feature and mark each observation belonging (Value=1) or not (Value=0) to that group. (Zuccarelli, 2020)

Target Encoding: It consists of substituting each group in a categorical feature with the average response in the target variable. The process to obtain the Target Encoding are: Group the data by category, Calculate the average of the target variable per each group and Assign the average to each observation belonging to that group (Zuccarelli, 2020).

Two (2) ways in which continuous attributes differ from categorical attributes are as follows:

Continuous/Quantitative attribute is data where the values can change continuously, and we cannot count the number of different values whereas Categorical attribute, in contrast, is for those aspects of our data where we make a distinction between different groups, and where we typically can list a small number of categories (Kosara, 2013).

Examples of Continuous attribute include weight, price, profits, counts, etc. Basically, anything that can be measure or count is quantitative whereas Categorical attributes includes product type, gender, age group, etc.

References

Kosara, R. (2013). Data: Continuous vs. Categorical.

Zuccarelli, E. (2020). Handling Categorical Data, The Right Way.

—————————————————————————————————————————————-

Answer 2

  • Discuss four (2) techniques in handling categorical attributes?

Categorical attributes including symmetric binary attributes such as Gender, and nominal attributes such as state and level of education. In order to apply association rule mining and extract pattern from the categorical attributes, there is a need to transform them into items. One way of transformation is to create a new binary item for each attribute-value pair, for example, gender can be replaced with Male and female, education can be replaced with education = graduate, education = college etc. One of the items will have a value of 1 the rest will have a value of 0. Another approach is to group related attribute values to smaller number of categories or group less frequent items into “others” category. This approach works well with nominal attributes whose values are not frequent, such as State. The third technique is to remove some high-frequency items before apply standard association rule algorithms, because they corresponds to typical values of an attribute and seldom carry new information about the pattern. The fourth technique is to avoid generating candidate itemsets that contains more than one item from the same attribute because the support count of the itemset will be zero. This approach helps reduce the computation time (Tan et al., 2019).

  • Discuss two (2) ways in which continuous attributes differ from categorical attributes?

Examples of continuous attributes are annual income and age. They need to be handled differently from how we handle the categorical attributes. One method is discretization method, where the adjacent values of a continues attribute are grouped into finite intervals and then the discrete intervals can be mapped to asymmetric binary attributes and the existing association analysis algorithms can be applied. Another way of handling continuous attribute is to transform the data into 0/1 matrix. If the counts exceed a certain threshold the entry will be 1 and otherwise 0. By transforming the continuous attributes into binary dataset, the existing frequent itemset generation algorithms can be applied. However, the association accuracy will be impacted by the threshold value. Some associations will be missed if the threshold is too high and low threshold can result in many spurious associations. The third approach is statistics-based method, where the interested target variable is withheld, the rest categorical and continuous attributes are binarized. Finally, the existing algorithms such as Apriori and FP-growth can be applied to the binarized data to extract frequent itemsets (Tan et al., 2019).

Reference:

Tan, P.-N., Steinbach, M., Kumar, V., & Karpatne, A. (2019). Introduction to data mining (Second). Pearson.