Writing Homework Help
MIS 600 Grand Canyon University IBM Approach to Text Analytics Discussion
Read “About IBM SPSS Modeler Text Analytics,” view “Text Analytics in IBM SPSS Modeler 18.2,” located in the study materials, and compare to section 5.5 in Chapter 5 of the textbook.
In 100-150 words, discuss whether the IBM approach is consistent with what is in the textbook. Provide examples to support your rationale.
Read “About IBM SPSS Modeler Text Analytics,” located on the IBM website.
View “Text Analytics in IBM SPSS Modeler 18.2,” from DTE (2019), located on the YouTube website.
Section 5.5 Text Mining ProcessTo be successful, text mining studies should follow a sound methodology based on best practices. A standardized process model is needed similar to Cross-Industry Standard Process for Data Mining (CRISP-DM), which is the industry standard for data mining pro-jects (see Chapter 4). Even though most parts of CRISP-DM are also applicable to text min-ing projects, a specific process model for text mining would include much more elaborate data preprocessing activities. Figure 5.5 depicts a high-level context diagram of a typical text mining process (Delen & Crossland, 2008). This context diagram presents the scope of the process, emphasizing its interfaces with the larger environment. In essence, it draws boundaries around the specific process to explicitly identify what is included in (and excluded from) the text mining process.As the context diagram indicates, the input (inward connection to the left edge of the box) into the text-based knowledge-discovery process is the unstructured as well as struc-tured data collected, stored, and made available to the process. The output (outward exten-sion from the right edge of the box) of the process is the context-specific knowledge that can be used for decision making. The controls, also called the constraints (inward connection to the top edge of the box), of the process include software and hardware limitations, privacy issues, and the difficulties related to processing of the text that is presented in the form of natural language. The mechanisms (inward connection to the bottom edge of the box) of the process include proper techniques, software tools, and domain expertise. The primary pur-pose of text mining (within the context of knowledge discovery) is to process unstructured (textual) data (along with structured data, if relevant to the problem being addressed and available) to extract meaningful and actionable patterns for better decision making. 268□ □□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□SECTION 5.4 REVIEW QUESTIONS1. List and briefly discuss some of the text mining applications in marketing.2. How can text mining be used in security and counterterrorism?3. What are some promising text mining applications in biomedicine?5.5 Text Mining ProcessTo be successful, text mining studies should follow a sound methodology based on best practices. A standardized process model is needed similar to Cross-Industry Standard Process for Data Mining (CRISP-DM), which is the industry standard for data mining pro-jects (see Chapter 4). Even though most parts of CRISP-DM are also applicable to text min-ing projects, a specific process model for text mining would include much more elaborate data preprocessing activities. Figure 5.5 depicts a high-level context diagram of a typical text mining process (Delen & Crossland, 2008). This context diagram presents the scope of the process, emphasizing its interfaces with the larger environment. In essence, it draws boundaries around the specific process to explicitly identify what is included in (and excluded from) the text mining process.As the context diagram indicates, the input (inward connection to the left edge of the box) into the text-based knowledge-discovery process is the unstructured as well as struc-tured data collected, stored, and made available to the process. The output (outward exten-sion from the right edge of the box) of the process is the context-specific knowledge that can be used for decision making. The controls, also called the constraints (inward connection to the top edge of the box), of the process include software and hardware limitations, privacy issues, and the difficulties related to processing of the text that is presented in the form of natural language. The mechanisms (inward connection to the bottom edge of the box) of the process include proper techniques, software tools, and domain expertise. The primary pur-pose of text mining (within the context of knowledge discovery) is to process unstructured (textual) data (along with structured data, if relevant to the problem being addressed and available) to extract meaningful and actionable patterns for better decision making.Extract knowledge from available data sourcesA0Unstructured data (text)Structured data (databases)Context-specific knowledgeSoftware/hardware limitationsPrivacy issuesLinguistic limitationsTools and techniquesDomain expertise,
At a very high level, the text mining process can be broken down into three consec-utive tasks, each of which has specific inputs to generate certain outputs (see Figure 5.6). If, for some reason, the output of a task is not that which is expected, a backward redirec-tion to the previous task execution is necessary.Task 1: Establish the CorpusThe main purpose of the first task activity is to collect all the documents related to the context (domain of interest) being studied. This collection may include textual documents, XML files, e-mails, Web pages, and short notes. In addition to the readily available textual data, voice recordings may also be transcribed using speech-recognition algorithms and made a part of the text collection.Once collected, the text documents are transformed and organized in a manner such that they are all in the same representational form (e.g., ASCII text files) for computer processing. The organization of the documents can be as simple as a collection of digitized text excerpts stored in a file folder or it can be a list of links to a collection of Web pages in a specific domain. Many commercially available text mining software tools could accept these as input and convert them into a flat file for processing. Alternatively, the flat file can be prepared out-side the text mining software and then presented as the input to the text mining application.Task 2: Create the Term–Document MatrixIn this task, the digitized and organized documents (the corpus) are used to create the term–document matrix (TDM). In the TDM, rows represent the documents and columns represent the terms. The relationships between the terms and documents are characterized by indices (i.e., a relational measure that can be as simple as the number of occurrences of the term in respective documents). Figure 5.7 is a typical example of a TDM.The goal is to convert the list of organized documents (the corpus) into a TDM where the cells are filled with the most appropriate indices. The assumption is that the essence of a document can be represented with a list and frequency of the terms used in that document. However, are all terms important when characterizing documents? Obviously, the answer is “no.” Some terms, such as articles, auxiliary verbs, and terms used in almost all the docu-ments in the corpus, have no differentiating power and, therefore, should be excluded from the indexing process. This list of terms, commonly called stop terms or stop words, is spe-cific to the domain of study and should be identified by the domain experts. On the other hand, one might choose a set of predetermined terms under which the documents are to be indexed (this list of terms is conveniently called include terms or dictionary). In addition, synonyms (pairs of terms that are to be treated the same) and specific phrases (e.g., “Eiffel Tower”) can also be provided so that the index entries are more accurate.
At a very high level, the text mining process can be broken down into three consec-utive tasks, each of which has specific inputs to generate certain outputs (see Figure 5.6). If, for some reason, the output of a task is not that which is expected, a backward redirec-tion to the previous task execution is necessary.Task 1: Establish the CorpusThe main purpose of the first task activity is to collect all the documents related to the context (domain of interest) being studied. This collection may include textual documents, XML files, e-mails, Web pages, and short notes. In addition to the readily available textual data, voice recordings may also be transcribed using speech-recognition algorithms and made a part of the text collection.Once collected, the text documents are transformed and organized in a manner such that they are all in the same representational form (e.g., ASCII text files) for computer processing. The organization of the documents can be as simple as a collection of digitized text excerpts stored in a file folder or it can be a list of links to a collection of Web pages in a specific domain. Many commercially available text mining software tools could accept these as input and convert them into a flat file for processing. Alternatively, the flat file can be prepared out-side the text mining software and then presented as the input to the text mining application.Task 2: Create the Term–Document MatrixIn this task, the digitized and organized documents (the corpus) are used to create the term–document matrix (TDM). In the TDM, rows represent the documents and columns represent the terms. The relationships between the terms and documents are characterized by indices (i.e., a relational measure that can be as simple as the number of occurrences of the term in respective documents). Figure 5.7 is a typical example of a TDM.The goal is to convert the list of organized documents (the corpus) into a TDM where the cells are filled with the most appropriate indices. The assumption is that the essence of a document can be represented with a list and frequency of the terms used in that document. However, are all terms important when characterizing documents? Obviously, the answer is “no.” Some terms, such as articles, auxiliary verbs, and terms used in almost all the docu-ments in the corpus, have no differentiating power and, therefore, should be excluded from the indexing process. This list of terms, commonly called stop terms or stop words, is spe-cific to the domain of study and should be identified by the domain experts. On the other hand, one might choose a set of predetermined terms under which the documents are to be indexed (this list of terms is conveniently called include terms or dictionary). In addition, synonyms (pairs of terms that are to be treated the same) and specific phrases (e.g., “Eiffel Tower”) can also be provided so that the index entries are more accurate.Establish the Corpus:Collect and organizethe domain-specificunstructured dataCreate the Term-Document Matrix:Introduce structureto the corpusExtract Knowledge:Discover novelpatterns from theT-D matrixThe inputs to the processinclude a variety of relevant unstructured (and semi-structured) data sources such as text, XML, HTML, etc. The output of Task 1 is a collection of documents in some digitized format for computer processing The output of Task 2 is a flatfile called a term-documentmatrix where the cells arepopulated with the term frequenciesThe output of Task 3 is a number of problem-specific classification, association, clustering models and visualizationsTask 1Task 2Task 3FeedbackFeedback Knowledge12345DataText□□□□□□□□□□□□The Three-Step/Task Text Mining Process. place to accurately create the indices is stemming, which refers to the reduction of words to their roots so that, for example, different grammati-cal forms or declinations of a verb are identified and indexed as the same word. For example, stemming will ensure that modeling and modeled will be recognized as the word model.The first generation of the TDM includes all the unique terms identified in the cor-pus (as its columns), excluding the ones in the stop term list; all the documents (as its rows); and the occurrence count of each term for each document (as its cell values). If, as is commonly the case, the corpus includes a rather large number of documents, then there is a very good chance that the TDM will have a very large number of terms. Processing such a large matrix might be time-consuming and, more important, might lead to extrac-tion of inaccurate patterns. At this point, one has to decide the following: (1) What is the best representation of the indices? and (2) How can we reduce the dimensionality of this matrix to a manageable size?□□□□□□□□□□□□□□□□□□□□□□□□ Once the input documents are indexed and the ini-tial word frequencies (by document) computed, a number of additional transformations can be performed to summarize and aggregate the extracted information. The raw term frequencies generally reflect on how salient or important a word is in each document. Specifically, words that occur with greater frequency in a document are better descriptors of the contents of that document. However, it is not reasonable to assume that the word counts themselves are proportional to their importance as descriptors of the documents. For example, if a word occurs one time in document A, but three times in document B, then it is not necessarily reasonable to conclude that this word is three times as important a descriptor of document B as compared to document A. To have a more consistent TDM for further analysis, these raw indices need to be normalized. As opposed to showing the actual frequency counts, the numerical representation between terms and documents can be normalized using a number of alternative methods, such as log frequencies, binary frequencies, and inverse document frequencies, among others.□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□ Because the TDM is often very large and rather sparse (most of the cells filled with zeros), another important question is, “How do we reduce the dimensionality of this matrix to a manageable size?” Several options are available for managing the matrix size. □□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□□ 271□□A domain expert goes through the list of terms and eliminates those that do not make much sense for the context of the study (this is a manual, labor-intensive process).□□Eliminate terms with very few occurrences in very few documents.□□Transform the matrix using SVD.Singular value decomposition (SVD), which is closely related to principal com-ponents analysis, reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower-dimensional space, where each consecutive dimension represents the largest degree of variability (between words and documents) possible (Manning & Schutze, 1999). Ideally, the analyst might identify the two or three most salient dimensions that account for most of the variability (differences) between the words and documents, thus identifying the latent semantic space that orga- nizes the words and documents in the analysis. Once such dimensions are identified, the underlying “meaning” of what is contained (discussed or described) in the documents has been extracted.Task 3: Extract the KnowledgeUsing the well-structured TDM, and potentially augmented with other structured data ele-ments, novel patterns are extracted in the context of the specific problem being addressed. The main categories of knowledge extraction methods are classification, clustering, asso-ciation, and trend analysis. A short description of these methods follows.□□□□□□□□□□□□□□ Arguably the most common knowledge-discovery topic in analyzing complex data sources is the classification (or categorization) of certain objects. The task is to classify a given data instance into a predetermined set of categories (or classes). As it applies to the domain of text mining, the task is known as text categorization, where for a given set of categories (subjects, topics, or concepts) and a collection of text documents the goal is to find the correct topic (subject or concept) for each document using models developed with a training data set that includes both the documents and actual document categories. Today, automated text classification is applied in a variety of contexts, includ-ing automatic or semiautomatic (interactive) indexing of text, spam filtering, Web page categorization under hierarchical catalogs, automatic generation of metadata, detection of genre, and many others.The two main approaches to text classification are knowledge engineering and machine learning (Feldman & Sanger, 2007). With the knowledge-engineering approach, an expert’s knowledge about the categories is encoded into the system either declaratively or in the form of procedural classification rules. With the machine-learning approach, a general inductive process builds a classifier by learning from a set of reclassified exam-ples. As the number of documents increases at an exponential rate and as knowledge experts become harder to come by, the popularity trend between the two is shifting toward the machine-learning approach.□□□□□□□□□□ Clustering is an unsupervised process whereby objects are classified into “natural” groups called clusters. Compared to categorization, where a collection of preclassified training examples is used to develop a model based on the descriptive features of the classes to classify a new unlabeled example, in clustering the problem is to group an unlabeled collection of objects (e.g., documents, customer comments, Web pages) into meaningful clusters without any prior knowledge.Clustering is useful in a wide range of applications, from document retrieval to en- abling better Web content searches. In fact, one of the prominent applications of cluster-ing is the analysis and navigation of very large text collections, such as Web pages. The earch in such a way that when a query matches a document its whole cluster is returned.□□Improved search precision. Clustering can also improve search precision. As the number of documents in a collection grows, it becomes difficult to browse through the list of matched documents. Clustering can help by grouping the documents into a number of much smaller groups of related documents, ordering them by relevance and returning only the documents from the most relevant group (or groups).The two most popular clustering methods are scatter/gather clustering and query-specific clustering:□□Scatter/gather. This document browsing method uses clustering to enhance the efficiency of human browsing of documents when a specific search query cannot be formulated. In a sense, the method dynamically generates a table of contents for the collection and adapts and modifies it in response to the user selection.□□Query-specific clustering. This method employs a hierarchical clustering approach where the most relevant documents to the posed query appear in small tight clusters that are nested in larger clusters containing less-similar documents, cre-ating a spectrum of relevance levels among the documents. This method performs consistently well for document collections of realistically large sizes.□□□□□□□□□□□ A formal definition and detailed description of association was pro-vided in the chapter on data mining (Chapter 4). Associations or association rule learning in data mining is a popular and well-researched technique for discovering interesting relationships among variables in large databases. The main idea in generating association rules (or solving market-basket problems) is to identify the frequent sets that go together.In text mining, associations specifically refer to the direct relationships between concepts (terms) or sets of concepts. The concept set association rule A + C relating two frequent concept sets A and C can be quantified by the two basic measures of support and confidence. In this case, confidence is the percentage of documents that include all the concepts in C within the same subset of those documents that include all the concepts in A. Support is the percentage (or number) of documents that include all the concepts in A and C. For instance, in a document collection the concept “Software Implementation Failure” may appear most often in association with “Enterprise Resource Planning” and “Customer Relationship Management” with significant support (4%) and confidence (55%), meaning that 4% of the documents had all three concepts represented together in the same document, and of the documents that included “Software Implementation Failure,” 55% of them also included “Enterprise Resource Planning” and “Customer Relationship Management.”Text mining with association rules was used to analyze published literature (news and academic articles posted on the Web) to chart the outbreak and progress of the bird flu (Mahgoub et al., 2008). The idea was to automatically identify the association among the geographic areas, spreading across species, and countermeasures (treatments).□□□□□□□□□□□□□□ Recent methods of trend analysis in text mining have been based on the notion that the various types of concept distributions are functions of document collections; that is, different collections lead to different concept distributions for the same set of concepts. It is, therefore, possible to compare two distributions that are otherwise identical except that they are from different subcollections. One notable direction of this type of analysis is having two collections from the same source (such as from the same set of academic journals) but from different points in time. Delen and Crossland (2008) applied trend analysis to a large number of academic articles (published in the three highest-rated academic journals) to identify the evolution of key concepts in the field of information systems.