2.1 INTRODUCTION
Natural language processing (NLP) is referred to convergence in computer science, linguistics and machine learning. The field is all about machines and people communicating with NLP in natural language. NLP methods are used in voice assistants such as Amazon's Alexa and Apple's Siri, but there are other areas where NLP is used, including text filtering and machine translation. Text Mining (TM) field has received a great deal of attention in recent years due to the large amount of text data generated in a variety of forms such as social networks, patient records, health insurance data, news outlets, etc. In an EMC-funded report, IDC predicts that the data volume will increase to 40 zettabytes1 by 2020, leading to a 50-fold increase from the beginning of 2010[2]. Text data is a good example of unstructured information, one of the simplest types of data that can be generated in most scenarios. Unstructured text is easily understood and perceived by humans, but it is much harder for computers to understand. Of course, this volume of text is an invaluable source of information and knowledge. As a result, there is a desperate need to design methods and algorithms to efficiently process this avalanche of text in a wide range of applications. Text mining techniques are similar to traditional data mining and knowledge discovery approaches, with some of the specificities mentioned below.
2.1.1 Knowledge Discovery vs Data Mining
There are various definitions for discovery of information or discovery of knowledge in databases (KDDs) and data mining in literature. It is defined as follows: Knowledge Discovery in Databases extracts implicitly real, new and potentially useful information from non-trivial data[4, 5, 8]. Data Mining is the use of complex algorithms to collect data patterns. The KDD aims to discover hidden patterns and ties in the data. Based on the above definitions, KDD refers to the overall process of seeking useful data knowledge, while data mining refers to a specific step of this process. Data may be organized as databases, but mostly unstructured as data in a plain text file. The method of extracting useful patterns from a data set involves the application of several measures to the data set of interest. Decisions can be made by the consumer by iterative and collective action. The Crisp DM2 model is a well-known data mining standard. It is imperative that databases be used to evaluate large quantities of data. On the other hand, data mining algorithms could dramatically increase the data analysis capability. Database integration is necessary for the sake of data integrity and management.
For more detail on how to make use of the database when mining data, see [28]. In machine learning (ML), researchers are searching for data trends to predict future data patterns. The theory of automated information extraction is a key concept in machine learning. Data mining is very much based on machine learning algorithms. For more detail, please search the following articles: [10, 26]. Statistical analysis, collection, interpretation and presentation of data are all areas in which statistics fall within the field of mathematics. Today, there are different kinds of statistical and probability-based data mining algorithms. A large amount of data mining and mathematical learning research has already been carried out.
2.2 BACKGROUND
Machine learning methods have recently improved computational power and algorithms, making it easier to perform data-driven analysis. Adaptation and compatibility with different operating platforms are of greater importance when looking for new extraction techniques for wider datasets than before, as the key goal has changed to suitability and universality [12]. The two main classes of data-driven approaches are both predictive and descriptive. The former focuses on the creation of a model representing the underlying phenomena that drive a dataset, after which the resulting information and results are produced by the study of that dataset. Research on such methods has already been performed in connection with the estimation of aircraft accident risk [13,14] and the predictive operations of the airline industry[15,16]. In comparison to prediction, the descriptive approach prioritises identifying non-trivial conclusions and patterns rather than using existing trends to make future predictions.
There has been an increase in the use of text mining in the aviation industry, considering the increasing number of subjective text companies that need to be quickly evaluated. In order to search and classify records, as well as to obtain information on the dynamic relationships that exist within them, NLP techniques have been developed with a variety of approaches. The approach to automated linguistic analysis (specific to aviation safety reports) put forward by Pimm et al. [20] addresses several problems that relate to aviation safety reports, including the challenge of dealing with acronyms and classifying terms that use distinct words but have similar meanings. In addition, the study of Tanguy et al.[21] deals with the classification of aviation accidents, how events can be detected, described and recorded, as well as their history and evolution over time and space.
The study was focused on identifying and classifying the various categories of flight safety triggers used in the narrative text. Another work, which emphasizes the importance of arranging events in their respective subject areas, is the classification of events on the basis of their subjects [23]. Determining subject labels based on subject matter experts leaves the door open for further study of all the types of metadata discovered in the ASRS archives. We will examine how closely these label themes map to the subjects of their associated narratives, if we assign them these labels. In a different but complementary approach, El Ghaoui et al.[24] examines the relationship between metadata and patterns in text reports. Such research has only been seen for two airports, so we need to look at the effect of the external flight conditions on each narrative. Similar clustering techniques have been used, according to Srivastava et al.[26], to classify recurring anomalies in aerospace studies. While the research focuses on the use of repeated anomalies and anomaly classifications, this study provides several innovative approaches to the unsupervised classification of safety narratives and serves as a valuable framework for the subsequent portion of this paper, which discusses different types of metadata-centered anomaly triggers. Robinson's research also found that isometric mapping provides a more intuitive way of visually representing the effect of various event categories on text-based narrative documents.
2.3 DATASET
Instead of large datasets that take a long time to fit the models, it is better to use smaller datasets that you can quickly download and process quickly. It is also recommended that you use well-understood and widely used datasets to assess progress. These datasets were included in the machine-learning research papers. Datasets are an essential part of research in machine learning. Major advancements in this field are typically made by advancements in computer hardware, particularly when it comes to learning algorithms (such as deep learning) and, more practically, large-scale training datasets. Data labeling is a long and time-consuming process, so datasets for supervised and semi-supervised machine learning algorithms are rarely high-quality and affordable. Good data sets for unsupervised learning, however, are difficult and costly to produce, even if they are not labeled.
2.4 RELATED WORK
In previous approaches, the type of dataset used has divided clustering algorithms. While some studies use either real-world or artificial data, others use both dataset types to compare the performance of several clustering methods. A comparative analysis using a dataset of the real world is presented in several works [20, 21, 24, 25]. Some of these works are briefly reviewed below. In [20], the authors propose an evaluation approach based on multiple decision-making criteria in the field of financial risk analysis over three datasets for real-world credit risk and bankruptcy risk. In particular, clustering algorithms are evaluated in the context of a combination of clustering measurements, including the collection of external and internal validation indices. Their results show that no algorithm can achieve the best performance for any dataset in all measurements and, for this reason, it is mandatory to use more than one performance measure to evaluate clustering algorithms. According to Barker and Cornacchia (2010), extracting noun phrases from a text and sorting them by their word length, frequency, and frequency is an effective way of discovering useful ideas. The method proposed by Muñoz (2017) is an unsupervised algorithm that relies on adaptive resonance theory to identify bi-term key phrases. El-Beltagy and Rafea (2010) have introduced a new method they have developed, KP-Miner, which is a state-of-the-art TF-approved IDF implementation. A few of the main drawbacks for KP-Miner are that it sees phrases as nothing more than information in the document and, furthermore, it fails to account for the fact that, according to studies conducted over the last few years, the average percentage of stop words in each piece of text is around 15%.
Several measures have been proposed to ensure that this does not become a problem. For their paper, Danesh et al. (2015) present a hybrid statistical-and graphic-based approach, using TF-IDF and the position of the first phrase to measure the weight for that phrase. Once sentences have been collected and weights assigned, they are used to create a phrase popularity graph and the weights are recalculated using a centrality metric to produce the final sentence ranking. Wan and Xiao (2018) introduce the additional edge weighting of adjacent documents that incorporates their co-occurrence information into the visualization of the input text. To extract documents from a large document corpus that are currently applicable to the input document, the algorithm uses a cosine similarity test. Identifying and ranking phrases that relate to the themes in the document is achieved by the retrieval of records. This, however, results in an exceptionally high cost of retrieving information relevant to the subject from large companies.
After a certain linguistic pattern, the computer then selects keywords or phrases that contain one or more cluster centroids. While Key-Cluster has been shown to outperform other prominent AKE methods, it is premature to say that all the clusters generated are sufficiently comprehensive to contain all key phrases in the document. Heuristics, such as the use of the centroid cluster average, which functions in the vast majority of situations, yield incorrect results when used to recognize and extract key phrases. In the modern era, the concept of integrating the topic analysis into the AKE task has become more common, and this new method aims to ensure that key phrases that represent the subject from an up-to-date point of view appear prominently in development. Clustering-based approaches show great improvement in AKE performance, but suffer from empiric problems with the subject study. An significant example of this is when LDA and other domain-agnostic text processing models are being applied to new domains. LDA and other domain-agnostic text processing models produce high computational complexity and involve hyper parameter (re)tuning, which is difficult to extend to new domains.
Figure: Comparison of identified BPMN modeling tools
This proposal uses a literature review and feedback on the benefits and disadvantages of other approaches to learn how to use them. It uses a technique known as n-grams (commonly referred to as "KP-Miner") and incorporates background information sources (similar to SG-Rank, Sem-Graph, Expand-Rank, and Key-Cluster). Since the knowledge base of SemCluster would be too concentrated, it is designed so that it can be extended by connecting to other sources of information. As a clustering-based technique, the SemCluster approach uses latent semantic relations between words as well as the overall frequency of a word to rate terms on a scale of 1 to N, where N is the total number of terms. In addition, SemCluster does not require a topic analysis to find thematically unimportant clusters.
2.4.1 Classifications
The method was first developed by Fledman et al.[26] (i.e. structured such as RDBMS data [28, 33], semi-structured such as XML and JSON [11, 12], and unstructured text resources such as word documents, videos, and images). It will support several different communities (such as knowledge retrieval, natural language processing, data mining, machine learning, network and biomedical sciences).
2.4.2 Optimization of Search Engine (SEO):
Information retrieval is a method of searching through a variety of unstructured data sets to find sufficient information tools (usually documents) to satisfy the information needs. Text mining was aimed at facilitating access to information rather than attempting to uncover hidden patterns. According to our research, knowledge retrieval focuses less on the encoding or transformation phase of text; in contrast, text mining or access to information outside information enables users to make sense of and gain understanding of the information they access.
2.4.3 Natural language processing (NLP):
Natural Language Processing is a sub-section devoted to the representation of natural language using computers in the fields of computer science, artificial intelligence and linguistics. NLP techniques such as speech tagging (POG), syntactic parsing, and other types of linguistic analysis are widely used in text mining algorithms (see [8, 11] for more information).
2.4.4 Text Extraction (TE):
The extraction of information is a method of taking facts and information from documents that are not completely structured or well organized. It serves as a simple text mining algorithm for other approaches. Useful semantic information can be obtained from the Extraction Entities, Name Entity Recognition (NER) and their relationship to the text.
2.4.5 Text description:
In order to locate relevant details in a variety of documents, a broad document or a collection of documents on the subject must be summarized. Extractive summarization requires the use of techniques that extract the relevant information from the source text, while abstract summarization is a method of synthesis where the explanation may contain "synthesized" information [6, 38].
2.4.6 Unsupervised learning methods:
Unsupervised learning approaches seek a hidden framework in unstructured data without supervision. Due to the short training time, machine learning can be used on any text data without human intervention. Unsupervised learning algorithms, such as clustering and topic modeling, are used in the context of text data. When looking for documents, we would like to split the collection of documents in order to help locate groups of documents that are more similar to each other. Topics are probabilistically modeled in which each document has a distribution across all clusters, rather than a hard cluster in which each document is assigned to a single cluster. Topic models associate themes with word probabilities and documents with subject probabilities. To put it another way, the topic is a cluster and the membership of a document is probabilistic.
2.4.7 Supervised methods of learning:
Supervised learning methods are machine learning techniques that include using training examples data to infer a function or learn a classifier that predicts unseen data. As the class of supervised machine learning approaches grows, so does the broad spectrum of supervised machine learning methods.
2.4.8 Text Mining:
Unsupervised thematic models, like probabilistic Latent Semantic Analysis (pLSA) [6] and Latent Dirichlet Allocation (LDA) [16], are commonly used for text mining. Sites such as Instagram, LinkedIn and Pinterest help to create text streams and aggregate social media data; In addition to search engines, the web is home to a variety of different applications, all of which generate vast volumes of text data streams. By producing vast volumes of text content, news stream applications and aggregates such as Reuters and Google News provide an invaluable source of extractive information. Many people use social networks, particularly Facebook and Twitter, which generate vast amounts of text data that are constantly accumulating. A platform that gives users the right to express themselves in a variety of topics is provided. Searching for social networks requires special skills in the handling of poor and non-standard languages, which hinders text mining. As a consequence of e-commerce and online shopping, a vast number of new texts are related to customer reviews of different products or consumer opinions. We mine such data in order to find crucial information and insight on the subject, which is essential to marketing and advertising [10].
2.4.9 Representation and encoding of text
One of the most time-consuming aspects of text mining is the study of a broad variety of documents. In order to make this easier, it is important to have a text data structure that will aid in further document analysis. Probabilistic Latent Semantic Indexing (PLSA)[6] and topic models[16] are three main dimensional reduction techniques used in text mining. Information retrieval (IR) requests, particularly for large collections, often require documents to be ranked before the search results can be returned [13]. Words in the text are interpreted as vectors, and each word is associated with a numerical value. The three most widely used models are Vector Space Model (VSM), Probabilistic Model [9] and Inference Network Model [9, 13].
Preprocessing is one of the most important methods in many text mining algorithms. Let's use the traditional text categorization framework as an example. Preprocessing includes tasks such as data cleaning, the extraction of features involves tasks such as the extraction of features, and classification is a step that uses these features to shape a classification. Feature extraction, feature selection, and classification algorithm are thought to have a significant effect on the classification process, but it is not clear if the pre-processing feature has a noticeable impact. Uysal et al. [14] explored how pre-processing functions, such as text classification, can affect the outcome. A standard part of pre-processing involves tasks such as tokenization, filtering, stemming and lemmatization.
2.4.10 Tokenization:a digital representation of an entity that represents a single piece of value (also called a "token") in a particular system or environment.
The practise of separating a character sequence into pieces (words or phrases) and discarding separate characters (such as punctuation) is known as tokenization. At this point, the list of tokens is used for further processing. In the course of the processing of documents, filtering is typically carried out on documents intended to delete those words. Stopping words, known as stop-words, is a common filtering technique. These terms appear frequently in the text, but they are devoid of any information content. Similarly, words that sometimes seem to have no meaning in the text may be omitted from the records. The method of considering the various inflected forms of a word so that they can be analyzed as a single entity is known as lemmatization. Lemmatization methods search for infinite verb tense and nouns for a single form. The document lemmatizer first determines the POS of each word, which is error-prone and repetitive, and therefore prefers stemming.
2.4.11 Restriction
The purpose of the derivation search technique is to obtain a stem (root) of derived words. Stemming algorithms, like most natural language processing (NLP) methods, are language-based. [9] but the stemmer[11] was the most widely used stemming form in English[8]. In order to allow for more formal descriptions of the algorithms, we first define some terms and variables that will be widely used in the following: given the set of documents, we add the following concepts: document selection, document description, and variables such as collection, document and variables.
2.5 RESULTS ANALYSIS
These results will be presented and analyzed in this chapter. The systematic review of the literature was carried out in two stages:
· Search by applying a research query to each academic database
· Manual application of the inclusion/exclusion criteria to the preceding step resulting in a set of documents
Out of the papers indexed in Scopus, IEEE Explorer, ACM Digital Library, and Springer Link, the distribution is 54.5% IEEE Explorer, 33.3% Scopus, 18.18% ACM Digital Library, and 18.18% Springer Link.
This table shows that, of the papers considered, the majority (52 per cent) are engaged in both identification and discovery activities, and none (0 per cent) work on discovery alone. The second most common approach is to perform analysis first and then implement the strategy. And finally, it is Discovery, Identification, and Union Analysis that, when added together, produces Discovery. Leopold (2013) stated that the NLP in BPM plays a key role in supporting model generation and analysis, as demonstrated by the results of the literature review. Because the design process is time-consuming and expensive, both applications have a significant impact on the industry.
Table: Papers selected in the systematic literature review.
| Paper | Author | Title | Year |
| 1. | Charu C Aggarwal and ChengXiang Zhai | Mining text data | 2012 |
| 2. | Mehdi Allahyari and Krys Kochut | Automatic topic labeling using ontologybased topic models In Machine Learning and Applications | 2015 |
| 3. | Mehdi Allahyari and Krys Kochut | Discovering Coherent Topics with Entity Topic Models | 2016 |
| 4. | Mehdi Allahyari and Krys Kochut | Semantic Context-Aware Recommendation via Topic Models Leveraging Linked Open Data | 2016 |
| 5. | Mehdi Allahyari and Krys Kochut | Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network | 2016 |
| 6. | M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut. | Text Summarization Techniques: A Brief Survey | 2017 |
| 7. | Sophia Ananiadou, Sampo Pyysalo, Jun’ichi Tsujii, and Douglas B Kell | Event extraction for systems biology by text mining the literature | 2010 |
| 8. | Sofia J Athenikos and Hyoil Han | Biomedical question answering: A survey. Computer methods and programs in biomedicine | 2010 |
| 9. | Bob Carpenter | Integrating out multinomial parameters in latent Dirichlet allocation and naive bayes for collapsed Gibbs sampling | 2010 |
| 10. | Yee Seng Chan and Dan Roth | Exploiting background knowledge for relation extraction. In Proceedings of the 23rd International Conference on Computational Linguistics | 2010 |
| 11. | Yee Seng Chan and Dan Roth | Exploiting background knowledge for relation extraction. In Proceedings of the 23rd International Conference on Computational Linguistics | 2010 |
| 12. | Yee Seng Chan and Dan Roth | Exploiting syntactico-semantic structures for relation extraction. | 2011 |
| 13. | Richard O Duda, Peter E Hart, and David G Stork | Pattern classification | 2012 |
| 14. | Guozhong Feng, Jianhua Guo, Bing-Yi Jing, and Lizhu Hao | A Bayesian feature selection paradigm for text classification | 2012 |
| 15. | John Gantz and David Reinsel | THE DIGITAL UNIVERSE IN 2020: Big Data, Bigger Digital Shadows, and Biggest Grow th in the Far East | 2012 |
| 16. | Pritam Gundecha and Huan Liu | Mining social media: a brief introduction. In New Directions in Informatics, Optimization, Logistics, and Production | 2012 |
| 17. | Juan B Gutierrez, Mary R Galinski, Stephen Cantrell, and Eberhard O Voit | From within host dynamics to the epidemiology of infectious disease: scientific overview and challenges | 2015 |
| 18. | Xianpei Han and Le Sun | An entity-topic model for entity linking | 2012 |
| 19. | Payam Porkar Rezaeiye, Mojtaba Sedigh Fazli | Use HMM and KNN for classifying corneal data | 2014 |
| 20. | Hassan Saif, Miriam Fernández, Yulan He, and Harith Alani | On stopwords, filtering and data sparsity for sentiment analysis of twitter | 2014 |
| 21. | Prithviraj Sen | Collective context-aware topic models for entity disambiguation | 2015 |
| 22. | E. D. Trippe, J. B. Aguilar, Y. H. Yan, M. V. Nural, J. A. Brady, M. Assefi, S. Safaei, M. Allahyari, S. Pouriyeh, M. R. Galinski, J. C. Kissinger, and J. B. Gutierrez | A Vision for Health Informatics: Introducing the SKED Framework An Extensible Architecture for Scientific Knowledge Extraction from Data | 2017 |
2.6 CATEGORIZATIONS
Data mining, database, machine learning, and information retrieval are all classes that have thoroughly studied text classification and are used in a wide range of fields, such as image processing, medical diagnosis, document organization, and so on. Classification of text documents is the purpose of the processing of texts. According to this definition, the classification problem is described as follows. To measure the efficiency of the classification model, we randomly choose a certain percentage of the labelled documents (test set). When we train the classifier with our training set, we classify the test set and evaluate how much the expected labels look like the actual labels, and then we check the accuracy of the classifier. When dividing the total number of documents that have been classified correctly, the percentage of valid classifications is called accuracy.
Clustering algorithms can be implemented with software tools such as Lemur[9] and BOW. Text data can be organized using a variety of cluster algorithms. A text document can be visualized as a two-dimensional vector, i.e. a map where words are plotted as either present or absent. Incorporate weighing processes, such as TF-IDF. However, the algorithms required for text clustering can vary greatly depending on the particular characteristics of the text data. Some of these text representations have special properties, such as: although the data is sparse, the text representation has a very large dimensionality. Specifically, the volume of data versus the size of the feature space As a result, we need to develop algorithms that factor in the relationship between words when designing the clustering task.
The efficiency of the various clustering algorithms have contrasting trade-offs. Clustering algorithms are typically evaluated by experimental comparisons [13, 14 & 15]. In the following pages, we explain some of the most common text clustering algorithms. The hierarchical clustering algorithms are called this because they create a cluster hierarchy that can be visualized as a cluster group that has a hierarchical structure. Similarity functions are used in hierarchical clustering algorithms to assess the proximity between documents. Clustering algorithms for text data are summarized in [10, 13, 14]. The cluster is made up of individuals on whom clusters are created. The initial phases of the k-means algorithm include the search for a global solution for k-means clustering. However, heuristics such as [18] are used to find a local solution quickly. One of the major drawbacks of k-means clustering is that it is highly susceptible to the choice of k as the initial value.
The first algorithm is called k-means clustering and begins by randomly selecting the k starting centroids. Assign documents to the centroids on the basis of the closest similarity, although the algorithms are not yet converged. To test the cluster centroids for all clusters, calculate the cluster means for all clusters. The clustering algorithm will end when the last k clusters are clustered. In order to have a better probability of generalising the model to a new set of unknown records, the pLSA model does not have a document-level probabilistic model. Latent Dirichlet Allocation was generalised by Blei et al.[16] who implemented Dirichlet prior to the subject of mixture weights and referred to the model as Latent Dirichlet Allocation (LDA).
BASE 11 contains details such as dosing guidance and labelling of medications that can allow health care providers to ensure proper treatment. These data can also include clinically actionable gene-drug interactions and genotype-phenotype relationships. A great deal of information extraction and clustering techniques, such as information extraction and clustering, rely on ontologies and knowledge bases from earlier in this article. Another way to describe this operation is by automating the extraction of organised information from unstructured text. The unstructured text in biomedical literature consists primarily of scientific articles and clinical knowledge from clinical information systems. Information extraction is typically performed prior to the analysis in biomedical text mining applications such as question answering, knowledge extraction, hypothesis generation, and summarization.
There are several different approaches to the extraction of the biomedical relationship. The simplest technique is based on where the entities co-occur. If they speak to each other often, there is a high probability that they may be connected in some way. But we cannot understand the form and direction of the relationship by using statistics alone. Co-occurrence approaches typically provide a high degree of recall and low accuracy. Rule-based techniques are another group of methods used for the extraction of the biomedical relationship. Rules can be specified either manually by domain experts or automatically by machine learning techniques from an annotated corpus. Classification-based approaches are also very common methods for relational extraction in the biomedical field. Great work is being done using supervised machine learning algorithms that detect and discover different types of relationships, such as[20] by defining and classifying the relationship between diseases and treatments extracted from PubMed abstracts and between genes and diseases in the human GeneRIF database.
The primary steps of the question answering system are as follows:
· The system receives the text of the natural language as input.
· Using linguistic analysis and question categorization algorithms, the framework shall decide the form and response of the question posed.
· Produces a query and moves it to the processing phase of the text.
· In the document processing phase, the machine feeds the query, sends the query to the search engine retrieves the recovered documents and extracts the appropriate text snippets as candidate responses, and sends them to the processing point.
· Responding to the processing stage, analyses the candidate's answers and rates them according to the degree to which they conform to the expected response type defined in the question processing stage.
· The top-ranked answer is selected as the performance of the answering system.
2.7 DISCUSSION
In this article, we tried to give a brief introduction to the field of text mining. We provided an overview of some of the most fundamental algorithms and techniques that are widely used in the text domain. This paper also provided an overview of some of the important text mining approaches in the biomedical field. Although it is impossible to fully describe all the different methods and algorithms with regard to the limits of this article, it should provide a rough overview of current developments in the field of text mining. Text mining is essential for scientific research, given the very high volume of scientific literature produced every year[16]. These large archives of online scientific articles are growing considerably as a large number of new articles are added on a daily basis. While this growth has made it easier for researchers to access more scientific information, it has also made it quite difficult for them to identify articles that are more relevant to their interests. The processing and mining of this massive amount of text is therefore of great interest to researchers.
2.8 SUMMARY
A great deal of attention has recently been paid to the huge rise in unstructured knowledge in the biomedical field, such as scientific papers and clinical information. On the basis of their intent, a number of document summaries may be drawn up, such as single-document summaries targeting the content of individual documents and multi-document summaries where the information content of multiple documents is considered. The evaluation of summarization methods is a real challenge in the biomedical field. It is also subjective to determine whether or not a summary is "good" and also to carry out manual assessments of the summaries. There is a common automated evaluation technique for summaries called ROUGE (Recall-Oriented Understudy for Gisting Evaluation). ROUGE tests the consistency of the automatically generated summary by contrasting it with the ideal summary produced by humans. The measure is determined by counting the overlapping words between the computer-generated summary and the ideal human-produced summary.
2.9 CONCLUSION
We have concluded as a whole, NLP is still at an early stage of development. It will be useful to correct thousands of little details and complexities that are vital to the language. Experts claim that they can easily overcome machine learning problems due to increased investment in related areas, such as human feature engineering. We have done our best to ensure that our world is less complicated.
REFERENCES
[1] Charu C Aggarwal and ChengXiang Zhai. 2012. Mining text data. Springer.
[2] Mehdi Allahyari and Krys Kochut. 2015. Automatic topic labeling using ontologybased topic models. In Machine Learning and Applications (ICMLA), 2015 IEEE 14th International Conference on. IEEE, 259–264.
[3] Mehdi Allahyari and Krys Kochut. 2016. Discovering Coherent Topics with Entity Topic Models. In Web Intelligence (WI), 2016 IEEE/WIC/ACM International Conference on. IEEE, 26–33.
[4] Mehdi Allahyari and Krys Kochut. 2016. Semantic Context-Aware Recommendation via Topic Models Leveraging Linked Open Data. In International Conference on Web Information Systems Engineering. Springer, 263–277.
[5] Mehdi Allahyari and Krys Kochut. 2016. Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network. In Semantic Computing (ICSC), 2016 IEEE Tenth International Conference on. IEEE, 63–70.
[6] M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut. 2017. Text Summarization Techniques: A Brief Survey. ArXiv e-prints (2017). arXiv:1707.02268
[7] Sophia Ananiadou, Sampo Pyysalo, Jun’ichi Tsujii, and Douglas B Kell. 2010. Event extraction for systems biology by text mining the literature. Trends in biotechnology 28, 7 (2010), 381–390.
[8] Sofia J Athenikos and Hyoil Han. 2010. Biomedical question answering: A survey. Computer methods and programs in biomedicine 99, 1 (2010), 1–24.
[9] Bob Carpenter. 2010. Integrating out multinomial parameters in latent Dirichlet allocation and naive bayes for collapsed Gibbs sampling. Technical Report. Technical report, LingPipe.
[10] Yee Seng Chan and Dan Roth. 2010. Exploiting background knowledge for relation extraction. In Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 152–160.
[11] Yee Seng Chan and Dan Roth. 2011. Exploiting syntactico-semantic structures for relation extraction. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, 551–560.
[12] Richard O Duda, Peter E Hart, and David G Stork. 2012. Pattern classification. John Wiley & Sons.
[13] Guozhong Feng, Jianhua Guo, Bing-Yi Jing, and Lizhu Hao. 2012. A Bayesian feature selection paradigm for text classification. Information Processing & Management 48, 2 (2012), 283–302.
[14] John Gantz and David Reinsel. 2012. THE DIGITAL UNIVERSE IN 2020: Big Data, Bigger Digital Shadows, and Biggest Grow th in the Far East. Technical Report 1. IDC, 5 Speen Street, Framingham, MA 01701 USA. Accessed online on May, 2017. https://www.emc.com/collateral/analyst-reports/ idc-the-digital-universe-in-2020.pdf.
[15] Pritam Gundecha and Huan Liu. 2012. Mining social media: a brief introduction. In New Directions in Informatics, Optimization, Logistics, and Production. Informs, 1–17.
[16] Juan B Gutierrez, Mary R Galinski, Stephen Cantrell, and Eberhard O Voit. 2015. From within host dynamics to the epidemiology of infectious disease: scientific overview and challenges. (2015).
[17] Xianpei Han and Le Sun. 2012. An entity-topic model for entity linking. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 105–115.
[18] Mehmed Kantardzic. 2011. Data mining: concepts, models, methods, and algorithms. John Wiley & Sons.
[19] Anthony ML Liekens, Jeroen De Knijf, Walter Daelemans, Bart Goethals, Peter De Rijk, and Jurgen Del-Favero. 2011. BioGraph: unsupervised biomedical knowledge discovery via automated hypothesis generation. Genome biology 12, 6 (2011), R57.
[20] Payam Porkar Rezaeiye, Mojtaba Sedigh Fazli, et al. 2014. Use HMM and KNN for classifying corneal data. arXiv preprint arXiv:1401.7486 (2014).
[21] Hassan Saif, Miriam Fernández, Yulan He, and Harith Alani. 2014. On stopwords, filtering and data sparsity for sentiment analysis of twitter. (2014).
[22] Prithviraj Sen. 2012. Collective context-aware topic models for entity disambiguation. In Proceedings of the 21st international conference on World Wide Web. ACM, 729–738.
[23] Songbo Tan, Yuefen Wang, and Gaowei Wu. 2011. Adapting centroid classifier for document categorization. Expert Systems with Applications 38, 8 (2011), 10264– 10273.
[24] E. D. Trippe, J. B. Aguilar, Y. H. Yan, M. V. Nural, J. A. Brady, M. Assefi, S. Safaei, M. Allahyari, S. Pouriyeh, M. R. Galinski, J. C. Kissinger, and J. B. Gutierrez. 2017. A Vision for Health Informatics: Introducing the SKED Framework An Extensible Architecture for Scientific Knowledge Extraction from Data. ArXiv e-prints (2017). arXiv:1706.07992.
[25] Stéphane Tufféry. 2011. Data mining and statistics for decision making. John Wiley & Sons.
[26] Yu Usami, Han-Cheol Cho, Naoaki Okazaki, and Jun’ichi Tsujii. 2011. Automatic acquisition of huge training data for bio-medical named entity recognition. In Proceedings of BioNLP 2011 Workshop. Association for Computational Linguistics, 65–73.
[27] Alper Kursat Uysal and Serkan Gunal. 2014. The impact of preprocessing on text classification. Information Processing & Management 50, 1 (2014), 104–112.
[28] Christopher C Yang, Haodong Yang, Ling Jiang, and Mi Zhang. 2012. Social media mining for drug safety signal detection. In Proceedings of the 2012 international workshop on Smart health and wellbeing. ACM, 33–40.
[29] Liyang Yu. 2011. A developer’s guide to the semantic Web. Springer.

No comments:
Post a Comment