University of Houston University of Houston-Clear Lake ISSO Annual Report Y2005 45-49
A Text-Mining Technique for Literature Profiling and Information Extraction from Biomedical Literature
Abstract--Massive amounts of biomedical literature are readily available online in many forms. Huge amounts of valuable knowledge and relationships are embedded in these resources and need to be properly extracted, discovered, and utilized. Recognizing and classifying biomedical entity names and terms are important steps for developing efficient knowledge/information extraction techniques from these repositories. This research investigates and develops effective computational methods for literature profiling for the biomedical field. Specifically, this paper presents new techniques for biomedical term identification and classification. We utilize the advances in feature selection techniques (e.g., MI, X2) in IR in this task to select the key features for term identification and classification. We evaluated the method using Genia 3.0 corpus with about 3,000 to more than 34,000 biomedical terms and entity names. The outcome of this project can be applied in various fields including the aerospace domain. In the aerospace field, there is a great interest in discovering the relations between certain changes in the body of astronauts and changes in structure at the levels of genes, proteins, and bindings.
Massive amounts of biomedical literature are readily available online to researchers in many forms: text abstracts (PubMed contains over 14 million biomedical abstracts1), full text research articles, databases of protein interactions, dictionaries of gene and protein names, and much more. Huge amounts of valuable knowledge and useful information are embedded in these resources waiting to be properly extracted, discovered, and utilized. There is great need for computational techniques to utilize and extract the useful knowledge from these resources. A number of systems and software tools have been developed to utilize these overwhelming resources.2-6 Biomedical research has shown that text mining can be effective in this field, making text mining increasingly important and necessary for biology and medicine.
The purpose of this research is to investigate and design an effective computational method for literature profiling to extract and organize important information and relationships from biomedical literature. For that, we implemented new methods to identify and classify technical terms and entity names in biomedical texts. The methods are based on machine learning and can be viewed as a word classification task. We utilized feature extraction techniques like MI (mutual information) and X2 (Chi-square) to select the key features in the contexts of the terms of interest. The methods were evaluated extensively with a large number of experiments. The outcome of this project can be applied in various fields including the aerospace domain. In the field of aerospace, there is a great interest in discovering the relations between certain changes in the body of the astronauts (due to radiation, reduced gravity, and isolation) and the structural changes at the levels of genes, proteins, and bindings. Moreover, in aerospace, certain symptoms have need of being explained at the levels of gene or protein, so that the consequences and future complications can be known and treated in a timely manner.
Related Work
In the biomedical domain, the majority of term identification and recognition techniques
target certain specific entities and terms (mostly gene and protein names); this way term
identification and term classification are integrated as one task.7 A number of
machine learning and statistically-based approaches have been proposed for term
identification and classification in the past.7-9 For example, Morgan et al.8
used HMMs based on local context and simple orthographic and case variations and reported
F-measure of 75% for the recognition of Drosophila gene names. Moreover, Shen et
al.10 used POS tags and noun heads as features and achieved F-scores of 16.7%
to 80%, depending on the class, and reported that POS tags proved to be among the most
useful features. A number of approaches employed SVM for term identification and
recognition. For example, Kazama et al.11 used SVMs for multi-class
classification. They annotated the training data class label with B, I, and O labels to
indicate that a term is beginning, inside, or outside the term.11
The JNLPBA-2004 competition12 included eight systems for the Bio-Entity
recognition task.9 The competition was an open challenge, and the participants
were allowed to use whatever techniques and data resources they preferred. However, the
systems were evaluated using a common evaluation methodology and a common dataset. Four
types of classification models where used: SVM, HMM, MEMM, and CRFs.
The overall results (Table 1) showed the recall ranges from 50.8% to 76.0%, precision
from 43.6% to 69.4%, and F-score from 47.7% to 72.6%.9,12
Table 1. Results of the JNLPBA-2004 competition of Bio-Entity recognition: (recall/precision/F-score) results of each one of the participating systems and the baseline (BL), taken from Kim et al. (2004)9
|
The Techniques
A number of previous related methods utilized the words in terms of interest as features
for term identification or classification.13,14 We also use word features to
represent the biomedical terms, but the words in the context of the term are not used
directly as features. Instead we select, as features, only those words having high 'discriminating'
capabilities between the various classes of terms. These word features are used
to represent each instance (example) of the terms in the training and testing. The method
then employs machine learning (SVMs) to train classifiers with labeled (training)
examples. So, some already labeled terms (annotated with class labels) are used
as training examples. The classifiers will then be used to classify unseen and unlabeled
examples (term instances) in the testing (classification) phase. One of the main
contributions of this work is the way we select features for learning and classification.
Feature Selection
Assume that we have two classes, C1 and C2, of
labeled examples extracted from biomedical texts. Let C1 be examples
of biomedical term instances and their contexts from one category (C1),
whereas C2 includes examples with their contexts from another category
(C2). We want to classify terms from C1 and C2
into their correct classes. The term, which belongs to either C1
or C2, is what is to be classified in this case, and the words
preceding and following the term are its context words. Consequently each example
in the set C1 or C2 can be represented as:
pn...p3 p2 p1 <term> f1 f2 f3...fn,
where the words p1, p2, p3,...,
pn and f1, f2, f3,...,
fn are the preceding and following words (context words) surrounding
the term, and n is called the window size (w). We
extract all the context words W = {w1, w1,...,wm}
from the examples in the sets C1 and C2. Now, each
such context word wi
W
may occur in contexts from either C1 or C2 or both
with different frequency distributions. We want to determine that if we see a context word
wi in an ambiguous example the extent to which this occurrence of wi
suggests that this example belongs to C1 or C2.
Thus, we select those words wi from W which are highly
associated with either C1 or C2 (the highly
discriminating words) as features. We utilize feature selection techniques like mutual
information (MI) and chi-square (X2)14,15
to select the highly discriminating words from W. We now explain how we implement
and use MI and X2. Let us first define the notions of a,
b, c, and d: From the training examples, we calculate a,
b, c, and d for each context word wi
W as follows:
a = number of occurrences of wi in C1
b = number of occurrences of wi in C2
c = number of examples of C1 that do not contain wi
d = number of examples of C2 that do not contain wi.
Then, the mutual information (MI) is defined as:
| MI = N a / (a + b) (a + c), | (1) |
where N is the total number of examples in C1 and C2. And Chi-Square (X2) is computed as:
| X2 = N (ad - cb)2 / (a + c) (b + d) (a + b) (c + d). | (2) |
When using the MI technique for feature selection, we calculate MI
values for each wi
W;
then we choose the top v words wi
W with the highest MI values
as features in this term's feature vectors. In our experiments, we tested on v
values of 10, 20, 30, 50, and 100. For example, if v = 10, then each training
example is represented by a vector size of 10 entries (thus, v: vector size)
such that the first entry represents the word with the highest MI value, the
second entry represents the word with the second highest MI value, and so on.
Then for a given training example, the feature vector entry is set to "1" if the
corresponding feature word occurs in that training example and set to "0"
otherwise.
Learning and Classification
We generate feature vectors from the training examples using the top words selected using MI
or X2. Then, we use a well-established learning technique Support
Vector Machines (SVM)16 to train classifiers with the training
vectors. SVM is an inductive learning technique for two-class classification.
Significant theoretical and empirical justifications exist in the literature to support SVM.16
We construct, for each class, one feature vector for each training example. Then we take
two classes at a time and apply SVM to train, and the classifier (model)
is produced. The classifier will then be used in the testing/classification phase to
classify testing instances. We use SVMlight (<http://svmlight.joachims.org>)
with the default parameters except that we adjust the cost factor (j parameter)
by which training errors on positive examples outweigh errors on negative examples (default
j = 1).
Evaluation and Discussion
The proposed techniques have been evaluated with a large variety of experiments using data
from the Genia 3.0 corpus. In this section, we describe the datasets, the
experimental design, and then we discuss the results.
Dataset
Data for training and testing are taken from the Genia corpus version 3.0.17
This corpus is used as benchmark in most of the biomedical term/entity name related
problems.9,12 The Genia corpus was developed at the University of
Tokyo and constructed from Medline1 by querying the terms 'human,'
'blood cells,' and 'transcription factors.' From this search process,
2,000 abstracts were selected for the corpus. The identified terms in these selected
documents were hand annotated with 36 classes/types, these classes are shown in Table 2.
The corpus contains a total of 75,108 term occurrences.
Table 2. 36 Terminal Classes of Genia 3.0
|
|||||||||||||||||||||||||||||||||||||||
Experimental design
We selected for testing 30 pairs of classes from the classes in Table 2. For space
constraints, Table 3 contains only part of these selected classes. We used five-fold cross
validation, such that we divided the data into five equal folds and repeated each
experiment five times. Each time we leave one fold (20%) out for testing and use the
remaining four folds (80%) for training. In the text preprocessing step, the training and
testing texts were preprocessed as follows: (1) We changed all the letters into lower
case; (2) Word stemming: all words converted to their stems using Porter's
stemming algorithm.18 (3) Stopword removal: we removed all the
function words (stopwords) like 'the', 'of', 'in', 'for',
'on',...etc. For performance metrics, we use accuracy, precision,
Recall, and F1-score.
Table 3. Main Selected Class Pairs for Our Evaluations
|
Results
First, we conducted a variety of experiments using feature selection techniques MI
and X2 to compare their performance. In these experiments, we changed
window size w (number of neighboring context words) with varying vector size v
as well. Consider the first class pair in Table 3 [amino_acid_monomer, protein_domain_or_region].
The first class (amino_acid_monomer) includes 780 annotated terms from this
class, whereas the second class contains 990 annotated terms, and the total is 1,770
terms. Of these 1,770 instances, 80% (1,418 instances) were used for training, and the
remaining 20% (354 instances) are used for testing. This step is repeated five times by
changing the training/testing folds. We record accuracy, precision, and recall for each
round, and then we take the average accuracy, precision, and recall of the five rounds.
Finally, we take the microaverage of accuracy, precision, and recall for all of the 30 testing pairs. Table 4 shows the results of the first set of experiments in which we changed the window size w and vector size v with the two feature selection techniques MI and X2. In this table, we notice that using windows size w = 3 and vector of size v = 30 with X2 for feature selection produces the highest accuracy (75.17%) and F1 (75.55%) results, while the best precision (81.48%) was produced with MI when w = 3 and v = 30. In the second set of experiments, we examined the performance after preprocessing steps. The results are in Table 6 when we applied the preprocessing steps one at a time. Table 7 contains the results when combinations of preprocessing steps were applied.
Table 4. Results of the first set of experiments using different feature selection (f.s.) techniques, window size (w), and vector sizes (v)
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Table 5. Results of the main, larger datasets, using w = 5,10, v = 20, 30, and feature selection (f.s) is X2
|
|||||||||||||||||||||||||||||||||||
Table 6. Results of the main larger datasets, w = 5, v = 20, 30, and feature selection (f.s) is X2 with preprocessing steps
|
|||||||||||||||||||||||||||||||||||||||||||||||||
Table 7. Results of the third set of experiments using combinations of preprocessing steps. In these experiments, window size w = 5, vector size v = 20, 30, and f.s. is X2
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
These results clearly demonstrate that our technique produces impressive performance results proven by a large number and variety of experiments. Moreover, we notice that the strength of the proposed method lies mostly in the feature selection techniques and the learning/classification process. We have seen that the preprocessing steps (Table 6 and Table 7) did not improve the performance results of Table 5. Furthermore, we conclude that the best performance can be achieved when the X2 feature selection is used.
Conclusion
Interest in bioinformatics and biotechnology is rising for many reasons, among which are
the massive amounts of biomedical information and data and the significant knowledge
embedded in them. The objective of this research is to devise effective computational
techniques to extract and discover useful and significant knowledge from the existing
repositories of biomedical literature. This paper presents new techniques for biomedical
terms and entity names identification and classification to constitute an important
component in an effective computational system. Experimental results showed that the
method is effective in dealing with ambiguous biomedical terms using few surrounding
context words as features. The strength of the method lies in the way we select these
context features. We borrowed from the IR and TC domains two successful feature selection
techniques (viz. mutual information and Chi-square) and proved with a
variety of experiments the effectiveness of the approach. The outcome of this research can
be applied in various fields. For example, in the Aerospace domain, certain
symptoms in the body of the astronauts need explanations at the level of gene or protein
of the body, so that the consequences and future complications can be known and treated in
a timely manner.
References
1Medline, accessed using Entrez PubMed Interface <http://www.ncbi.nlm.nih.gov/entrez/query.fcgi>
2D. Chaussabel and A. Sher, "Mining Microarray Expression Data by
Literature Profiling," Genome Biology 3.10 (2002): research0055.1-0055.16.
3C. Creighton and S. Hanash, "Mining Gene Expression Databases for
Association Rules," Bioinformatics 19.1 (2003): 79-86.
4E. M. Marcotte, I. Xenarios, and D. Eisenberg, "Mining Literature for
Protein-Protein Interactions," Bioinformatics 17.4 (2001): 359-63.
5Medminer, Genomics and Bioinformatics Group and SRA International Inc. <http://discover.nci.nih.gov/textmining/filters.html>
6T. Ono, H. Hishgaki, A. Tanigami, and T. Takagi, "Automated Extraction of
Information on Protein-Protein Interactions from the Biological Literature," Bioinformatics
17.2 (2001): 155-61.
7M. Krauthammer and G. Nenadic, "Term Identification in the Biomedical
Literature," J. Biomed. Info. 37.6 (2004): 512-26.
8A. Morgan, A. Yeh, L. Hirschman, and M. Colosimo, "Gene Name Extraction
Using FlyBase Resources," Proc., of NLP in Biomedicine, ACL 2003, Sapporo,
Japan, 2003. 1-8.
9J.-D. Kim, O. Tomoko, T. Yoshimasa, Y. Tateisi, and N. Collier,
"Introduction to the Bio-Entity Recognition Task at JNLPBA," Proc.,
Intl. Workshop on Natural Language Processing in Biomedicine and its Applications, 2004.
10D. Shen, J. Zhang, G. Zhou, J. Su, and C. Tan., "Effective Adaptation of
Hidden Markov Modelbased Named Entity Recognizer for Biomedical Domain," Proc.,
NLP in Biomedicine, ACL 2003, Sapporo, Japan, (2003): 49-56.
11J. Kazama, T. Makino, Y. Ohta, and J. Tsujii, "Tuning Support Vector
Machines for Biomedical Named Entity Recognition," Proc., Workshop on NLP in
the Biomedical Domain, ACL 2002.
12JNLPBA-04 Workshop: <http://www.genisis.ch/~natlang/JNLPBA04/>. Shared task homepage:
<http://research.nii.ac.jp/~collier/workshops/JNLPBA04st.htm>
13F. Ginter, J. Boberg, J. Jarvinen, and T. Salakoski, "New Techniques for
Disambiguation in Natural Language and Their Application to Biological Text," JMLR
5 (2004): 605-21.
14L. Galavotti, F. Sebastiani, and M. Simi, "Experiments on the Use of
Feature Selection and Negative Evidence in Automated Text Categorization," Proc.,
4th European Conf. on Research and Advanced Technology for Digital Libraries, 2000.
15Y. Yang and J. P. Pedersen, "A Comparative Study on Feature Selection in
Text Categorization," in The 4th Intl Conf. on Machine Learning, Ed. Jr. D. H.
Fisher. San Francisco: Morgan Kaufman Pub., 1997. 412-20.
16V. Vapnik, The Nature of Statistical Learning Theory, N.Y.:
Springer, 1995.
17J.-D. Kim, T. Ohta, Y. Tateisi, and J. Tsujii, "GENIA Corpus--A
Semantically Annotated Corpus for Bio-Textmining," Bioinformatics 19 Suppl.
1 (2003): i180-82.
18M. F. Porter, "An Algorithm for Suffix Stripping," Program
14 (1980): 130-7.
19H. Al-Mubaid and M. Siddiqui, "Automatic Text Categorization with
Learning Logic," Proc., 16th Intl. Conf. for Computer Applications in
Industry and Engineering, Las Vegas, NV, Nov. 11-13, 2003.
20G. Boetticher, H. Al-Mubaid, and K. Frasier-Scott, "Automated
Hybridization of Machine Learners for Recursive Spot Identification, Optimization, and Gel
Matching of 2-Dimensional Gel Electrophoresis," International Space Systems Annual
Report, 2003.
Publications
Al-Mubaid, H. "Context-Based Technique for Biomedical Term Classification," IEEE
GrC-06. (Submitted paper, 2006.)
Al-Mubaid, H. and N. Ghaffari. "A New Gene Selection Technique Using Feature
Selection Methodology Gene Selection," CATA-2006. (Accepted paper.)
Presentations
See "Natural Language Interface Models for Fast
Responsiveness Applications"
Institute for Space Systems Operations - Y2005 Annual Report
Copyright © 2006