Protein Name Tagging Guidelines: Lessons Learned

Mani, Inderjeet; Hu, Zhangzhi; Jang, Seok Bae; Samuel, Ken; Krause, Matthew; Phillips, Jon; Wu, Cathy H.

doi:https://doi.org/10.1002/cfg.452

International Journal of Genomics

On this page

Abstract Copyright Related Articles

Conference paper | Open Access

Volume 6 | Article ID 930160 | https://doi.org/10.1002/cfg.452

Protein Name Tagging Guidelines: Lessons Learned

Inderjeet Mani,¹Zhangzhi Hu,¹Seok Bae Jang,¹Ken Samuel,²Matthew Krause,¹Jon Phillips,¹and Cathy H. Wu¹

Received09 Dec 2004

Accepted14 Dec 2004

Abstract

Interest in information extraction from the biomedical literature is motivated by the need to speed up the creation of structured databases representing the latest scientific knowledge about specific objects, such as proteins and genes. This paper addresses the issue of a lack of standard definition of the problem of protein name tagging. We describe the lessons learned in developing a set of guidelines and present the first set of inter-coder results, viewed as an upper bound on system performance. Problems coders face include: (a) the ambiguity of names that can refer to either genes or proteins; (b) the difficulty of getting the exact extents of long protein names; and (c) the complexity of the guidelines. These problems have been addressed in two ways: (a) defining the tagging targets as protein named entities used in the literature to describe proteins or protein-associated or -related objects, such as domains, pathways, expression or genes, and (b) using two types of tags, protein tags and long-form tags, with the latter being used to optionally extend the boundaries of the protein tag when the name boundary is difficult to determine. Inter-coder consistency across three annotators on protein tags on 300 MEDLINE abstracts is 0.868 F-measure. The guidelines and annotated datasets, along with automatic tools, are available for research use.

Copyright

Copyright © 2005 Hindawi Publishing Corporation. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation Order printed copies

Views

336

Downloads

875

Citations