Natural Language Understanding Laboratory

We use a high-throughput phenotyping system to rapidly assign ontology terms to available content or an input expression from a subject matter expert or potential trial participant. 

Ontology is the modeling of reality in as much precision as we have information to support. In this project, we will use the Basic Formal Ontology as an upper-level ontology. These ontologies are used to index data from individual trials. We will use the Ontology for Biomedical Investigations and Systematized Nomenclature of Medicine — Clinical Terms (SNOMED CT) as our main ontologies. Where these do not have content coverage we will look to Medical Subject Headings (MeSH), RxNorm and Logical Observation Identifiers Names and Codes (LOINC).

Attempting to match text with ontological terms requires a level of syntactic processing. The linguistic representation is specified in language models. Of primary concern to us is an English-language model to identify sentences, phrases, words and parts of speech. Terms from the input ontologies are then assigned to spans of text. Matching is done using string-matching techniques, which allow for inexact matches influenced by the underlying language model.

When terms are matched to text, the goal is to select the term that has the meaning intended by the original text’s author. Term Sense Disambiguation (TSD) is the process by which a single term is selected from several that cover the same span of text in the input document. To employ TSD, we use the ontology in which the ambiguous terms are present. Excluding the is-a relation, we attempt to find a common ancestor in the ontology of each codified word in the surrounding text and, in turn, each candidate ambiguous term. The term most related to the surrounding text is selected.

Negated and uncertain phrases are identified using a modified version of the NegEx algorithm, which is capable of identifying not only negated words and phrases but negated subwords (e.g., “steroidal” in “nonsteroidal”).

Compositional expressions are identified using a database of noun phrase surface feature to graph transformations, where the nodes in the graph are terms and the edges are relations.

We have machinery to parse clinical notes and reports from a variety of formats and also biomedical research articles from the Journal Article Tag Suite (JATS) format.

Resources and Equipment

Our lab has dedicated Windows and Linux servers and access to high-performance computers (16,000 nodes) at UB’s Center for Computational Research (CCR).

Data Security

Secured data feeds from our clinical partners deliver raw data to the CCR, the National Science Foundation-funded supercomputing center located within UB’s New York State Center of Excellence in Bioinformatics and Life Sciences (CBLS). This data repository is isolated and maintained within administrative, physical and technical safeguard layers designed to ensure that individually identifiable information never leaves the facility.

Once our project receives Institutional Review Board (IRB) approval, our study data will be held on a separate, secured infrastructure also located in the CCR in a second, firewalled area of the data center.

Investigators can access the data via a virtualized desktop environment that will offer shared analysis tools accessible from any network location with appropriate role-based authentication and authorization.

Investigators may also locate approved analysis tools and equipment within the center. They will receive assistance to ensure they are deployed in a manner conforming to the stipulated data security plan requirements.

Nonidentifiable data resulting from analysis may be stored and transferred to the researcher outside of the center’s secured infrastructure in an aggregated form, as required by the IRB-approved protocol.

Role-based access control and mechanisms for authentication and nonrepudiation of the translational research data warehouse are implemented in this security architecture. Our departmental chair and UB’s director of HIPAA compliance serve as the human keys to unlock our trusted broker technical infrastructure. No one individual has the authority to unlock the individually identifiable records from the data warehouse.

Data marts for IRB-approved protocols are created for security-trained investigators and for public health emergencies. Data in the warehouse have pseudoidentifiers added after deidentifying the patient’s health records or other research data (e.g., genomic and/or image data).

The research environment has portability ports disabled so no external device can be attached by the end user for capturing data. In addition, emailing from the system is disabled, and intrusion detection and protection systems are in place with monitoring and notification based on predefined service levels. Authorized personnel, however, can request for data to be migrated off the data store once standard procedures are followed.

UB is strongly committed to the protection of its data, computer systems and networks. It provides for monitoring, protection and incident response with a variety of proactive measures to ensure the confidentiality, integrity and availability of data held within our university’s infrastructure.

UB’s Office for Research and Economic Development, in consultation with the Research Data Security Oversight Committee, is responsible for research data security. The office has implemented a comprehensive data security plan driven by risk assessment and risk management plans that are mindful of the importance of secure sharing data.

  • All human studies (including those of our clinical partner, Great Lakes Health) go through the common UB/Roswell Park Comprehensive Cancer Center IRB (certified by the Association for the Accreditation of Human Research Protection Program).
  • Data security is driven by a risk-assessment and risk-management paradigm guided by best practice standards such as HIPAA, International Organization for Standardization (ISO) 27001/2 and National Institute of Standards and Technology (NIST) 800 to protect privacy while ensuring data integrity and availability.
  • Tiered, risk-based security requirements and incident response plans are based on data element identification and risk categorization, program needs and federal and state regulatory requirements.
  • The security measures include policy-based identity assurance and data access controls.
  • This security structure is integrated into the existing IRB structure.

Data Governance and Implementation

UB has implemented a strong data governance policy.  UB’s director of HIPAA compliance has institutionwide scope and works with the IRB to ensure all human subject research data is handled in a HIPAA-compliant manner. With his and the departmental chair’s consent, the translational research data warehouse (TRDW) can be unlocked to create a research data mart for IRB-approved protocols by security-trained investigators and for public health emergencies.

The TRDW has a pseudoidentifier added after fully deidentifying the patient’s health records or other research data (e.g., deep sequencing and other genomic data and image data).

Detailed data security policy is in place based on HIPAA, ISO-27001, ISO-27002 and relevant NIST SP 800 series best practices. Role-based access control and mechanisms are implemented for authentication and nonrepudiation.


We provide Natural Language Processing and Natural Language Understanding services on our website. These allow interested investigators to submit medical text and get back codes and graphs that represent the knowledge from the documents in a computable form.

We also are building a bibleome enrichment and systems biology resource that will be available to investigators.


Elkin, Peter


Professor and Chair, Department of Biomedical Informatics & Professor of Internal Medicine

UB Downtown Gateway 77 Goodell Buffalo, NY 14203

Phone: 716 888 4854



77 Goodell Street, 5th Floor
Buffalo, NY 1420