Biomedical Information Extraction

How to extract information from unstructured biomedical data and text.
What is BioIE? It includes any effort to extract structured information from unstructured (or, at least inconsistently structured) biological, clinical, or other biomedical data. The data source is often some collection of text documents written in technical language. If the resulting information is verifiable and consistent across sources, we may then consider it knowledge. Extracting information and producing knowledge from bio data requires adaptations upon methods developed for other types of unstructured data.BioIE has undergone massive changes since the introduction of language models like BERT and the more recently created Large Language Models (LLMs; e.g., GPT-3/4, LLAMA2/3, Gemini, etc).Resources included here are preferentially those available at no monetary cost and limited license requirements. Methods and datasets should be publicly accessible and actively maintained.See also awesome-nlp, awesome-biology and Awesome-Bioinformatics.Please read the contribution guidelines before contributing. Please add your favourite resource by raising a pull request.Contents

Research Overviews

Groups Active in the Field

Organizations

Journals and Events

Journals

Conferences and Other Events

Challenges

Tutorials

Guides

Video Lectures and Online Courses

Code Libraries

Repos for Specific Datasets

Tools, Platforms, and Services

Annotation Tools

Techniques and Models

Datasets

Biomedical Text Sources

Annotated Text Data

Protein-protein Interaction Annotated Corpora

Other Datasets

Ontologies and Controlled Vocabularies

Data Models

Credits

Research OverviewsLLMs in Biomedical IE

Large language models in healthcare: A comprehensive benchmark - a statistical and human evaluation of sixteen different LLMs applied to medical language tasks.

Assessing the research landscape and clinical utility of large language models: a scoping review - a high-level review of LLM applications in medicine as of March 2024.

Ethical and regulatory challenges of large language models in medicine - a review of ethical issues arising from applications of LLMs in biomedicine.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 - a frequently referenced but still relevant work concerning the roles, applications, and risks of language models.

Pre-LLM Overviews

Biomedical Informatics on the Cloud: A Treasure Hunt for Advancing Cardiovascular Medicine - An overview of how BioIE and bioinformatics workflows can be applied to questions in cardiovascular health and medicine research.

Clinical information extraction applications: A literature review - A review of clinical IE papers published as of September 2016. From Mayo Clinic group (see below).

Literature Based Discovery: Models, methods, and trends - A review of Literature Based Discovery (LBD), or the philosophy that meaningful connections may be found between seemingly unrelated scientific literature.

For some historical context on LBD, see papers by University of Chicago's Don Swanson and Neil Smalheiser, including Undiscovered Public Knowledge (paywalled) and Rediscovering Don Swanson: the Past, Present and Future of Literature-Based Discovery.

Mining Electronic Health Records (EHRs): A Survey - A review of the methods and philosophy behind mining electronic health records, including using them for adverse event detection. See Table 2 for a list of relevant papers as of mid-2017.

How to extract information from unstructured biomedical data and text.

What is BioIE? It includes any effort to extract structured information from unstructured (or, at least inconsistently structured) biological, clinical, or other biomedical data. The data source is often some collection of text documents written in technical language. If the resulting information is verifiable and consistent across sources, we may then consider it knowledge. Extracting information and producing knowledge from bio data requires adaptations upon methods developed for other types of unstructured data.

BioIE has undergone massive changes since the introduction of language models like BERT and the more recently created Large Language Models (LLMs; e.g., GPT-3/4, LLAMA2/3, Gemini, etc).

Resources included here are preferentially those available at no monetary cost and limited license requirements. Methods and datasets should be publicly accessible and actively maintained.

Please read the contribution guidelines before contributing. Please add your favourite resource by raising a pull request.

Research Overviews
Groups Active in the Field
Organizations
Journals and Events
Journals
Conferences and Other Events
Challenges
Tutorials
Guides
Video Lectures and Online Courses
Code Libraries
Repos for Specific Datasets
Tools, Platforms, and Services
Annotation Tools
Techniques and Models
Datasets
Biomedical Text Sources
Annotated Text Data
Protein-protein Interaction Annotated Corpora
Other Datasets
Ontologies and Controlled Vocabularies
Data Models
Credits

Research Overviews

LLMs in Biomedical IE

Large language models in healthcare: A comprehensive benchmark - a statistical and human evaluation of sixteen different LLMs applied to medical language tasks.
Assessing the research landscape and clinical utility of large language models: a scoping review - a high-level review of LLM applications in medicine as of March 2024.
Ethical and regulatory challenges of large language models in medicine - a review of ethical issues arising from applications of LLMs in biomedicine.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜 - a frequently referenced but still relevant work concerning the roles, applications, and risks of language models.

Pre-LLM Overviews

Biomedical Informatics on the Cloud: A Treasure Hunt for Advancing Cardiovascular Medicine - An overview of how BioIE and bioinformatics workflows can be applied to questions in cardiovascular health and medicine research.
Clinical information extraction applications: A literature review - A review of clinical IE papers published as of September 2016. From Mayo Clinic group (see below).
Literature Based Discovery: Models, methods, and trends - A review of Literature Based Discovery (LBD), or the philosophy that meaningful connections may be found between seemingly unrelated scientific literature.
For some historical context on LBD, see papers by University of Chicago's Don Swanson and Neil Smalheiser, including Undiscovered Public Knowledge (paywalled) and Rediscovering Don Swanson: the Past, Present and Future of Literature-Based Discovery.
Mining Electronic Health Records (EHRs): A Survey - A review of the methods and philosophy behind mining electronic health records, including using them for adverse event detection. See Table 2 for a list of relevant papers as of mid-2017.