Biomedical Text Mining Unit

OTG Sanidad | Secretaría de Estado para la Agenda Digital


The Text Mining Unit (TEMU) at Barcelona Super Computing Center focuses on the application and development of biomedical text mining technologies, which are becoming a key tool for the efficient exploitation of information contained in unstructured data repositories including the scientific literature, electronic health records (EHRs), patents, biobank metadata, clinical trials and social media. The unit has a particular interest in processing clinical documents written in Spanish and other co-official languages in the area of health-related topics and the integration of molecular and biological information derived from the literature.

The unit is fully funded through the “Plan de Impulso de las Tecnologías del Lenguaje de la Agenda Digital (Plan TL)”, in the framework of an agreement (“encomienda”) between BSC and the Secretary of State of Telecommunications of the Spanish Ministry of Economic Affairs and Digital Transformation.

Aims & Objectives

The strategic goals of the Text Mining Unit are:

  • To design and to develop biomedical language-processing resources with emphasis on oncology.
  • To provide consultancy and technical advice for language technologies in the biomedical domain.
  • To design requirements and standards for interoperability of biomedical language technologies.
  • To coordinate community assessment and evaluation challenges of biomedical text mining tasks.
  • To leverage the uptake of biomedical text mining technologies and relevant standards.

One of the main scopes of the unit is to provide biomedical text mining and language processing infrastructures that can be maintained efficiently over time and be integrated in biomedical analysis platforms comprising data from experimental outcomes of patient-derived information.

Tools & Components

NLP components

  • Medical PoS tagger : A Part-of-Speech Tagger for medical domain corpus in Spanish based on FreeLing.

Other tools

Resources & Corpora

This page provides links to various types of resources, developed both by TEMU and externally.

Annotated corpora

    SPACCC_SPLIT - A collection of 1,000 clinical cases in Spanish where sentence boundary symbols are marked-up.

    SPACCC_TOKEN - A collection of 1,000 clinical cases in Spanish where sentence tokens are marked-up.

    SPACCC_POS A collection of 1,000 clinical cases in Spanish annotated with Part-of-Speech tags.

Terminological resources

Spanish Medical Abbreviation DataBase - The database is created automatically by detecting abbreviations and their potential definitions explicitly mentioned in the same sentence. These abbreviations are extracted from the metadata of different biomedical publications written in Spanish, which contain the titles and abstracts. The sources of these publications are SciELO, IBECS and Pubmed.

Bilingual medical glossaries - Bilingual medical glossaries for various language pairs generated from free online medical glossaries and dictionaries made by professional translators.

Translation models for neural machine translation

A number of translation models for neural machine translation needed to use the Neural Machine Translation (NMT) system for the Biomedical Domain. The available language directions for translation are: English to Spanish, Spanish to English, English to Portuguese, Portuguese to English, Spanish to Portuguese and Portuguese to Spanish.

Online demos


Talks & Presentations

  • Gonzalez-Agirre, A.; Vivanco-Hidalgo, R.M.; Abilleira, S.; Gallofré, M.; Valencia, A.; Villegas, M. and Krallinger, M. Mining Spanish and Catalan Electronic Health Records: Extraction of Information on Diagnosis of Stroke from Discharge Reports. In 3rd European Conference on Translational Bioinformatics: Biomedical Big Data Supporting Precision Medicine, 2018.


Martin Krallinger Head of Biological Text Mining Unit
Aitor Gonzalez-Agirre Postdoctoral researcher.
Ander Intxaurrondo: postdoctoral researcher.
Montserrat Marimon: senior researcher.
Jesus Santamaria: postdoctoral researcher.
Felipe Soares: research engineer.
Marta Villegas: senior researcher.



  • Villegas, M., Intxaurrondo, A., Gonzalez-Agirre, A., Marimon, M. and Krallinger, M. The MeSpEN resource for English-Spanish medical machine translation and terminologies: Census of parallel corpora, glossaries and term translations. In Proceedings of the LREC 2018 Workshop “MultilingualBIO: Multilingual Biomedical Text Processing, pp. 32–39. ISBN: 979-10-95546-03-0, EAN: 9791095546030. [PDF]
  • Intxaurrondo, A., Marimon, M., Gonzalez-Agirre, A., Lopez-Martin, J.A., Rodriguez, H., Santamaria, J., Villegas, M. and Krallinger, M. Finding mentions of abbreviations and their definitions in Spanish Clinical Cases: the BARR2 shared task evaluation results. In Proceedings of the Third Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval-2018), co-located with the 34th Conference of the Spanish Society for Natural Language Processing (SEPLN-2018), pp. 280-289. [PDF]
  • Santamaría, J. and Krallinger, M. Construcción de recursos terminológicos médicos para el español: el sistema de extracción de términos CUTEXT y los repositorios de términos biomédicos. In Procesamiento del Lenguaje Natural, nº 61, pp. 49-56. [PDF]
  • Corvi, J., Fernandez, J.M., Intxaurrondo, A., Krallinger, M., Valencia, A. and Capella-Gutierrez, S. Updating the LimTox Content provider workflow. In XIV Symposium on Bioinformatics (JBI-2018). [Abstract booklet (page 111)]


  • Villegas, M., de la Peña, S., Intxaurrondo, A., Santamaria, J. and Krallinger, M. Esfuerzos para fomentar la minería de textos de biomedicina más allá del inglés: el plan estratégico nacional español para las tecnologias del lenguaje. In Procesamiento del Lenguaje Naturalnº 59, pp. 141-144. ISBN: 979-10-95546-03-0, EAN: 9791095546030. [PDF]
  • Intxaurrondo, A., Perez-Perez, M., Perez-Rodriguez, G., Lopez-Martin, J.A., Santamaria, J., de la Peña, S., Villegas, M., Ahmad-Akhondi, S., Valencia, A., Lourenço, A. and Krallinger, M. The Biomedical Abbreviation Recognition and Resolution (BARR) Track: Benchmarking, Evaluation and Importance of Abbreviation Recognition Systems Applied to Spanish Biomedical Abstracts. In Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) co-located with 33th Conference of the Spanish Society for Natural Language Processing (SEPLN 2017), pp. 230-246. [PDF]
  • Intxaurrondo, A. and Krallinger, M. CNIO at BARR IberEval 2017: Exploring Three Biomedical Abbreviation Identifiers for Spanish Biomedical Publications. In Proceedings of the Second Workshop on Evaluation of Human Language Technologies for Iberian Languages (IberEval 2017) co-located with 33th Conference of the Spanish Society for Natural Language Processing (SEPLN 2017), pp. 278-285. [PDF]