Journal List > Healthc Inform Res > v.24(3) > 1099784

Healthc Inform Res. 2018 Jul;24(3):242-246. English.
Published online July 31, 2018.
© 2018 The Korean Society of Medical Informatics
Construction of an Electrocardiogram Database Including 12 Lead Waveforms
Dahee Chung, PhD,1 Junggu Choi, BE,1 Jong-Hwan Jang, BA,1 Tae Young Kim, BE,1 JungHyun Byun, BS,1 Hojun Park, BS,1 Hong-Seok Lim, MD, PhD,2 Rae Woong Park, MD, PhD,1,3 and Dukyong Yoon, MD, PhD1,3
1Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Korea.
2Department of Cardiology, Ajou University School of Medicine, Suwon, Korea.
3Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Korea.

Corresponding Author: Dukyong Yoon, MD, PhD. Department of Biomedical Sciences, Ajou University Graduate School of Medicine, 206 World cup-ro, Yeongtong-gu, Suwon 16499, Korea. Tel: +82-31-219-4476, Email:
Received June 30, 2018; Revised July 19, 2018; Accepted July 24, 2018.

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License ( which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.



Electrocardiogram (ECG) data are important for the study of cardiovascular disease and adverse drug reactions. Although the development of analytical techniques such as machine learning has improved our ability to extract useful information from ECGs, there is a lack of easily available ECG data for research purposes. We previously published an article on a database of ECG parameters and related clinical data (ECG-ViEW), which we have now updated with additional 12-lead waveform information.


All ECGs stored in portable document format (PDF) were collected from a tertiary teaching hospital in Korea over a 23-year study period. We developed software which can extract all ECG parameters and waveform information from the ECG reports in PDF format and stored it in a database (meta data) and a text file (raw waveform).


Our database includes all parameters (ventricular rate, PR interval, QRS duration, QT/QTc interval, P-R-T axes, and interpretations) and 12-lead waveforms (for leads I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, and V6) from 1,039,550 ECGs (from 447,445 patients). Demographics, drug exposure data, diagnosis history, and laboratory test results (serum calcium, magnesium, and potassium levels) were also extracted from electronic medical records and linked to the ECG information.


Electrocardiogram information that includes 12 lead waveforms was extracted and transformed into a form that can be analyzed. The description and programming codes in this case report could be a reference for other researchers to build ECG databases using their own local ECG repository.

Keywords: Electrocardiogram; Waveform; QT Interval; Database; Adverse Drug Reaction

I. Introduction

Electrocardiogram (ECG) has been widely used to diagnose various cardiovascular diseases including arrhythmia and acute coronary syndrome [1, 2, 3] because it is a non-invasive and convenient tool for measuring the continuous wave sequence characterizing the heart activity [2, 4].

Information from ECGs is also used to detect a prolonged QT interval, which is one of the life-threatening adverse drug reactions (ADRs). A prolonged QT interval leads to an irregular heart beat and can result in various types of cardiac arrest including ventricular fibrillation, ventricular tachyarrhythmia, Torsades de Pointes, and sudden death [5, 6, 7]. Due to its importance as a drug-induced adverse reaction, a prolonged QT interval is strictly monitored and regulated [5].

To address the needs of ECG data analysis, we previously constructed the ECG databases, Electrocardiogram Vigilance with Electronic data Warehouse I (ECG-ViEW I) and ECG-ViEW II, using ECG data measured in a tertiary teaching hospital located in Korea [8, 9]. However, these previous versions of the database had limitations because they did not provide the ECG waveform. To overcome these limitations, we designed an update of the ECG database.

II. Case Description

This study was a retrospective review of Electronic Health Records and was approved by the Ajou University Hospital Institutional Review Board (No. AJIRB-MED-MDB-18-075), which also waived the requirement for informed consent.

1. Data Resources and Patient Characteristics

ECG-ViEW I and II covered three data sources: scanned images of paper-based ECGs, ECGs in portable document format (PDF) from the MUSE system (GE Healthcare, Waukesha, WI, USA), and image files stored in the hospital's Electronic Medical Records (EMRs). In contrast, we used only the PDF ECGs from the MUSE system and no images in this study due to the following reasons. The images scanned from paper-based ECGs or image files from the EMRs are saved as pixels images; the quality of information extracted by optical character recognition (OCR) is dependent on the quality of image, and there is no appropriate way to extract waveforms with high accuracy. On the other hand, the quality of data extracted from PDF files from the MUSE system is stable and well controlled. Moreover, because the waveforms in PDF files are saved in scalable vector graphics (SVG) format, it is possible to export the waveforms and maintain the quality of the raw data [10].

2. ECG Data Extraction

An ECG report typically contains both alphanumeric values and waveform graphs (Figure 1). The upper part of the ECG report is a list of alphanumeric values including demographic information, patient ID, evaluation date, ECG parameters, and interpretations (e.g., normal sinus rhythm). Demographic information refers to basic patient information including name, age, sex, and ethnicity. ECG parameters include ventricular rate, PR interval, QRS duration, QT/QTc, and P-R-T axes. The waveform graphs, which typically cover the middle and bottom part of the ECG, are time series of graphs representing the sensor measurement data.

Figure 1
Example of an electrocardiogram (ECG) report. Alphanumeric values (demographic information, ECG parameter values, and interpretation) are located in the upper part of the report. Waveform data are given as time-series graphs with a grid, covering the middle and lower part of the report. One grid unit (1 mm × 1 mm square) corresponds to 0.1 mV × 0.04 seconds.
Click for larger image

The alphanumeric data were converted from PDF to eXtensible Markup Language (XML) format to increase the accuracy of the parsing results. The main difference between parsing in PDF and XML is related to the handling of irrelevant data. In PDF format, data are saved and parsed according to object type, and deletion of unnecessary data requires careful manual revision. In contrast, the XML format enables conditional parsing, and thus, relevant data can be extracted using automated code. Second, the XML format provides x- and y-coordinates for each piece of data. The axes of the coordinates start from the upper left corner of the ECG report. Thus, alphanumeric information can be extracted based on the position information (i.e., the location of each piece of data on the ECG report).

The part of the PDF file containing the waveform data is stored as metadata of the image in SVG format. To extract the image data, we used INKSCAPE (open-source software, in Linux and converted the waveforms in PDF to SVG format images. The waveforms in the SVG format were processed using svgpathtools from the Python library to classify the parsed data into three categories: path, attribute, and svg_attribute. We used the ‘path’ data to convert the waveform into a numeric series. ‘path’ can be considered to be composed of real and imaginary numbers describing the starting and ending points (i.e., a complex plane). After transforming this complex plane into a Cartesian coordinate system (Figure 2A), the starting point of each waveform was adjusted to ‘(0,0)’ (Figure 2B). It was required to reset the baseline of each waveform because the starting point of the waveforms in raw data corresponds to a certain position on the ECG report. Then, we adjusted the values of the time series data to indicate the units of millivolt (mV), which corresponds to 10 times the height of the grid unit (square) used in the ECG report (Figure 2C). Finally, we converted the x- and y-coordinates of the vector images to an equidistant time series similar to that obtained from the sensor (Figure 2D). We set the frequency of data at 500 Hz and identified data points on the waveform using linear interpolation between points with known coordinates (the frequency of the raw data was varied from about 200 to 420 Hz).

Figure 2
Data transformation process for the waveform data. (A) Raw data in SVG format contains information regarding the exact position of the data point on the electrocardiogram. (B) The start point of all 12-lead waveforms was set to 0. (C) Because 1 mV in the raw data corresponds to y-axis values of 28, we divided all y-axis values by 28 to adjust the scale of the y-axis to mV. (D) Using linear interpolation (500 Hz), vector image data given as x- and y-coordinates were converted into equidistant time series data. Although the resulting waveform data consist of a series of values without timestamps, the timestamps could be calculated by counting the data points from the starting point because the starting point was provided in the file name or database, and one data point corresponds to 1/500 seconds.
Click for larger image

There are 13 waveforms in a single ECG report: 3-second strips for each of the 12 leads, plus a 10-second strip, usually for lead II. Each waveform was saved in a separate file in comma-separated value (CSV) format, resulting in 13 waveform CSV files per ECG report. The waveform data were saved as a compressed CSV format using gzip, while their metadata were stored in a database to link the waveform data with the corresponding alphanumeric values from the ECG report as well as with the clinical data from the EMR.

3. Software Tools

We used a Java programming tool to extract the PDF files from the MUSE system, and a Linux-based program (pdftohtml) to convert PDF to the XML format. INKSCAPE for Linux was used to convert PDF to the SVG format. The parsing of XML and SVG files was performed using Python, and the svgpathtools library was used to extract the waveform data. The ElementTree library was used to parse the XML formats. All software tools and codes used during the parsing process are available in Supplementary 1.

4. Data Validation and Quality Control

The accuracy of the ECG data extraction was validated according to the correlation between the extracted and calculated QTc values. The parameter QTc can be calculated based on the QT intervals and RR intervals using Bazett formula. Approximately 99.94% of the extracted QTc values and calculated QTc values matched within ± 2 ms. We assume that the cause of this difference between the two values is due to the process of rounding off. The QT and RR values that we used to calculate QTc were already rounded as integers; thus, there might be some difference from the original value used when calculating QTc in the ECG machine. For the extracted results in which the difference with the calculated QTc value was relatively high (> 2 mm), we manually reviewed the results and confirmed that there was no error in the data extraction process.

Because the frequency of the waveforms was adjusted to 500 Hz, there could be little difference in the converted waveform with the raw data at locations where the time series data point does not coincide with the x- and y-coordinates. To validate the quality of the linear interpolation, we compared the original waveform to the converted waveform for randomly chosen waveforms. The difference was not noticeable shown in Figure 3.

Figure 3
Comparison between the raw waveform and converted waveform data. Due to high-density interpolation, there was no significant difference between the raw data (which are based on x- and y-coordinates) and time series data.
Click for larger image

5. Data in the Database

The ECG database contains a total of 1,039,550 ECGs from 447,445 patients (Table 1). The mean follow-up period per person was 717 ± 1,534 days.

Table 1
Summary of demographic and ECG data covered by ECG-ViEW III (n = 447,445)
Click for larger image

6. Software Availability

All programming codes for extracting both the alphanumeric and waveform data are provided in Supplementary 1.

III. Discussion

The update of the ECG databases includes a complete dataset in which the relevant data (ECG values, ECG waveforms, and demographic, diagnosis, medication, laboratory, and any other information related to the hospital visit) are provided for all patients covered by the database. Therefore, the database could be used as a data source in various studies including comprehensive clinical evaluation to determine the potential associations between ECG values or patterns and specific diagnoses, medications, or hospital visit characteristics.

We are currently working on collecting biosignal data from patient monitoring devices from more than 100 beds in an emergency room, intensive care units, and an operating room [11]. All biosignals including ECG lead II, peripheral capillary oxygen saturation, respiration, arterial blood pressure, central venous pressure, and end-tidal CO2 data are collected onto a local server. By constructing the ECG database from the ECG reports, we could expand coverage of the biosignal collection into general wards.

The ECG database, described in this article, is one of the largest ECG databases linked to relevant clinical data. This database has integrated all 12 lead ECG waveforms and not just only the numeric parameters of ECG, patient demographics, diagnosis data and drug prescription data. Although the full dataset cannot be made publicly available due to legal restrictions imposed by the Korean government in relation to the Personal Information Protection Act, we expect that the description of the process for constructing the database and the programming codes provided in Supplementary 1 could be a good reference for other researchers to build their own ECG databases using their own local ECG repository.

Supplementary Materials

Supplementary materials can be found via

Supplement 1

Software for extracting information from an ECG report in PDF format

Click here to view.(657K, pdf)


Conflict of Interest:No potential conflict of interest relevant to this article was reported.


This research was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (No. HI16C0982, HI17C0970, and HI6C0992).

1. Ovreiu M, Simon D. Biogeography-based optimization of neuro-fuzzy system parameters for diagnosis of cardiac disease; Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation; 2010 Jul 7–11; Portland, OR. pp. 1235-1242.
2. Martis RJ, Acharya UR, Adeli H. Current methods in electrocardiogram characterization. Comput Biol Med 2014;48:133–149.
3. O'Neil BJ, Hoekstra J, Pride YB, Lefebvre C, Diercks D, Frank Peacock W, et al. Incremental benefit of 80-lead electrocardiogram body surface mapping over the 12-lead electrocardiogram in the detection of acute coronary syndromes in patients without ST-elevation myocardial infarction: results from the Optimal Cardiovascular Diagnostic Evaluation Enabling Faster Treatment of Myocardial Infarction (OCCULT MI) trial. Acad Emerg Med 2010;17(9):932–939.
4. Laguna P, Jane R, Caminal P. Automatic detection of wave boundaries in multilead ECG signals: validation with the CSE database. Comput Biomed Res 1994;27(1):45–60.
5. Lynch DR Jr, Washam JB, Newby LK. QT interval prolongation and torsades de pointes in a patient undergoing treatment with vorinostat: a case report and review of the literature. Cardiol J 2012;19(4):434–438.
6. Li XB, Tang YL, Zheng W, Wang CY, de Leon J. QT interval prolongation associated with intramuscular ziprasidone in Chinese patients: a case report and a comprehensive literature review with meta-analysis. Case Rep Psychiatry 2014;2014:489493
7. Tarapues M, Cereza G, Arellano AL, Montane E, Figueras A. Serious QT interval prolongation with ranolazine and amiodarone. Int J Cardiol 2014;172(1):e60–e61.
8. Park MY, Yoon D, Choi NK, Lee J, Lee K, Lim HS, et al. Construction of an open-access QT database for detecting the proarrhythmia potential of marketed drugs: ECG-ViEW. Clin Pharmacol Ther 2012;92(3):393–396.
9. Kim YG, Shin D, Park MY, Lee S, Jeon MS, Yoon D, et al. ECG-ViEW II, a freely accessible electrocardiogram database. PLoS One 2017;12(4):e0176222
10. Ortigosa N, Gimenez VM. Raw data extraction from electrocardiograms with Portable Document Format. Comput Methods Programs Biomed 2014;113(1):284–289.
11. Yoon D, Lee S, Kim TY, Ko J, Chung WY, Park RW. System for collecting biosignal data from multiple patient monitoring systems. Healthc Inform Res 2017;23(4):333–337.