This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on http://www.jmir.org/, as well as this copyright and license information must be included.
The beginning of the coronavirus disease (COVID-19) epidemic dates back to December 31, 2019, when the first cases were reported in the People’s Republic of China. In the Czech Republic, the first three cases of infection with the novel coronavirus were confirmed on March 1, 2020. The joint effort of state authorities and researchers gave rise to a unique team, which combines methodical knowledge of real-world processes with the know-how needed for effective processing, analysis, and online visualization of data.
Due to an urgent need for a tool that presents important reports based on valid data sources, a team of government experts and researchers focused on the design and development of a web app intended to provide a regularly updated overview of COVID-19 epidemiology in the Czech Republic to the general population.
The cross-industry standard process for data mining model was chosen for the complex solution of analytical processing and visualization of data that provides validated information on the COVID-19 epidemic across the Czech Republic. Great emphasis was put on the understanding and a correct implementation of all six steps (business understanding, data understanding, data preparation, modelling, evaluation, and deployment) needed in the process, including the infrastructure of a nationwide information system; the methodological setting of communication channels between all involved stakeholders; and data collection, processing, analysis, validation, and visualization.
The web-based overview of the current spread of COVID-19 in the Czech Republic has been developed as an online platform providing a set of outputs in the form of tables, graphs, and maps intended for the general public. On March 12, 2020, the first version of the web portal, containing fourteen overviews divided into five topical sections, was released. The web portal’s primary objective is to publish a well-arranged visualization and clear explanation of basic information consisting of the overall numbers of performed tests, confirmed cases of COVID-19, COVID-19-related deaths, the daily and cumulative overviews of people with a positive COVID-19 case, performed tests, location and country of infection of people with a positive COVID-19 case, hospitalizations of patients with COVID-19, and distribution of personal protective equipment.
The online interactive overview of the current spread of COVID-19 in the Czech Republic was launched on March 11, 2020, and has immediately become the primary communication channel employed by the health care sector to present the current situation regarding the COVID-19 epidemic. This complex reporting of the COVID-19 epidemic in the Czech Republic also shows an effective way to interconnect knowledge held by various specialists, such as regional and national methodology experts (who report positive cases of the disease on a daily basis), with knowledge held by developers of central registries, analysts, developers of web apps, and leaders in the health care sector.
In early 2020, the pandemic of the coronavirus disease (COVID-19) started to spread all over the world. COVID-19 is caused by a novel type of coronavirus, referred to as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). This highly infectious disease is mainly manifested by fever, respiratory difficulty (cough, dyspnea), muscle pain, and fatigue. The disease can be rather serious for people who are older or have chronic illnesses and can even be fatal. The beginning of the COVID-19 epidemic dates back to December 31, 2019, when the first cases were reported in the city of Wuhan in Hubei Province in the People’s Republic of China. Over the next 4 months, more than 1,120,000 people became infected across the world and almost 60,000 patients died from the disease [
The aim of the research team was to adopt one of the time-tested methodologies for data mining, analytics, knowledge discovery, and data science projects, and to apply it in the process of mapping the current COVID-19 epidemic situation in the Czech Republic. This paper describes all essential steps from methodological as well as technical points of view. The Czech approach to the design, development, and implementation of online monitoring of the COVID-19 epidemic is based on a verified methodology for the acquisition, processing, and presentation of information. The methodology discussed in this paper made it possible to interconnect knowledge held by various specialists such as regional and national methodology experts from the National Institute of Public Health and regional public health authorities, who report positive cases of the disease on a daily basis, with knowledge held by developers of central registries, analysts, developers of web apps, and leaders in the health care sector.
When the urgent need emerged to map the current situation regarding the COVID-19 epidemic, it was essential to consider several factors that might have a significant impact on the resulting reports. The infrastructure of a nationwide information system run by the public administration is one of the most important factors in this regard; data from basic registries as well as data from health service providers are collected and processed in this information system. The cooperation between organizations involved in the process of data reporting, collection, processing, validation, analysis, evaluation, and visualization is another key aspect. The methodological setting of communication channels, duties resulting from legal measures, clearly defined competence, the sequence of steps to be made, and the overall management of the health care sector are complex at the time of an unexpected pandemic. In addition, it is important to select appropriate procedures for knowledge mining from database structures and for an undistorted interpretation of data provided to individual target groups. These groups involve not only the general public and health care professionals but also the media, as the resulting reports are published on behalf of the Ministry of Health of the Czech Republic and can, therefore, be considered as guaranteed and entirely reliable.
Among several known methodological recommendations and standardized procedures, which can be cited as the implementation of the knowledge discovery in databases, the process Sample, Explore, Modify, Model, Assess and cross-industry standard process for data mining (CRISP-DM) are used most frequently in practice [
In the initial stage, maximum attention is paid to mapping the situation from the managerial point of view. In this case, the task is focused on online visualization of data on the current state of the COVID-19 epidemic in the Czech Republic. The Ministry of Health of the Czech Republic is responsible for the methodological setting of regular reports and processing data on newly identified cases of COVID-19 across the Czech Republic (see
Simplified schema of newly identified cases of COVID-19. COVID-19: coronavirus disease.
For a correct collection and processing of information on the COVID-19 epidemiology in the Czech Republic, it is essential that all reporting processes are well understood, that communication with all involved institutions runs smoothly, that the architecture of central databases is well designed from the technical point of view, and that the resulting reports are correctly interpreted. At the same time, all interactive outputs, whether in the form of graphs, maps, or tables, which are available for the general public, must meet the condition that it will never lead to a direct identification of any patient. It is therefore not possible to show detailed information on a district (or even a town) in combination with sex or age of a given person. What can be published, on the other hand, are summary data on the entire Czech Republic (or possibly on individual regions or districts), as the population of the entire country (or region or district) is large enough and the total number of positive cases is higher than 10, which cannot lead to the direct identification of a particular patient. Other examples of outputs that can be published involve daily reports on cumulative numbers and on the increase or decrease in the number of confirmed cases or division according to sex or age groups. When processing personal data in the various part of NHIS, every individual record must be processed in accordance with Regulation (EU) 2016/679 of the European Parliament and of the Council of April 27, 2016, on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation), Act No 372/2011 Coll, on Health Services and Conditions of Their Provision (Act on Health Services), as subsequently amended, and the Act No 110/2019 Coll, on the Processing of Personal Data, as subsequently amended. The interactive online reporting is based on the five following data sets, which, combined together, characterize the COVID-19 epidemic in the Czech Republic. First, reports by RPHAs contain daily records of persons with confirmed COVID-19 (eg, date of report, age, sex, region, location, and country of infection). Second, reports by laboratories (LAB) contain deidentified records on persons with confirmed COVID-19, which have not yet been taken over by a RPHA in the respective region. In relation to reports by RPHAs, these are disjointed sets of records (each particular record goes to the LAB repository and consequently is either approved or directly confirmed by a respective RPHA); in other words, none of the records are present in both data sets, and undesirable duplicates are, therefore, avoided.
Diagram of data transfer within the Information System of Infectious Diseases.
Several different data sources are used as input at this stage; these data sources must be validated thoroughly and then transformed into a form suitable for computer processing and subsequent visualization. In particular, the validation process involves a set of control mechanisms ensuring the completeness of individual records. This stage involves data cleaning (incomplete records are detected, corrected, or removed), construction of particular data views (selected attributes with primary identification are derived and merged), and data integration (final data sets are automatically generated). This process ensures that no invalid records are included in the stage of further processing and web visualization. Syncing tests, which are launched at regular intervals, are among the main control mechanisms. These tests provide information on whether or not data presented in the end report are consistent with data on the input. Their objective is to detect possible inconsistencies in the input records, such as a patient’s incomplete or incorrectly entered birth certificate number; their permanent address; or distinction between a Czech citizen, a foreigner, or even a person who is homeless. If possible, standardized lists of values (ie, those valid on a nationwide level) are integrated, making it possible to name values in a unified way. As an example, a standardized classification of territorial units in the Czech Republic has been used (Nomenclature of Units for Territorial Statistics) based on unique codes for each of the 14 Czech regions (eg, CZ064 was used as a code for the South Moravian Region). Valid records in the ISID form the basis for the selection of descriptive attributes and their final adaptation into a format in which they are sent by export tools at regular intervals and in a secured manner to the web server.
Deployment diagram. CSV: comma-separated values; cURL: Client URL; ISID: Information System of Infectious Diseases; JSON: JavaScript Object Notation; LAB: laboratories; ODBC: Open Database Connectivity; PDO: PHP Data Objects; RPHA: regional public health authority; SQL; Structured Query Language; SSH: Secure Shell.
In this step, detailed static instructions were prepared for each report (graph, map, table), always involving its data source, computational algorithm, description of visualization, explanation of its meaning, and information on the last update. Afterward, the entire process of data processing and dynamic drawing of the online report was transformed into a fully automatic one. All predefined data sets are transferred into an internal data storage device, which is dedicated to epidemiological analyses and reporting; individual overviews are subsequently generated in the form of interactive graphs, maps, and data tables. The resulting presentation is a set of interactive graph visualizations and tabular outputs, which can be viewed online by anyone using only a standard web browser. The portal has been developed using the Symfony 4.4 PHP framework. Graphical outputs are processed by the NVD3 JavaScript library, which is based on d3.js. Graphs are slightly adjusted to meet the report’s needs, mainly in the responsive environment of the contemporary imaging technology. Tabular data are subsequently processed using the DataTables library, which internally employs jQuery (The jQuery Team). In this case, the data layer mainly consists of preprocessed and saved files in the JavaScript Object Notation (JSON) format. This approach significantly accelerates the access to data, making it possible to accelerate the app’s response markedly. On each update, data are transformed straight away, source files being replaced by new ones. The number of computational operations that are carried out at the app layer is kept at a minimum. The presentation layer deals with the graphical appearance of the user interface with control elements as well as the interactivity of the visualization. All reports prepared in this way are first implemented on development servers, where functionality and data correctness are thoroughly verified. Development servers are also used to check syncing with external data sources. Individual versions are subsequently published, always within the logical structure of the web app, which is divided into thematic sections. The app itself has been designed to be responsive and is fully supported by all types of devices, including the portable ones (mobile, tablet, and desktop).
It is extremely important to validate the results before their publication, particularly when providing information about the spread of an epidemic. Despite several optimization and transformation processes performed in the preceding stages, data must remain consistent and all reports must correspond with the original input as well as with calculations carried out to check agreement with primary data. User testing not only revealed some inconsistency in original values obtained by static calculations, but also brought new ideas on how to improve the intelligibility, technical implementation, and user interface. Experts on accessibility of web apps were called in too. The target group was also taken into consideration, as it consists of disabled users, among others, for whom navigation on the internet might be difficult. A correct structure of headings, an adequate contrast between texts and the background, the overall legibility, a properly defined alternative information for images, and the availability of complementary tabular overviews next to graphs and maps are the most important accessibility attributes that have been thoroughly applied. One must keep in mind that web apps guaranteed by the state administration are required to provide maximum accessibility for all citizens without exception.
Primary data are entered into the ISID by LAB and RPHAs. After their validation and the unification of data formats, two data exports are created, and these are periodically sent from the ISID to a location from where they can be finally used for the purposes of the web portal. The periodicity of updates is set to 2 hours, with a nighttime shutdown between 2 am and 6 am The cron tool launches the updating script on the side of the “Disease at the Moment” (“Onemocnění aktuálně” in Czech) app at predefined times, which results in the transformation of provided data into preprocessed data sets. The previously mentioned syncing tests are launched during this process as well. The entire process is launched on the production instance of the “Disease at the Moment” app, as well as on the development instance and the stage instance. The latter two instances are automated by Jenkins, which is a continuous integration tool. In case of any error in data assembly or in the update of the entire app, an email notification is sent to the development team, allowing it to react appropriately. At the time of writing this paper (April 2020), primary data are updated three times a day on the production instance (12:30 am, 8:30 am, and 5:30 pm), and secondary data are usually updated once a day.
Science is built on data, namely their collection, analysis, publication, reanalysis, critique, and reuse. Barriers include inability to access data, restrictions on usage applied by publishers or data providers, and publication of data that are difficult to reuse, for example, because they are poorly annotated or ‘‘hidden’’ in unmodifiable tables like PDF documents [
Monitoring and further analysis of web usage is one of the crucial points in terms of reflecting users’ behavior and requirements. Systematic tracking and web analysis significantly improve the efficiency and quality during a long-term design and development of robust web apps. Google Analytics is a third-party service that measures and generates up-to-date statistics, reports, and analyses based on website traffic and on the behavior of its visitors. It tracks the visitors’ activity, collects statistical data in real time, and stores them for a later analysis. Using different types of metrics, we can easily determine the number of visitors over any period, which pages they viewed, and how long was their visit. The flow of visitors is an important factor, showing user transitions between pages and the rate of abandonment of every single page [
On March 11, 2020, the first version of the web portal was released [
Because published data are based on several independent sources, it is obvious that updates must be performed at different times. Nevertheless, fixed time intervals have been set at which updated values are presented on the web. The latest numbers summing up the incidence of COVID-19 in the Czech Republic are updated 3 times a day: at 12:30 am, at 8:30 am, and at 8:30 pm. The overviews showing total numbers from the previous day are prepared each morning at 8:30 am, and the latest numbers of recovered persons and COVID-19-related deaths are published at 8:30 am and at 5:30 pm.
In the stages of design and development of this web app, particular emphasis was placed on the character of presented data, which might be viewed by users anywhere and anytime. That was one of the reasons why responsive web design was among the main requirements, together with the overall optimization for mobile devices. The majority of line graphs and bar charts, therefore, primarily display an overview from the last 14 days, which is always adjusted even to small screens of mobile phones without the user having to manipulate the graph in any way. An additional complex view of the entire period of epidemic follow-up means that users of smaller devices may need to manipulate the graph if they want to display all values (
Daily trend in the percentage of persons with confirmed coronavirus disease in the overall number of persons tested on a given day.
Overall (cumulative) number of persons with laboratory-confirmed coronavirus disease according to reports by regional public health authorities and laboratories.
Open data sets intended for further processing are an integral part of this information website. Source data in .CSV (comma-separated values) and .JSON formats are published daily at regular intervals, and can be downloaded by anyone who wants to process them either by a computer or manually. Data set schema are also included, describing the structure of records. Normalized schemas contain the list and definitions of all descriptive attributes as well as the technological representation of the data schema. All data sets published in this way are linked to records in the National Catalogue of Open Data (provided by the Ministry of Health of the Czech Republic), which is administered by the IHIS CR.
The Google Analytics component is used to monitor and to analyze the users’ behavior on the website. The objective of the development team was to monitor all basic characteristics such as the number of sessions, page views, ways of user acquisition (direct, organic, referral), type of device used (mobile, tablet, desktop), display resolution, and web browser. An overview of sessions on a given day and time makes it possible for the development team to plan timely releases of new versions of the portal and to update information. The website was officially launched on March 11, 2020, and has immediately become the primary communication channel employed by the health care sector to present the current situation regarding the COVID-19 epidemic. In the period between March 11, 2020, and April 12, 2020, the web portal registered 13,634,325 sessions and 19,662,351 page views. Returning visitors accounted for 16,286,551 (more than 80%) of users. This trend can also be confirmed by the graph showing how visitors landed on the website. A total of 7,158,020 (more than 50%) of visits were direct (ie, the website URL was typed into a browser) or were the result of an organic search (ie, users employed search engines such as Seznam.cz or Google.com). From the beginning, the development team designed the website as mobile-first. With regard to the portal contents, we assumed that its visitors would want to see the information fast and at any time. A total of 8,248,766 (more than 60%) of visitors used a mobile device or a tablet. A focus on graphs being drawn on displays with a certain resolution was no less important. The most usual width of mobile devices was 360 px, which comfortably accommodates a summary graph containing information from the last 14 days. Despite the clear preference of mobile devices, we also had to bear in mind that all graphs had to be well displayed in desktop browsers. Although Chrome was the most frequently used browser (8,112,423 of users, almost 60%), we also had to consider that some users prefer Internet Explorer (version 11.0 or lower), which accounted for more than 160,000 sessions. We used the Google Data Studio tool to create a dashboard that presents all needed information (see
The web-based app introducing an overview of the current spread of COVID-19 in the Czech Republic has been designed, developed, and implemented in accordance with the CRISP-DM methodology. All interactive graphs, maps, and tables fully respect strict rules of data management in the health care sector, where data reporting, collection, processing, validation, analysis, evaluation, and final publishing are under the supervision of the Ministry of Health of the Czech Republic. The online interactive overview of the current spread of COVID-19 in the Czech Republic [
Diagram of the cross-industry standard process for data mining reference model.
Selected visualization of coronavirus disease epidemiology in the Czech Republic. A: incidence of confirmed cases of coronavirus disease in the Czech Republic by region per 100,000 people. B: Overall number of persons who tested positive for coronavirus disease in the Czech Republic. C: Daily overview of the number of persons with newly confirmed coronavirus disease over the last 2 weeks. D: Overall number of persons with laboratory-confirmed coronavirus disease by age group.
Overview of monitored statistics regarding the number of visits (period: March 12-April 12, 2020).
coronavirus disease
cross-industry standard process for data mining
comma-separated values
Institute of Health Information and Statistics of the Czech Republic
Information System of Infectious Diseases
JavaScript Object Notation
laboratories
National Health Information System
personal protective equipment
regional public health authority
severe acute respiratory syndrome coronavirus 2
This app was directly supported by the leadership of the Ministry of Health of the Czech Republic and the Institute of Health Information and Statistics of the Czech Republic. In particular, we would like to thank all regional public health authorities, laboratories, and hospitals for their enormous commitment during the COVID-19 epidemic.
None declared.