This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.
Survival analysis is a cornerstone of medical research, enabling the assessment of clinical outcomes for disease progression and treatment efficiency. Despite its central importance, no commonly used spreadsheet software can handle survival analysis and there is no web server available for its computation.
Here, we introduce a web-based tool capable of performing univariate and multivariate Cox proportional hazards survival analysis using data generated by genomic, transcriptomic, proteomic, or metabolomic studies.
We implemented different methods to establish cut-off values for the trichotomization or dichotomization of continuous data. The false discovery rate is computed to correct for multiple hypothesis testing. A multivariate analysis option enables comparing omics data with clinical variables.
We established a registration-free web-based survival analysis tool capable of performing univariate and multivariate survival analysis using any custom-generated data.
This tool fills a gap and will be an invaluable contribution to basic medical and clinical research.
Bioinformatic programs include databases, algorithms, services, and software tools. These not only span a wide range of utility but have also gained increased value in scientific research in recent years; approximately 80% of papers published in biology and 60% of papers published in medicine report the use of at least one bioinformatic tool [
The assessment of survival following the onset of a disease or of a treatment is a fundamental analysis in medical research. In an optimal scenario, the differential survival of two cohorts can be compared by employing a simple Mann-Whitney test. However, survival times do not follow a normal distribution and it is common for numerous subjects to lack associated event data at the end of follow-up. Kaplan and Meier [
Despite its widespread use, there is no online tool available for survival analysis. Therefore, it is necessary to acquire specialized software packages, as none of the general office packages (OpenOffice, LibreOffice, MS Office) is suitable for analyzing follow-up data. We previously established an online platform capable of linking survival outcome in various cancer types to mRNA [
The website is built on an Apache 2.4 web server and hosted by a Linux-based server machine. The user interface is written in PHP 7 and JavaScript using JQuery. The backend side is written in PHP 7 and R, and the repository layer is built on the PostgreSQL 12 database. The database temporarily contains the uploaded data and generated results. The analysis platform is accessible via any standard browser (Firefox, Edge, Chrome, Safari).
Multiple R packages are used for the statistical computations and for generating the output graphs. The
When comparing two cohorts, the significance is computed using the Cox-Mantel (log rank) test [
Kaplan-Meier curves showing main concepts used in survival analysis, including the (A) hazard rate (high/low) and (B) median survival. The green arrow shows the visually determined median survival and the blue arrow shows the survival probability at 50 months.
The generated results also include the median survival time, which is the time at which the probability of 0.5 is reached in one of the cohorts. The median time can also be determined visually by drawing a vertical line from the selected probability to the X axis. Of note, performing the steps backward can determine the cumulative probability of survival at a given time point (
To enable visualization in the Kaplan-Meier plot, it is necessary to establish a cut-off value and assign the samples to one of two cohorts. We implemented three different options for this task: using a predefined quantile (including the median, upper, and lower quartiles), trichotomizing the data (eg, assign the data into three cohorts and then omit the middle cohort), and using the best available cut-off value.
To find the best cutoff, we iterate over the input variable values from the lower quartile to the upper quartile and compute the Cox regression [
A cut-off plot can be used to visualize the correlation between the used cut-off values and the achieved
During the computation of multiple cut-off values, multiple hypotheses are generated. Therefore, the false discovery rate (FDR) is computed by default in this setting using the Benjamini-Hochberg method [
A requirement for Cox regression is that the hazard is independent of time. To fulfill this requirement, the censoring should be independent of the prognosis, samples entering at different time points in the analysis should have the same prognosis, and the time should be measured as a continuous variable (not in bins). We employed the coxph function of the
In some cases, one might want to compare clinical and genomic variables. To enable this, clinical data can be selected not only as filters but also as variables to be included in the multivariate analysis. In these analyses, the “Results” page displays the
We implemented multiple options to simultaneously use and combine multiple variables. Each of these settings uses the original variable values as input and basic mathematical functions to calculate the new joint values.
The simplest option is to select multiple variables and then use each variable separately. In this case, the same analysis is performed for each selected marker using the exact same filtering settings. This option is identical to running the analysis for each variable consecutively.
In the second feature, one can use the mean expression of a panel of variables; in this case, any variable can be inverted and a weight can be added to each. Using the mean expression of a set of genes can be termed a “signature analysis,” as the expression of each included variable will influence the value of the final “composite variable.” This feature can also be used to validate previously published gene expression signatures utilizing a preselected panel of genes.
A third option is utilization of the ratio of two genes; in this case, one variable is used as the numerator, the other variable is used as the denominator, and a new value is computed for each sample. This setting is useful when one wants to compare the expression values to a reference gene such as
The fourth option enables the stratification of all patients based on the median expression level of a selected variable and then use another variable in the high or low cohort only. This enables the investigation of a selected variable in an already stratified cohort and ultimately the setup of a decision tree–like classification for the investigated cohort.
In each of the settings where multiple variables are combined, a new value based on the equation is generated for each sample, which is then used when performing the survival analysis, including the cut-off selection. Of note, one might want to directly compare two or more selected continuous variables to each other. For this purpose, we implemented an option to compute Spearman and Pearson correlation coefficients between the variables using the
We established an online survival analysis platform that grabs a user-generated tab-separated or semicolon-separated file as input. The table headers can include case-insensitive letters of the English alphabet, numbers, spaces, underscores, colons, round brackets, and exclamation marks as characters. The content within the table cells can be numeric or text values. Some columns can be used as filters and a maximum of three filters are allowed.
Quick start guide for setting up an input file.
Header name | Sample ID | Survival time | Survival event | Filter | Gene |
Automatically recognized | Yes | Yes | Yes | Yes | No |
Maximal number of different values | No limit | No limit | 2 (0 or 1) | 10 | No limit |
Can be text | Yes | No | No | Yes | No |
Can be binary | No | No | Yes | Yes | Yes |
Can be continuous |
|
Yes | No | No | Yes |
Sample input file.
Sample ID | Survival time | Survival event | Filter_A | Filter_B | Filter_C | Gene_1 | ABC123 | DE45 |
Sample 1 | 95 | 1 | 2 | 2 | 3 | 1441 | 4474 | 1.13 |
Sample 2 | 66 | 0 | 3 | 3 | 3 | 3064 | 421 | 2.395 |
Sample 3 | 70 | 0 | 3 | 1 | 1 | 2529 | 2974 | 1.363 |
Sample 4 | 26 | 1 | 3 | 1 | 3 | 19 | 3346 | 4.818 |
Sample 5 | 13 | 0 | 1 | 2 | 3 | 3573 | 1244 | 2.058 |
Sample 6 | 67 | 0 | 2 | 3 | 2 | 2977 | 962 | 4.431 |
Sample 7 | 96 | 1 | 3 | 3 | 3 | 2777 | 4367 | 2.015 |
Sample 8 | 67 | 0 | 3 | 3 | 1 | 4606 | 4190 | 1.05 |
Sample 9 | 95 | 1 | 3 | 1 | 2 | 1209 | 3930 | 1.980 |
Sample 10 | 1 | 1 | 2 | 3 | 2 | 1894 | 4897 | 4.073 |
Currently, genomics, transcriptomics, proteomics, and metabolomics enable the simultaneous investigation of multiple markers related to patient prognosis in experimental and clinical studies. Multiple online tools make survival analysis possible using previously published datasets such as those employing data from The Cancer Genome Atlas [
A major advantage of our platform is the inclusion of multiple choices to select a cut-off value to be used in the analysis. To generate a Kaplan-Meier plot, one must first determine a cutoff; a convenient and widespread option for this task is the median expression value [
The analysis automatically checks the proportional hazards assumption to evaluate the independence from time. This can also be achieved by a simple visual inspection of the graph: in case there seems to be a significant difference between the two cohorts but the lines cross at multiple time points, then the hazard is clearly not independent of time [
When interpreting the results, one has to be aware of some common caveats of survival analysis. First, the
A second important deception is the proportion of recorded events within a study. As only the actual events contribute to the drops in survival curves, it is not possible to perform a meaningful survival analysis when the number of events is very low. This not only prevents the computation of median (or upper quartile) survival, but the accidental concentration of all events into one of the cohorts can lead to an infinite HR. For example, The Cancer Genome Atlas Network published a breast cancer dataset with approximately 1000 patient samples [
We also have to discuss some limitations of the software. The input file has to be carefully formatted, and a maximum of 100 columns and 8000 rows are allowed. Only full columns are acceptable as variables, a maximum of three filters can be defined, and the survival event can only be coded “0” or “1.” Although these restrictions can make the setup of the analysis challenging, when a correctly formatted table is uploaded, the system can automatically recognize columns representing a survival event or survival time. A second limitation is the exclusive use of the Cox proportional-hazards model to compute significance, and other tests such as the Cochran-Mantel-Haenszel test [
In summary, we established an online survival analysis tool capable of performing univariate and multivariate survival analysis using any custom-generated data. We believe that this registration-free online platform simultaneously integrating multiple different analysis and quality-control options will be a valuable tool for biomedical researchers.
false discovery rate
hazard rate
The research was financed by the 2018-2.1.17-TET-KR-00001, 2020-1.1.6-JÖVŐ-2021-00013 and 2018-1.3.1-VKE-2018-00032 grants and by the Higher Education Institutional Excellence Program (2020-4.1.1.-TKP2020) of the Ministry for Innovation and Technology (MIT) in Hungary, within the framework of the Bionic thematic program of the Semmelweis University. The authors wish to acknowledge the support of ELIXIR Hungary.
None declared.