Insights from Structured SARS-2 Diagnostics Data

A. James Phillips

doi:doi:10.21428/9610ddb2.116a6b09

Version 1.0

This is the third document in our series exploring SARS-CoV-2 diagnostics. Our first exploratory documents [1][2] were produced in mid April after a significant increase in the number of new diagnostics being issued Emergency Use Authorisations (EUAs) [3] and in the intervening 6 months many new EUAs have been released. Over the same period of time several similar projects have extracted structured data from EUAs to help make sense of the evolving availability and quality of diagnostics [4][5][6].

As we previously reported, the information revealed in EUAs is very heterogeneous [1] both in what data is or is not included and in the quality of the data. This document aims to quantify and make clear some of the differences between EUAs. Secondly why those differences may be important from both a public health perspective and an efficient markets perspective in assisting or hampering an effective response to this and future infectious diseases. And finally invites suggestions of uses and additional dimensions for which structured EUA data could be valuable.

Information Heterogeneity

Despite the FDA having provided templates for EUAs which list all the fields of expected data, there remains a large difference in the quality and disclosure of important diagnostic test characteristics and performance data.

For example the FDA template document for in vitro diagnostics includes a request to “include the nucleic acid sequences for all primers and probes used in the test“ [7]. As the SARS-CoV-2 virus mutates, this data is important for clinicians and public health officials to be able to accurately appraise how the performance of the tests they are using may change given sequencing information of new mutations they are encountering. This is particularly pressing as the probability of mutations which evade existing diagnostics is increasing as the United States enters into the winter season and the number of infections continues to rise. By disclosing this information it enables the test to be used correctly, maximising patient clinical care and effective public health campaigns.

As of mid August 2020 approximately ~34% of EUAs include sequence information, dropping to ~5.8% of the EUAs for the most frequently used tests.

There is an important caveat with the sequence data, namely that proving its absence is, anecdotally, more error prone than other dimensions of data (see methods section towards the end of this article). If this particular field describing diagnostic tests is submitted using a structured data format this error prone and expensive search should be avoidable.

Another important metric for diagnostic performance is its analytical sensitivity represented by its limit of detection (LOD) given specific samples. At present the lack of a molecular standard makes comparison of values reported in EUAs prone to significant error. When a standard material is available, its use to reproduce the LOD of all diagnostics EUAs will be made impossible as some manufacturers have chosen to report LOD in non SI units. TCID50 ml^-1 are the most frequent units of LOD employed by the top 10 tests used in the United States (by 100 CLIA labs in AMP’s August survey). TCID50 is the Median Tissue Culture Infectious Dose. Unlike PFU (Plaque Forming Units) or genome copies, TCID50 depends on many experimental variables that have not been disclosed in the EUAs; variables such as the cell line used, incubation conditions, and how long the culture was left before measurement. This renders these values incomparable and unreproducible.

Uniform Information Disclosure

Unlike primer / probe sequence disclosure and non-SI units of LOD, the values of LOD reported are present in almost 100% of the EUAs analysed. Secondly the materials used to represent the virus and measure the LOD were also disclosed in the vast majority of EUAs. For example the majority used unpackaged naked RNA to perform their LOD measurements. Although naked RNA has various limitations and important interactions with the clinical matrix used, its declared uses in EUAs allows for a more robust understanding and reproducibility of diagnostic test LOD performance.

Open Structured Diagnostic Data

The rate of new tests being published remains constant at approximately 1 per day [5]. This constant increase is placing an increasing information burden on all parties looking to install new tests, validate their performance, ensure they maintain their claimed performance and to get tests performed for their communities, businesses or families. As the data becomes increasingly structured it is envisioned this will help the aforementioned use cases, as well as permit clinicians, researchers, public health officials, and policy makers to better conduct their work.

Below is the current structured EUA data set pulled from the open source code and data repository we have developed to support this work.

Methods

EUA primer / probe sequence data search protocol

In the ‘Prevalence of Primer & Probe Sequences’ section above, the data for “Not specified” is a probable but not certain classification. A meticulous search of up to 100 pages of each EUA still leaves the possibility of missing the sequence information. The protocol used was to attempt to find raw sequences by searching for “tg”, “cg”, “gg”, “gc”, “gt”, and “aa” with in the EUA. If these found no raw sequence information then a search for “sequence” was conducted to focus the manual parsing on a subset of the EUA for primer and probe sequence information. Finally if this was also negative, searching for “Center for Disease Control” or “CDC” was performed as the CDC primer probes were by far the most frequently referenced sequence information. If the data was not obviously present, the section for “Inclusivity” which mentioned primer / probe sequence homology to known mutant genomes was most often annotated as being the most relevant area sequence information disclosure would have been appropriate within, for example.

Error rate; whilst producing and reviewing this article 1 additional declaration of sequence information was discovered after initially being classified as “not specified”. This further serves to reinforce the need for structured diagnostic test characterisation data to enable accurate and timely assessment, purchasing and use of diagnostic tests to best serve patients.

Request for comment

We are actively seeking feedback with regard to the data collected and dimensions that would aid laboratories or researchers continue to make sense of the changes in the diagnostics EUA landscape. Please contact us or see the following repositories for the annotation and data store & processing tools supporting this project. If there are any errors please contact us or open an issue.