Study evaluation requirements
When a study is entered into the HAWC database for use in an assessment, risk of bias metrics can be entered for a metric of bias for each study. Risk of Bias metrics are organized by domain. The following questions are required for evaluation for this assessment.
Requirements by Study Type
Domain | Metric | Bioassay | Epidemiology | In Vitro |
---|---|---|---|---|
Reporting Quality4319 | Reporting of information necessary for study evaluation7310 | ✔ | - | - |
Selection or Performance4320 | Allocation of animals to experimental groups?7311 | ✔ | - | - |
Selection or Performance4320 | Blinding of investigators, particularly during outcome assessment7312 | ✔ | - | - |
Selection or Performance4320 | Participant selection7313 | - | ✔ | - |
Confounding/Variable Control4321 | Control for variables across experimental groups7314 | ✔ | - | - |
Reporting or Attrition4322 | Lack of selective data reporting and unaccounted for loss of animals7315 | ✔ | - | - |
Exposure Methods Sensitivity4323 | Characterization of the exposure to the compound of interest7316 | ✔ | - | - |
Exposure Methods Sensitivity4323 | Exposure measures7318 | - | ✔ | - |
Exposure Methods Sensitivity4323 | Utility of the exposure design for the endpoint of interest7317 | ✔ | - | - |
Outcomes Measures and Results Display Sensitivity4324 | Outcome measures7321 | - | ✔ | - |
Outcomes Measures and Results Display Sensitivity4324 | Sensitivity, specificity, and usability/interpretation of the endpoint evaluations7319 | ✔ | - | - |
Outcomes Measures and Results Display Sensitivity4324 | Presentation of results7320 | ✔ | - | - |
Confounding4325 | Confounding7322 | - | ✔ | - |
Analysis4326 | Does the analysis strategy and presentation convey the necessary familiarity with the data and assumptions?7323 | - | ✔ | - |
Selective Reporting4327 | Selective Reporting7324 | - | ✔ | - |
Sensitivity4329 | Is there a concern that sensitivity of the study is not adequate to detect an effect?7326 | - | ✔ | - |
Overall Study Confidence Domain4330 | Overall study confidence domain (epi)7328 | - | ✔ | - |
Overall Study Confidence Domain4330 | Overall confidence domain (animal)7327 | ✔ | - | - |
Reporting Quality4319
Reporting of information necessary for study evaluation7310
EXAMPLE: Adequate. Key information is reported, but some outcome-specific information is missing (e.g., methods used to measure urinary, blood biochemical parameters, histopathological methodology). [Do not use the "not reported" option]
RATING GUIDANCE: Key information necessary for study evaluation (study would be deemed critically deficient if not reported1):
- Species; test article description; levels and duration of exposure; route of administration; endpoints investigated; qualitative or quantitative results.
A study with good or adequate reporting quality will provide the following information for the endpoints of interest. Notably, procedures for allocating animals to groups and for blinding are NOT evaluated under reporting quality; they are separately considered below. Important: The delineation between good and adequate is a judgement about the sufficiency of the methodological details for reproducing the experiments of interest. However, it is inefficient to spend time on this delineation before evaluating the metrics below, as it will typically become clear when reviewing the bias and sensitivity domains.
- Test animal – strain; sex; source (e.g., vendor); general husbandry procedures.
- Exposure methods – test article source; description of vehicle control; route of administration. The specifics of the exposure methods will be evaluated in detail below.
- Experimental design – periodicity of exposure; sample size, animal age/life-stage during exposure and at endpoint evaluations (these may be explicit or inferable).
- Endpoint evaluations – general assays used to evaluate the endpoints of interest (e.g., histopathology; Morris water maze). The specifics of the assays will be evaluated in detail below.
- Results presentation – presents qualitative or quantitative findings for endpoints of interest. The details of the reported information are considered below.
1 Although such decisions should be made on an assessment-specific basis, if this information is not reported, it is generally not useful to reach out to study authors. However, for other missing study details that might change study confidence conclusions if it they were available, efforts should be made to reach out to study authors.
Note: Studies adhering to GLP (good laboratory practices) or to testing guidelines established by (inter)national agencies are assumed to be of good reporting quality.
Selection or Performance4320
Allocation of animals to experimental groups?7311
EXAMPLE: Not reported. Study authors did not indicate whether animals were randomly allocated to treatment groups (assumed animals were not randomly assigned). [Do not use the "poor" option, use "not reported" instead. These are functionally equivalent ratings]
RATING GUIDANCE: Ideally, animal studies are randomized, with each animal or litter having an equal chance of being assigned to any experimental group, including controls, and allocation procedures sufficiently described. Less ideally, but generally adequate or good, are studies indicating normalization of experimental groups prior to exposure, for example according to body weight or litter, but without indication of randomization. Studies which fail to report information on allocation should be indicated as ‘not reported’, which is inferred as ‘poor’. However, the evaluator should draw a judgement (and provide rationale) regarding the expected impact of this limitation on the endpoints of interest when drawing the overall confidence judgement below.
Blinding of investigators, particularly during outcome assessment7312
EXAMPLE: Not reported. Study authors did not indicate whether investigators were blinded during outcome assessment (assumed not conducted). In addition, there was no indication that independent assessments of histopathology were conducted to offset concern for potential bias due to lack of blinding at outcome assessment. [Do not use the "poor" option, use "not reported" instead. These are functionally equivalent ratings]
RATING GUIDANCE: Good studies will conceal the treatment groups from the researchers conducting the endpoint evaluations (and, in rare but ideal situations, from all research personnel and technicians). Concern regarding blinding may be attenuated when outcome measures are more objective (e.g., as is the case of obtaining organ weights) or measurement is automated using computer-driven systems (e.g., as is the case in many behavioral assessments). Studies which fail to report information on blinding should be indicated as ‘not reported’, which is inferred as ‘poor’. However, the evaluator should draw a judgement (and provide rationale) regarding the expected impact of this limitation on the endpoints of interest when drawing the overall confidence judgement below.
Confounding/Variable Control4321
Control for variables across experimental groups7314
EXAMPLE: Good. Vehicle was assumed to be the same in controls and treatment groups ("clean air"). The experimental conditions described provided no indication of different practices across treatment groups. [Do not use the "not reported" option]
RATING GUIDANCE: In a good study, all variables, outside of the dependent variable in question, will be controlled for and consistent across experimental groups. Concern regarding additional variables, introduced intentionally or unintentionally, may be mitigated by knowledge or inferences regarding the likelihood and extent to which the variable can influence the endpoint(s) of interest.
A very important example to consider is whether the exposure was sufficiently controlled to attribute the effects of exposure to the compound of interest alone. Generally, well-conducted exposures will not have any evidence of co-exposures and will include experimental controls that minimize the potential for confounding (e.g., use of a suitable vehicle control).
Other examples of variables that may be uncontrolled or inconsistent across experimental groups include: protective or toxic factors that could mask or exacerbate effects; diet composition; surgical procedures (e.g., ovariectomy).
Note: This metric does not consider in-study toxicity, which is considered within ‘usability of the presented data’.
Reporting or Attrition4322
Lack of selective data reporting and unaccounted for loss of animals7315
EXAMPLE: Good. Data are complete for most endpoints. Ten animals per group were treated. For blood biochemistry (Table 3), results were presented for only 9 animals in the group of female mice exposed to 12 ppm. Considered to pose a low risk of bias because the potential impact on the overall interpretation of study results is minimal. There are explanations for death while on the study, with necropsy information provided (although it is not noted which lesions were detected in deceased animals). [Do not use the "not reported" option]
RATING GUIDANCE: In a good study, information is reported on all pre-specified outcomes and comparisons for all animals, across treatment groups and scheduled sacrifices.
Aspects to consider include whether all study animals were accounted for in the results (if not, are explanations, such as death while on study, provided), and whether expected comparisons or certain groups were excluded from the analyses. In some studies, the outcomes evaluated must be inferred (e.g., a suite of standard measures in a guideline study).
Note: This metric does not address whether quantitative data were reported, nor considers statistics. It also does not consider the impact of any observed toxicity on interpretations (that is considered below under “Sensitivity, specificity, and usability/interpretation of the endpoint evaluations”), but only whether all tested animals are accounted for in some manner.
Exposure Methods Sensitivity4323
Characterization of the exposure to the compound of interest7316
EXAMPLE: Good. Source was reported (Wako Pure Chemical Industries, who report a reagent grade purity of 99%). In addition, authors note "Each lot of chloroform was analyzed for its chemical purity and stability by gas chromatography and infrared spectrometry, but no impurities or decomposition products were observed." The technique for generating chloroform vapor-air mixture and the exposure system were described in detail in the previous paper [Kano H, Goto K, Suzuki M. 2002. An exposure system for combined administration of an organic solvent to rodents by inhalation and water-drinking and its operational performance. J Occup Health. 44:119-124.] [Do not use the "not reported" option]
RATING GUIDANCE: In good studies, there are no notable issues that raise doubt about the reliability of the exposure levels, or of exposure to the compound of interest.
Depending on the chemical being assessed, this may include considering factors such as: the stability and composition (e.g., purity; isomeric composition) of the test article; exposure generation and analytic verification methods (including whether the tested levels and spacing between exposure groups is resolvable using current methods); and details of exposure methods (e.g., inhalation chamber type; gavage volume). In some cases, exposure biomarkers in blood, urine, or tissues of treated animals can mitigate concerns regarding inaccurate dosing (dependent on the validity of the biomarker for the chemical of interest).
Note: While this identifies uncertainties in dose-response, it is typically not a valid reason for exclusion from Hazard ID.
Utility of the exposure design for the endpoint of interest7317
EXAMPLE: Good. Study uses a standard subchronic exposure design for evaluating toxicological effects. The authors state: "we conducted experiments on inhalation exposure of rats and mice of both sexes to chloroform vapor for 2 and 13 wk, with reference to the Organization of Economic Cooperation and Development's (OECD) Guidelines for Testing of Chemicals 412 and 413, respectively." [Do not use the "not reported" option]
RATING GUIDANCE: Based on the known or presumed biological progression of the outcomes being evaluated, consider whether there are notable concerns regarding the timing, frequency, or duration of exposure.
For example, better developmental studies will cover a greater proportion of the developmental window thought to be critical to the system of interest, while better studies for assessing cancer or other chronic outcomes will be of longer duration. Exposure design or duration considerations that raise notable concerns may need to be confirmed as PECO-relevant, e.g., an acute duration study may not be considered directly applicable to a cancer assessment. In these cases, the study should be excluded at the full-text level. Studies that expose animals infrequently or sporadically, or, conversely, on a continuous basis (which, depending on the exposure level, can impact food/ water consumption, sleep cycles, or pregnancy/ maternal care), might introduce additional complications.
Outcomes Measures and Results Display Sensitivity4324
Sensitivity, specificity, and usability/interpretation of the endpoint evaluations7319
[Do not use the "not reported" option]
EXAMPLE:
Histopathology (Poor - females): Significant concerns were raised regarding the procedural methods for histological evaluations. Specifically, no information was provided on the number of slides evaluated, how slides were selected for evaluation, and how lesions were judged. The sampling of 10 animals/ group is typical for subchronic studies, although larger sample sizes are typically preferred. (Poor/Critically deficient - males): In addition to the issues above, significant concerns were raised regarding the usability/interpretation of the data for males. "In the 13-wk exposure study, almost all the exposed male mice died after the first day of exposure".
Biochemical and urinalysis measures (Poor - females): Significant concerns were raised regarding the procedural methods for biochemical and urinalysis measurements. Specifically, no information was provided on the diagnostic kits or whether serum, plasma or whole blood were used. The sampling of 10 animals/ group is typical for subchronic studies, although larger sample sizes are typically preferred. (Poor/Critically deficient - males): In addition to the issues above, significant concerns were raised regarding the usability/interpretation of the data for males. "In the 13-wk exposure study, almost all the exposed male mice died after the first day of exposure".
Organ weights (Adequate - females): Procedural methods for organ weight measures were not described in detail, but since methodology is fairly straight-forward and qualitative results were provided for relative and absolute weights, the organ weight procedures were considered 'adequate'. The sampling of 10 animals/ group is typical for many subchronic studies, although larger sample sizes are typically preferred. Note: relative organ weights are often preferred for the liver and other organ weights as changes in absolute weights are considered less indicative of a histological effect, particularly when there is evidence of changes in body weight [Bailey SA, Zidell RH, Perry RW. 2004. Relationship between organ weight and body/brain weight in the rat: What is the best analytical endpoint? Toxicol Pathol. 32(4):448-66. PMID: 15204968]. However, for kidney, absolute weights were more predictive of kidney histopathology compared to relative weight, even in animals that suffered exposure-related body weight loss. [Craig EA, Yan Z, Zhao QJ. 2015. The relationship between chemical-induced kidney weight increases and kidney histopathology in rats. J Appl Toxicol. 35:729-36. PMID: 25092041]. Significant concerns were raised regarding the usability of the data for males. (Poor - males): Significant usability/interpretation concerns were raised for males "In the 13-wk exposure study, almost all the exposed male mice died after the first day of exposure".
RATING GUIDANCE: In good studies, there are no notable concerns about aspects of the procedures for, or the timing of, the endpoint evaluations.
Based on the endpoint evaluation protocol used for the endpoints of interest, specific considerations will typically include:
- Concerns regarding the sensitivity of the specific protocols for evaluating the endpoint of interest (i.e. assays can differ dramatically in terms of their ability to detect effects), and/or their timing (i.e. the age of animals or time of day at assessment can be critical to the appropriateness and sensitivity of the evaluation). This includes both overestimates or underestimates of the true effect, as well as a much higher (or lower) probability for detecting the effect(s) being assessed.
- Concerns regarding the specificity and validity of the protocols. This includes the use of appropriate protocol controls to rule out non-specific effects, which can often be inferred from established guidelines or historical assay data. It may be considered useful for insensitive, complex, or novel protocols to include positive and/or negative controls. This also considers the utility of the specific procedural details (e.g., how lesions were characterized) for measuring effects.
- Concerns regarding adequate sampling. This includes both the experimental unit (e.g., litter; animal) and endpoint (e.g., number of slides evaluated). This is typically inferred from historical knowledge of the assay or comparable assays.
Note: Human relevance of the endpoint is not addressed during study evaluation; for undersampling without blinding (e.g., sampling bias), this will typically lead to gross overestimates of effect; sample size is generally not a reason for exclusion.
Presentation of results7320
[Do not use the "not reported" option]
EXAMPLE:
Histopathology: (Good): Full quantitative presentation of results.
Biochemical and urinalysis measures: (Adequate): Central tendency values and statistical significance were presented but not variance.
Organ weights (Poor): Only qualitative reporting of results (no summary values).
Note on statistical analysis: Bartlett's test was used to test whether the variance was homogeneous or not. When homogeneous, one-way ANOVA was performed; when not homogeneous the Kruskal-Wallis rank sum test was used. Dunnett's multiple comparison test was used also. Males and females were analyzed separately.
RATING GUIDANCE:In good studies, there are no notable concerns about the way results are analyzed or presented.
Items that will typically be important to consider include:
- Concern that the level of detail provided does not allow for an informed interpretation of the results (e.g., authors’ conclusions without quantitative data; discussing neoplasms without distinguishing between benign and malignant tumors; not presenting variability).
- Concern that the way in which the data were analyzed, compared, or presented is inappropriate or misleading. Examples include: failing to control for litter effects (e.g., when presenting pup data rather than the preferred litter data); pooling results from males and females or across lesion types; failing to address observed or presumed toxicity (e.g., in assessed animals; in dams) when exposure levels are known or expected to be highly toxic; incomplete presentation of the data (e.g., presenting continuous data as dichotomized); or non-preferred display of results (e.g., using a different readout than is expected for that assay). The evaluator should support how or why, and to what extent, this might mislead interpretations.
Note: Concerns regarding the statistical methods applied are not addressed during study evaluation, but should be flagged for review by the SWG. Missing information related to this metric should typically be requested from study authors.
Overall Study Confidence Domain4330
Overall confidence domain (animal)7327
[Do not use the "not reported" option]
EXAMPLE: Histopathology: (Low – females): The lack of blinding or independent assessments of histopathology is a concern, which would typically be expected to overestimate effects, as is the lack of procedural details provided. (Uninformative - males): The lack of blinding or independent assessments of histopathology is a concern, as is the lack of procedural details provided. In addition, overt acute toxicity in males precludes interpretations regarding effects on the liver or kidneys following subchronic exposure. These findings may still provide some qualitative support for analyses of acute effects.
Biochemical and urinalysis measures: (Medium – females): Concerns were raised regarding the procedural methods for biochemical and urinalysis measurements, and the data are presented without a measure of variability. However, the overall risk of bias is considered low. (Uninformative – males): Concerns were raised regarding the procedural methods for biochemical and urinalysis measurements, and the data are presented without a measure of variability. In addition, overt toxicity in males precludes interpretations regarding effects on the liver or kidneys following subchronic exposure.
Organ weights (Low/Medium - females): Because this is a more objective outcome measure, there was less concern for the impact of the procedural limitations identified for histopathology. However, findings were only described in narrative (e.g., “significantly increased” or “not changed”) without presentation of numerical results. (Uninformative - males): Because this is a more objective outcome measure, there was less concern for the impact of the procedural limitations identified for histopathology. However, overt toxicity in males precludes interpretations regarding effects on the liver or kidneys following subchronic exposure.
Overall, there is less concern for lack of blinding for the more objectively measured outcomes (mortality, body weight, organ weight, hematological, and blood biochemical measures) compared to the histopathology evaluations.
RATING GUIDANCE:Once the evaluation domains have been classified, these ratings will be combined to reach an overall study confidence classification of High, Medium, Low, or Uninformative.
This classification will be based on the classifications in the evaluation domains, and will include consideration of the likely impact of the noted deficiencies in bias and sensitivity on the results. Studies with critical deficiencies in any evaluation domain will be classified as Uninformative. Other classifications will generally follow a sorting such that High Confidence studies would have the highest evaluation (“Good”) for all or most domains; Low Confidence studies would have a “Poor” evaluation for one or more domains (unless the impact of the particular limitation(s) is judged to be unlikely to be severe), and Medium Confidence studies are in between these groups (e.g., most domains receiving a mid-level Adequate evaluation, with no limitations judged to be severe.) Once initial evaluation has been performed with consensus between reviewers, the classifications will be re-evaluated, looking at the variability “within” and “between” levels to ensure that the separation between the levels of confidence are appropriate and that no additional criteria need to be considered.
Selection or Performance4320
Participant selection7313
EXAMPLE TEXT: Adequate. Nested case-control design in Mexico City birth cohort with 30 cases of preterm birth and 30 controls selected randomly from same population of woman who were recruited during prenatal visits at one of four clinics (serving low to moderate income population). Recruitment and eligibility criteria (inclusion/exclusion criteria) discussed. Little discussion of participants versus nonparticipants but the available information indicates that differential selection is possible but not likely. Participation rate reported to be low (36%). Evaluates the vulnerable population of low-moderate income pregnant women.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Is there evidence that selection into or out of the study (or analysis sample) was jointly related to exposure and to outcome?
Study design, where and when was the study conducted, and who was included? Recruitment process, exclusion and inclusion criteria, type of controls, total eligible, comparison between participants and nonparticipants (or followed and not followed), final analysis group. Does the study include potential vulnerable/susceptible groups or lifestages?
LINK to prompting questions.
Exposure Methods Sensitivity4323
Exposure measures7318
EXAMPLE TEXT: Poor for long-chained (DEHP, DiNP) and adequate for short-chained (DEP, DBP, DiBP) phthalate metabolites based on number of samples. A single spot (second morning void) urine sample was collected from each woman during a third-trimester visit to the project’s research center; third trimester sample is relevant to later term preterm births. Analytical approach described and appropriate. High percent >LOD.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Does the exposure measure reliably distinguish between levels of exposure in a time window considered most relevant for a causal effect with respect to the development of the outcome?
Source(s) of exposure (consumer products, occupational, an industrial accident) and source(s) of exposure data, blinding to outcome, level of detail for job history data, when measurements were taken, type of biomarker(s), assay information, reliability data from repeat measures studies, validation studies.
LINK to prompting questions.
Outcomes Measures and Results Display Sensitivity4324
Outcome measures7321
EXAMPLE TEXT: Adequate. Preterm birth defined by length of gestation (< 37 weeks), a standard measure of birth outcome, estimated by maternal recall of the date of last menstrual period, rather than the preferred early ultrasound. Potential misclassification of preterm cases due to maternal recall of last menstrual period to estimate gestational age which may be nondifferential with respect to exposure; however, differential misclassification is still possible but unlikely.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Does the outcome measure reliably distinguish the presence or absence (or degree of severity) of the outcome?
Source of outcome (effect) measure, blinding to exposure status or level, how measured/classified, incident versus prevalent disease, evidence from validation studies, prevalence (or distribution summary statistics for continuous measures).
LINK to prompting questions.
Confounding4325
Confounding7322
EXAMPLE TEXT: Adequate. Information on key confounders was collected through questionnaire. The strategy for evaluating confounding and the process for retaining variables in the models was described. Rationale for selecting confounders not provided. Inclusion in model not solely based on statistical significance. Adjustment for relative co-exposures.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Is confounding of the effect of the exposure unlikely?
Background research on key confounders for specific populations or settings; participant characteristic data, by group; strategy/approach for consideration of potential confounding; strength of associations between exposure and potential confounders and between potential confounders and outcome; degree of exposure to the confounder in the population.
LINK to prompting questions.
Analysis4326
Does the analysis strategy and presentation convey the necessary familiarity with the data and assumptions?7323
EXAMPLE TEXT: Adequate. Multivariable (multivariate) logistic regression used to take into account potential confounding variables; quantitative results presented (ORs and 95% CIs with ORs adjusted for confounders). Imputation techniques used when phthalate metabolite concentrations were below the LOD (filling in data where there wasn’t); Amount of missing data not noted; Dichotomous exposure (reduced sensitivity) and use of median as the cut-off adjusted for urine creatinine and specific gravity to assess effect of method used.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Does the analysis strategy and presentation convey the necessary familiarity with the data and assumptions?
Extent (and if applicable, treatment) of missing data for exposure, outcome, and confounders, approach to modeling, classification of exposure and outcome variables (continuous versus categorical), testing of assumptions, sample size for specific analyses, relevant sensitivity analyses.
An ideal study would convey a thoughtful and thorough description of the analytical approach, and descriptive data for key variables (e.g., exposure measures, outcome measures), including the amount of missing data (or proportion less than the limit of detection [LOD]). The ideal analysis would use an appropriate and well thought out modeling approach for the study design (e.g., logistic regression for case-control data) and specify the covariates used in the final model; the methods should be described in enough detail such that they could be applied to the data from another study. In addition, the results should be presented with sufficient detail to enable estimation of effect estimates and precision of the estimates (e.g., standard error [SE] or confidence interval [CI]
LINK to prompting questions.
Selective Reporting4327
Selective Reporting7324
Selective Reporting
EXAMPLE TEXT: Not applicable. Information on subgroups not provided; comparison of cases and controls discussed in population domain.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Is there concern for selective reporting?
Are results presented with adequate detail for all the endpoints of interest? Are results presented for the full sample as well as for specified subgroups? Were stratified analyses (effect modification) motivated by a specific hypothesis?
LINK to prompting questions.
Sensitivity4329
Is there a concern that sensitivity of the study is not adequate to detect an effect?7326
Sensitivity
EXAMPLE TEXT: Poor/Good. Small sample size/ Potential nondifferential misclassification of outcome and exposure. Low exposure levels. Range of exposure is narrow. Healthy worker effect.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Are there concerns for study sensitivity?
What exposure range is spanned in this study? What are the ages of participants (e.g., not too young in studies of pubertal development)? What is the length of follow-up (for outcomes with long latency periods)? Choice of referent group and the level of exposure contrast between groups (i.e., the extent to which the “unexposed group” is truly unexposed, and the prevalence of exposure in the group designated as “exposed”). Is the study relevant to the exposure and outcome of interest?
LINK to prompting questions.
Overall Study Confidence Domain4330
Overall study confidence domain (epi)7328
EXAMPLE TEXT: Low confidence. Give brief rationale for rating.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Once the evaluation domains have been classified, these ratings will be combined to reach an overall study confidence classification of High, Medium, Low, or Uninformative.
This classification will be based on the classifications in the evaluation domains, and will include consideration of the likely impact of the noted deficiencies in bias and sensitivity on the results. Studies with critical deficiencies in any evaluation domain will be classified as Uninformative. Other classifications will generally follow a sorting such that High Confidence studies would have the highest evaluation (“Good”) for all or most domains; Low Confidence studies would have a “Poor” evaluation for one or more domains (unless the impact of the particular limitation(s) is judged to be unlikely to be severe), and Medium Confidence studies are in between these groups (e.g., most domains receiving a mid-level Adequate evaluation, with no limitations judged to be severe.) Once initial evaluation has been performed with consensus between reviewers, the classifications will be re-evaluated, looking at the variability “within” and “between” levels to ensure that the separation between the levels of confidence are appropriate and that no additional criteria need to be considered.
LINK to prompting questions.