Study evaluation requirements
Follow link to see attachments that contain example prompting and follow-up questions for epidemiological studies.
Requirements by Study Type
| Domain | Metric | Bioassay | Epidemiology | In Vitro |
|---|---|---|---|---|
| Selection and Performance100000937 | Participant selection100001521 | - | ✔ | - |
| Exposure Methods100000940 | Exposure measures100001526 | - | ✔ | - |
| Outcome Methods/Results Presentation100000941 | Outcome measures100001529 | - | ✔ | - |
| Confounding100000942 | Confounding100001530 | - | ✔ | - |
| Analysis100000943 | Does the analysis strategy and presentation convey the necessary familiarity with the data and assumptions?100001531 | - | ✔ | - |
| Selective Reporting100000944 | Selective Reporting100001532 | - | ✔ | - |
| Sensitivity100000945 | Are there concerns for study sensitivity100001533 | - | ✔ | - |
| Methodological quality100000948 | Allocation to different tests100001598 | ✔ | - | - |
| Methodological quality100000948 | Route of administration100001592 | ✔ | - | - |
| Methodological quality100000948 | Timing and duration of administration100001593 | ✔ | - | - |
| Methodological quality100000948 | Untreated/vehicle control100001585 | - | - | ✔ |
| Methodological quality100000948 | Individual identification100001587 | ✔ | - | - |
| Methodological quality100000948 | Allocation to treatments100001591 | ✔ | - | - |
| Methodological quality100000948 | Duration of exposure100001601 | - | - | ✔ |
| Methodological quality100000948 | Cytotoxicity100001608 | - | - | ✔ |
| Methodological quality100000948 | Solubility of compound100001581 | - | - | ✔ |
| Methodological quality100000948 | Vehicle (in vitro)100001582 | - | - | ✔ |
| Methodological quality100000948 | Concentrations used100001602 | - | - | ✔ |
| Methodological quality100000948 | Replicates/repetitions100001605 | - | - | ✔ |
| Methodological quality100000948 | Statistical methods and software100001609 | ✔ | - | ✔ |
| Methodological quality100000948 | Purity100001580 | ✔ | - | ✔ |
| Methodological quality100000948 | Negative control group100001584 | ✔ | - | - |
| Methodological quality100000948 | Animal model100001586 | ✔ | - | - |
| Methodological quality100000948 | Frequency of administration100001597 | ✔ | - | - |
| Methodological quality100000948 | Number of animals100001607 | ✔ | - | - |
| Methodological quality100000948 | Vehicle (in vivo)100001583 | ✔ | - | - |
| Methodological quality100000948 | Number of animals in cage100001589 | ✔ | - | - |
| Methodological quality100000948 | Contaminants100001590 | ✔ | - | - |
| Methodological quality100000948 | Test conditions100001603 | - | - | ✔ |
| Methodological quality100000948 | Collection of measurements100001606 | ✔ | - | ✔ |
| Methodological quality100000948 | Blinding or similar measures100001612 | ✔ | - | ✔ |
| Methodological quality100000948 | Test methods100001604 | ✔ | - | ✔ |
| Methodological quality100000948 | Attrition100001611 | ✔ | - | ✔ |
| Methodological quality100000948 | Conditions for cultivation100001600 | - | - | ✔ |
| Methodological quality100000948 | Test system100001599 | - | - | ✔ |
| Methodological quality100000948 | Housing conditions100001588 | ✔ | - | - |
| Reporting quality100000947 | Purity of compound100001537 | ✔ | - | ✔ |
| Reporting quality100000947 | Bedding material100001551 | ✔ | - | - |
| Reporting quality100000947 | Results presentation100001575 | ✔ | - | ✔ |
| Reporting quality100000947 | Solubility of compound100001538 | - | - | ✔ |
| Reporting quality100000947 | Test system100001542 | - | - | ✔ |
| Reporting quality100000947 | Control group100001540 | ✔ | - | ✔ |
| Reporting quality100000947 | Vehicle100001539 | ✔ | - | ✔ |
| Reporting quality100000947 | Individual identification100001543 | ✔ | - | - |
| Reporting quality100000947 | Animal model (reporting)100001541 | ✔ | - | - |
| Reporting quality100000947 | Number of animals100001547 | ✔ | - | - |
| Reporting quality100000947 | Physical enrichment100001549 | ✔ | - | - |
| Reporting quality100000947 | Allocation100001555 | ✔ | - | - |
| Reporting quality100000947 | Water bottle material100001550 | ✔ | - | - |
| Reporting quality100000947 | Frequency and duration of dosing100001559 | ✔ | - | - |
| Reporting quality100000947 | Route of administration100001557 | ✔ | - | - |
| Reporting quality100000947 | Sex and age of animals at dosing100001558 | ✔ | - | - |
| Reporting quality100000947 | Source of test system100001561 | - | - | ✔ |
| Reporting quality100000947 | Metabolic competence100001562 | - | - | ✔ |
| Reporting quality100000947 | Competing interests100001579 | ✔ | - | ✔ |
| Reporting quality100000947 | Animals subjected to separate tests100001572 | ✔ | - | - |
| Reporting quality100000947 | Allocation to different tests100001571 | ✔ | - | - |
| Reporting quality100000947 | Replicates/repetitions100001569 | - | - | ✔ |
| Reporting quality100000947 | Test/analytical methods100001570 | ✔ | - | ✔ |
| Reporting quality100000947 | Media composition100001564 | - | - | ✔ |
| Reporting quality100000947 | Statistical unit100001577 | ✔ | - | - |
| Reporting quality100000947 | Statistical methods and software100001576 | ✔ | - | ✔ |
| Reporting quality100000947 | Cytotoxicity100001574 | - | - | ✔ |
| Reporting quality100000947 | Data collection timepoints100001573 | - | - | ✔ |
| Reporting quality100000947 | Contamination100001566 | - | - | ✔ |
| Reporting quality100000947 | Funding100001578 | ✔ | - | ✔ |
| Reporting quality100000947 | Number of cell passages100001563 | - | - | ✔ |
| Reporting quality100000947 | Incubation parameters100001565 | - | - | ✔ |
| Reporting quality100000947 | Cell density100001567 | - | - | ✔ |
| Reporting quality100000947 | Treatment duration100001568 | - | - | ✔ |
| Reporting quality100000947 | Chemical name100001536 | ✔ | - | ✔ |
| Reporting quality100000947 | Housing temperature100001544 | ✔ | - | - |
| Reporting quality100000947 | Relative humidity100001545 | ✔ | - | - |
| Reporting quality100000947 | Light-dark cycle100001546 | ✔ | - | - |
| Reporting quality100000947 | Cage material100001548 | ✔ | - | - |
| Reporting quality100000947 | Feed100001552 | ✔ | - | - |
| Reporting quality100000947 | Drinking water100001553 | ✔ | - | - |
| Reporting quality100000947 | Dose levels100001554 | ✔ | - | ✔ |
| Reporting quality100000947 | Number of animals in dose groups100001556 | ✔ | - | - |
| Overall Study Confidence (Epidemiology)100000946 | Overall confidence (epi)100001535 | - | ✔ | - |
| Overall Study Confidence (In vivo / in vitro)100000949 | Overall confidence (in vivo / in vitro)100001610 | ✔ | - | ✔ |
Methodological quality100000948
Important note:
In case multiple endpoints or groups of endpoints are presented in the study, consider to evaluate each of them separately by providing explanations in the comment.
Allocation to different tests100001598
The allocation of animals to different tests and measurements was randomized.
Guidance:
Animals should be randomly assigned to different tests and/or measurements. Randomization is applied as one measure to avoid that statistically significant results arise by chance alone.
How to judge this criterion:
Good – animals were randomly assigned to different tests and measurements using an appropriate method
Deficient – not clearly reported if animals were randomized but it can be inferred from other information reported for the study design and conduct.
Critically deficient – animals were not randomly assigned to tests and measurements.
Route of administration100001592
The route of administration was appropriate and not likely to interfere with the study results.
Guidance:
NOTE: The relevance of the administration route for human exposure scenarios and health risk assessment should not be specifically considered here. Application of this criterion for the evaluation of study reliability should consider any influence of the administration route on the validity, accurateness and robustness of the study results.
In general, for repeated dose and long term/chronic studies it is recommended that the test compound is administered orally, by dietary admixture or in drinking water, by gavage or in capsules (for non-rodents). If another route of exposure was used in such a study, the scientific and/or practical reasons for this should generally be justified by the study authors. Gavage is often used as it allows for delivery of a precise and consistent dose. Administration in feed or drinking water may be chosen as it is less invasive than gavage and sometimes better mimics human exposure scenarios (i.e. is more relevant). It is important to note that the toxicokinetics of the test compound may vary, and consequent toxicity may be different, if administered via gavage as compared to administration in feed or water. If administration is via food or water in perinatal studies offspring may be exposed both indirectly via maternal milk (if the test compound is transferred) and directly from feed/water in the last part of the lactation period, as they gradually start to consume food and water (from around postnatal day 14 in rats) and they may actually be consuming a higher dose per kg/bw than the adults. (OECD 2008)
Dermal administration and inhalation may often pose additional practical and technical difficulties compared to oral administration or injection. For this reason, dermal administration is specifically not recommended for reproductive toxicity studies according to OECD test guidelines and guidance documents (OECD 2008). In terms of inhalation, many factors can affect e.g. deposition and retention of the test compound in the respiratory tract and the dose is dependent on how much of the compound is delivered to the exposure chamber in a respirable form. (OECD 2002). It should be noted that if pups are exposed in inhalation chambers they may receive inhaled and dermal doses simultaneously, depending on the equipment (OECD 2008).
Importantly, in reproductive/developmental studies, dams and pups should not be separated for long periods of time (several hours) to allow for exposure of one or the other (e.g. inhalation studies). (OECD 2008)
How to judge this criterion:
Good – the recommended route of administration was used and is considered to not influence study results in this case. In case an alternative administration route was used this was explicitly and correctly justified by the study authors.
Deficient – another administration route than recommended in test guidelines was used and was not justified by the study authors. However, the chosen administration route is unlikely to have influenced study results.
Critically deficient - an alternative administration route was used that is suspected to have influenced study results
Timing and duration of administration100001593
The timing and duration of administration were appropriate for investigating the included endpoints.
Guidance:
OECD test guidelines and corresponding guidance provide recommendations for timing and duration of administration of the test compound for different types of studies. In general, the dosing regimen should “maximise the sensitivity of the test without significantly altering the accuracy and interpretability of the biological data obtained” (OECD 2002).
Timing and duration should be considered specifically in terms of covering sensitive periods of development (e.g. “period of male sexual differentiation in late gestation” (OECD 2008)).
In certain cases, it is also relevant to consider timing of administration in relation to when measurements of toxicological outcomes are conducted. For example, when investigating effects on behavior the potential of the administration to produce acute effects on behavioral measures should be considered, especially where the test substance is administered directly to offspring daily (OECD 2008).
How to judge this criterion:
Good – the timing and duration of administration of the test compound is in line with general recommendations for the study type, is not likely to interfere with the measurements conducted, and cover sensitive periods of development, where relevant.
Deficient – the timing and duration of administration of the test compound deviates somewhat from standard recommendations, however a scientific or practical justification is provided and sensitive periods of development are covered.
Critically deficient - the timing and duration of administration of the test compound is significantly different from general recommendations for the study type without being justified, and/or is likely to directly interfere with toxicological outcomes/measurements, and/or do not cover sensitive periods of development, where relevant.
Individual identification100001587
Animals were individually identified.
Guidance:
In order to ensure reliable administration of the test compound, allocation to treatment groups and different tests, as well as recording of observations and test results, it is important that animals are individually identified.
How to judge this criterion:
Good – it is stated that animals were individually identified, the specific method for identification does not have to be described.
Deficient – it is not clearly stated whether or not animals were individually identified, but it may be inferred from other information reported for the study design and conduct
Critically deficient – it is stated that animals were not individually identified, or this can be inferred from other information reported for the study design and conduct.
Allocation to treatments100001591
The allocation of animals to different treatments was randomized.
Guidance:
Animals should be randomly assigned to control and treatment groups. Randomization is applied as one measure to avoid that statistically significant results arise by chance alone.
How to judge this criterion:
Good – animals were randomly assigned to different treatment groups using an appropriate method
Deficient – not clearly reported if animals were randomized but it can be inferred from other information reported for the study design and conduct
Critically deficient – animals were clearly not randomly assigned to treatment groups
Statistical methods and software100001609
The statistical methods were clearly described and do not seem inappropriate, unusual or unfamiliar.
Guidance:
The choice of statistical analyses will depend on the type of study and the nature of the endpoints measured.
OECD test guidelines and corresponding guidance documents provide some recommendations for statistical tests (e.g. OECD’s Guidance notes for analysis and evaluation of chronic toxicity and carcinogenicity studies and Guidance Notes for Analysis and Evaluation of Repeat-Dose Toxicity Studies) as well as for considerations to be made in statistical analyses of different types of tests.
Evaluation of this criterion also includes considering if the correct statistical unit was used. For example in in vivo tox studies, it is generally recommended that the litter (or dam) is the statistical unit in developmental toxicity studies to account for litter effects. Correlations across litter mates due to genetic and/or prenatal conditions can have considerable influence on the statistical significance of results (e.g. Holson et al. 2008; Li et al. 2008). To control for litter effects, either only one pup per sex and litter is submitted to each test/measurement in the study, or all pups are examined and litter effects are accounted for in the statistical analyses. For certain endpoints, e.g. malformations, it might be warranted to examine all pups as it increases the statistical power and not all pups are identical. Similarly, examining many pups per litter greatly enhances the ability to detect low dose effects (OECD 2008). The size of litter effect varies depending on endpoint measured, dose (being larger at high dose levels), and chemical mode of action.
In general, normality of the data should have been checked and the choice of parametric or non-parametric tests should have been based upon that result.
How to judge this criterion:
Good – the statistical methods have been clearly described and do not seem inappropriate, unusual or unfamiliar.
Deficient – unusual or unfamiliar methods were applied in the statistical analyses but do not seem clearly inappropriate.
Critically deficient – no statistical tests were used, or the tests used are clearly inappropriate for the study type and/or endpoints measure
Purity100001580
The test compound or mixture was unlikely to contain any impurities that may significantly have affected the results of the study.
Guidance:
Purity of the test compound, or the composition of substances in a mixture can potentially affect study results. Purity and composition is also an important aspect to consider in terms of the relevance of the test compound to the compound being risk assessed. Ideally, in the case of single compounds, the test chemical should be of the highest available purity.
Significant impurities, or isomers of the test compound, are more likely to be present, and/or to impact toxicity for certain compounds. For example, PCBs (individual or in mixtures) are often contaminated with low levels of potentially highly toxic dioxins. The measured toxicity of the test compound may then be due to the contaminant. In such cases information about the level of purity and composition is critical.
How to judge this criterion:
Good – The test compound has been clearly identified and characterized and is of sufficient purity. In cases of mixtures, the composition of substances is well characterized and their individual purities are sufficient.
Deficient – The purity of the test compound has not been described but it is unlikely that impurities are present that would significantly affect the results of the study.
Critically deficient - The test compound or mixture is likely to contain impurities that can affect study results.
Negative control group100001584
A concurrent negative control group was included.
Guidance:
A concurrent negative control group should always be included as it is critical for determining treatment-related effects. The negative control group can be either untreated or vehicle-treated. However, in studies where a vehicle is used to administer the test compound it is critical that a vehicle-treated control group is included. In certain cases, it may be useful to also include a completely untreated group for identification of any influence on results from the vehicle. Control animals should be handled in the same way as treated animals. It is also important that animals in the control and treatment groups are the same age since some toxicological effects are age-dependent, e.g. may represent acceleration and/or enhancement of age-related changes.
Historical control data from the same laboratory using the same methods and relating to animals of the same strain, age and sex, and supplier, as those used in the study may be very useful. However, such data should not provide the only negative control data for statistical analyses as biological parameters in laboratory animals can vary significantly over time. Therefore, if a study includes only historical negative control data this criterion should be judged as “not fulfilled”.
How to judge this criterion:
Good – a concurrent negative control group was included.
Critically deficient – no negative control was included or only a historical negative control was referred to.
Animal model100001586
A reliable and sensitive animal model was used for investigating the test compound and selected endpoints.
Guidance:
The choice of animal model (test species, strain, sex, etc.) is based on a number of considerations, including knowledge regarding species differences in terms of pharmacology, repeat-dose toxicology, metabolism, toxicokinetics and route of administration. Rodents (rats or mice) are commonly recommended for in vivo testing in current OECD test guidelines and are well characterized in terms of the reliability and sensitivity, as well as relevance to humans of different biological parameters and endpoints. Thus, it is specifically important that the study authors have justified their choice of animal model if other species have been used. It should be noted that, for investigation of certain endpoints, other species may be more sensitive and preferable. For example, rabbits are commonly recommended for teratology studies (OECD 2008). Similarly, available information about species differences in the toxicokinetics of a compound may warrant testing in a specific species. The evaluator is referred to regulatory test guidelines (e.g. OECD or US EPA) for discussions of the most appropriate test species for different study types.
Reliability, in this context, refers to whether the animal model has been shown to generate reproducible results for the type of endpoints investigated.
The sensitivity of the animal model relates to the ability to detect changes in the endpoints investigated in the model. E.g. different strains of rats may exhibit differences in the sensitivity to the effects of estrogenic compounds
How to judge this criterion:
Good – The animal model used is not suspected to be insensitive or unreliable.
Deficient - Another species than preferred in relevant guidance has been used without justification, but does not seem unreliable or insensitive.
Critically deficient – there is available information that indicates that the animal model is either insensitive or clearly unreliable for studying the test compound or for investigating the endpoints considered. Or the expected outcome is lacking from concurrent positive controls, if included, indicating that the test methods or animal model is insensitive.
Frequency of administration100001597
The frequency of administration was appropriate for investigating the included endpoints.
Guidance:
The appropriate frequency of administration depends on the dose, physicochemical properties and toxicokinetics of the test compound, as well as practical considerations.
“Responses produced by chemicals in humans and experimental animals may differ according to the quantity of the substance received and the duration and frequency of exposure, e.g. responses to acute exposures (a single exposure or multiple exposures occurring within twenty-four hours or less) may be different from those produced by subchronic and chronic exposures.” (OECD 2002)
Good – the frequency of administration of the test compound is in line with general recommendations for the study type and considering the inherent properties of the test compound. Or, if not in line with general recommendations, adjustments have been justified by the study authors.
Critically deficient - the frequency of administration of the test compound is significantly different from general recommendations for the study type without being justified, or seems inappropriate considering the inherent properties of the test compound.
Number of animals100001607
A sufficient number of animals per dose group were subjected to separate tests/data collection/measurements to generate reliable and valid results.
Guidance:
Sample size should be large enough to ensure sufficient statistical power to detect any effects in the endpoints measured. This includes considerations of the background incidence and variability of the measured effects, as well as the method of analysis. Excessive losses of animals in treatment groups that could affect statistical power should be noted.
OECD test guidelines provide recommendations for number of animals per treatment group for different study types and endpoint measurements. However, primary consideration should be given to justifications for sample size provided by study authors, if stated.
How to judge this criterion:
Good – a sufficient number of animals was included in the different treatment groups and loss of animals during the study is not likely to have substantially affected statistical power.
Deficient – a lower than usual number of animals was used, which may have caused lower sensitivity/statistical power of the study.
Critically deficient – the number of animals in each treatment group was clearly insufficient or there was substantial loss of animals during the study that may have affected statistical power.
Vehicle (in vivo)100001583
An appropriate vehicle was used that is not expected to interfere with the absorption, distribution, metabolism, excretion or toxicity of the test compound.
Guidance:
If information concerning the potential toxicity of the vehicle is missing from the study it should not automatically lead to the judgment “not fulfilled” or “cannot be determined” for this criterion. The approach to evaluation may be to consider if the vehicle is clearly inappropriate, e.g. potentially toxic or affecting skin permeability or toxicokinetics.
If information about which vehicle was used is completely missing the criterion should be judged as “cannot be determined” and a comment motivating this judgment should be included.
How to judge this criterion:
Good – water or another common and historically well characterized vehicle was used, the vehicle was appropriate considering the solubility of the test compound, and there are no other aspects that raise concern.
Deficient – the vehicle was not well characterized or is not commonly used in this context but there are no obvious concerns that it interferes with the absorption, distribution, metabolism, excretion or toxicity of the test compound.
Critically deficient – the vehicle used is inappropriate considering the solubility of the test compound or otherwise, or is clearly toxic.
Number of animals in cage100001589
The number of animals per sex in each cage were appropriate for the study type and animal model.
The number of animals housed together may have an effect on behavior and other biological parameters. Generally, laboratory animals should be housed in pairs or groups, unless the species is naturally solitary. Crowding should also be avoided as it induces stress that affects e.g. hormone levels and development.
Scientific and practical aspects connected to the type of study influence how animals are housed together. Recommendations and requirements for the number of animals per cage relevant for different study types can be found in OECD test guidelines and corresponding guidance documents. Single housing may be recommended in some cases, e.g. in acute toxicity tests and in inhalation studies using aerosol exposure. Individual housing may also be necessary e.g. for pregnant dams and for males after mating, as well as during certain procedures, such as the use of metabolism cages. According to general guidelines for laboratory animal science, single housing should be restricted to the shortest time possible.
Standardization of litter size by culling is sometimes conducted. Descriptions and recommendations for this procedure are provided in OECD test guidelines for developmental toxicity studies.
How to judge this criterion:
Good – the number of animals per sex and cage were in line with standard recommendations relevant to the study type and animal model.
Deficient – the number of animals per cage deviated somewhat from standard recommendations relevant to the study type and animal model, however scientific and/or practical justifications for these deviations were provided.
Critically deficient – the number of animals per cage deviated significantly from standard recommendations relevant to the study type and animal model, no scientific or practical justification was provided and this clearly has affected the results
Contaminants100001590
The test system was unlikely to contain contaminants that could affect study results, such as organic pollutants, pesticide residues, heavy metals, and mycotoxins, as well as phytoestrogens.
Guidance:
Materials used in cages, water bottles and any physical enrichment should be considered, e.g. in terms of releasing substances that may affect study results.
It should be ensured as far as possible that feed and drinking water are free from contaminants, such as pesticide residues, persistent organic pollutants, heavy metals and mycotoxins, as well as phytoestrogens. Phytoestrogen content is specifically critical in studies where endocrine activity/disruption is being investigated. For guidance on appropriate phytoestrogen levels in feed see e.g. OECD TG 440. Ideally, feed and water should be tested for the presence of contaminants and phytoestrogens.
Similarly, the bedding material should be considered, especially if endocrine activity/disruption is being investigated, since it may contain naturally occurring estrogenic or antiestrogenic substances. E.g. corn cob appears to be antiestrogenic and affects cyclicity in rats (OECD 2007). Specifically, phytoestrogen content should be minimized in the bedding material in these cases.
A full report of possible contaminants is seldom provided in studies published in the peer-reviewed literature, therefore it might be useful to keep in mind that this criterion may often be judged as partially fulfilled for such studies, and the impact of lack of reporting on total study reliability should be carefully considered.
How to judge this criterion:
Good – no contaminants that could have influenced study results are suspected and/or feed, water, bedding and other materials have been analyzed and controlled for relevant contaminants.
Deficient– There may potentially be contaminants present.
Critically deficient – it is likely that the test system was contaminated in a way that could affect study results, e.g. a bedding material known to contain estrogenic or antiestrogenic substances was used in a study investigating endocrine endpoints.
Collection of measurements100001606
Measurements were collected at suitable time points in order to generate sensitive, valid and reliable data.
Guidance:
This criterion covers several aspects concerning the timing of measurements and collection of data. Overall, to avoid introducing potential bias and to generate robust data, it is most important that tests and/or measurements are performed under the same conditions in all treatment groups.
For in vivo tox studies:
1. Data should be collected at the correct time point in relation to the time needed to detect treatment related effects. In regard to specific developmental effects, these may only become apparent at a certain age, relating e.g. to behavioral ontogeny or onset of puberty. In addition, the time point for measurements and data collection should be chosen to avoid influence from any acute effects of the test substance administration (OECD 2008). OECD test guidelines provide recommendations for the timing of measurements and data collection in different study types.
2. Data should be collected at the same age across treated and control animals. For some developmental effects in rats and mice investigated during pregnancy or early after birth the time of day when measurements are performed is critical since development is rapid and differences between controls and treated animals may otherwise only represent differences related to (gestational) age (OECD 2008).
3. Data should be collected so that the time of day does not influence measurements. For example, responses in behavioral testing in nocturnal animals like mice and rats is likely to produce different behavior during the day than during the night. For such reasons reversed lighting conditions may be applied to test nocturnal animals during the day.
For in vitro studies, the time points for the measurements should be adapted to the test system and endpoint.
How to judge this criterion:
Fulfilled – The timing of tests and measurements were appropriate to detect sensitive effects and there are no related aspects that are likely to influence the reliability of the results. For in vivo tox studies, conditions have been the same for treated and control animals.
Deficient – Some, but not all, aspects of timing were appropriate. Importantly, there are no critical issues that raise concern, e.g. in in vivo animal tox experinments, control and treated animals were tested at different age/time points.
Critically deficient - The timing of tests and measurements were not appropriate. E.g. it is likely that sensitive treatment related effects have been missed, or there are other aspects that are likely to have influenced the reliability of the results. And/Or control and treated groups were tested at different age/time points.
Blinding or similar measures100001612
Did the study implement measures to reduce observational bias?
Prompting questions:
· Does the study report blinding or other methods/procedures for reducing observational bias?
· If not, did the study use a design or approach for which such procedures can be inferred?
· What is the expected impact of failure to implement (or report implementation) of these methods/procedures on results?
How to judge this criterion:
Good - The study reported that blinding or other similar procedures were performed whenever it was adequate
Deficient - The study did not report that blinding or similar procedures were performed where it would have been adequate, however, the expected impact of failure to implement these methods is small
Critically deficient -The study did not report that blinding or similar procedures were performed where it would have been adequate (or explicitly stated that no blinding or similar measures were conducted) AND the expected impact of failure to implement them is significant
Test methods100001604
Reliable and sensitive test methods were used for investigating the selected endpoints.
Guidance:
The reliability of the methods refers to whether they are known to generate reproducible results for the type of endpoints investigated, e.g. if the methods have been validated across different laboratories.
The sensitivity of the methods relates to the ability to detect changes in the endpoints investigated.
For in vivo toxicity studies, conducted according to standardized and validated test guidelines (such as OECD test guidelines) are often considered to be reliable and adequate for risk assessment. However, it is important to keep in mind that adherence to standardized test guidelines does not automatically ensure the sensitivity of the methods applied. Further, sensitivity of the methods may in some cases be influenced by how the protocols are utilized (discussed in e.g. OECD 2008).
In many cases, details may be missing from the description of test methods hampering a full evaluation of their reliability and sensitivity. If most of the methods have been sufficiently reported, and it does not seem reasonable to set this criterion to 'not determined', it is suggested to judge it as 'partially fulfilled' and make a note in the comments field.
How to judge this criterion:
Good – there is no information that suggests that the test methods are insensitive or unreliable in this context.
Deficient – it is suspected that one or more of the methods applied may be insensitive or unreliable. OR test methods have not been sufficiently reported to fully evaluate the reliability and sensitivity of test methods,
Critically deficient – there is available information that indicates that one or more of the methods applied is either insensitive or clearly unreliable for studies of the test compound or for investigating the endpoints considered. Or the expected outcome is lacking from (concurrent) positive controls, if included, indicating that the methods (or animal model) is insensitive.
Attrition100001611
Did the study report results for all tested animals/replicates/repetitions?
Prompting questions:
- Are all animals/replicates/repetitions accounted for in the results?
- If there are discrepancies, do authors provide an explanation (e.g., in in vivo tox studies: death or unscheduled sacrifice during the study)?
- If unexplained results omissions and/or attrition are identified, what is the expected impact on the interpretation of the results?
How to judge this criterion:
Good - it is stated or can be inferred to that there are no losses in animals/replicates/repetitions in the study
Deficient - there are discrepancies, however, the authors provide an explanation OR the authors do not provide an explanation but the expected impact on the interpretation of the results is small OR it is unclear/not reported whether there are any discrepancies
Critically deficient - There are substantial, unexplained losses of animals/replicates/repetitions that might have had a significant impact on the interpretation of the results
Housing conditions100001588
Housing conditions (temperature, relative humidity, light-dark cycle) were appropriate for the study type and animal model.
Guidance:
Housing conditions and handling may influence animal behavior and physiological response to stress and, consequently, study results. Importantly, variability in housing conditions may lead to increased variability in results and decreased sensitivity of the tests conducted.
Different housing conditions apply to different species and different types of studies. Descriptions of standard conditions may for example be found in OECD test guidelines relevant to different types of studies and in corresponding guidance documents. Guidance is also provided in the US National Research Council’s “Guide for the Care and Use of Laboratory Animals”.
Housing conditions are often incompletely reported in studies published in the peer-reviewed literature, therefore it might be useful to keep in mind that this criterion may often be judged as partially fulfilled for such studies, and the impact of lack of reporting on total study reliability should be carefully considered. If no housing conditions were reported this criterion should be judged as “cannot determine” and a comment justifying the judgment should be made.
How to judge this criterion:
Good – housing conditions have been fully described and were in line with standard recommendations relevant to the study type and animal model.
Deficient– some of the housing conditions were in line with standard recommendations relevant to the study type and animal model. Others deviated from standard recommendations or were not reported.
Critically deficient – all housing conditions deviated from standard recommendations relevant to the study type and animal model and/or it is clear that deviations have influenced the results.
Reporting quality100000947
Please use the following guidance for rating reporting quality:
Good - the criteria is completely fulfilled
Deficient - the criteria is partially fulfilled (e.g. one aspect is fulfilled but another is not)
Critically deficient - the criteria is not fulfilled
Purity of compound100001537
The purity of the test compound was stated or is traceable according to information given regarding manufacturer and lot/batch number. In case of mixtures, the composition of different constituents was stated.
Bedding material100001551
The bedding material used was described.
Results presentation100001575
All results for the investigated endpoints were clearly reported. The most critical results were presented in tables and figures, including description of variation and statistically significant results.
Control group100001540
It was stated that a negative control / untreated group / vehicle control was included
Vehicle100001539
The vehicle was described
Individual identification100001543
The method for individual identification of animals was described.
Animal model (reporting)100001541
The animal model (species, strain, age or life stage and sex) was described.
Number of animals100001547
The number of animals per sex in each cage was stated.
Physical enrichment100001549
Any materials used for physical enrichment were described.
Allocation100001555
The method for allocating animals to different treatments was stated.
Water bottle material100001550
Water bottle materials were described.
Frequency and duration of dosing100001559
The frequency and duration of dosing/administration of the test compound was stated.
Route of administration100001557
The route of administration was stated.
Sex and age of animals at dosing100001558
The sex and age (or life stage) of the animals at start of dosing was stated or is obvious from the information given, e.g. “pregnant rats were used” is enough information that animals are female and sexually mature/adult.
Competing interests100001579
Any competing interests were disclosed or it was explicitly stated that the authors did not have any competing interests.
Animals subjected to separate tests100001572
The sex, age and number of animals per dose group subjected to separate tests and measurements was stated.
Allocation to different tests100001571
The method for allocating animals to different tests and measurements (e.g. tissue collection or evaluation of functional or behavioural endpoints) was stated.
Test/analytical methods100001570
The tests and/or analytical methods used were sufficiently described to allow for evaluation of reliability of results.
Statistical unit100001577
The statistical unit, e.g. the individual or the litter, was stated.
Statistical methods and software100001576
The statistical methods and software used were described.
Funding100001578
The funding sources for the study were stated.
Chemical name100001536
The chemical name, ID or CAS-number of the test compound was given
Housing temperature100001544
The housing temperature was stated.
Relative humidity100001545
The relative humidity was stated.
Light-dark cycle100001546
The light-dark cycle was described.
Cage material100001548
The cage materials were described.
Feed100001552
The type and source of feed were reported.
Drinking water100001553
The source of drinking water was reported.
Dose levels100001554
The administered dose levels or concentrations were stated.
Number of animals in dose groups100001556
The total number of animals per dose group was stated.
Overall Study Confidence (In vivo / in vitro)100000949
Overall confidence (in vivo / in vitro)100001610
RATING GUIDANCE: Once the evaluation domains have been classified, these ratings will be combined to reach an overall study confidence classification of High, Medium, Low, or Uninformative. In case of multiple endpoints or group of endpoints, a rationale for each one should be given separately.
This classification will be based on the classifications in the evaluation domains (reporting, methodology), and will include consideration of the likely impact of the noted deficiencies in bias and sensitivity on the results. In general, individual criteria in the reporting quality domain are regarded as less important compared to those in the methodological quality domain and this should be taken into account when arriving at the final confidence rating.
Studies with a high number of critical deficiencies, especially on criteria that are deemed very important (e.g. failure to report the chemical name, route of exposure, concentration used) will be classified as Uninformative.
Other classifications will generally follow a sorting such that High Confidence studies would have the highest evaluation ('Good') for all or most domains; Low Confidence studies would have a 'Deficient' evaluation a high number of criteria (unless the impact of the particular limitation(s) is judged to be unlikely to be severe), and Medium Confidence studies are in between these groups.
Example ratings:
High confidence:
Reproductive and developmental effects other than behavior [High Confidence]: The study was well-designed for the evaluation reproductive and developmental toxicity induced by chemical exposure. The study applied established approaches, recommendations, and best practices, and employed an appropriate exposure design for these endpoints. Evidence was presented clearly and transparently.
Behavioral measures [Low Confidence]: The cursory cage-side observations of motor-related activity are considered to be insensitive methods for detecting behavioral effects, with a strong bias towards the null.
Medium confidence:
Developmental effects [all Medium Confidence]: The study was adequately designed for the evaluation of developmental toxicity. Although the authors failed to describe randomized allocation of animals to exposure groups and some concerns were raised regarding the sensitivity (i.e., timing) and sample sizes (i.e., n=6 litters/group) used for the evaluation of potential effects on male reproductive system development with gestational exposure, these limitations are expected to have a minimal impact on the results.
Lower confidence
Developmental effects [Low Confidence]: Substantial concerns were raised regarding quantitative analyses without addressing potential litter effects. Other significant limitations included incomplete data presentation (sample sizes for outcome assessment were unclear; no information on maternal toxicity was provided), and methods for selection of animals for outcome assessment.
Histopathology [Medium Confidence]: The study authors did not report information on the severity of histological effects for which this is routinely provided. The authors also failed to describe use of methods to reduce potential observational bias.
Sperm Measures [Uninformative]: Issues were identified with the methods used to prepare samples for analysis, which are likely to introduce artifacts. Concerns were also raised regarding results presentation (i.e., lack of group variability), missing information on sample sizes and loss of animals, and a lack of information on the timing of these evaluations. Taken together, the evaluation of this endpoint was considered uninformative.
Uninformative
Example 1: Critical information was not reported. Specifically, the study authors did not report the duration of the exposure or the results (qualitative or quantitative). Given this critical deficiency, the other domains were not evaluated.
Example 2: Concerns were raised over the lack of information on test animal strain and allocation, and chemical source/purity. The lack of information on blinding or other methods to reduce observational blinding is also of significant concern for the endpoints of interest (i.e., follicle counts, ova counts, and evaluation of estrous cyclicity). Finally, concerns were also raised over the apparent self-plagiarism in similar chromium studies published in 1996 by this group of authors. Taken together, this combination of limitations resulted in an interpretation that the results were unreliable.
Selection and Performance100000937
Participant selection100001521
EXAMPLE TEXT: Adequate. Nested case-control design in Mexico City birth cohort with 30 cases of preterm birth and 30 controls selected randomly from same population of woman who were recruited during prenatal visits at one of four clinics (serving low to moderate income population). Recruitment and eligibility criteria (inclusion/exclusion criteria) discussed. Little discussion of participants versus nonparticipants but the available information indicates that differential selection is possible but not likely. Participation rate reported to be low (36%). Evaluates the vulnerable population of low-moderate income pregnant women.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Is there evidence that selection into or out of the study (or analysis sample) was jointly related to exposure and to outcome?
Study design, where and when was the study conducted, and who was included? Recruitment process, exclusion and inclusion criteria, type of controls, total eligible, comparison between participants and nonparticipants (or followed and not followed), final analysis group. Does the study include potential vulnerable/susceptible groups or lifestages?
Follow link to see attachments that contain example prompting and follow-up questions for epidemiological studies.
Exposure Methods100000940
Exposure measures100001526
EXAMPLE TEXT: Poor for long-chained (DEHP, DiNP) and adequate for short-chained (DEP, DBP, DiBP) phthalate metabolites based on number of samples. A single spot (second morning void) urine sample was collected from each woman during a third-trimester visit to the project's research center; third trimester sample is relevant to later term preterm births. Analytical approach described and appropriate. High percent >LOD.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Does the exposure measure reliably distinguish between levels of exposure in a time window considered most relevant for a causal effect with respect to the development of the outcome?
Source(s) of exposure (consumer products, occupational, an industrial accident) and source(s) of exposure data, blinding to outcome, level of detail for job history data, when measurements were taken, type of biomarker(s), assay information, reliability data from repeat measures studies, validation studies.
Follow link to see attachments that contain example prompting and follow-up questions for epidemiological studies.
Outcome Methods/Results Presentation100000941
Outcome measures100001529
EXAMPLE TEXT: Adequate. Preterm birth defined by length of gestation (< 37 weeks), a standard measure of birth outcome, estimated by maternal recall of the date of last menstrual period, rather than the preferred early ultrasound. Potential misclassification of preterm cases due to maternal recall of last menstrual period to estimate gestational age which may be nondifferential with respect to exposure; however, differential misclassification is still possible but unlikely.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Does the outcome measure reliably distinguish the presence or absence (or degree of severity) of the outcome?
Source of outcome (effect) measure, blinding to exposure status or level, how measured/classified, incident versus prevalent disease, evidence from validation studies, prevalence (or distribution summary statistics for continuous measures).
Follow link to see attachments that contain example prompting and follow-up questions for epidemiological studies.
Confounding100000942
Confounding100001530
EXAMPLE TEXT: Adequate. Information on key confounders was collected through questionnaire. The strategy for evaluating confounding and the process for retaining variables in the models was described. Rationale for selecting confounders not provided. Inclusion in model not solely based on statistical significance. Adjustment for relative co-exposures.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Is confounding of the effect of the exposure unlikely?
Background research on key confounders for specific populations or settings; participant characteristic data, by group; strategy/approach for consideration of potential confounding; strength of associations between exposure and potential confounders and between potential confounders and outcome; degree of exposure to the confounder in the population.
Follow link to see attachments that contain example prompting and follow-up questions for epidemiological studies.
Analysis100000943
Does the analysis strategy and presentation convey the necessary familiarity with the data and assumptions?100001531
EXAMPLE TEXT: Adequate. Multivariable (multivariate) logistic regression used to take into account potential confounding variables; quantitative results presented (ORs and 95% CIs with ORs adjusted for confounders). Imputation techniques used when phthalate metabolite concentrations were below the LOD (filling in data where there wasn't); Amount of missing data not noted; Dichotomous exposure (reduced sensitivity) and use of median as the cut-off adjusted for urine creatinine and specific gravity to assess effect of method used.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Does the analysis strategy and presentation convey the necessary familiarity with the data and assumptions?
Extent (and if applicable, treatment) of missing data for exposure, outcome, and confounders, approach to modeling, classification of exposure and outcome variables (continuous versus categorical), testing of assumptions, sample size for specific analyses, relevant sensitivity analyses.
An ideal study would convey a thoughtful and thorough description of the analytical approach, and descriptive data for key variables (e.g., exposure measures, outcome measures), including the amount of missing data (or proportion less than the limit of detection [LOD]). The ideal analysis would use an appropriate and well thought out modeling approach for the study design (e.g., logistic regression for case-control data) and specify the covariates used in the final model; the methods should be described in enough detail such that they could be applied to the data from another study. In addition, the results should be presented with sufficient detail to enable estimation of effect estimates and precision of the estimates (e.g., standard error [SE] or confidence interval [CI]
Follow link to see attachments that contain example prompting and follow-up questions for epidemiological studies.
Selective Reporting100000944
Selective Reporting100001532
Selective Reporting
EXAMPLE TEXT: Adequate. No concerns for selective reporting.
RATING GUIDANCE: Is there concern for selective reporting?
Rating should be 2-level - Adequate or Deficient.
Are results presented with adequate detail for all the endpoints of interest? Are results presented for the full sample as well as for specified subgroups? Were stratified analyses (effect modification) motivated by a specific hypothesis?
Follow link to see attachments that contain example prompting and follow-up questions for epidemiological studies.
Sensitivity100000945
Are there concerns for study sensitivity100001533
Sensitivity
EXAMPLE TEXT: Deficient. Small sample size/ Potential nondifferential misclassification of outcome and exposure. Low exposure levels. Range of exposure is narrow. Healthy worker effect.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Are there concerns for study sensitivity?
What exposure range is spanned in this study? What are the ages of participants (e.g., not too young in studies of pubertal development)? What is the length of follow-up (for outcomes with long latency periods)? Choice of referent group and the level of exposure contrast between groups (i.e., the extent to which the 'unexposed group' is truly unexposed, and the prevalence of exposure in the group designated as 'exposed'). Is the study relevant to the exposure and outcome of interest?
Follow link to see attachments that contain example prompting and follow-up questions for epidemiological studies.
Overall Study Confidence (Epidemiology)100000946
Overall confidence (epi)100001535
EXAMPLE TEXT: Low confidence. Give brief rationale for rating.
Add other concerns or limitations.
Add impact and direction to effect estimate, if applicable.
RATING GUIDANCE: Once the evaluation domains have been classified, these ratings will be combined to reach an overall study confidence classification of High, Medium, Low, or Uninformative.
This classification will be based on the classifications in the evaluation domains, and will include consideration of the likely impact of the noted deficiencies in bias and sensitivity on the results. Studies with critical deficiencies in any evaluation domain will be classified as Uninformative. Other classifications will generally follow a sorting such that High Confidence studies would have the highest evaluation ('Good') for all or most domains; Low Confidence studies would have a 'Poor' evaluation for one or more domains (unless the impact of the particular limitation(s) is judged to be unlikely to be severe), and Medium Confidence studies are in between these groups (e.g., most domains receiving a mid-level Adequate evaluation, with no limitations judged to be severe.) Once initial evaluation has been performed with consensus between reviewers, the classifications will be re-evaluated, looking at the variability 'within' and 'between' levels to ensure that the separation between the levels of confidence are appropriate and that no additional criteria need to be considered.
Follow link to see attachments that contain example prompting and follow-up questions for epidemiological studies.
Methodological quality100000948
Important note:
In case multiple endpoints or groups of endpoints are presented in the study, consider to evaluate each of them separately by providing explanations in the comment.
Untreated/vehicle control100001585
An untreated or vehicle control was included.
Guidance:
An untreated or vehicle control should always be included as it is critical for determining treatment-related effects. Control samples should be handled in the same way as treated samples.
How to judge this criterion:
Good – An untreated or vehicle control was included.
Deficient – It is not explicitly stated that an untreated or vehicle control was included, but it is likely that it was included.
Critically deficient – no untreated or vehicle control was included.
Duration of exposure100001601
The duration of exposure was suitable for the test system and investigated endpoints.
Guidance:
The duration of exposure should be adapted to the test system and endpoint.
How to judge this criterion:
Good – the duration of exposure of the test compound is clearly suitable for the test system and endpoint.
Deficient – the duration of exposure of the test compound is suitable for the test system and endpoint with some deviation.
Critically deficient - the duration of exposure of the test compound is not suitable for the test system and endpoint.
Cytotoxicity100001608
Cytotoxicity was measured and the test compound did not cause cytotoxicity that significantly affected the results.
Guidance:
Cytotoxicity might significantly affect study results and conclusions should only be made based on conditions (concentration of test compound and exposure duration) that do not cause significant cytotoxicity.
How to judge this criterion:
Good – cytotoxicity was measured and the test compound did not cause cytotoxicity at the relevant concentrations and exposure time.
Deficient – cytotoxicity was measured and the test compound cause minor cytotoxicity at the relevant concentrations and exposure time that did not affect the results.
Critically deficient - Cytotoxicity was not measured. Or cytotoxicity was measured and the test compound caused cytotoxicity at the relevant concentrations and exposure time.
Solubility of compound100001581
It was likely that the test compound was soluble at the concentrations used.
Guidance
Solubility of the test compound determines the maximum concentration of the compound that can be dissolved in the vehicle used for the test system. Above that concentration the substance might precipitate and the real concentration is lower than anticipated. Solubility of the text compound can be determined by visually inspecting the solution. More advanced methods such as HPLC and UV spectroscopy can also be used (OECD 2017).
How to judge this criterion:
Good – The solubility of the test compound at the concentration used has been determined, by visual inspection or other methods. Or the evaluator knows that the test compound is soluble at the concentration used.
Deficient – The solubility of the test compound at the concentration used has not been determined and the evaluator has no previous knowledge of the solubility of the test compound, but based on the methods description it is likely that the test compound was soluble at the concentrations used.
Critically deficient – It is stated that the test compound was insoluble or the evaluator knows that the test compound is insoluble at the concentration used.
Vehicle (in vitro)100001582
An appropriate vehicle was used that is not expected to interfere with the results of the study at the concentration used.
Guidance
A vehicle is “any agent which serves as a carrier used to mix, disperse, or solubilize the test item or reference item to facilitate the administration/application to the test system” (OECD 1998). The choice of vehicle will be determined by the solubility of the test compound, as well as the test system used.
The test compound is usually dissolved in ethanol or DMSO as vehicle. The final vehicle concentration in the test system is commonly <=1% (OECD 2017).
How to judge this criterion:
Good – ethanol, DMSO or water or another common and historically well-characterized vehicle was used, the vehicle concentration was appropriate and it is not expected to interfere with the results.
Deficient – the vehicle was not well characterized or is not commonly used in this context, or the vehicle concentration was not clearly stated, but there are no obvious concerns that it interferes with the results.
Critically deficient – the vehicle used was clearly interfering with the results or the vehicle concentration was too high.
Concentrations used100001602
The concentrations used were suitable for the test system and investigated endpoints.
Guidance:
The concentrations used should be adapted to the test system and endpoint.
How to judge this criterion:
Good – the concentrations of the test compound are clearly suitable for the test system and endpoint.
Deficient – the concentrations of the test compound are suitable for the test system and endpoint with some deviation.
Critically deficient - the concentrations of the test compound are not suitable for the test system and endpoint.
Replicates/repetitions100001605
Sufficient numbers of replicates or repetitions of the experiment were used to generate reliable and valid results.
Guidance:
Sample size should be large enough to ensure sufficient statistical power to detect any effects in the endpoints measured.
How to judge this criterion:
Good – a sufficient number of replicates or repetitions of the experiment were included.
Deficient – a lower than usual number of replicates were used, which may have caused lower sensitivity/statistical power of the study.
Critically deficient – the number of replicates was clearly insufficient.
Statistical methods and software100001609
The statistical methods were clearly described and do not seem inappropriate, unusual or unfamiliar.
Guidance:
The choice of statistical analyses will depend on the type of study and the nature of the endpoints measured.
OECD test guidelines and corresponding guidance documents provide some recommendations for statistical tests (e.g. OECD’s Guidance notes for analysis and evaluation of chronic toxicity and carcinogenicity studies and Guidance Notes for Analysis and Evaluation of Repeat-Dose Toxicity Studies) as well as for considerations to be made in statistical analyses of different types of tests.
Evaluation of this criterion also includes considering if the correct statistical unit was used. For example in in vivo tox studies, it is generally recommended that the litter (or dam) is the statistical unit in developmental toxicity studies to account for litter effects. Correlations across litter mates due to genetic and/or prenatal conditions can have considerable influence on the statistical significance of results (e.g. Holson et al. 2008; Li et al. 2008). To control for litter effects, either only one pup per sex and litter is submitted to each test/measurement in the study, or all pups are examined and litter effects are accounted for in the statistical analyses. For certain endpoints, e.g. malformations, it might be warranted to examine all pups as it increases the statistical power and not all pups are identical. Similarly, examining many pups per litter greatly enhances the ability to detect low dose effects (OECD 2008). The size of litter effect varies depending on endpoint measured, dose (being larger at high dose levels), and chemical mode of action.
In general, normality of the data should have been checked and the choice of parametric or non-parametric tests should have been based upon that result.
How to judge this criterion:
Good – the statistical methods have been clearly described and do not seem inappropriate, unusual or unfamiliar.
Deficient – unusual or unfamiliar methods were applied in the statistical analyses but do not seem clearly inappropriate.
Critically deficient – no statistical tests were used, or the tests used are clearly inappropriate for the study type and/or endpoints measure
Purity100001580
The test compound or mixture was unlikely to contain any impurities that may significantly have affected the results of the study.
Guidance:
Purity of the test compound, or the composition of substances in a mixture can potentially affect study results. Purity and composition is also an important aspect to consider in terms of the relevance of the test compound to the compound being risk assessed. Ideally, in the case of single compounds, the test chemical should be of the highest available purity.
Significant impurities, or isomers of the test compound, are more likely to be present, and/or to impact toxicity for certain compounds. For example, PCBs (individual or in mixtures) are often contaminated with low levels of potentially highly toxic dioxins. The measured toxicity of the test compound may then be due to the contaminant. In such cases information about the level of purity and composition is critical.
How to judge this criterion:
Good – The test compound has been clearly identified and characterized and is of sufficient purity. In cases of mixtures, the composition of substances is well characterized and their individual purities are sufficient.
Deficient – The purity of the test compound has not been described but it is unlikely that impurities are present that would significantly affect the results of the study.
Critically deficient - The test compound or mixture is likely to contain impurities that can affect study results.
Test conditions100001603
The test conditions during and after exposure to the test compound were suitable (media and serum used, cell density, incubation temperature, humidity, CO2 concentration).
Guidance:
Test conditions during and after exposure to the test compound should be appropriate. This includes e.g. media and serum used, cell density, incubation temperature, humidity, CO2 concentration.
How to judge this criterion:
Good – Test conditions during and after exposure to the test compound have been fully described and were appropriate.
Deficient – Most of the test conditions during and after exposure to the test compound were described and were appropriate. Others deviated from standard recommendations or were not reported.
Critically deficient – Most of the conditions test conditions during and after exposure to the test compound were not described or were not in line with standard recommendations.
Collection of measurements100001606
Measurements were collected at suitable time points in order to generate sensitive, valid and reliable data.
Guidance:
This criterion covers several aspects concerning the timing of measurements and collection of data. Overall, to avoid introducing potential bias and to generate robust data, it is most important that tests and/or measurements are performed under the same conditions in all treatment groups.
For in vivo tox studies:
1. Data should be collected at the correct time point in relation to the time needed to detect treatment related effects. In regard to specific developmental effects, these may only become apparent at a certain age, relating e.g. to behavioral ontogeny or onset of puberty. In addition, the time point for measurements and data collection should be chosen to avoid influence from any acute effects of the test substance administration (OECD 2008). OECD test guidelines provide recommendations for the timing of measurements and data collection in different study types.
2. Data should be collected at the same age across treated and control animals. For some developmental effects in rats and mice investigated during pregnancy or early after birth the time of day when measurements are performed is critical since development is rapid and differences between controls and treated animals may otherwise only represent differences related to (gestational) age (OECD 2008).
3. Data should be collected so that the time of day does not influence measurements. For example, responses in behavioral testing in nocturnal animals like mice and rats is likely to produce different behavior during the day than during the night. For such reasons reversed lighting conditions may be applied to test nocturnal animals during the day.
For in vitro studies, the time points for the measurements should be adapted to the test system and endpoint.
How to judge this criterion:
Fulfilled – The timing of tests and measurements were appropriate to detect sensitive effects and there are no related aspects that are likely to influence the reliability of the results. For in vivo tox studies, conditions have been the same for treated and control animals.
Deficient – Some, but not all, aspects of timing were appropriate. Importantly, there are no critical issues that raise concern, e.g. in in vivo animal tox experinments, control and treated animals were tested at different age/time points.
Critically deficient - The timing of tests and measurements were not appropriate. E.g. it is likely that sensitive treatment related effects have been missed, or there are other aspects that are likely to have influenced the reliability of the results. And/Or control and treated groups were tested at different age/time points.
Blinding or similar measures100001612
Did the study implement measures to reduce observational bias?
Prompting questions:
· Does the study report blinding or other methods/procedures for reducing observational bias?
· If not, did the study use a design or approach for which such procedures can be inferred?
· What is the expected impact of failure to implement (or report implementation) of these methods/procedures on results?
How to judge this criterion:
Good - The study reported that blinding or other similar procedures were performed whenever it was adequate
Deficient - The study did not report that blinding or similar procedures were performed where it would have been adequate, however, the expected impact of failure to implement these methods is small
Critically deficient -The study did not report that blinding or similar procedures were performed where it would have been adequate (or explicitly stated that no blinding or similar measures were conducted) AND the expected impact of failure to implement them is significant
Test methods100001604
Reliable and sensitive test methods were used for investigating the selected endpoints.
Guidance:
The reliability of the methods refers to whether they are known to generate reproducible results for the type of endpoints investigated, e.g. if the methods have been validated across different laboratories.
The sensitivity of the methods relates to the ability to detect changes in the endpoints investigated.
For in vivo toxicity studies, conducted according to standardized and validated test guidelines (such as OECD test guidelines) are often considered to be reliable and adequate for risk assessment. However, it is important to keep in mind that adherence to standardized test guidelines does not automatically ensure the sensitivity of the methods applied. Further, sensitivity of the methods may in some cases be influenced by how the protocols are utilized (discussed in e.g. OECD 2008).
In many cases, details may be missing from the description of test methods hampering a full evaluation of their reliability and sensitivity. If most of the methods have been sufficiently reported, and it does not seem reasonable to set this criterion to 'not determined', it is suggested to judge it as 'partially fulfilled' and make a note in the comments field.
How to judge this criterion:
Good – there is no information that suggests that the test methods are insensitive or unreliable in this context.
Deficient – it is suspected that one or more of the methods applied may be insensitive or unreliable. OR test methods have not been sufficiently reported to fully evaluate the reliability and sensitivity of test methods,
Critically deficient – there is available information that indicates that one or more of the methods applied is either insensitive or clearly unreliable for studies of the test compound or for investigating the endpoints considered. Or the expected outcome is lacking from (concurrent) positive controls, if included, indicating that the methods (or animal model) is insensitive.
Attrition100001611
Did the study report results for all tested animals/replicates/repetitions?
Prompting questions:
- Are all animals/replicates/repetitions accounted for in the results?
- If there are discrepancies, do authors provide an explanation (e.g., in in vivo tox studies: death or unscheduled sacrifice during the study)?
- If unexplained results omissions and/or attrition are identified, what is the expected impact on the interpretation of the results?
How to judge this criterion:
Good - it is stated or can be inferred to that there are no losses in animals/replicates/repetitions in the study
Deficient - there are discrepancies, however, the authors provide an explanation OR the authors do not provide an explanation but the expected impact on the interpretation of the results is small OR it is unclear/not reported whether there are any discrepancies
Critically deficient - There are substantial, unexplained losses of animals/replicates/repetitions that might have had a significant impact on the interpretation of the results
Conditions for cultivation100001600
Conditions for cultivation and/or maintenance of the cell line / cells / tissue / organ /embryo (incubation temperature, humidity, CO2 concentration, media used, number of cell passages, control of contamination) were appropriate.
Guidance:
Conditions for cultivation and/or maintenance of the cell line / cells / tissue / organ /embryo should be appropriate. This includes e.g. incubation temperature, humidity, CO2 concentration, media used, number of cell passages, control of contamination. Guidance for Good Cell Culture Practice (GCCP) describes the best practice.
How to judge this criterion:
Good – Conditions for cultivation and/or maintenance have been fully described and were in line with standard recommendations for the test system.
Deficient – Most of the conditions for cultivation and/or maintenance were described and are in line with standard recommendations for the test system. Others deviated from standard recommendations or were not reported.
Critically deficient – Most of the conditions for cultivation and/or maintenance were not described or were not in line with standard recommendations for the test system.
Test system100001599
A reliable and sensitive test system (cell line / cells / tissue / organ /embryo) with metabolic competence, if relevant, was used for investigating the test compound and endpoints.
Guidance:
The choice of test system (cell line / cells / tissue / organ /embryo) is based on a number of considerations, including knowledge regarding metabolism and mode of action.
Reliability, in this context, refers to whether the test system has been shown to generate reproducible results for the type of endpoints investigated.
The sensitivity of the test system relates to the ability to detect changes in the endpoints investigated in the model.
How to judge this criterion:
Good – The test system used is well-established or the reliability and sensitivity of the test system is clearly described. In case metabolism of the test compound is relevant for the toxicity, the metabolic competence of the test system is well-established or it is clearly stated that the test system has the appropriate metabolic competence.
Deficient – It is likely that the test system is reliable and sensitive but it is not a well-established system and reliability and sensitivity of the test system is not clearly described. In case metabolism of the test compound is relevant for the toxicity, it is likely that the test system has the appropriate metabolic competence, although it is not clearly stated.
Critically deficient – there is available information that indicates that the test system is either insensitive or clearly unreliable for studying the test compound or for investigating the endpoints considered. Or the expected outcome is lacking from positive control, if included. Or the test system lacks metabolic competence required to assess the toxicity of the test compound.
Reporting quality100000947
Please use the following guidance for rating reporting quality:
Good - the criteria is completely fulfilled
Deficient - the criteria is partially fulfilled (e.g. one aspect is fulfilled but another is not)
Critically deficient - the criteria is not fulfilled
Purity of compound100001537
The purity of the test compound was stated or is traceable according to information given regarding manufacturer and lot/batch number. In case of mixtures, the composition of different constituents was stated.
Results presentation100001575
All results for the investigated endpoints were clearly reported. The most critical results were presented in tables and figures, including description of variation and statistically significant results.
Solubility of compound100001538
The solubility of the test compound was described.
Test system100001542
The test system (cell line / cells/ tissue / organ / embryo) was described.
Control group100001540
It was stated that a negative control / untreated group / vehicle control was included
Vehicle100001539
The vehicle was described
Source of test system100001561
The source of the test system was stated.
Metabolic competence100001562
Metabolic competence of the test system was described.
Competing interests100001579
Any competing interests were disclosed or it was explicitly stated that the authors did not have any competing interests.
Replicates/repetitions100001569
The number of replicates per dose level/concentration or the number of times the experiment was repeated was stated.
Test/analytical methods100001570
The tests and/or analytical methods used were sufficiently described to allow for evaluation of reliability of results.
Media composition100001564
Composition of media was described, including use of serum, antibiotics, etc.
Statistical methods and software100001576
The statistical methods and software used were described.
Cytotoxicity100001574
It was stated that the effect of the test compound on cytotoxicity was measured.
Data collection timepoints100001573
The time points for data collection were stated.
Contamination100001566
Measures taken for avoiding or screening for contamination by mycoplasma, bacteria, fungi and virus were described.
Funding100001578
The funding sources for the study were stated.
Number of cell passages100001563
The number of cell passages of the cell line used, was stated. (rate with 'good' if no cell lines were used and include a comment)
Incubation parameters100001565
Incubation temperature, humidity, and CO2 concentration were described.
Cell density100001567
Cell density or number of cells used during treatment was described. (Remove this criterion if the study was not conducted in a cell line.)
Treatment duration100001568
The duration of treatment was stated.
Chemical name100001536
The chemical name, ID or CAS-number of the test compound was given
Dose levels100001554
The administered dose levels or concentrations were stated.
Overall Study Confidence (In vivo / in vitro)100000949
Overall confidence (in vivo / in vitro)100001610
RATING GUIDANCE: Once the evaluation domains have been classified, these ratings will be combined to reach an overall study confidence classification of High, Medium, Low, or Uninformative. In case of multiple endpoints or group of endpoints, a rationale for each one should be given separately.
This classification will be based on the classifications in the evaluation domains (reporting, methodology), and will include consideration of the likely impact of the noted deficiencies in bias and sensitivity on the results. In general, individual criteria in the reporting quality domain are regarded as less important compared to those in the methodological quality domain and this should be taken into account when arriving at the final confidence rating.
Studies with a high number of critical deficiencies, especially on criteria that are deemed very important (e.g. failure to report the chemical name, route of exposure, concentration used) will be classified as Uninformative.
Other classifications will generally follow a sorting such that High Confidence studies would have the highest evaluation ('Good') for all or most domains; Low Confidence studies would have a 'Deficient' evaluation a high number of criteria (unless the impact of the particular limitation(s) is judged to be unlikely to be severe), and Medium Confidence studies are in between these groups.
Example ratings:
High confidence:
Reproductive and developmental effects other than behavior [High Confidence]: The study was well-designed for the evaluation reproductive and developmental toxicity induced by chemical exposure. The study applied established approaches, recommendations, and best practices, and employed an appropriate exposure design for these endpoints. Evidence was presented clearly and transparently.
Behavioral measures [Low Confidence]: The cursory cage-side observations of motor-related activity are considered to be insensitive methods for detecting behavioral effects, with a strong bias towards the null.
Medium confidence:
Developmental effects [all Medium Confidence]: The study was adequately designed for the evaluation of developmental toxicity. Although the authors failed to describe randomized allocation of animals to exposure groups and some concerns were raised regarding the sensitivity (i.e., timing) and sample sizes (i.e., n=6 litters/group) used for the evaluation of potential effects on male reproductive system development with gestational exposure, these limitations are expected to have a minimal impact on the results.
Lower confidence
Developmental effects [Low Confidence]: Substantial concerns were raised regarding quantitative analyses without addressing potential litter effects. Other significant limitations included incomplete data presentation (sample sizes for outcome assessment were unclear; no information on maternal toxicity was provided), and methods for selection of animals for outcome assessment.
Histopathology [Medium Confidence]: The study authors did not report information on the severity of histological effects for which this is routinely provided. The authors also failed to describe use of methods to reduce potential observational bias.
Sperm Measures [Uninformative]: Issues were identified with the methods used to prepare samples for analysis, which are likely to introduce artifacts. Concerns were also raised regarding results presentation (i.e., lack of group variability), missing information on sample sizes and loss of animals, and a lack of information on the timing of these evaluations. Taken together, the evaluation of this endpoint was considered uninformative.
Uninformative
Example 1: Critical information was not reported. Specifically, the study authors did not report the duration of the exposure or the results (qualitative or quantitative). Given this critical deficiency, the other domains were not evaluated.
Example 2: Concerns were raised over the lack of information on test animal strain and allocation, and chemical source/purity. The lack of information on blinding or other methods to reduce observational blinding is also of significant concern for the endpoints of interest (i.e., follicle counts, ova counts, and evaluation of estrous cyclicity). Finally, concerns were also raised over the apparent self-plagiarism in similar chromium studies published in 1996 by this group of authors. Taken together, this combination of limitations resulted in an interpretation that the results were unreliable.