Screening tests are generally designed in two forms: objective and subjective. In objective tests, the examiners directly observe and assess an infant’s behavior. Example include the Denver II, early learning milestone scale for screening language skills, and Brigance, Battelle, and Bayley scales (
24). Subjective tests are developmental questionnaires completed by parents, such as the parent evaluation of developmental status and the ages and stages questionnaire (ASQ). Parents’ views of the developmental status of their infants have been considered appropriate and reliable for years (
25,
26). However, subjective tests have some weaknesses. For example, poorly educated parents may have difficulty reading a questionnaire, although this difficulty can be overcome by asking parents in an appropriate way. Furthermore, some physicians have opined that highly educated parents may be oversensitive to their infants’ development and that the use of parent-based questionnaires can lead to increased referrals (
25,
26). Thus, there are some doubts about the credibility of information provided by parents. It should be noted that questionnaires are used in two-stage screenings, and suspected or unsuccessful cases should be assessed through diagnostic tests or objective screenings that require greater amount of time and skill (
25,
26).
There are no comprehensive tools applicable to all societies and all age groups, and there are no culturally compatible screening tools in many developing countries (
25,
26). Thus, unstandardized tools should first be standardized according to the population of each country (
25,
26). Developmental screening tools that have been translated into the Persian language include the Denver test and ASQ, whose criterion validity has not been verified. Denver II is an objective test for developmental screening of children from birth to the age of 8 years (
27,
28). The sensitivity and specificity of this test have been reported to range from 40% - 83% to 40 % - 80%, respectively (
27,
28). In Iran, the psychometric properties of the Denver test were compared with those of the ASQ in a sample of 197 children (
29,
30). The authors reported that the Kappa agreement between the two tests was poor (0.21), with agreement of 0.17 with the results of a physical examination. Therefore, the authors concluded that the Kappa agreement coefficient was poor in the Denver test. Due to its wide range of sensitivity and specificity, the Denver II test is not recommended (
27).
The ASQ has been standardized in Iran. However, as a standard Iranian diagnostic developmental test is not available, the criterion validity, sensitivity, and specificity of the Persian version have not been determined (
31). The ASQ has some strong points. Unlike objective tests, it does not require the cooperation of the infant, and it has been designed according to developmental indices, which can be taught to parents. In addition, the ASQ is economical, and it can be administered in a short time. Its weaknesses include the need for a large space to store it, with its 4 - 5 pages (
27). Furthermore, poorly educated parents may find it difficult to complete, and it is unable to detect developmental delay in 13% of children (
27). Thus, its use in high-risk groups is uncertain.
The Bayley screening test items are a subtest of the cognitive, language, and motor items of the Bayley diagnostic test (
18). In the U.S., the evidence of Bayley screening test validity was conducted to examine the relation between performance on the Bayley diagnostic and Bayley screening test. Scores of 1 - 4 in the Bayley diagnostic test were equivalent to the criterion used to define the at-risk category in the Bayley screening test, and Bayley diagnostic test scores of 5 - 7 were equivalent to the criterion used to define the emerging category (
18). In that study, for children with Bayley diagnostic test scores of 1 - 4 (very low), the classification accuracy was moderate. The number of such children correctly identified by the Bayley screening test as being at risk ranged from 41.82% on the fine motor subtest to 65.91% on the receptive communication subtest, and none of these children was incorrectly classified as proficient. In the same study, for children with Bayley diagnostic test scores of 5 - 7, the Bayley screening test was even more accurate. The numbers of these children correctly identified as “emerging” ranged from 63.87 for the cognitive subtest to 77.78% for the receptive communication subtest. The numbers of such children misidentified as at risk was very low, ranging from 0.82% - 5.21%. For children with Bayley diagnostic test scores of 8 - 19, the Bayley screening test was very accurate, with 83.84% correctly identified as proficient in the cognitive subtest and 92.11% identified as proficient in the receptive communication subtest (
18). Furthermore, none of the children was incorrectly identified as at risk. Of note, in this classification, no child had a Bayley diagnostic test score of 1 - 4 (very low) and a Bayley screening test score in the component category, and no child had a Bayley diagnostic test scaled score of 8 - 19 (high) and a Bayley screening test score in the at risk category. This test was also shown to be valid in Taiwan, Canada, and the U.K. (
32-
34). In a study in the U.S., compared to the Alberta motor development scale, the Bayley motor subscale showed a higher correlation in early referral of high-risk infants to interventional service centers (
34), confirming the suitability of its application in such cases.
An examination of the relation between a test’s content and the construct it is intended to measure provides a major source of evidence for the validity of the test. Evidence of content validity is not based on empirical or statistics testing: rather, it is the degree to which the test items adequately represent and relate to the trait or function that is being measured. The test content also involves the wording and format of the items, as well as the procedures for administering and scoring the test. In the present study, the content validity of the test was confirmed by eight experts in child development, and the construct validity was confirmed using a factor analysis and comparison of the scores of the different age groups.
In the present study, in terms of the cultural and linguistic appropriateness of the items for Persian-speaking children, several items were modified. Other studies of screening tests, such as the ASQ, performed a similar process of item modification (
35-
37). According to the Scheffe post hoc test, there was a significant difference between the mean values, thereby indicating a correlation between the age and test scores in the five domains, with higher scores associated with increased age. These results confirmed the validity of the test construct. To confirm the reliability of the instrument, its internal consistency was determined, in addition to test-retest and inter-rater values. As shown by the assessment of internal consistency using Cronbach’s alpha method, the reliability of the cognitive scale, receptive, and expressive communication scales was .79, .76, and .81, respectively, and the reliability of the fine motor and gross motor scales was .80 and .81, respectively, with a small SEM (< 2). A study in the U.S. reported similar reliability results, with good internal consistency (0.82 - 0.88) and test-retest reliability of 0.80 - 0.83 (
18).
In this study, the inter-rater and intra-rater reliability coefficients of the test were excellent for all the subtests. The results indicate that raters who receive training in how to administer the Bayley scales can reliably assess Persian infants. The reliability data also suggest that the scores for the subtests reflect a high degree of internal consistency in the items and that this version of the Bayley screening test is equally reliable for assessing individuals with different levels of development. In the present study, the Bayley screening test scores showed very good stability over time across the age groups. Thus, the results of the test provide a reliable measurement, and the scores a child obtains in the test can be interpreted with a high level of confidence.
One of the first cross-cultural psychometric studies of the application of the Bayley scales to infants in an Eastern setting was a study of term and preterm Taiwanese infants. In that study, the correlations between the BSID-II and Bayley-III raw scores were good-to-excellent for the cognitive and motor items and low-to-excellent for the language items. In addition, both intra- and inter-rater reliability showed good-to-excellent correlations (> 0.75) and small SEMs (< 2) for term and preterm Taiwanese infants aged 6 - 24 months (
32).
The major strengths of the present study were the use of an objective assessment of five developmental domains and the inclusion of infants aged 1 - 42 months old children. The revised scale is appropriate for follow-ups of at-risk infants.
5.1. Conclusions
This study demonstrated high reliability, content validity, and construct validity for all the subtests of the Bayley screening test. The results indicate that the Bayley screening test is a reliable and valid tool for the assessment of child development in the Middle East.