Accurate monitoring of prevalence and trends in population levels of physical activity (PA) is a fundamental public health need. Test-retest reliability (repeatability) was assessed in population samples for four self-report PA measures: the Active Australia survey (AA, N=356), the short International Physical Activity Questionnaire (IPAQ, N=104), the physical activity items in the Behavioral Risk Factor Surveillance System (BRFSS, N=127) and in the Australian National Health Survey (NHS, N=122). Percent agreement and Kappa statistics were used to assess reliability of classification of activity status as 'active', 'insufficiently active' or 'sedentary'. Intraclass correlations (ICCs) were used to assess agreement on minutes of activity reported for each item of each survey and for total minutes. Percent agreement scores for activity status were very good on all four instruments, ranging from 60% for the NHS to 79% for the IPAQ. Corresponding Kappa statistics ranged from 0.40 (NHS) to 0.52 (AA). For individual items, ICCs were highest for walking (0.45 to 0.78) and vigorous activity (0.22 to 0.64) and lowest for the moderate questions (0.16 to 0.44). All four measures provide acceptable levels of test-retest reliability for assessing both activity status and sedentariness, and moderate reliability for assessing total minutes of activity.
Table of contents
Test-retest reliability measures the consistency of results when you repeat the same test on the same sample at a different point in time. You use it when you are measuring something that you expect to stay constant in your sample.
You devise a questionnaire to measure the IQ of a group of participants (a property that is unlikely to change significantly over time).You administer the test two months apart to the same group of people, but the results are significantly different, so the test-retest reliability of the IQ questionnaire is low.
To measure interrater reliability, different researchers conduct the same measurement or observation on the same sample. Then you calculate the correlation between their different sets of results. If all the researchers give similar ratings, the test has high interrater reliability.
A team of researchers observe the progress of wound healing in patients. To record the stages of healing, rating scales are used, with a set of criteria to assess various aspects of wounds. The results of different researchers assessing the same set of patients are compared, and there is a strong correlation between all sets of results, so the test has high interrater reliability.
Parallel forms reliability measures the correlation between two equivalent versions of a test. You use it when you have two different assessment tools or sets of questions designed to measure the same thing.
If you want to use multiple different versions of a test (for example, to avoid respondents repeating the same answers from memory), you first need to make sure that all the sets of questions or measurements give reliable results.
A set of questions is formulated to measure financial risk aversion in a group of respondents. The questions are randomly divided into two sets, and the respondents are randomly divided into two groups. Both groups take both tests: group A takes test A first, and group B takes test B first. The results of the two tests are compared, and the results are almost identical, indicating high parallel forms reliability.
When you devise a set of questions or ratings that will be combined into an overall score, you have to make sure that all of the items really do reflect the same thing. If responses to different items contradict one another, the test might be unreliable.
The way to prevent this type of bias is to ensure that participants take both tests under identical conditions, i.e. during the same time of day, with the same general lighting and environment, and given the same amount of time to complete the test.
Hi there, I'm having a similar situation and am not particularly tech-savvy.
I need to establish an average amount of time to take the online survey AND need to establish a Reliability Coefficient for the survey. In my naïve brain, I see this as a test-retest but tried to check how I'd do this by setting up a dummy survey, taking it, and then retaking it as a new survey, getting a new link through the data & analysis tab. I found my first answers on the "new" test but wanted it to be blank so that it would truly be a "retest" not a revision.
Ideally, I'd like to be able to set it up that an anonymous respondent to my first link would be able to get a 2nd, blank test after a 2 week break to meet what I understand as "best practices" for this process.
Help??? I welcome all comments, and TIA!
If you are thinking of selling your home and you have already tested your home for radon, review the Radon Testing Checklist to make sure that the test was done correctly. If so, provide your test results to the buyer.
The purpose of this study was to evaluate an updated list of digitally recorded Speech Recognition Threshold (SRT) materials for test-retest reliability. Chipman (2003) identified 33 psychometrically equated spondaic words that are frequently occurring in English today. These digitally recorded words were used to determine the SRT of 40 participants using the American Speech-Language Hearing Association guidelines. The participants were between the ages of 19 and 83 years and presented with hearing impairment ranging from normal to severe. The individual's pure-tone averages classified 16 participants with normal hearing to slight loss, 12 participants with mild loss, and 12 participants with moderate to severe hearing loss. The speech materials were presented to participants in one randomly selected ear. The SRT was measured for the same ear in both the test and retest conditions. The average SRT for the test condition was 22.7 dB HL and 22.8 dB HL in the retest condition with an improvement of 0.1 dB for retest but no significant difference was identified. Using a modified variance equation to determine test-retest reliability resulted in a 0.98, indicating almost perfect reliability. Therefore the test-retest reliability was determined to be exceptional for the new SRT words.
M.W. Dul, W.H. Swanson, J.H. Sohn; Evaluation of Diffuse Loss and Test-retest Variability in Patients with Advanced Glaucoma Using Full-Threshold and SITA 10-2 Strategies . Invest. Ophthalmol. Vis. Sci. 2003;44(13):73.
Abstract: : Purpose: To evaluate diffuse loss and test-retest variability in patients with advanced glaucoma using two different macular perimetric algorithms. Methods: We tested one eye each of 11 patients with stable, advanced glaucoma and a control group of 10 age-similar normal volunteers. All subjects were experienced and reliable visual field testers with good visual acuity, clear ocular media and no concomitant conditions affecting visual function. Differential light sensitivities were assessed using the Full Threshold (FT) and SITA Standard algorithms with the Humphrey Field Analyzer 10-2 pattern. All tests were repeated twice (t1, t2) within 5 +/- 6 days. To assess diffuse loss, we compared average differential light sensitivities at the 10 most sensitive points for patients and controls. To assess test-retest variability for each group, we used the standard deviation of (t1-t2) at all seeing points. To evaluate the effects of sensitivity on test-retest variability in the patient group, we performed linear regression on t1-t2 vs. mean sensitivity. Finally, we compared mean test-retest variability for two subsets of points: ?normal sensitivity? (26 to 35 dB) and ?reduced sensitivity? (5?25 dB). Results: Mean diffuse loss for patients was -4.7 dB (FT) and -5.2 dB (SITA) (t greater than 3.8, p less than 0.001). The number of patients with significant diffuse loss was 7 for FT and 8 for SITA (chi squared greater than 8.4, p less than 0.004). For both FT and SITA, test-retest variability was higher for the patient group than the control group (F greater than 4.2, p less than 0.0005) and linear regression on the patient data showed increased variability with depth of defect (r greater than 0.23, p less than 0.00005). Direct comparison of ?normal sensitivity? and ?reduced sensitivity? points confirmed that variability was greater in abnormal areas (t greater than 5, p less than 0.0005). Conclusions: Results were quite similar for both FT and SITA algorithms: patients showed diffuse loss on the order of 5 dB and test-retest variability was greater in damaged vs. normal areas.
Maximum phonation time has been widely utilized as a simple clinical evaluation of the vocal function. Its importance has been emphasized by Van Riper (1954), Westlake and Rutherford (1961), Boone (1971), and other authors. A review of the literature revealed three trials of sustained phonation have been utilized by most researchers to determine maximum duration of phonation. Additionally, the review revealed a lack of test-retest reliability in maximum phonation time in children.
The present study was designed to determine the variability in test-retest of maximum duration of sustained /a/ among prepubescent male and female children. Eighty subjects, twenty at each of the four age levels, seven, eight, nine and ten, were selected from a larger pool using a random order table. Each age level was further divided into two groups of ten male and ten female subjects. A tape recording of twenty maximum phonations of /a/. was obtained for each subject. A second measure of maximum phonation time was recorded between two weeks and a month following the original run. The essential questions of this investigation were:
df19127ead