Questionnairesare a self-reporting data collection technique. The questions (items) in a questionnaire are usually closed-ended and presented as multiple-choice. Respondents have to choose from a set of alternatives or points on a rating scale (e.g., very satisfied to very dissatisfied, strongly agree to strongly disagree).
A standardized questionnaire is a questionnaire that is written and administered so all participants are asked precisely the same questions in an identical format and responses recorded and scored in a specific, consistent manner (Boynton et al., 2004). Standardizing a questionnaire takes effort; it requires repeated testing with a big sample and extensive data analysis.
Validity: It refers to whether a questionnaire can measure what it is intended to measure. For example, a survey designed to explore learnability that actually measures system capabilities would not be considered valid.
Economy: The development of standardized measures requires a substantial amount of work as discussed earlier. However, once developed, they can be reused multiple times without the need to re-standardize.
Communication: Communicating findings from standardized measures is easier to interpret since the metrics tend to be standardized. For example, the score of a questionnaire can be compared to scores reported in previous studies.
The QUIS was developed by Chin, Diehl, and Norman in 1988 and it is one of the earliest questionnaires for evaluating user satisfaction. The QUIS is organized around general categories as screen, terminology and system information, learning, and system capabilities. Practitioners often use only categories relevant to the product they are testing and can supplement the QUIS with some of their own questions, specific to the design being evaluated. The questionnaire has been updated multiple times since its release and the current version is QUIS 7, which is available in five languages (English, German, Italian, Brazilian Portuguese, and Spanish) and two lengths, short (41 items) and long (122 items), using nine-point bipolar scales for each item. The shorter version is the most popular one. The questionnaire is licensed and can be found here.
The SUMI is currently available in 12 languages (Dutch, English, Finnish, French, German, Greek, Italian, Norwegian, Polish, Portuguese, Swedish, and Spanish) and is licensed. To view more information about SUMI and buy a license, see here. You can view the English version of the questionnaire here.
A few rounds of improvements have resulted in PSSUQ Version 3, which is the one used today. The original version had 18 questions but version 3 consists of 16 questions with a Likert scale(ranging from Strongly Agree to Strongly Disagree).
The PSSUQ can be used with both large sample sizes (more than 100) and with smaller sample sizes (fewer than 15). The main difference is the level of precision obtained. In a 2004 study, Tullis and Stetson used the CSUQ to compare two financial websites, and they found a sample size of 12 generated the same results as a larger sample size 90% of the time.
The PSSUQ scores correlate significantly with task-based measures and completion rates (r = .4) (Sauro, 2019). However, the PSSUQ should be used carefully as it is susceptible to acquiescence bias (also known as agreement bias), the tendency for survey respondents to agree with research statements. This is because all the items in the PSSUQ are positively worded.
While the SUS was initially intended to measure perceived ease-of-use (a single dimension), Lewis and Sauro found that it provides a global measure of system satisfaction and sub-scales of usability and learnability. Items 4 and 10 provide the learnability dimension and the other 8 items provide the usability dimension.
Scores for individual questions can also be calculated to give us more insight into usability issues. This is achieved by multiplying the normalized score of each question and multiplying it by 25 to align with the scale used for the overall SUS score.
Evidence suggests that SUS scores can predict customer loyalty. In particular, there is a significant positive correlation (r=.61) between SUS scores and Net Promoter Score (NPS). The NPS has become a popular metric of customer loyalty in the industry.
The SUPR-Q contains four factors: usability (can users accomplish what they want to do?), trust (do users trust the product?), appearance (how do users feel about the UI?), and loyalty (are users loyal to the brand?).
The Chatbot Usability Questionnaire (CUQ): It includes 16 balanced questions related to different aspects of chatbot usability. Eight of these relate to positive aspects of chatbot usability, and eight relate to negative aspects. Scores are calculated out of 100.
There is not a simple answer to this question. Determining which questionnaire to use depends on various factors such as the nature of the project, the stage of the research, the goal of the study, and the budget.
Are any of the sub-scores for either questionnaire especially interesting or relevant for your research? For example, if you are interested in the learnability of a product, then the SUS is a good choice.
How long is the usability session you are running? Consider tester fatigue. Some questionnaires, such as the PSSUQ are longer and more complex, thus increasing tester fatigue. Shorter questionnaires are more likely to be completed and are better for benchmarking studies.
I'm going to suggest that people not use QUIS at this point. I am one of the co-inventors and the scale was developed to be tied closely to the UI technology. This made it nicely diagnostic when it was fresh, but it hasn't been refreshed since Laura Slaughter and I did it in the 1990's.
Reliability: this refers to how consistent responses are to the questions. If a measure has high reliability the same or similar users are expected to give similar responses when we evaluate the same product. The most common measure of reliability in questionnaires is by using Cronbach\u2019s alpha, a measure of internal reliability. This ranges from 0 (poor reliability) to 1 (perfect reliability). Measures with a score over .70 are considered sufficiently reliable.
Post-task questionnaires: These measures are completed immediately after users finish a task and they capture their impressions of that task (e.g., Overall, this task was\u2026?). A question is usually presented after the end of each task, which results in multiple answers collected within a session.
Post-task and post-test questionnaires are not incompatible; both can be used in the same usability study if required. In this article, I\u2019ll be focusing on post-test questionnaires. Part 2 of this series covers post-task measures.
The SUMI was developed by the Human Factors Research Group (HFRG) at University College Cork in Ireland, led by Jurek Kirakowski. It is a 50-item questionnaire with a Global scale based on 25 of the items and five subscales for Efficiency, Affect, Helpfulness, Control, and Learnability (10 items each). As shown in the figure below, users can choose one of three options (Agree, Undecided, Disagree). The SUMI contains a mixture of positive and negative statements (e.g., \u201CThe instructions and prompts are helpful\u201D; \u201CI sometimes don\u2019t know what to do next with this system\u201D).
3a8082e126