It must be understood that all measurement involves sampling. A sample is a small subset of a large and usually indefinable population. During development, the user sample sizes available to most design teams are extremely small: usually on the order of 3 to 10 users. Similarly, testing procedures will usually focus on a small number of user tasks -- presumably those adjudged to be the most critical ones. Typically, a testing session will last up to 60 minutes and test the execution of 4 to 5 tasks, whereas the product will typically support a large and possibly indefinable number of tasks. Thus task sample sizes will also be extremely small.
Compared to the projected sale and use of a product, these are extremely risky samples to base major decisions on. If we have a measure whose standard deviation (spread) is as low as 4.00, fig. 1 shows how with small sample sizes, the 95% confidence area (where we are 95% sure the true population mean lies on the basis of our sample means) is extraordinarily large and only becomes reasonable when we have measured approx. 40 users or investigated usability over approx. 40 tasks.
The particular difficulty is that we do not know whereabouts within the 95% confidence interval the true (population) mean does actually lie.
The situation is not as bad as it appears in the graph, because usually, testing during development attempts to diagnose as well as measure, and user data is usually qualitative rather than quantitative, but still, an inappropriate choice of user testers and tasks, with small samples of both, may give either extremely optimistic results or extremely pessimistic ones - and we have no way of telling.
However, there is usually no way of being able to test with large samples until the product is released.
The development of applications for the web may prove to be an exception: a web site may be launched in a Beta version state to a limited but usually sizeable sample of users, and testing can be carried out on that basis, to achieve a well-tested final version. However, this is only possible if the measures used in testing are reliable themselves, not something one can take for granted.
Thus there is a considerable need to continue testing and measurement after the product has been released, to gain an increasingly truer picture of how the product is performing in relation to the usability goals set for it. Data from such an activity can be used by the organisation developing the product in a number of ways. It can:
Selecting the right method: discussion of principles
A frequently asked question is: what are the right methods to use for evaluation? Which methods should an organisation acquire first?
It is extremely important to note that unless an organisation has mature quality processes, at least to levels 2 or 3 of the CMM hierarchy, for instance, then usability testing will do very little to the organisation as a whole, although it may demonstrate the gains to be made in small, localised projects, which may become an incentive to increase general CMM level. In fact, unless carefully handled, usability testing in an immature organisation may actually demonstrate that usability testing is neither cost effective nor useful, thus putting the entire organisation back in its attempts to develop better quality standards.
The ISO 9241/11 definition of usability is well worth remembering:
The effectiveness, efficiency and satisfaction with which a well defined sample of users carry out a fixed set of tasks in a particular environment with a particular release of the software.
Four conclusions can be made from this definition, leading to an acquisition strategy of four main classes of usability methods:
Follow the entries in the methods table for more information.
Selecting the right method: a quick approach for newbies
Here is a kit of measurement methods that is designed to get the novice practitioner off to a quick start. The emphasis is on ‘guerilla methods’: that is, methods which can be used on their own to make a point. Later on, you’ll want to connect the methods together to show there’s an underlying thread behind all this usability work and to use the more complex methods which yield more data.
Effectiveness: Use ‘participatory evaluation’
Efficiency: Use ‘Time on Task’
Satisfaction: Use SUS, or SUMI if you can afford it (see ‘subjective assessment’).
Reporting and documentation
No amount of testing and measurement is any use if you cannot make an impact on your organisation with your results. Secondly, if you don’t keep records, you won’t improve your processes.
Reporting is therefore useful for two reasons:
The fundamental principle in any method of reporting is the 5-minute rule: the most important people you hope to influence will not spend more than five minutes reading your report. This means you have to get your message across within the first few pages. Both the schemes mentioned below allow you to present a report following this principle.
The second principle is the principle of traceability. You must lay down an audit trail in your report so that you, your reader, and your reviewer can always see how you came from your findings to your conclusions. This is harder than science, where your paper will come under scrutiny by expert peer reviewers and will therefore, if published, receive their imprimatur. Your reports will be issued by you under yours or your manager’s authority. A sceptical reviewer must be able to follow your trail.
There are two standards you can follow.
©UsabilityNet 2006. Reproduction permitted provided the source is acknowledged.