A User Rating Recession

Online rating systems are ubiquitous. After the birth of the “user-generated” web in the early 2000’s, user ratings are still the most prevalent–and oddly unchanged–remnant of that bygone Web 2.0 era.

Varying implementations

User ratings are typically implemented in several ways.

  • 4-5 star rating scale

    Netflix 5-star rating scale

  • Vote up / vote down

    Reddit up/down vote

  • Vote up only

    Facebook like

These rating implementations vary depending on what each site is attempting to get from users:

  • Making users feel like they’re part of a democratic system where an up or down vote contributes to how content appears to other users on the site. Digg and Reddit were early purveyors of this approach and this is what makes the StackOverflow network of sites so successful.
  • Using ratings to compile a comprehensive personal recommendation profile à la Amazon and Netflix.
  • Making the site feel active and popular, which, in turn, makes users more willing to spend time and contribute high-quality content on the site.

User expectations

Yet from a user’s perspective, rating systems are often opaque and confusing. How these systems work varies wildly from site to site based on the sites’ business requirements. As a user, it’s unclear what happens with a rating.

When I provide a rating will the site…

  • Prompt me to login?
  • Ask me to provide additional feedback?
  • Actually use my rating to improve or better-surface its content, or to simply build a recommendation or advertising profile against me?

Many sites follow general conventions to make things as least obtrusive as possible, but for every five sites that follow the same simple conventions, there’s one site that will do its own crazy, unexpected thing in response to user ratings.

In addition to the user expectation ambiguity as to what happens in response to rating submissions, there’s other factors that lead to confusion. Several years ago, I had the opportunity to take part in a usability study where off-the-street people would come in and participate in a round of website feedback. The site design being evaluated contained a typical five-star rating system under a “How helpful was this article?” heading.

When a user submitted a rating below a certain threshold–three stars–a dialog appeared prompting the user to provide additional feedback. This rating system was consistently skipped over by user after user. When prompted, one participant explained that the article content in question was horrible–it wasn’t even deserving of a single star, so she didn’t provide a rating. The article content was intentionally horrible–we were really trying to gauge the reaction to the feedback dialog that appeared for low ratings–but the insight that bad content isn’t worthy of a single star led us to change the rating system to binary Yes / No buttons under a more direct “Was this article helpful?” heading.1

Rating scales imply a greater degree of granularity than a typical Up/Down, Yes/No binary rating mechanism. Yet that granularity is ambiguous and misunderstood. I wish Netflix offered 1/2 ratings, because, in my mind, there’s a big difference between a 3-star movie and a 4-star film. But adding half-star ratings adds five more discrete ratings; would that increase the cognitive load of users, increasing the complexity of every rating decision? Would a rating slider where every decimal value between zero and five simply be too much, causing users to skip ratings entirely? But exploring the binary rating, if Netflix replaced its star rating scale with a simple “would you watch this movie again?” Yes/No button, its valued recommendation engine might be able to generate more accurate profiles, since I might give film a 4-star rating, but never wish to see it again, yet re-watch a three-starrer because it’s fun and I might play it in the background while doing something else.

Nuanced approaches

Some sites attempt a different approach, where, rather than supplying a single catch-all rating, multiple rating scales are supplied in order to gain insight into different factors.

Goodfil.ms attempts to address the discrepancy between how well you like a film was to whether you’d watch it again.

Goodfil.ms rating scale

REI and Zappos provide the ability to rate different product factors, such as comfort and durability.

REIs BazaarVoice-powered rating system

Zappos multiple star rating scales

Since gdgt has ratings at its core, it does a great job of providing meaningful rating values in a variety of product categories.

Five different rating scales, one for each important category for the product

An overall rating is generated based on the separate category ratings

These implementations do a good job of splitting a complex set of product factors and satisfaction feelings out from a single rating scale into ratings with several different facets. Yet they require more work from the user. A user has to really want to provide a rating in order to fill in multiple rating scales. Even though they’re opaque and often ambiguous, drive-by, single-click ratings are a lot easier to process.

Online rating systems are hard to get right. It’s hard for sites to provide the correct metaphor for ratings or to convey how the rating data that users provide will be used. It’s often hard for users to quantify complicated feelings like satisfaction into a single rating. Some sites attempt to address this ambiguity by providing more fine-grained rating systems; but these approaches require more effort from users.

User ratings are a problem with no right answer, with no single, elegant solution. All existing solutions are clumsily-implemented and feel out of place.

Maybe the solution is to stop providing these rating systems altogether. With more aggressive A/B testing and better big data analysis, why not determine the quality and popularity of content based on what users actually do rather than what they say?

Which movies does this user tend to watch and re-watch? Which films languish at the bottom of her queue? What type of products does this user buy? All of the big-time sites that keep recommendation profiles already do this type of analysis, but still provide rating systems as a check and an algorithmic tuner–because these programmatic insights are so new and unrefined. Sites that don’t use ratings for personal recommendations have the ability to use other factors in order to determine the quality and popularity of content (number of page views, social media sharing, time spent on the page, etc.) . Over time, as user-activity and data-analysis techniques continue to improve, the ubiquitous, yet clunky, user rating may finally get downvoted into obscurity.

  1. The feedback dialog is perhaps a topic for another day, but, spoiler alert: it tested poorly. The additional dialog that appeared after submitting a low rating was unexpected and was met with hostile reactions. “Who’s going to actually read this?” “Why should I have to tell this site that it’s doing a poor job.”