Category Archives: user engagement

How engaged are Wikipedia users?

Wikipedia Recently, we were asked: “How engaged are Wikipedia users?” To answer this question, we visited Alexa, a Web Analytics site, and learned that Wikipedia is one of the most visited sites in the world (ranked 6th), that users spend on average around 4:35 minutes per day on Wikipedia, and that many visits to Wikipedia come from search engines (43%). We also found studies about readers’ preferences, Wikipedia growth, and Wikipedia editors. There is however little about how users engage with Wikipedia, in particular about those not contributing content to Wikipedia.

Can we do more?

Beside reading and editing articles, users perform many other actions: they look at the revision history, search for specific content, browse through Wikipedia categories, visit portal sites to learn about specific topics, or visit the community portal. Although discussing an article is a sign of a highly engaged user, performing several actions within the same visit to Wikipedia is also a sign of a highly engaged user. It is this latter type of engagement we looked into.

Action networks

action_networkWe collected 13 months (September 2011 to September 2012) of browsing data from an anonymized sample of approximately 1.3M users.  We identified 48 actions such as reading an article, editing, opening an account, donating, visiting a special page. We then built a weighted action network: nodes are the actions and two nodes are connected by an edge if the two corresponding actions were performed during the same visit to Wikipedia. Each node has  a weight representing the number of users performing the corresponding action (the node traffic). Each edge has a weight representing the number of users that performed the two corresponding actions (the traffic between the two nodes).

Engagement over time

We use the following metrics to measure engagement on Wikipedia based on actions:

  • TotalNodeTraffic: total number of actions (sum of all node weights)
  • TotalEdgeTraffic: total number of pairwise actions (sum of all edge weights)
  • TotalTrafficRecirculation: actual network traffic with respect to maximum possible traffic (TotalEdgeTraffic/TotalNodeTraffic).

We calculated these metrics for the 13 months under consideration and plotted their variations over time. An increase in TotalNodeTraffic means that more users visited Wikipedia. An increase in TotalTrafficRecirculation means that more users performed at least two actions while on Wikipedia, our chosen indicator of high engagement in Wikipedia. We observe that TotalNodeTraffic increased first then became more or less stable. By contrast, TotalTrafficRecirculation mostly decreased, but we see a small peak in January 2011.

rcTraffic_monthlyTwo important events happened in our 13-month period. During the donation campaign (November to December 2011) more users visited Wikipedia (higher TotalNodeTraffic value). We speculate that many users became interested in Wikipedia during the campaign. However, because TotalTrafficRecirculation actually decreased for the same period, although more users visited Wikipedia, they did not perform two (or more) actions while visiting Wikiepedia; they did not become more engaged with Wikipedia. However, during the SOPA/PIPA protest (January 2012), we see a peak in TotalNodeTraffic and TotalTrafficRecirculation. More users visited Wikipedia and many users became more engaged with Wikipedia; they also read articles, gathered information about the protest, donated money while visiting Wikipedia.

rcTraffic_weekdays+endWe detected different engagement patterns on weekdays and weekends. Whereas more users visited Wikipedia during weekdays (high value of TotalNodeTraffic), users that visited Wikipedia during the weekend were more engaged (high value of TotalTrafficRecirculation). On weekends, users performed more actions during their visits.

People behave differently on weekdays compared to weekends. The same happens with Wikipedia.

Did the donation campaign make Wikipedia more engaging?

meaganmakes - 182-365+1 [cc] - 2 So which actions became more frequent as a result of the donation campaign? As expected, we observed a significant traffic increase on the “donate” node during the two months; many users made a donation. In addition, the traffic from some nodes to other nodes  increased but only slightly. Additional actions were performed;  for instance, more users created a user account, visited community-related pages, all within the same session. However, overall, users mostly performed individual actions since TotalTrafficRecirculation decreased during that time period.

So the campaign was successful in terms of donation, but less in terms of making Wikipedia more engaging.

This is a write-up of the presentation given by Janette Lehmann at TNETS Satellite, ECCS, Barcelona, September 2013.

Measuring user engagement for the “average” users and experiences: Can psychophysiological measurement help?

3081315619_fe0647a5d8_mI recently attended the Input-Output conference in Brighton, UK. The theme of the conference was “Interdisciplinary approaches to Causality in Engagement, Immersion, and Presence in Performance and Human-Computer Interaction”. I wanted to learn about  psychophysiological measurement.

I am myself on a quest: understand what is user engagement and how to measure it, with a focus on web applications with thousands to millions of users. To this end, I am looking at three measurement approaches: self-reporting (e.g., questionnaires); observational methods (e.g., facial expression analysis, mouse tracking); and of course web analytics (dwell time, page views, absence time).

Observational methods include measurement from psychophysiology, a branch of physiology that studies the relationship between physiological processes and thoughts, emotions, and behaviours. Indeed, the body responds to physiological processes: when we exercise, we sweat; when we get embarrassed, our cheeks get red and warm.

relaxCommon measurements include:

  • Event-related potentials – the electroencephalogram (EEG) is based on recordings of electrical brain activity measured at the surface of the scalp.
  • Functional magnetic resonance imaging (fMRI) – this technique involves imaging blood oxygenation using an MRI machine
  • Cardiovascular measures – heart rate (HR); beats per minute (BPM); heart rate variability (HRV).
  • Respiratory sensors – monitor oxygen intake and carbon dioxide output.
  • Electromyographic (EMG) sensors – measure electrical activity in muscles.
  • Pupillometry – measures measure variations in the diameter of the pupillary aperture of the eye in response to psychophysical and/or psychological stimuli.
  • Galvanic skin response (GSR) – measures perspiration/sweat gland activity, also called Skin Conductance Level  (SCL).
  • Temperature sensors – measure changes in blood flow and body temperature.

I learned how these measures are used, why, and some outcomes. But I started to ask myself. Yes these measures can help understanding engagement (and other related phenomena) for extreme cases, for example:
2643110825_013f4c89d4_m

  • patient with a psychiatric disorder (such as depersonalisation disorder),
  • strong emotion caused by an intense experience (a play where the audience is part of the stage, or when on a roller coaster ride), or
  • total immersion (while playing a computer game), which actually goes beyond engagement.

In my work, I am measuring user engagement for the “average” users and experiences; millions of users who visit a news site on a daily basis to consume the latest news. Can these measures tell me something?

Some recent work published in the Journal of Cyberpsychology, Behavior, and Social Networking explored many of the above measures to study the body responses of 30 healthy subjects during a 3-minute exposure to a slide show of natural panoramas (relaxation condition), their personal social network account (Facebook), and a mathematical task (stress condition). They found differences in the measures depending on the condition. Neither the subjects nor the experiences were “extreme”. However, the experiences were different enough. Can a news portal experiment with three comparably distinct conditions?

Psychophysiology measurement can help understanding user engagement and other  phenomena. But to be able to do so for the average users or experiences, we are likely to need to conduct “large-ish scale” studies to obtain significant insights.

How large-ish? I do not know.

This is in itself an interesting and important question to ask, a question to keep in mind when exploring these types of measurement, as they are still expensive to conduct, cumbersome, and obtrusive. This is a fascinating area to dive into.

Image/photo credits: The Cognitive Neuroimaging Laboratory, and Image Editor and benarent ((Creative Commons BY).

Today I am giving a keynote at the 18th International Conference on Application of Natural Language to Information Systems (NLDB2013), which is held at MediaCityUK, Salford.

I have now started to think at what are the questions to ask when evaluating user engagement. In the talk, I discuss these questions through five studies we did. Also included are questions asked when

  • evaluating serendipitous experience in the context of entity-driven search using social media such as Wikipedia and Yahoo! Answers.
  • evaluating the news reading experience when links to related articles are automatically generated using “light weight” understanding techniques.

The slides are available on Slideshare.

Relevant published papers include:

I will write about these two works in later posts.

What can absence time tell about user engagement?

Two widely employed engagement metrics are click-through rate and dwell time. These are particularly used for services where user engagement is about clicking, for example in the context of search where presumably users click on relevant results, and/or spending time on a site, for example consuming content in the context of a news portal.

In search, both have been used as indicator of relevance, and have been exploited to infer user satisfaction with their search results and improve ranking functions. However, how to properly interpret the relations between these metrics, retrieval quality and the long-term user engagement with the search application is not straightforward. Also, relying solely on clicks and time spent can  lead to contradictory if not erroneous conclusions. Indeed, with the current trend of displaying rich information on web pages, for instance the phone number of restaurants or weather data in search results, users do not need to click to access the information and the time spent on a website is shorter.

5127965259_66c1061cbb_nMeasure: Absence time 
The absence time measures the time it takes a user to decide to return to a site to accomplish a new task. Taking a news site as an example, a good experience associated with quality articles might motivate the user to come back to that news site on a regular basis. On the other hand, if the user is disappointed, for example, the articles were not interesting, the site was confusing, he or she may return less often and even switch to an alternative news provider. Another example is a visit to a community questions and answers website. If the questions of a user are well and promptly answered, the odds are that he or she will be enticed to raise new questions and return to the site soon.

Our assumption is that if users find a site interesting, engaging or useful, they will return to it sooner.

This assumption has the advantage of being simple, intuitive and applicable to a large number of settings.

Case study: Yahoo! Answers Japan
We used a popular community querying and answering website hosted by Yahoo! Japan, where users are given the possibility to ask questions about any topic of their interest. Other users may respond by writing an answer. These answers are recorded and can be searched by any user through a standard search interface. We studied the actions of approximately one million users during two weeks.  A user action happens every time a user interacts with Yahoo! Answers: every time he or she issues a query or clicks on a link, be it an answer, an ad or a navigation button. We compare the behaviour of users exposed to six functions used to rank past answers both in term of traditional metrics and of absence time.

Methodology: Survival analysis
We use Survival Analysis to study absence time. Survival Analysis has been used in applications concerned with the death of biological organisms, each receiving different treatments. An example is throat cancer treatment where patients are administered one of several drugs and the practitioner is interested in seeing how effective the different treatments are.  The analogy with our analysis of absence time is unfortunate but nevertheless useful. We associate the user exposition to one of the ranking functions as a “treatment” and his or her survival time as the absence time. In other words, a Yahoo! Answers user dies each time he or she visits the site … but hopefully resuscitates instantly as soon as his or her visit ends.

Survival analysis makes uses of a hazard rate, which reflects the probability that a user dies at a given time. It can be very loosely understood as the speed of death of a population of patients at that  time. Returning to our example, if the hazard rate of throat cancer patients administered with say drug A is higher than the hazard rate of patients under drug B treatment, then drug B patients have a higher probability of surviving until that time. A higher hazard rate implies a lower survival rate.

We use hazard rates to compare the different ranking functions for Yahoo! Answers: a higher hazard rate translates into a short absence time and a prompter return to Yahoo! Answers, which is a sign of higher engagement. What did we find?

A better ranking does not imply more engaged users
Ranking algorithms are compared with a number of measures; a widely used one is DCG, which rewards ranking algorithms retrieving relevant results at high ranks. The higher the DCG, the better the ranking algorithm. We saw that, for the six ranking functions we compared, a higher DCG did not always translate to a higher hazard rate, or in other words, users returning to Yahoo! Answers sooner.

Returning relevant results is important, but is not the only criterion to keep users engaged with the search application.

More clicks is not always good, but no click is bad
A common assumption is that a higher number of clicks is a reflection of a higher user satisfaction with the search results. We observe that up to 5 clicks, each new click is associated with a higher hazard rate, but the increases from the third click are small. A fourth or fifth click has a very similar hazard rate. From the sixth click, the hazard rates decreases slowly.

This suggests that on average, clicks after the fifth one reflect a poorer user experience; users cannot find the information they are looking for.

We also observed that the hazard rate with five clicks or more is always higher compared with no click at all; when users search on Yahoo! Answers, no click means a bad user experience.

A click at rank 3 is better than a click at rank 1
The hazard rate is larger for clicks at ranks 2, 3 and 4, the maximum arising at rank 3, when compared to click at rank 1. For lower ranks, the trend is toward decreasing hazard.  Only the click at rank 10 was found to be clearly less valuable than a click at rank 1. It seems that users unhappy with results at earlier ranks simply click on the last displayed result, for no apparent reason apart for it being the last one on the search result page.

Clicking lower in the ranking suggests a more careful choice from the user, while clicking at the bottom is a sign that the overall ranking is of low quality.

Clicking fast on a result is a good sign
We found that the shorter the time between the search results of a query being displayed and the first click, the higher the hazard rate.

Users who find their answers quickly return sooner to the search application.

More views is worst that more queries
When users are returned search results, they may click on a result, then return back to the search result page, and then click on another result. Each display of search results generates a view. At anytime, the user may submit a new query. Both returning to the search result page several times and a higher number of query reformulations are signs that the user is not satisfied with the current search results. Which one is worse? We could see that having more views than queries was associated on average with a low hazard rate, meaning a longer absence time.

This suggests that returning to the same search result page is a worse user experience  than reformulating the query.

Without the absence time, it would have been harder to observe this, unless we asked explicitly the users to tell us what is going on.

7179266571_541698d0e5_nA small warning
A user might decide to return sooner or later to a website due to reasons unrelated with the previous visits (being on holidays for example). It is important to have a large sample of interaction data to detect coherent signals and to take systematic effects into account.

Take away message

Using absence time to measure user engagement is easy to interpret and less ambiguous than many of the commonly employed metrics. Use it and get new insights with it.

This work was done in collaboration with Georges Dupret. More details about the study can be found in  Absence time and user engagement: Evaluating Ranking Functions, which was published at the 6th ACM International Conference on Web Search and Data Mining in Rome, 2013.

Photo credits: tanfelisa and kaniths (Creative Commons BY).

We need a taxonomy of web user engagement

There are lots lots and lots of metrics that can be used to assess how users engage with a website. Widely used ones by the web-analytics community are click-through rates, number of page views, time spent on a website, how often users return to a site, number of users.

uue_engmetrics_wordle

Although these metrics cannot explicitly explain why users engage with a site, they can act as proxy for online user engagement: two millions of users accessing a website daily is a strong indication of a high engagement with that site.

Metrics, metrics and metrics

There are three main types of web-analytics metrics:

  • Popularity metrics measure how much a website is used (for example, by counting the total number of users on the site in a week). The higher the number, the more popular the website.
  • How a website is used when visited is measured with activity metrics, for example, the average number of clicks per visit across all users.
  • Loyalty metrics are concerned with how often users return to a website. An example is the return rate, calculated as the average number of times users visited a website within a month.

Loyalty and popularity metrics can be calculated on a daily, weekly or monthly basis. Activity metrics are calculated at visit level.

So one would think that a highly engaging website is one with a high number of visits (very popular), where users spend lots of time and click often (lots of activity), and return frequently (high loyalty). But not all websites, whether popular or not, have both active and loyal users.

This does not mean that user engagement on such websites is lower; it is simply different.

422362185_a260ad4ee4_q What did we do?

We collected one-month browsing data from an anonymized sample of approximately 2M users. For 80 websites, encompassing a diverse set of services such as news, weather, movies, mail, we calculated the average values of the following eight metrics:

  • Popularity metrics: number of distinct users, number of visits, and number of clicks (also called page views) for that month.
  • Activity metrics: average number of page views per visit and average time per visit (also called dwell time).
  • Loyalty metrics: number of days a user visited the site, number of times a user visited the site, and average time a user spend on the site, for that month.

Websites differ widely in terms of their engagement

Some websites are very popular (for example, news portals) whereas others are visited by small groups of users (lots of specific-interest websites were this way). Visit activity also depends on the websites. For instance, search sites tend to have a much shorter dwell time than sites related to entertainment (where people play games). Loyalty per website differed as well. Media (news, magazines) and communication (messenger, mail) sites have many users returning to them much more regularly, than sites containing information of temporary interests (e-commerce site selling cars). Loyalty is also influenced by the frequency in which new content is published. Indeed, some sites produce new content once per week.

High popularity did not entail high activity. Many site have many users spending little time on them. A good example is of a search site, where users come, submit a query, get the result, and if satisfied, leave the site.

This results in a low dwell time even though user expectations were entirely met.

The same holds for a site on Q&A, or a weather site. What matters for such sites is their popularity.

Any patterns? Yes … 

To identify engagement patterns, we grouped the 80 sites using clustering approaches applied to the eight engagement metrics. We also extracted for each group which metrics and their values (whether high or low) were specific to that group. This process generated five groups with clear engagement patterns, and a sixth group with none:

  • Sites where the main factor was their high popularity (for example as measured by the high numbers of users). Examples of sites following this pattern include media sites providing daily news and search sites. Those are sites where users interact in various ways with them; what is common is that they are used by many users.
  • Sites with low popularity, for instance having a low number of visits. Many interest-specific sites followed this pattern. Those sites center around niche topics or services, which do not attract a large number of users.
  • Sites with a high number of clicks per visit. This pattern was followed by e-commerce and configuration (accessed by users to update their profiles for example) sites, where the main activity is to click.
  • Sites with high dwell time and low clicks per visit, and with low loyalty. This pattern was followed by domain-specific media sites of periodic nature (new content published on a weekly basis), which are therefore not often accessed. However when accessed, users spend more time to consume their content. The design of such sites (compared to mainstream media sites) leads to such type of engagement, since new content was typically published on their homepage. Thus users are not enticed to reach (if any) additional content.
  • Sites with high loyalty, small dwell time and few clicks. This pattern was followed by navigational sites (the front page of an Internet company), which role is to direct users to interesting content or service in other sites (of that same company); what matters is that users come regularly to them.

This simple study (80 sites and 8 metrics) identified several patterns of user engagement.

However, sites of the same type do not necessarily follow the same engagement pattern.

For instance, not all mainstream media sites followed the first pattern (high popularity). It is likely that, among others, the structure of the site has an effect.

Green apples measured  the meter, sports apples                                          … So what now?

We must study way more sites and include lots more engagement metrics. This is the only way to build, if we want, and we should, a taxonomy of web user engagement. With a taxonomy, we will know the best metrics to measure engagement on a site.

Counting clicks may be totally useless for some sites. But if not, and the number of clicks is for instance way too low, knowing which engagement pattern a site follows helps making the appropriate changes to the site.

This work was done in collaboration with Janette Lehmann, Elad Yom-Tov and Georges Dupret. More details about the study can be found in  Models of User Engagement, a paper presented at the 20th conference on User Modeling, Adaptation, and Personalization (UMAP), 2012.

Photo credits: Denis Vrublevski and matt hutchinson (Creative Commons BY).

Together with Heather O’Brien and Elad Yom-Tov, we will be giving a tutorial at the International World-Wide Web Conference (WWW), 13-17 May 2013, Rio de Janeiro.

The slides are now available on Slideshare.
You can also access the two-slides per page format (PDF) here: MeasuringUserEngagement or one-slide per page format (PDF) here.
The references can be found here: References_Tutorial.

We will continue updating the slides, correct any errors and so on. Feedback very welcome.