Monthly Archives: June 2013

Measuring user engagement for the “average” users and experiences: Can psychophysiological measurement help?

3081315619_fe0647a5d8_mI recently attended the Input-Output conference in Brighton, UK. The theme of the conference was “Interdisciplinary approaches to Causality in Engagement, Immersion, and Presence in Performance and Human-Computer Interaction”. I wanted to learn about  psychophysiological measurement.

I am myself on a quest: understand what is user engagement and how to measure it, with a focus on web applications with thousands to millions of users. To this end, I am looking at three measurement approaches: self-reporting (e.g., questionnaires); observational methods (e.g., facial expression analysis, mouse tracking); and of course web analytics (dwell time, page views, absence time).

Observational methods include measurement from psychophysiology, a branch of physiology that studies the relationship between physiological processes and thoughts, emotions, and behaviours. Indeed, the body responds to physiological processes: when we exercise, we sweat; when we get embarrassed, our cheeks get red and warm.

relaxCommon measurements include:

  • Event-related potentials – the electroencephalogram (EEG) is based on recordings of electrical brain activity measured at the surface of the scalp.
  • Functional magnetic resonance imaging (fMRI) – this technique involves imaging blood oxygenation using an MRI machine
  • Cardiovascular measures – heart rate (HR); beats per minute (BPM); heart rate variability (HRV).
  • Respiratory sensors – monitor oxygen intake and carbon dioxide output.
  • Electromyographic (EMG) sensors – measure electrical activity in muscles.
  • Pupillometry – measures measure variations in the diameter of the pupillary aperture of the eye in response to psychophysical and/or psychological stimuli.
  • Galvanic skin response (GSR) – measures perspiration/sweat gland activity, also called Skin Conductance Level  (SCL).
  • Temperature sensors – measure changes in blood flow and body temperature.

I learned how these measures are used, why, and some outcomes. But I started to ask myself. Yes these measures can help understanding engagement (and other related phenomena) for extreme cases, for example:

  • patient with a psychiatric disorder (such as depersonalisation disorder),
  • strong emotion caused by an intense experience (a play where the audience is part of the stage, or when on a roller coaster ride), or
  • total immersion (while playing a computer game), which actually goes beyond engagement.

In my work, I am measuring user engagement for the “average” users and experiences; millions of users who visit a news site on a daily basis to consume the latest news. Can these measures tell me something?

Some recent work published in the Journal of Cyberpsychology, Behavior, and Social Networking explored many of the above measures to study the body responses of 30 healthy subjects during a 3-minute exposure to a slide show of natural panoramas (relaxation condition), their personal social network account (Facebook), and a mathematical task (stress condition). They found differences in the measures depending on the condition. Neither the subjects nor the experiences were “extreme”. However, the experiences were different enough. Can a news portal experiment with three comparably distinct conditions?

Psychophysiology measurement can help understanding user engagement and other  phenomena. But to be able to do so for the average users or experiences, we are likely to need to conduct “large-ish scale” studies to obtain significant insights.

How large-ish? I do not know.

This is in itself an interesting and important question to ask, a question to keep in mind when exploring these types of measurement, as they are still expensive to conduct, cumbersome, and obtrusive. This is a fascinating area to dive into.

Image/photo credits: The Cognitive Neuroimaging Laboratory, and Image Editor and benarent ((Creative Commons BY).

Hey Twitter crowd … What else is there?

Original and shorter post at

Twitter is a powerful tool for journalists at multiple stages of the news production process: to detect newsworthy events, interpret them, or verify their factual veracity. In 2011, a poll on 478 journalists from 15 countries found that 47% of them used Twitter as a source of information. Journalists and news editors also use Twitter to contextualize and enrich their articles by examining the responses to them, including comments and opinions as well as pointers to other related news. This is possible because some users in Twitter devote a substantial amount of time and effort to news curation: carefully selecting and filtering news stories highly relevant to specific audiences.

We developed an automatic method that groups together all the users who tweet a particular news item, and later detects new contents posted by them that are related to the original news item. We call each group a transient news crowd. The beauty with this, in addition to be fully automatic, is that there is no need to pre-define topics and the crowd becomes available immediately, allowing journalists to cover news beats incorporating the shifts of interest of their audiences.

Transient news crowds
Figure 1. Detection of follow-up stories related to a published article using the crowd of users that tweeted the article.

Transient news crowds
In our experiments, we define the crowd of a news article as the set of users that tweeted the article within the first 6 hours after it is published. We followed users on each crowd during one week, recording every public tweet they posted during this period. We used Twitter data around news stories published by two prominent international news portals: BBC News and Al Jazeera English.

What did we find?

  • After they tweet a news article, people’s subsequent tweets are correlated to that article during a brief period of time.
  • The correlation is weak but significant, in terms of reflecting the similarity between the articles that originate a crowd.
  • While the majority of crowds simply disperse over time, parts of some crowds come together again around new newsworthy events.

What can we do with the crowd?
Given a news crowd and a time slice, we want to find the articles in a given time slice that are related to the article that created the crowd. To accomplish this, we used a machine learning approach, which we trained on data annotated using crowd sourcing. We experimented with three types of features:

  • frequency-based: how often an article is posted by the crowd compared to other articles?
  • text-based: how similar are the two articles considering the tweets posted them?
  • user-based: is the crowd focussed on the topic of the article? does it contain influential members?

We find that the features largely complement each other. Some features are always valuable, while others contribute only in some cases. The most important features include the similarity to the original story, as well as measures of how unique is the association of the candidate article and its contributing users to the specific story’s crowd.

Crowd summarisation
We illustrate the outcome of our automatic method with the article Central African rebels advance on capital, posted on Al Jazeera on 28 December, 2012.

Figure 2. Word clouds generated for the crowd on the article “Central African rebels advance on capital”, by considering the terms appearing in stories filtered by our system (top) and on the top stories by frequency (bottom).

Without using our method (in the figure, bottom), we obtain frequently-posted articles which are weakly related or not related at all to the original news article. Using our method (in the figure, top), we observe several follow-up articles to the original one. Four days after the news article was published, several members of the crowd tweeted an article about the fact that the rebels were considering a coalition offer. Seven days after the news article was published, crowd members posted that rebels had stopped advancing towards Bangui, the capital of the Central African Republic.

News crowds allow journalists to automatically track the development of stories. For more details you can check our papers:

  • Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman: Transient News Crowds in Social Media. Seventh International AAAI Conference on Weblogs and Social Media (ICWSM), 8-10 July 2013, Cambridge, Massachusetts.
  • Janette Lehmann, Carlos Castillo, Mounia Lalmas and Ethan Zuckerman: Finding News Curators in Twitter. WWW Workshop on Social News On the Web (SNOW), Rio de Janeiro, Brazil.

Janette Lehmann, Universitat Pompeu Fabra
Carlos Castillo, Qatar Computing Research Institute
Mounia Lalmas, Yahoo! Labs Barcelona
Ethan Zuckerman, MIT Center for Civic Media

Today I am giving a keynote at the 18th International Conference on Application of Natural Language to Information Systems (NLDB2013), which is held at MediaCityUK, Salford.

I have now started to think at what are the questions to ask when evaluating user engagement. In the talk, I discuss these questions through five studies we did. Also included are questions asked when

  • evaluating serendipitous experience in the context of entity-driven search using social media such as Wikipedia and Yahoo! Answers.
  • evaluating the news reading experience when links to related articles are automatically generated using “light weight” understanding techniques.

The slides are available on Slideshare.

Relevant published papers include:

I will write about these two works in later posts.

What can absence time tell about user engagement?

Two widely employed engagement metrics are click-through rate and dwell time. These are particularly used for services where user engagement is about clicking, for example in the context of search where presumably users click on relevant results, and/or spending time on a site, for example consuming content in the context of a news portal.

In search, both have been used as indicator of relevance, and have been exploited to infer user satisfaction with their search results and improve ranking functions. However, how to properly interpret the relations between these metrics, retrieval quality and the long-term user engagement with the search application is not straightforward. Also, relying solely on clicks and time spent can  lead to contradictory if not erroneous conclusions. Indeed, with the current trend of displaying rich information on web pages, for instance the phone number of restaurants or weather data in search results, users do not need to click to access the information and the time spent on a website is shorter.

5127965259_66c1061cbb_nMeasure: Absence time 
The absence time measures the time it takes a user to decide to return to a site to accomplish a new task. Taking a news site as an example, a good experience associated with quality articles might motivate the user to come back to that news site on a regular basis. On the other hand, if the user is disappointed, for example, the articles were not interesting, the site was confusing, he or she may return less often and even switch to an alternative news provider. Another example is a visit to a community questions and answers website. If the questions of a user are well and promptly answered, the odds are that he or she will be enticed to raise new questions and return to the site soon.

Our assumption is that if users find a site interesting, engaging or useful, they will return to it sooner.

This assumption has the advantage of being simple, intuitive and applicable to a large number of settings.

Case study: Yahoo! Answers Japan
We used a popular community querying and answering website hosted by Yahoo! Japan, where users are given the possibility to ask questions about any topic of their interest. Other users may respond by writing an answer. These answers are recorded and can be searched by any user through a standard search interface. We studied the actions of approximately one million users during two weeks.  A user action happens every time a user interacts with Yahoo! Answers: every time he or she issues a query or clicks on a link, be it an answer, an ad or a navigation button. We compare the behaviour of users exposed to six functions used to rank past answers both in term of traditional metrics and of absence time.

Methodology: Survival analysis
We use Survival Analysis to study absence time. Survival Analysis has been used in applications concerned with the death of biological organisms, each receiving different treatments. An example is throat cancer treatment where patients are administered one of several drugs and the practitioner is interested in seeing how effective the different treatments are.  The analogy with our analysis of absence time is unfortunate but nevertheless useful. We associate the user exposition to one of the ranking functions as a “treatment” and his or her survival time as the absence time. In other words, a Yahoo! Answers user dies each time he or she visits the site … but hopefully resuscitates instantly as soon as his or her visit ends.

Survival analysis makes uses of a hazard rate, which reflects the probability that a user dies at a given time. It can be very loosely understood as the speed of death of a population of patients at that  time. Returning to our example, if the hazard rate of throat cancer patients administered with say drug A is higher than the hazard rate of patients under drug B treatment, then drug B patients have a higher probability of surviving until that time. A higher hazard rate implies a lower survival rate.

We use hazard rates to compare the different ranking functions for Yahoo! Answers: a higher hazard rate translates into a short absence time and a prompter return to Yahoo! Answers, which is a sign of higher engagement. What did we find?

A better ranking does not imply more engaged users
Ranking algorithms are compared with a number of measures; a widely used one is DCG, which rewards ranking algorithms retrieving relevant results at high ranks. The higher the DCG, the better the ranking algorithm. We saw that, for the six ranking functions we compared, a higher DCG did not always translate to a higher hazard rate, or in other words, users returning to Yahoo! Answers sooner.

Returning relevant results is important, but is not the only criterion to keep users engaged with the search application.

More clicks is not always good, but no click is bad
A common assumption is that a higher number of clicks is a reflection of a higher user satisfaction with the search results. We observe that up to 5 clicks, each new click is associated with a higher hazard rate, but the increases from the third click are small. A fourth or fifth click has a very similar hazard rate. From the sixth click, the hazard rates decreases slowly.

This suggests that on average, clicks after the fifth one reflect a poorer user experience; users cannot find the information they are looking for.

We also observed that the hazard rate with five clicks or more is always higher compared with no click at all; when users search on Yahoo! Answers, no click means a bad user experience.

A click at rank 3 is better than a click at rank 1
The hazard rate is larger for clicks at ranks 2, 3 and 4, the maximum arising at rank 3, when compared to click at rank 1. For lower ranks, the trend is toward decreasing hazard.  Only the click at rank 10 was found to be clearly less valuable than a click at rank 1. It seems that users unhappy with results at earlier ranks simply click on the last displayed result, for no apparent reason apart for it being the last one on the search result page.

Clicking lower in the ranking suggests a more careful choice from the user, while clicking at the bottom is a sign that the overall ranking is of low quality.

Clicking fast on a result is a good sign
We found that the shorter the time between the search results of a query being displayed and the first click, the higher the hazard rate.

Users who find their answers quickly return sooner to the search application.

More views is worst that more queries
When users are returned search results, they may click on a result, then return back to the search result page, and then click on another result. Each display of search results generates a view. At anytime, the user may submit a new query. Both returning to the search result page several times and a higher number of query reformulations are signs that the user is not satisfied with the current search results. Which one is worse? We could see that having more views than queries was associated on average with a low hazard rate, meaning a longer absence time.

This suggests that returning to the same search result page is a worse user experience  than reformulating the query.

Without the absence time, it would have been harder to observe this, unless we asked explicitly the users to tell us what is going on.

7179266571_541698d0e5_nA small warning
A user might decide to return sooner or later to a website due to reasons unrelated with the previous visits (being on holidays for example). It is important to have a large sample of interaction data to detect coherent signals and to take systematic effects into account.

Take away message

Using absence time to measure user engagement is easy to interpret and less ambiguous than many of the commonly employed metrics. Use it and get new insights with it.

This work was done in collaboration with Georges Dupret. More details about the study can be found in  Absence time and user engagement: Evaluating Ranking Functions, which was published at the 6th ACM International Conference on Web Search and Data Mining in Rome, 2013.

Photo credits: tanfelisa and kaniths (Creative Commons BY).

We need a taxonomy of web user engagement

There are lots lots and lots of metrics that can be used to assess how users engage with a website. Widely used ones by the web-analytics community are click-through rates, number of page views, time spent on a website, how often users return to a site, number of users.


Although these metrics cannot explicitly explain why users engage with a site, they can act as proxy for online user engagement: two millions of users accessing a website daily is a strong indication of a high engagement with that site.

Metrics, metrics and metrics

There are three main types of web-analytics metrics:

  • Popularity metrics measure how much a website is used (for example, by counting the total number of users on the site in a week). The higher the number, the more popular the website.
  • How a website is used when visited is measured with activity metrics, for example, the average number of clicks per visit across all users.
  • Loyalty metrics are concerned with how often users return to a website. An example is the return rate, calculated as the average number of times users visited a website within a month.

Loyalty and popularity metrics can be calculated on a daily, weekly or monthly basis. Activity metrics are calculated at visit level.

So one would think that a highly engaging website is one with a high number of visits (very popular), where users spend lots of time and click often (lots of activity), and return frequently (high loyalty). But not all websites, whether popular or not, have both active and loyal users.

This does not mean that user engagement on such websites is lower; it is simply different.

422362185_a260ad4ee4_q What did we do?

We collected one-month browsing data from an anonymized sample of approximately 2M users. For 80 websites, encompassing a diverse set of services such as news, weather, movies, mail, we calculated the average values of the following eight metrics:

  • Popularity metrics: number of distinct users, number of visits, and number of clicks (also called page views) for that month.
  • Activity metrics: average number of page views per visit and average time per visit (also called dwell time).
  • Loyalty metrics: number of days a user visited the site, number of times a user visited the site, and average time a user spend on the site, for that month.

Websites differ widely in terms of their engagement

Some websites are very popular (for example, news portals) whereas others are visited by small groups of users (lots of specific-interest websites were this way). Visit activity also depends on the websites. For instance, search sites tend to have a much shorter dwell time than sites related to entertainment (where people play games). Loyalty per website differed as well. Media (news, magazines) and communication (messenger, mail) sites have many users returning to them much more regularly, than sites containing information of temporary interests (e-commerce site selling cars). Loyalty is also influenced by the frequency in which new content is published. Indeed, some sites produce new content once per week.

High popularity did not entail high activity. Many site have many users spending little time on them. A good example is of a search site, where users come, submit a query, get the result, and if satisfied, leave the site.

This results in a low dwell time even though user expectations were entirely met.

The same holds for a site on Q&A, or a weather site. What matters for such sites is their popularity.

Any patterns? Yes … 

To identify engagement patterns, we grouped the 80 sites using clustering approaches applied to the eight engagement metrics. We also extracted for each group which metrics and their values (whether high or low) were specific to that group. This process generated five groups with clear engagement patterns, and a sixth group with none:

  • Sites where the main factor was their high popularity (for example as measured by the high numbers of users). Examples of sites following this pattern include media sites providing daily news and search sites. Those are sites where users interact in various ways with them; what is common is that they are used by many users.
  • Sites with low popularity, for instance having a low number of visits. Many interest-specific sites followed this pattern. Those sites center around niche topics or services, which do not attract a large number of users.
  • Sites with a high number of clicks per visit. This pattern was followed by e-commerce and configuration (accessed by users to update their profiles for example) sites, where the main activity is to click.
  • Sites with high dwell time and low clicks per visit, and with low loyalty. This pattern was followed by domain-specific media sites of periodic nature (new content published on a weekly basis), which are therefore not often accessed. However when accessed, users spend more time to consume their content. The design of such sites (compared to mainstream media sites) leads to such type of engagement, since new content was typically published on their homepage. Thus users are not enticed to reach (if any) additional content.
  • Sites with high loyalty, small dwell time and few clicks. This pattern was followed by navigational sites (the front page of an Internet company), which role is to direct users to interesting content or service in other sites (of that same company); what matters is that users come regularly to them.

This simple study (80 sites and 8 metrics) identified several patterns of user engagement.

However, sites of the same type do not necessarily follow the same engagement pattern.

For instance, not all mainstream media sites followed the first pattern (high popularity). It is likely that, among others, the structure of the site has an effect.

Green apples measured  the meter, sports apples                                          … So what now?

We must study way more sites and include lots more engagement metrics. This is the only way to build, if we want, and we should, a taxonomy of web user engagement. With a taxonomy, we will know the best metrics to measure engagement on a site.

Counting clicks may be totally useless for some sites. But if not, and the number of clicks is for instance way too low, knowing which engagement pattern a site follows helps making the appropriate changes to the site.

This work was done in collaboration with Janette Lehmann, Elad Yom-Tov and Georges Dupret. More details about the study can be found in  Models of User Engagement, a paper presented at the 20th conference on User Modeling, Adaptation, and Personalization (UMAP), 2012.

Photo credits: Denis Vrublevski and matt hutchinson (Creative Commons BY).