Category Archives: computational advertising

Engineering features of  ad quality

This is the second blog post about our  WWW 2016 paper on the pre-click quality of native advertisements. This work is in collaboration with Ke (Adam) Zhou, Miriam Redi and Andy Haines. A big thank you to Miriam Redi for providing examples of visual features.

In a previous post, I reported on a study that led to important insights into how users perceive the quality of native ads. In online services, native advertising is a very popular form of online advertising where the ads served reproduce the look-and-feel of the service in which they appear. Due to the low variability in terms of ad formats in native advertising, the content and the presentation of the ad are important factors contributing to the quality of the ad. The most important factors are, in order of importance:

Aesthetic appeal > Product, Brand, Trustworthiness > Clarity > Layout

Based on this study, we designed a large set of features to characterize these factors. We derived the features from the ad copy (text and image), and the advertiser properties. The text of an ad is made of a title and a description. Most ads have an image. An overview of the features together with their mapping to the reasons is shown below.

Reasons Features
Brand Brand (domain pagerank, search term popularity)
Product/Service Content (topic category, adult detector, image objects)
Trustworthiness Psychology (sentiment, psychological incentives)
Content Coherence (similarity between title and desc)
Language Style (formality, punctuation, superlative)
Language Usage (spam, hatespeech, click bait)
Clarity Readability (RIX index, number of complex words)
Layout Readability (number of sentences, words)
Image Composition (Presence of objects, symmetry)
Aesthetic appeal Colors (H.S.V, Contrast, Pleasure)
Textures (GLCM properties)
Photographic Quality (JPEG quality, sharpness)

Clarity

The clarity of the ad reflects the ease with which the ad text (the title or the description) can be understood by a reader. We measure clarity with several readability metrics (Flesch’s reading ease test, Flesch-Kincaid grade level, Gunning fog index, Coleman-Liau index, Laesbarheds index and RIX index). These metrics are defined using low-level text statistics, such as the number of words, the percentage of complex words, the number of sentences, number of acronyms, number of capitalized words and syllables per words. We also retain these low-level statistics.

Trustworthiness

Trustworthiness is the extent to which users perceive the ad as reliable. We represent this factor by analyzing different psychological reactions that users might experience when reading the ad text.

  • Sentiment Incentives. Sentiment analysis tools automatically detect the overall contextual polarity of a text. To determine the polarity (positive, negative) of the ad sentiment, we analyze the ad title and description with SentiStrength, an open source sentiment analysis tool. We report two values, the probabilities (on a 5-scale grade) of the text sentiment being positive and negative, respectively.
  • Psychological Incentives. The words used in the ad copy could have different psychological effects on users. To capture these, we resort to the LIWC 2007 dictionary, which associates psychological attributes to common words. We look at words categorized as social (e.g. talk, daughter, friend), affective (e.g. happy, worried, love, nasty), cognitive (e.g. think, because, should), perceptual (e.g. observe, listen, feel), biological (e.g. eat, flu, dish), personal concerns (words related to work, leisure, money) and finally relativity (words related to motion, space and time). For both the ad title and the description, we retain the frequency of the words that the LIWC dictionary associates with each of these 7 categories.
  • Content Coherence. The consistency between ad title and ad description may also affect the ad trustworthiness. We capture this by calculating the similarity between the words in the ad title and ad description.
  • Language Style. We analyze the degree of formality of the language in the ad, using a linguistic formality measure, which weights different types of words, with nouns, adjectives, articles and prepositions as positive elements, and adverbs, verbs and interjections as negative. We also include low-level features, such as the frequency of punctuation, numbers, “5W1H” words (e.g. What, Where, Why, When, Where, How), superlative adjectives or adverbs.
  • Language Usage. We parse the text using Yahoo content analysis platform (CAP). From CAP, we get two scores. The spam score reflects the likelihood of a text to be of spamming nature. The hate speech score captures the extent to which any speech may suggest violence, intimidation or prejudicial action against/by a protected individual or group. We also extract a feature telling us whether the ad title is a click-bait. We also retain the frequency counts of words relating to slang and profanity.

Product/Service

Although quality is independent to relevance, some ad categories of ads might be considered lower quality (offensive) than others, and features may be more important for some types of product/service

  • Text. To capture the product or service provided by the ad, we use Yahoo text-based classifier (YCT) that computes, given a text, a set of category scores (e.g. sports, entertainment) according to a topic taxonomy (only top-level categories). In addition, we calculate the adult score, as extracted from CAP, that suggests whether the product advertised is related to adult-related services such as dating websites.
  • Image. To understand the content of the ad from a visual perspective, we tag the ad image with image classifiers, which automatically recognize the objects depicted in a picture (e.g. a person, a flower). For each of the detectable objects, the classifiers output a confidence score corresponding to the probability that the object is represented in the image. Since tag scores are very sparse (an image shows few objects), we group semantically similar tags into topically-coherent tag clusters (e.g. dog, cat will fall in the animal cluster). Examples of clusters include “plants”, “animals”. We also run an adult image detector, and retain the output confidence score as an indicator of the adultness of the ad creative.

We also extract deep-learning based features from the ad images. At this stage, it is not easy to interpret the visual attributes they are linked to (in terms of recommendations), thus we do not describe them. In the graph shown later, the top-50 discriminative deep-learning features are nonetheless shown for completeness (they are referred to as CNNObjectxxx).

Layout

  • Text. Since the ad format of the native ads served on a given platform is fixed, we capture the textual layout of the ad creative by looking at the length of the ad text (e.g. number of sentences or words).
  • Image. To quantify the composition of the ad image, we analyze the spatial layout in the scene using compositional visual features inspired by computational aesthetics research, a branch of computer vision that studies ways to automatically predict the beauty degree of images and videos. We compute a Symmetry descriptor based on the gradient difference between the left half of the image and its flipped right half. We then analyze whether the image follows the photographic Rule of Thirds, according to which important compositional elements of the picture should lie on four ideal lines (two horizontal and two vertical) that divide it into nine equal parts, using saliency distribution counts to detect the Object Presence. Finally, we look at the Depth of Field, which measures the ranges of distances from the observer that appear acceptably sharp in a scene, using wavelet coefficients. We also include an image text detector  to capture whether the image contains text in it.

rule
symmetryjpeg
depth

Aesthetic Appeal

To explore the contribution of visual aesthetics for ad quality, we also resort to computational aesthetics. We extract a total of 43 compositional features from the ad images.

  • Color. Color patterns are important cues to understand the aesthetic value of a picture. We first compute a luminance-based Contrast metric, which reflects the distinguishability of the image colors. We then extract the average Hue, Saturation, Brightness (H,S,V), by averaging HSV channels of the whole image and HSV values of the inner image quadrant. We then linearly combine average Saturation (S ̄) and Brightness (V ̄ ) values, and obtain three indicators of emotional responses, Pleasure, Arousal and Dominance. In addition, we quantize the HSV values into 12 Hue bins, 5 Saturation bins, and 3 Brightness bins and collect the pixel occurrences in the HSV Itten Color Histograms. Finally, we compute Itten Color Contrasts as the standard deviation of H, S and V Itten Color Histograms.

brightness
hue
saturation

  • Texture. To describe the overall complexity and homogeneity of the image texture, we extract the Haralick’s features from the Gray-Level Co-occurrence Matrices, namely the Entropy, Energy, Homogeneity, Contrast.
    texture
  • Photographic Quality. These features describe the image quality and integrity. High-quality photographs are images where the degradation due to image post-processing or registration is not highly perceivable. To determine the perceived image degradation, we can use a set of simple image metrics originally designed for computational aesthetics, independent of the composition, the content, or its artistic value. These are:

Contrast Balance: We compute the contrast balance by taking the distance between the original image and its contrast-equalized version.

Constrastjpeg

Exposure Balance: To capture over/under exposure, we compute the luminance histogram skewness.

exposedjpeg

JPEG Quality: When too strong, JPEG compression can cause disturbing blockiness effects. We compute here the objective quality measure for JPEG images.

JPEGjpeg

JPEG Blockiness: This detects the amount of ‘blockiness’ based on the difference between the image and its compressed version at low quality factor.

BlockyCorrectjpeg

Sharpness: We detect the image sharpness by aggregating the edge strength after applying horizontal or vertical Sobel masks (Teengrad’s method).

Foreground Sharpness: We compute the Sharpness metric on salient image zones only.

blursharpjpeg

Brand

We hypothesize that the intrinsic properties of the advertiser (such as the brand) have an effect of the user perception of ad quality. We use two features: domain pagerank and search volume. The domain pagerank is the pagerank score of the advertiser domain for a given ad landing page. An ad site with a high page rank is one that is linked to by many sites, hence reflecting a known brand. The search volume reflects the raw search volume of the advertiser within Yahoo search logs. This represents the overall popularity of the advertiser and its product/service.

Features importance

We focus on how the features listed above characterize the quality of an ad. To monitor ad quality, we exploit the information provided by the Yahoo ad feedback tool, namely ad offensiveness. Many Internet companies have put in place an afeedbackd feedback mechanisms, which give the users the possibility to provide negative feedback on the ads served. With the Yahoo ad feedback tool, a user can choose to hide an ad, and further select one of the following options as the reason for doing so: (a) It is offensive to me; (b) I keep seeing this; (c) It is not relevant to me; (d) Something else.

Among the many reasons why users may want to hide ads, marking one ad as offensive seems the most explicit indication of the quality of the ad.

We analyze the extent to which each feature individually correlates with offensive feedback rate (OFR). This is the number of time an ad has been marked as offensive divided by the number of time the ad was impressed. We report the correlation between each feature and OFR in the figure below (top-correlated features only). 1 means totally correlated, -1 means totally inversally correlated, whereas 0 means no correlation at all.

featurecor

Correlation between ad copy features and offensive feedback rate

Some features are moderately positively correlated such as “Negative sentiment (Title)” with OFR whereas others are moderately negatively correlated such as “Image JPEG Quality”. This means that ad copy title with highly negative sentiment are more likely to be marked as offensive . Examples of such ads include those with words “hate” or “ugly” in them. Ad images with low quality JPEG are also more likely to be marked as offensive.

Overall we can see that features, such as

  • visual features (e.g. JPEG compression artifacts, reflecting that high quality images are important),
  • text features (e.g. whether the title contains negative sentiments, suggesting that although negative sentiments may attract clicks, they also offend users), and
  • advertiser features (e.g. brand as measured with ad domain page rank, indicating that unknown brands are more likely to be marked as offensive by users.)

correlate with OFR. Interestingly, an ad title starting with a number is likely to belong to an offensive ad. Through manual inspection, we found that many offensive ads’ titles indeed tend to start with numbers, for example “10 most hated…”. Ad copy with lower image JPEG quality are often marked as offensive. A copy with less formal language and expressing negative sentiment in the ad titles are also often marked as offensive by users.

We used these features to build a prediction pre-click quality model, which aimed at predicting which ads will be marked as offensive by many users. Our model reaches an AUC of 0.77. We also deployed a model based on a subset of the above listed features on Yahoo news streams, which reduced the ad offensive feedback rate by 17.6% on mobile and 8.7% on desktop.

What makes an ad preferred by users?

This is the first blog post on a paper that will be presented at WWW 2016 [1], on our work on advertising quality. The focus of this work was the pre-click quality of native advertisements. This work is in collaboration with Ke (Adam) Zhou, Miriam Redi and Andy Haines.

native_exampleIn online services, native advertising has become a very popular form of online advertising, where the ads served reproduce the look-and-feel of the platform in which they appear. Examples of native ads include suggested posts on Facebook, promoted tweets on Twitter, or sponsored contents on Yahoo news stream. On the right, we show an example of a native ad (the second item with the “dollar” sign) in a news stream on a mobile device.

Promoting relevant and quality ads to users is crucial to maximize long-term user engagement with the platform. In particular, low-quality advertising has been shown to have detrimental effect on long-term user engagement. Low quality advertising can have even more severe consequences in the context of native advertising, since native advertisement forms an integrated part of the user experience of the product. For example, a bad post-click quality (quantified by short dwell time on the ad landing page) in native ads can result in weaker long-term engagement (e.g. fewer clicks).

Here we focus on the pre-click experience, which is concerned with the user experience induced by the ad creative before the user decides (or not) to click.

The ad creative is the ad impression shown within the stream, and includes text, visuals, and layout. Due to the low variability in terms of ad formats in native advertising, the content and the presentation of the ad creative are extremely important to determine the quality of the ad.

Our first step was to understand ad quality from a user perspective, and infer the underlying criteria that users assess when choosing between ads.

To this end, we designed a crowd-sourcing study to spot what drives users’ quality preferences in the native advertising domain.

We extracted a sample of ads impressed on Yahoo mobile news stream.  To ensure diversity and the representativeness of our data in terms of subjects and quality ranges, we uniformly sampled a subset of those ads from (1) different click-through rate quantiles;  and (2) five different popular topical categories: “travel”, “automotive”, “personal finance”, “education” and “beauty and personal care”.

We used Amazon Mechanical Turk to conduct our study. We showed users pairs of native ads, and asked them to indicate which ad they prefer, and the underlying reasons for their choice. To eliminate the effect of ad relevance, we presented the users with topically-coherent ads (e.g., ads from the same subject category, such as “beauty”), assuming that, for example, when users are comparing two beauty ads, the preference depends mostly on the ad quality.

Once users chose their preferred ad, we asked them the reasons why they chose the selected ad. To define such options, we resorted to existing user experience/perception research literature. We were inspired by the UES (User Engagement Scale) framework, an evaluation scale for user experience capturing a wide range of hedonic and cognitive aspects of perception, such as aesthetic appeal, novelty, involvement, focused attention, perceived usability, and endurability. Moreover, previous studies in the context of native advertising investigated user perceptions of native ads with dimensions such as “annoying”, “design”, “trust” and “familiar brand”. Similarly, researchers have studied the amount of ad “annoyingness” in the context of display advertising, showing that users tend to relate ad annoyance with factors such as advertiser reputation, ad aesthetic composition and ad logic. Based on these, we provided users with the following options as underlying reasons of their choice:

  • the brand displayed
  • the product/service offered
  • the trustworthiness
  • the clarity of the description
  • the layout
  • the aesthetic appeal

Users were asked to rate each on a five-grade scale: 1 (strongly disagree), to 5 (strongly agree) or NA (not available).

We report  in the following table the percentage of judgements that, for each factor, is assigned to grades 4 or 5 (the user highly agrees this factor affects his or her ad preference choice).

PreClickPref

Underlying reasons of users’ preference of ad pairs based on ad creative.

The most important factors are, in order of importance:

Aesthetic appeal > Product, Brand, Trustworthiness > Clarity > Layout

where “>” represents a significant increase (in ad preferences). Further test showed that, apart from the brand factor, there were not any significant differences. This suggests that the factors affecting user preferences generalize across ad categories.

However, for different ad categories, compared to the general pattern, we observe few small differences. Aesthetic appeal is more important for Automotive, Beauty and Education, than Personal Finance and Travel. For the Travel category, where most ad images were indeed beautiful, aesthetics did not affect much compared to others. For Beauty and Education categories, the product advertised was the most important factor (other than aesthetic appeal) affecting user choices; for Automotive, the brand was crucial. For Personal Finance category, the clarity of the description had a big impact on the user perception of the quality if the ad.

This study provided us important insights into how users perceive the quality of native ads. In a future blog post, I will discuss how we map these insights to engineered features, which we then use to predict the pre-click experience.

Promoting Positive Post-Click Experience for Native Advertising

Since September 2013, I have been working on user engagement in the context of native advertising. This blog post describes our first paper on this work, published at the Industry Track of ACM Knowledge Discovery & Data Mining (KDD) conference in 2015 [1]. This is work in collaboration with Janette Lehmann, Guy Shaked, Fabrizio Silvestri and Gabriele Tolomei.

Feed-based layouts, or streams, are becoming an increasingly common layout in many applications, and a predominant interface in mobile applications. In-stream advertising has emerged as a popular online advertising because it offers a user experience that fits nicely with that of the stream, and is often referred to as native advertising. In-stream or native ads have an appearance similar to that of the items in the stream, but clearly marked with a “Sponsored” label or a currency symbol e.g. “$” to indicate that they are in fact adverts.

A user decides if he or she is interested in the ad content by looking at its creative. If the user clicks on the creative he or she is redirected to the ad landing page, which is either a web page specifically created for that ad, or the advertiser homepage. The way user experiences the landing page, the ad post-click experience, is particularly important in the context of native ads because the creatives have mostly the same look and feel, and what differs mostly is their landing pages. The quality of the landing page will affect the ad post-click experience.

A positive experience increases the probability of users “converting” (e.g., purchasing an item, registering to a mailing list, or simply spending time on the site building an affinity with the brand). A positive post-click experience does not necessarily mean a conversion, as there may be many reasons why conversion does not happen, independent of the quality of the ad landing page. A more appropriate proxy of the post-click experience is the time a user spends on the ad site before returning back to the publisher site:

“the longer the time, the more likely the experience was positive”

The two most common measures used to quantify time spent on a site are dwell time and bounce rate. Dwell time is the time between users clicking on an ad creative until returning to the stream; bounce rate is the percentage of “short clicks” (clicks with dwell time less than a given threshold). On a randomly sampled native ads served on a mobile stream, we showed that these measures were indeed good proxies of post-click experience.

We also saw that users clicking on ads promoting a positive post-click experience, i.e. small bounce rate, were more likely to click on ads in the future, and their long-term engagement was positively affected.

Focusing on mobile, we found that a positive ad post-click experience was not just about serving ads with mobile-optimised landing pages; other aspects of an landing page affect the post-click experience. We therefore put forward a learning approach that analyses ad landing pages, and showed how these can predict dwell time and bounce rate. We experimented with three types of landing page features, related to the actual content and organization of the ad landing page, the similarity between the creative and the landing page, and ad past performance. The later type were best at predicting dwell time and bounce rate, but content and organization features performed well, and have the advantages to be applicable for all ads, not only for those that have been served.

Finally, we deployed our prediction model for ad quality based on dwell time on Yahoo Gemini, an unified ad marketplace for mobile search and native advertising, and validated its performance on the mobile news stream app running on iOS. Analyzing one month data through A/B testing, returning high quality ads, as measured in terms of the ad post-click experience, not only increases click-through rates by 18%, it has a positive effect on users: an increase in dwell time (+30%) and a decrease in bounce rate (-6.7%).

This work has progressed in two ways. We have improved the prediction model using survival random forests and considered new landing page features, such as text readability and the page structure [2]. We are also working with advertisers to help improving the quality of their landing pages. More about this in the near future.