Airbnb Engineering Blog Notes

Features for Airbnb

Role: Data Scientist, Product Analytics

Goal: Improve Airbnb’s Similar Listings

Underlying Real Life ML Model: Recommendation Systems

  • Related: Youtube Recommended Similar products, Amazon Recommended Similar products

Current Features:

  • Price

Additional Features:

  • Style Tags (categorical; unordered): hip, modern, old,
    • Match with user demographics: Female, Age 25, Hometown Bay Area
  • Overpricedness (binary)
    • Residual of price prediction algo = How much overpriced a listing is.
  • Requires security deposit (binary)
  • Amount of security deposit (continuous)
  • Number of house rules (categorical; ordered int)
  • Restrictive check in and check out times (continuous or binary)
  • Mood of home (categorical; unordered): oceanfront view, romantic, on the beach,
    • Match with Event: Honeymoon, Wedding, Birthday, Anniversary, Special Dates
  • Amenity Weights based on Location/Season: AC in the summer, Mosquito Nets in humid places
  • Neighborhood
    • Rank by Safety
    • If point of interest specified: Rank by Distance (Convinence) to point of interest. 
    • If no point of interest specified: rank by Distance to Nearby Attractions (defined as n reviews > 50000 on google maps?) (continuous)
  • Host friendliness
  • Annoying vs. Chill Host (like Uber drivers that talk alot vs. silent) (binary)
  • Human interaction required? Forced interaction with Host or Full Privacy (binary)
  • House Cleanliness
  • Guarentee-ability of quality, service, standard of both 1) host and 2) house
    • “Airbnb Guarantee”: Independent Airbnb verification program that can shows you a badge “Airbnb trustedness score”. Offer Insurance if things are not as they advertised.
    • Comp to hotel stamp of approval, Get over “dont wanna book it if it has 0 reviews” fear.
  • Last minute booking (binary)
    • Same as free cancellation toggle.
  • N Day Cancellation policy of hosts
    • Same but reverse, if host can cancel up to <14 days before trip. After that there is full refund and penalty.
  • Suitable for Responsible Fun? (disturbing)
    • Come home late without worrying about disturbing owner?
    • Leave food out without worrying?
  • Comp similar listing price vs. Hotel price
    • 3 star place vs 3 star hotel

Similar Listings:

  • Same Filters
  • Distance from Original (Unavailable) Listing

How would you improve Airbnb:

  • Saved searches. Literally what I do for Booking.com but they dont have it.
  • Highlight Helpful Reviews
    • Mini ML classifier (1: helpful, 0: not helpful)
  • Keyword matching for Location/Neighborhood, not Host
  • More picture requirements. VR Walkthrough.
  • Offer discounts from Airbnb for 0 review places to 1) encourage guests to stay and 2) leave a review.
    • Amazon should do it as well with lesser known products (n < 100 reviews).
  • User submitted pictures
  • Exponentially weight review averages.
    • For any review sites (Amazon, Yelp, Google Maps)
    • This prevents a listing with 100,000 reviews w/ 4.9 stars from cutting corners since a simple average will not assign a higher weight to recent (bad) reviews.
  • What does Instant Booking even mean? What are the terms associated with it? I want to confirm I’m using the correct info as opposed to 1 click pay.

Double Edged Sword:

  • Anonymous reviewers
    • ++True negative feedback is allowed
    • –Bad guests hiding behind keyboard to leave unreasonably scathing reviews.
    • –Increase in fake reviews/spam?

At Airbnb, Data Science Belongs Everywhere

Empathy:

  • Required for success as a DS.
  • Data = Voice of our customers. A decision made by a customer. 
  • Therefore, if you are able to have empathy, you have a unique ability to translate/recreate what the customer might have done to lead to a datapoint being the way it is. A good data scientist is able to get in the mind of people who use the product and understand their needs. 
  • Up to the data scientist to present it well to Decision Makers.
    • When decision-makers don’t understand the ramifications of an insight, they don’t act on it. When they don’t act on it, the value of the insight is lost.

Embedded + Centralized DS Team

  • Embedded within other teams (to help lead decision making) + Centralized (within DS team, to learn from other DS on the same team)

Airbnb Pipeline

  1. EDA of problems. Aimed at sizing opportunities, and generating hypotheses that lead to actionable insights.
  2. Predictive Analytics: make a decision about what path to follow (where we expect to have the largest impact)
  3. A/B Testing: Operational market-based tests, traditional online environments.
  4. Measure results of the experiment, identifying the causal impact of our efforts. Success = Roll out to entire customer base. Fail = Learn why it wasn’t successful and repeat the process.

Airbnb =  a two-sided marketplace with network effects, strong seasonality, infrequent transactions, and long time horizons.

Location Relevance at Airbnb

Iterating on Location Relevance.

Upon search of a city like SF:

Level 1. Return any highest RANKed listings within radius of center of city.

  • SELECT airbnb_name, airbnb_address,
  • ROW_NUMBER() OVER (PARTITION BY city ORDER BY reviews DESC) as airbnb_rank_by_city

Level 2. Return highest RANKed listings based on exponential demotion function based upon the distance between the center of the search and the listing location, which we applied on top of the listing’s quality score.

  • ORDER BY abs(listing_location – center_of_city)

Level 3. Return highest RANKed listings based on sigmoid demotion curve. Based on center of the search’s neighborhood and listing location

  • ROW_NUMBER() OVER (PARTITION BY city, neighborhood 
  • ORDER BY abs(listing_location – center_of_neighborhood) ASC, reviews DESC) as airbnb_rank_by_city

Level 4. Conditional probability of booking in a location, given where the person searched.

  • ROW_NUMBER() OVER (PARTITION BY city, neighborhood
  • ORDER BY book_rate_given_search DESC, reviews DESC) as airbnb_rank_by_city

Level 5. Normalize by # of listings by city.

  • ROW_NUMBER() OVER (PARTITION BY city, neighborhood
  • ORDER BY book_rate_given_search/num_listings_in_city DESC, reviews DESC) as airbnb_rank_by_city

Level 6. Add conditional probability encoding relationship between the city people booked in and the cities they searched to get there.

  • Add vector of related cities [Santa Cruz, Aptos, Capitola] w high similarity score.

Architecting a Machine Learning System for Risk

Underlying Real Life ML Model: Anomaly Detection

When a critical event (reservation is created) occurs in our system:

  1. Query the fraud prediction service for this event (1 = fraud, 0 = not fraud).
  2. This service calculates all the features for the “reservation creation” model ->
    1. Eg. Time it took on first page, set dates page, on cc payment page
    2. N other listings looked at
    3. Known IP address?
  3. Sends these features to our Openscoring service -> 
  4. Openscoring returns a score -> 
  5. Decision is made based on a set threshold (eg. 75%+) -> 
  6. Fraud prediction service can then use this information to take action (allow or hold).

Pipeline:

  1. Derive features from airbnb.com. Store them into a json.
  2. Transform features via binning, impute NAs, remove unimportant features (Lasso?)
  3. Train, Score with CV with precision recall, ROC curves

Feature transformation, model building, model validation, deployment and testing are all carried out in a single script.

Takeaway: Ground truth > features > algorithm choice

  • Bad precision recall = Ground truth wasn’t accurate. (Not more data problem, not RF problem).
  • If your ground truth is inaccurate, you’ve already set an upper limit to how good your precision and recall can be. If your ground truth is grossly inaccurate, that upper limit is pretty low.
  • Eg. Model is trained for fraud detection, but cannot generalize to new types of fraud (ground truth has moved).
  • Throw it all in HDFS, whether you need it now or not. In the future, you can always use this data to backfill new data stores if you find it useful. This can be invaluable in responding to a new attack vector.

Aerosolve: Machine learning for humans

Underlying Real Life ML Model: Dynamic Pricing Prediction (Regression)

Features: seasonality, unique features of a listing

Philosophy: Humans should partner with a machine in a symbiotic way that exceeds the capabilities of humans or machines alone.

Scaling Knowledge at Airbnb

Experiments at Airbnb

Underlying Real Life Concepts: Hypothesis Testing, Causal Inference, Experiment Design

Experiment = Simple way to make Causal Inference

Experiment Design Pitfalls

  • 1. Stopping an experiment too soon
  • 2. Bias on a marketplace level
  • 3. Failing to understand results in their full context
  • 4. Assuming the system works the way you think

Controlled experiments isolate the impact of the product change while controlling for the aforementioned external factors. Outside world has a much larger effect on metrics vs. product changes. 

Things to control for during A/B Testing:

  • Day of week
  • Time of year (seasonality)
  • Weather
  • Mood
  • User found Airbnb thru online ad or organically?
  • Users can browse when not logged in or signed up, people switch devices, making it more difficult to tie a user to actions.
  • Bookings take a few days to confirm (host has to approve)

Factors outside of our control:

  • Successful bookings are dependent on available inventory and responsiveness of hosts

Define the Metric: Define what Conversion Rate is:

  • Main Metric = Between searching and booking = n_step4 / n_step1

Funnel

  1. A visitor has to make a search
  2. Searcher needs to contact a host about a listing. 
  3. Host has to accept an inquiry
  4. Guest has to actually book the place.
  5. “One click booking”: Guest can instantly book some listings without a contact, and can also make a booking request that goes straight to booking. This four step flow is visualized in Figure 3. We look at the process of going through these four stages, but the overall conversion rate

P < 0.05 = significant.

Full A/B Testing Notebook

  • Dont stop it after 7 days (significant), keep it running to see LT effects (post novelty effects).
  • Pattern of hitting “significance” early and then converging back to insignificant is actually quite common.
    • Early converters have larger weight in the beginning of the experiment (small n since ppl take a long time to book).
    • Since the statistical test is a function of the sample size and effect sizes, if an early effect size is large through natural variation it is likely for the p-value to be below 0.05 early. But the most important reason is that you are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.
  • Even though insignificant, we found that certain users like the ability to search for high-end places and decided to accommodate them, given there was no dip in the metrics.

How long should experiments run for / How long to Run the A/B Test?

  1. Determine 3 variables w your team.
    1. Set the power (0.8) (Prob of correctly rejecting null)
    2. Set the significance level (0.95)
    3. Set the MDE (determined by mgmt, what is the minimum “change” you want to see. 5% increase in conversion? 1% increase in revenue?)
  2. Use the 3 variables above to calculate minimum sample size required.
    1. Sample size req = 16SDsamples2 / MDE2
    2. Denominator is actually Treatment – Control, but we dont know it in advance. So plug in MDE here for now.
  3. # days to run experiment = Sample Size / # of users in treatment
    1. # days to run experiment = Sample Size / # of users in control
  4. MAX(# of days to run experiment, 14 days)
    1. To capture weekly patterns.
    2. You don’t want to run an experiment too short, otherwise false negative (Type I error).
    3. You don’t want to run an experiment too long, otherwise false negative (Type II error).

We often don’t have a good idea of the size, or even the direction, of the treatment effect.

Monitor metric over time, even if experiment is over.

Be skeptical of early significance. As more data comes in, you can increase the threshold as the probability of finding a false positive is much lower later.

Diff Thresholds of Required p value for Stat Sig vs. Length of Experiment

Don’t just look at p value from aggregate perspective, GROUP BY and see if there is underlying effects:

However, if now that you have multiple tests, you need to decrease p value threshold to be much stricter (0.05/n total tests = significance)

Do A/A tests to detect if everything underlying is working correctly. Split users into treatment and control groups, both groups seeing the same exact thing. If results come back as significant, there is something underlying wrong.

Experiment Reporting Framework

Yaml file to Describe the experiment:

search_per_page:

  human_readable: Search results per page

  subject: visitor

  treatments:

    12_per_page:

      human_readable: 12 per page

    18_per_page:

      human_readable: 18 per page

    24_per_page:

      human_readable: 24 per page

  control: 18_per_page

deliver_experiment(

  “search_per_page”,

  :12_per_page  =>  lambda { <%= render “search_results”, :results => 12 %> },

  :18_per_page  =>  lambda { <%= render “search_results”, :results => 18 %> },

  :24_per_page  =>  lambda { <%= render “search_results”, :results => 24 %> },

  :unknown      =>  lambda { <%= render “search_results”, :results => 18 %> }

)

  • Use lambdas instead of if statements to specify that you’re running this as a experiment so that other teams wont go in and change your code.
  • Have an “unknown” case as a fail safe so that users will still see a regular page if something goes wrong.
  • Retain the results after the experiment is over.

How Not To Run an A/B Test

Base case: your conversion rate is 50%.

A/B Test: Test to see if a new logo gives you a conversion rate of more than 50% (or less). 

You stop the experiment as soon as there is 5% significance, or you call off the experiment after 150 observations. Now suppose your new logo actually does nothing. What percent of the time will your experiment wrongly find a significant result? 26.1%.

Decide on a sample size in advance and wait until the experiment is over to look at sig or not.

  • This is where the 16SD2/MDE2 formula comes from.
  • SD2 = p(1-p) if binomial proportion.
Share Share Tweet Email
Scroll to Top