Our import pipeline is triggered by a timer to run every 30 minutes.

The high-level overview of each stage of the import pipeline:

  1. The next client to update is selected from the site, this acts as an infinite circular-buffer.
  2. Gather source URLs from the client's web site.
  3. Download and save the HTML from each source URL and add it to a processing queue.
  4. Read source Html from the processing queue and scrape the listing information (basic fields, description, images, and features) and add it to the suggestion queue.
  5. Read each listing from the suggestion queue and create data suggestions by parsing the listing's description, then add to the classification queue.
  6. Read each listing from the classification queue to decide if it is new, exists and requires updating, or should be removed from the site. The listing is then added to the appropriate import queue (new, update, remove).
  7. Read listings from each of the three import queues and process accordingly: Add new listings, update current listings, remove existing listings.
  8. Update the current client timestamp for last imported via our API (Application Programming Interface).

Now that we've seen the high-level overview of our pipeline, let's dive a bit deeper into what each stage does.

Stage 1: Select the next client from our infinite 'circular-buffer'

Our main website exposes what is known in the technical world as an API, which stands for Application Programming Interface. This interface allows us to request information from the main site from anywhere else. So in stage 1, we ask our API for the next client to be updated, and it sends back the information for the next client - which in this case is the client with the oldest 'last-updated' timestamp.

Stage 2 & 3: Gather source URLs from the client's web site and the save the HTML for processing

The first stage in our pipeline sends a crawler/bot to the target client's website in order to find the pages that represent active listings - it tries to exclude listings which have been marked as 'sold' or are incomplete. The whole process stops if this function fails.

We then proceed to download each listing's source HTML which then gets compared with the last source of this page (provided we have seen it before) and then perform one of two actions:

  1. Ignore the current page if there are no changes between the current version and the previous version in order to avoid unnecessary work, or,
  2. Add the current page to the processing queue.

In practice it turns out most listings are not amended very often so this step heavily reduces the workload for the pipeline.

This bit of processing helps us avoid thousands of redundant operations per day. Some of our larger clients have in excess of four-hundred listings, and worst-case amend or add ten new listings to their site per day. We record which pages downloaded successfully and which didn't in order to fine-tune our crawler.

Stage 4: Scrape the HTML source file for the listing information we need

After the source HTML is read from the processing queue our second crawler extracts the listing's information from the elements its embedded in on the page. Below is an example of some contrived HTML we might expect to find:

<html>
	<body>
		<h1>2 bed apartment for sale in Las Americas</h1>
        <span>€150,000 - ABC123</span>
        <p>Lovely, fully furnished apartment on complex with pool.</p>
        <ul>
            <li>Sea views</li>            
            <li>Fully furnished</li>
        </ul>
        <img src=/img/front-of-house.jpg />
        <img src=/img/lounge.jpg />
	</body>
</html>

Our crawler is very general but there are parts which are intrinsic to each client's website, we attempt to gather the following information:

  • Basic fields such as reference number, price, currency, bedrooms, bathrooms, etc.
  • Features ('Sea views' and 'Fully furnished' from our example HTML above)
  • The description (between the p tags)
  • The images (the .jpg files in the example above)

The information is then compressed into a lightweight format (JSON) and added to the suggestion queue. The final result will end up looking something like this:

{ 
	"type": "apartment",
    "location": "Las Americas",
    "status": "for-sale",
    "beds": 2,
	"price": 150000,
    "ref": "ABC123",
	"desc": "Lovely, fully furnished apartment on complex with pool.",
	"features": ["Sea views", "Fully furnished"],
	"images": ["/img/front-of-house.jpg", "/img/lounge.jpg"],
    "suggestions": ...
}

Stage 5: Generate suggestions based on the data we retrieved

Any suggestion this stage suggests can safely be ignored, it only exists to aid us in creating the most accurate database we possibly can.

Not every website has required fields, not all agents specify complex on the page except in the listing description.

Find missing information in the listing, 0 bedrooms, 0 bathrooms, no complex specified, etc.

  • Load list of valid locations and complexes from The TPG website's API.
  • Extract information from the listing description.
suggestions: [
    { "field": "baths", "suggestedValue": 1 },
    { "field": "pool", "suggestedValue": true },
    { "field": "complex", "suggestedValue": "Amarilla Golf" }
]

Stage 6: Classify each listing into new, requires an update, or remove

By this stage in the pipeline, our listing is very close to making it to The TPG site and being displayed live to the world.

Stage 7: Process new, require update, and removals

We do not let listings onto our site if they cannot be validated for correctness, in this case, an exception is logged and an email is sent it out with the details for us to fix.

Stage 8: Update metadata, handling data exceptions, and cleaning up

As data moves through the pipeline, each stage cleans up after itself even in the event of failure.

Comments


Comments are closed