Web Scraper That Can Bypass Distil Protection For theknot.com

Budget 194$ per month

Posted: 6 years ago

Opened

Description: I need to find an experienced web scraping specialist who is well versed in methods or scraping architecture to bypass distil anti-bot protection for the website theknot.com

The goal is to scrape basic data (listed below) for all the wedding venues in the United States on theknot.com

Starting with this page:

https://www.theknot.com/marketplace/wedding-reception-venues?redirectToCity=false

And going through every state at the bottom. And then going through every city at the bottom of every state's page. And then cycling through all the pages of the city results and first capturing all the URLs attached to all the venues.

Once all URLs captured, deduplicate them since there will be a lot of crossover between cities.

(I would just use a sitemap to find all the URLs instead of scraping but it appears this site doesn't have or hides their marketplace sitemap very well)

Once the final list of wedding venues is complete and deduplicated, go to each URL and scrape the following into a CSV:

• Domain (website) of the venue
• Address of venue
• Facebook URL
• Instagram URL
• Twitter URL
• Pinterest URL
• Guest Capacity
• Settings (a field under amenities)
• Phone Number
• [array] of urls used in slideshow
Skills:
architectural design,facebook,instagram,marketplace,pinterest,sitemap,software development,twitter,web,web scraping

Source: peopleperhour.com