Content Extractor/Data Scraper

129 votes

Using Regex patterns, XPath or CSS/jQuery selectors, you will be able to configure Sitebulb to scrape text or HTML on each URL it audits. This data will then be presented within its own section in the audit report as a URL List + data, and exportable to Excel for further analysis.

Done Suggested by: Moderator (03 Oct, '17) • Upvoted: 17 Sep, '20 • Comments: 10

Comments: 10

03 Oct, '17
Tom W
Highly interesting feature (if it's accurate). N.B. scraping text cleanly is not trivial. (I believe Diffbot is the market leader and charges about 0.1c/page)
03 Oct, '17
Zdenek Dvorak [linki.cz]
I really miss that from Sceaming Frog. I can tell what kind of URL have at an e-commerce site (scrape unique text or content grouping), if a content is popular (scrape number of comments), assign authos and many more.
06 Oct, '17
Michael Field
import.io is a pretty cool service for this. whilst I don't anticipate the point and click features, some of it's elements could be pulled through on a crawl.
03 Nov, '17
Simon Cox
Good idea but I believe this distracts from the original purpose of Sitebulb. There are a lot more things time could be well-spent on before tackling this.
29 Nov, '17
Tim Wolfe
This should be a Must-Do feature. I use custom extractions in DeepCrawl & Screaming Frog for everything from schema markup to list counts and page copy.
05 May, '18
Walid
For me this shouldn't be optional, it's essential. Crawling a website helps to get data, and then insights. To make valuable insights the data must be segmented. Type of page on e-commerce websites. We have product pages, categories, informational, manufacturer, institutional pages and so on. A custom extractor helps flagging all URLs for understand where they fit. Once done the data is much more insightful, it can help me then in log analysis, keyword analysis, technical website analysis....
25 May, '18
gareth
having to decide on one crawler SF or SB - this is a pretty hard feature not to have. heart wants SB, brain is making me take SF this budget cycle.
04 Jul, '18
Laurie Turnbull Merged
Identify pages based on a footprint (e.g. video embed code)
09 Jul, '18
Admin
"Custom extraction based on Footprint" (suggested by Laurie Turnbull on 2018-07-04), including upvotes (1) and comments (0), was merged into this suggestion.
31 Oct, '18
Eric
This could fit under the "crawl map advanced features" request, but...
Something that could help paint a picture for keyword density. I was thinking something as rudimentary as a wordcloud could be extremely useful as a starting point in many use cases.