Websites across the internet embed structured content into their unstructured markup for a variety of reasons, such as:

  • To improve the quality and visibility of search results in search engines, which is essential to SEO
  • To take control of how shared content looks in Social Media, which is essential to building a brand online
  • To participate in the semantic web, which is becoming increasingly relevant as technologies like Google Assistant gain traction

At the highest level, all of these goals serve to increase machine understanding of unstructured content. Naturally, this is of great interest for web scraping!

What is Structured Data?

In the context of web standards, structured data is any data which is highly organized and follows a predefined format or schema that can be read by a machine. Unstructured data, on the other hand, follows no such predefined format or schema, and therefore is very difficult for a machine to organize, parse, and interpret. In general, text, audio, video, and markup are all unstructured data, whereas JSON or XML with an associated schema is structured data.

Note that data's structuredness has no bearing on its interest or worth whatsoever -- merely its encoding. Shakespeare's sonnets may be unstructured data, but they're still Shakespeare!

Why is Structured Data Useful?

Representing information in a well-documented, well-understood -- daresay, standardized -- format allows one party to encode data in a way that it can be understood specifically and unambiguously by other third parties. While this may not be obviously or intuitively valuable for a reserved seat at a concert, hopefully it is clear how it is interesting for other types of things, such as movies, songs, news articles, or books.

Imagine If every chocolate chip cookie recipe were available on the web as structured data. One could find the ideal chocolate chip cookie recipe algorithmically by simply scraping the recipes' webpages and analyzing their ingredients versus their ratings!

More broadly, making data machine-readable makes it much easier to extract, transform, and load.

A  Series on Scraping Standards

Now that it's clear what structured data is, we can start to explore it in more depth. This blog series will cover the various web standards that underpin the modern semantic web, how they relate to the task of web scraping, and how Arachnio puts them to work to make web scraping, simplified.

We'll see you in the first article, the semantic schema!