Websites across the internet embed structured content into their unstructured markup for a variety of reasons, such as:
At the highest level, all of these goals serve to increase machine understanding of unstructured content. Naturally, this is of great interest for web scraping!
In the context of web standards, structured data is any data which is highly organized and follows a predefined format or schema that can be read by a machine. Unstructured data, on the other hand, follows no such predefined format or schema, and therefore is very difficult for a machine to organize, parse, and interpret. In general, text, audio, video, and markup are all unstructured data, whereas JSON or XML with an associated schema is structured data.
Note that data's structuredness has no bearing on its interest or worth whatsoever -- merely its encoding. Shakespeare's sonnets may be unstructured data, but they're still Shakespeare!
Representing information in a well-documented, well-understood -- daresay, standardized -- format allows one party to encode data in a way that it can be understood specifically and unambiguously by other third parties. While this may not be obviously or intuitively valuable for a reserved seat at a concert, hopefully it is clear how it is interesting for other types of things, such as movies, songs, news articles, or books.
Imagine If every chocolate chip cookie recipe were available on the web as structured data. One could find the ideal chocolate chip cookie recipe algorithmically by simply scraping the recipes' webpages and analyzing their ingredients versus their ratings!
More broadly, making data machine-readable makes it much easier to extract, transform, and load.
Now that it's clear what structured data is, we can start to explore it in more depth. This blog series will cover the various web standards that underpin the modern semantic web, how they relate to the task of web scraping, and how Arachnio puts them to work to make web scraping, simplified.
We'll see you in the first article, the semantic schema!