Understanding how news, blog, and social media behavior is changing in real-time is critically important to understanding and reaching savvy digital audiences. However, even basic web crawling and data extraction operations like link unwinding and text extraction are complex and hard to get right. The arachn.io API puts this data directly in your hands.
Parse URLs and hostnames into their component parts to detect outlinks, deduplicate shares, and more.
Convert short links, like go.nasa.gov/3QIXfBy, into their fully-unwound, canonical representations.
Extract article body content and structured metadata from public webpages, not just full-page HTML.
The arachn.io API uses proprietary implementations of core web standards like JSON-LD, the OpenGraph Protocol, and Schema.org Structured Microdata combined with cutting-edge artificial intelligence techniques to create structured data from web addresses and public webpages.
Of course! Anyone can use the Free Forever Plan to try out the API. It includes all of the API's core endpoints, including unwind and extract.
URL and Hostname parsing are an important part of undirected web crawling and content analysis.
It's easy for websites to link back to themselves or to other websites, but harder to get other websites to link back to them.
The arachn.io API allows code to distinguish between internal and external link types and detect valuable outlinks.
Many of the most valuable links online today, particularly those embedded in social media, use so-called "link shorteners" like bit.ly and t.co that hide the real target of a link.
The arachn.io API unwinds links to reveal the link's actual target, and the target's canonical form if possible.
Most of the HTML on a webpage is useless. For example, the navigation bar is typically exactly the same on every page of a website!
On most webpages, especially articles, the page's "body" is the important content, but there is no standard, well-adopted way in which the page's body is demarcated.
The arachn.io API uses proprietary algorithms and cutting-edge artificial intelligence to find, extract, and structure this valuable content.