All blogs and news websites provide some sort of aggregation feed, usually RSS or ATOM. This allows users to add the feed URL to their favourite aggregator and stay updated with future stuff when they come. This post shows how to get a URL to subscribe to, and how to get multiple URLs if the site provides multiple formats.
A .NET library
One easy way to do this in .NET is using the Argotic Syndication Framework library, the code will look like:
Here is what you get when you run the code against wordpress.com:
A few notes on this approach:
- You must have realised the `Where` check in the code, the library seems to capture any `related` link in the HTMl, not just syndication links. that’s why we needed to filter them explicitly
- Quite often when you have a main site that has different branches, you get more than one feed link, for CNN for example you get different feeds for certain site languages, for wordpress.com you got one for the site itself and another for members of the service. Arguably, this is not always what you want when you add the site to a reader kind of application
- As expected, this code is quite slow in debug mode, takes about 1.5 seconds to run alone! In release mode (build configuration set to “Release” and so web,config)
In its simplest form, the syndication discovery is a matter of finding a `link` tag with a proper `rel` attribute (typically set to `alternative`), and a `type` attribute holding the attribute, however, in real life, at least historically, there used to be many variations of the way the discovery was implemented (read the next section for more).
One of those who managed to get right URLs for different edge cases was Google Reader. Apart from Google Reader itself, whose closing was part of the reason I wrote this post, Google allows you to use their systems to get the right syndication feed URL of a given page via simply calling a public JSONP API.
This is the structure of the `data` parameter returned by the previous call:
To learn more about this specific API, check:
To learn what’s special about JSONP requests and why jQuery needs to treat it differently, read the $.getJSON() documentation:
Background Of The Problem
Even though social media has made people depend on links shared on social media sites (by their peers, or the creators of the feed), the trend of adding a syndication feed to website is a trend that continued to increase in many product and subscription websites, especially that it’s easy to automate social media posts from the feeds after that.
As Google Reader will be retired in July, I thought for a minute about what it’d take to put some web based reader together. This was before I learned about the existing awesome alternatives like Feedly and so many others.
Then I remembered there was an application I was working on in 2007, one feature we needed and a colleague worked on was getting RSS posts from personal blogs of the site members. I remember seeing him doing all sorts of crazy Regular Expression matches of so many formats to get the URL. Turns out at least at this time different blog providers used different ways to advertise the feed URL in blog homepage markup, there were so many cases, it took my colleague several days to cover a large set of test cases from different providers that we knew our users were using.
I wanted to see whether this problem was still a thing n 2013, and tried to see what options we had, hence came this post, you know, just for fun :).
Hope some of you were interested in this too!
Now that we're done, click this out ;)