Remember, I launched Reddit Media: intelligent fun online last week (read how it was made)?
I have been getting emails that it would be a wise idea to launch a Digg media website. Yeah! Why not?
Since Digg already has a video section there is not much point in duplicating it. The new site could just be digg for pictures.
I don’t want to use word ‘digg’ in the domain name because people warned me that the trademark owner could take the domain away from me. I’ll just go with a single letter g as “dig” and pictures, to make it shorter picz. So the domain I bought is digpicz.com.
Update: The site has been launched, visit digpicz.com: digg’s missing picture section. Time taken to launch the site: ~7 hours.
I released full source code of the reddit media website (reddit media website generator suite (.zip)). It can now be totally reused with minor modifications to suit the digg for pictures website.
Only the following modifications need to be made:
That’s it! A few hours of work and we have a digg for pictures website running!
Let’s create the data miner first. As I mentioned it’s called digg_extractor.pl, and it is a Perl script which uses Digg public API.
First, we need to get familiar with Digg API. Skimming over Basic API Concepts page we find just a few imporant points:
Next, to make our data miner get the stories, let’s look at Summary of API Features. It mentions List Stories endpoint which “Fetches a list of stories from Digg.” This is exactly what we want!
We are interested only in stories which made it to the front page, the API documentation tells us we should issue a GET /stories/popular request to http://services.digg.com.
I typed the following address in my web browser and got a nice XML response with 10 latest stories:
http://services.digg.com/stories/popular?appkey=http%3A%2F%2Fdigpicz.com
The documentation also lists count and offset arguments which control number of stories to retrieve and offset in complete story list.
So the general algorithm is clear, start at offset=0, loop until we go through all the stories, parse each bucket of stories and extract stories with pics in them.
We want to use the simplest Perl’s library possible to parse XML. There is a great one from CPAN which is perfect for this job. It’s called XML::Simple. It provides an XMLin function which given an XML string returns a reference to a parsed hash data structure. Easy as 3.141592!
This script prints out picture stories which made it to the front page in human readable format. Each story is printed as a paragraph:
title: story title type: story type desc: story description url: story url digg_url: url to original story on digg category: digg category of the story short_category: short digg cateogry name user: name of the user who posted the story user_pic: url to user pic date: date story appeared on digg YYYY-MM-DD HH:MM:SS <new line>
The script has one constant ITEMS_PER_REQUEST which defined how many stories (items) to get per API request. Currently it’s set to 15 which is stories per one Digg page.
The script takes an optional argument which specifies how many requests to make. On each request, story offset is advanced by ITEMS_PER_REQUEST. Specifying no argument goes through all the stories which appeared on Digg.
For example, to print out current picture posts which are currently on the front page of Digg, we could use command:
./digg_extractor.pl 1
Here is a sample of real output of this command:
$ ./digg_extractor.pl 1 title: 13 Dumbest Drivers in the World [PICS] type: pictures desc: Think of this like an even funnier Darwin awards, but for dumbass driving (and with images). url: http://wtfzup.com/2007/09/02/unlucky-13-dumbest-drivers-in-the-world/ digg_url: http://digg.com/offbeat_news/13_Dumbest_Drivers_in_the_World_PICS category: Offbeat News short_category: offbeat_news user: suxmonkey user_pic: http://digg.com/userimages/s/u/x/suxmonkey/large6009.jpg date: 2007-09-02 14:00:06
This input is then fed into db_inserter.pl script which inserts this data into SQLite database.
Then page_gen.pl is ran which generates the static HTML contents.
Please refer to the original post of the reddit media website generator to find more details.
Summing it up, only one new script had to be written and some minor changes to existing scripts had to be made to generate the new website.
Here is this new script digg_extractor.pl:
digg extractor (perl script, digg picture website generator) (34)
Click http://digpicz.com to visit the site!
Here are all the scripts packed together with basic documentation:
All the scripts in a single .zip:
Download link: digg picture website generator suite (.zip)
Downloaded: 302 times (cached)
For newcomers, digg is a democratic social news website where users decide its contents.
From their faq:
What is Digg?
Digg is a place for people to discover and share content from anywhere on the web. From the biggest online destinations to the most obscure blog, Digg surfaces the best stuff as voted on by our users. You won't find editors at Digg - we're here to provide a place where people can collectively determine the value of content and we're changing the way people consume information online.
How do we do this? Everything on Digg - from news to videos to images to Podcasts - is submitted by our community (that would be you). Once something is submitted, other people see it and Digg what they like best. If your submission rocks and receives enough Diggs, it is promoted to the front page for the millions of our visitors to see.
Did you like this post? Subscribe to my posts!