Picurls.com launches - Making of the Buzziest Picture Website, Part II

This is part II of II of the article how picurls.com was created. In part one we made a universal plugin-based website scraper in Perl to get posts from social news and social bookmarking websites.

In this part I will describe the database design and how the user interface was done in PHP and Smarty template and caching engine.

Picurls.com launches:

picurls - picture buzz, buzziest pics on the net

Database design

I chose SQLite database for this project because of its simplicity - the whole database is a single file and its excellent performance and memory footprint. The only thing I am concerned with is concurrency issues.

SQLite FAQ says:

SQLite allows multiple processes to have the database file open at once, and for multiple processes to read the database at once. When any process wants to write, it must lock the entire database file for the duration of its update. But that normally only takes a few milliseconds. Other processes just wait on the writer to finish then continue about their business. However, client/server database engines (such as PostgreSQL, MySQL, or Oracle) usually support a higher level of concurrency and allow multiple processes to be writing to the same database at the same time. This is possible in a client/server database because there is always a single well-controlled server process available to coordinate access. If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine.

If picurls gets really popular people might have start getting database errors when posting comments or updating their profiles. I certainly do not want to create such a negative user experience.

The database uses the most simple SQL constructs and thus can can easily be moved to a client/server database engine if something bad happens.

Here is how the database scheme of the first version of picurls looks:

CREATE TABLE items (
  id         INTEGER PRIMARY KEY AUTOINCREMENT,
  title      STRING  NOT NULL,
  sane_title STRING  NOT NULL,
  url        STRING  NOT NULL,
  thumb      STRING  NOT NULL,
  site_id    INTEGER NOT NULL,
  date_added DATE    NOT NULL,
  visible    BOOL    NOT NULL DEFAULT 1
);

CREATE TABLE tmp_items (
  id         INTEGER PRIMARY KEY AUTOINCREMENT,
  title      STRING  NOT NULL,
  url        STRING  NOT NULL,
  date_added DATE    NOT NULL,
  site_id    INTEGER NOT NULL,
  tries      INTEGER NOT NULL DEFAULT 0
);

CREATE TABLE comments (
  id             INTEGER PRIMARY KEY AUTOINCREMENT,
  comment        STRING  NOT NULL,
  item_id        INTEGER NOT NULL,
  user_id        STRING  NOT NULL,
  anonymous_name STRING,
  ip_address     STRING  NOT NULL,
  date_added     DATE    NOT NULL
);

CREATE TABLE sites (
  id        INTEGER PRIMARY KEY,
  name      STRING  NOT NULL UNIQUE,
  sane_name STRING  NOT NULL UNIQUE,
  url       STRING  NOT NULL UNIQUE,
  visible   BOOL    NOT NULL DEFAULT 1,
  priority  INTEGER NOT NULL
);

CREATE TABLE users (
  id          INTEGER PRIMARY KEY AUTOINCREMENT,
  username    STRING NOT NULL UNIQUE,
  password    STRING NOT NULL,
  data        STRING,
  ip_address  STRING NOT NULL,
  date_regged DATE   NOT NULL,
  date_access DATE   NOT NULL,
  can_login   BOOL   NOT NULL DEFAULT 1
);

CREATE INDEX IDX_sites_sane_name       on sites(sane_name);
CREATE INDEX IDX_sites_priority        on sites(priority);
CREATE INDEX IDX_items_site_id         on items(site_id);
CREATE INDEX IDX_items_date_added      on items(date_added);
CREATE INDEX IDX_items_sane_title      on items(sane_title);
CREATE INDEX IDX_comments_item_id      on comments(item_id);
CREATE INDEX IDX_comments_user_id      on comments(user_id);
CREATE INDEX IDX_comments_date_added   on comments(date_added);
CREATE INDEX IDX_comments_item_user_ip on comments(item_id, user_id, ip_address);
CREATE INDEX IDX_users_username        on users(username);

INSERT INTO sites (name, sane_name, url, priority) VALUES('Digg',        'digg',        'http://www.digg.com',        1);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Reddit',      'reddit',      'http://reddit.com',          2);
INSERT INTO sites (name, sane_name, url, priority) VALUES('del.icio.us', 'delicious',   'http://del.icio.us',         3);
INSERT INTO sites (name, sane_name, url, priority) VALUES('StumbleUpon', 'stumbleupon', 'http://www.stumbleupon.com', 4);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Flickr',      'flickr',      'http://www.flickr.com',      5);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Simpy',       'simpy',       'http://www.simpy.com',       6);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Furl',        'furl',        'http://www.furl.net',        7);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Boing Boing', 'boingboing',  'http://www.boingboing.net',  8);
INSERT INTO sites (name, sane_name, url, priority) VALUES('Wired',       'wired',       'http://www.wired.com',       9);

INSERT INTO users (id, username, password, ip_address, date_regged, date_access, can_login) VALUES (0, 'anonymous', 'x', '0.0.0.0', '1970-01-01 00:00:00', '1970-01-01 00:00:00', 0);

As I mentioned in part one I want the whole project to be reusable in the future (I already have an idea what I will fork from this project).

You might have noticed that the database schema almost does not contain fields specific to picurls (except 'thumb' field in items and tmp_items tables).

Here is a very brief description of the tables:

items - contains links to pictures to be displayed on the front page of picurls.
tmp_itmes - contains links to possible pictures which the scraper (see part one of this article) found on social bookmarking/social news sites.
omments - contains user comments.
sites - contains information about sites picurls is collecting pictures from.
users - contains registered user infromation.

If SQLite becomes unsuitable for picurls at some point, I can just dump the database and use almost the same database schema (maybe changing a few field types) in MySQL or PostgreSQL.

User Interface Design

I chose PHP programming language as the server side language for the user interface of picurls. One of the reasons is that it is one of the most popular programming language and as I am releasing the full source code of picurls, I expect someone to help me with adding features or just spotting bugs :)

The logic behind handling requests of the user interface works similar to the web.py framework. First we define the URL structure, and specify which scripts will handle which request URLs.

Here is an example of picurl's URL structure:

$pages = Array(
    '#^/(?:index\.(?:html|php))?$#'  => 'page-index.php',       # main page handler
    '#^/site/(\w+)(?:-(\d+))?.html#' => 'page-site.php',        # site handler (digg, reddit etc)
    '#^/item/([a-z0-9-]+).html#'     => 'page-item.php',        # single item handler
    '#^/login.html#'                 => 'page-login.php',       # login page handler
    '#^/register.html#'              => 'page-register.php',    # registration page handler
    '#^/logout.html#'                => 'page-logout.php',      # logout page handler
    '#^/my-comments(?:-(\d+))?.html#'=> 'page-my-comments.php', # my comment page handler
    '#^/my-profile.html#'            => 'page-my-profile.php'   # my profile page handler
);

For example, a request to 'http://picurls.com/site/digg-3.html' would get handled by 'page-site.php' script. The value '3' (page number) would get saved, so the page-site.php knew which page number got requested.

Each request to the server gets handled by the default webserver index file - index.php. To have it this way I set up mod_rewrite to rewrite URLs to index.php:

<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

It is very important to have human readable URLs, such as 'http://picurls.com/item/animals-really-can-get-along-pictures.html' and not something like 'http://picurls.com/item.php?id=37'. This way the website will rank better on search engines and people will easier find what they were looking for.

I noticed that people coming from search engines to digpicz and redditmedia have most often searched for a set of keywords they remembered from the picture.

Here is the index.php script that handles the URL requests and dispatches them to appropriate scripts:

See catonmat.net/ftp/picurls-index.phps

And here is the page-site.php script which handles requests for pictures from particular site (such as digg or stumbleupon):

See catonmat.net/ftp/picurls-page-site.phps

The actual contents of the pages get served by Smarty templating and caching framework. It is a good practice to separate application logic and content from its presentation. Read the about Smarty page for an example why it's a good practice if you have not done it before.

As I went with a dynamic (PHP) solution for serving contents and I expect the website to become quite popular and I am on a server with limited resources, I needed to find a good and fast way to display contents of the website. Smarty has exactly what I am looking for - support for caching.

The first version of picurls.com does not do caching dynamically (based on content change), instead it just caches pages for some constant time and then flushes cache until the next request.

That's about it. :)

Please ask in the comments if you want a more detailed explanation of some software components!

Download Picurls Website Source Code

All the scripts in a single .zip:

Download link: catonmat.net/ftp/picurls-dot-com-website-source-code.zip

The source code is released under GNU General Public License.

If you use the source and create your own picurls-like website you must link back to my (this) blog https://catonmat.net and you must link back to http://picurls.com!

The .zip archive contains several subdirectories:

cache - directory where Smarty keeps cached pages.
db - directory where the SQLite database is kept. I included a sample database with 90 pictures - 10 pictures from each of the sites picurls.com takes contents from (digg, reddit, delicious, flickr, stumbleupon, simpy, furl, wired and boing boing).
locks - directory where scripts hold their lockfiles to ensure single copy of scraper/thumbnail generator scripts are running at any given time
scraper - website data-miner/scraper program (see part one of this article for more information).
scripts - scripts which call scraper, insert data in the database and generate thumbnails.
templates - HTML templates for Smarty.
templates_c - directory where Smarty keeps compiled templates.
www - main website directory, containing all the PHP scripts, CSS, images and thumbnails (around 90 thumbnails for each item in the sample database).

To get your own picurls website running, you will have to configure config.php file in www directory where you need to specify full path to the sqlite database (in db directory). That's just the user interface, though. No new items will ever be retrieved from websites because you have to submit the scraper up. The instructions would take another article. If you want to try, though, look at cronjob.sh shell script in scripts directory. Running this script periodically will scrape websites for new posts which look like images posts and try to insert them in the database (you will have to change some constants in picurls_db_inserter.pl script). After this script has run, pic_mover.pl script has to be run (also needs constants changed).

Enjoy!