This is part II of II of the article how picurls.com was created. In part one we made a universal plugin-based website scraper in Perl to get posts from social news and social bookmarking websites.
In this part I will describe the database design and how the user interface was done in PHP and Smarty template and caching engine.
Database design
I chose SQLite database for this project because of its simplicity - the whole database is a single file and its excellent performance and memory footprint. The only thing I am concerned with is concurrency issues.
SQLite FAQ says:
SQLite allows multiple processes to have the database file open at once, and for multiple processes to read the database at once. When any process wants to write, it must lock the entire database file for the duration of its update. But that normally only takes a few milliseconds. Other processes just wait on the writer to finish then continue about their business. However, client/server database engines (such as PostgreSQL, MySQL, or Oracle) usually support a higher level of concurrency and allow multiple processes to be writing to the same database at the same time. This is possible in a client/server database because there is always a single well-controlled server process available to coordinate access. If your application has a need for a lot of concurrency, then you should consider using a client/server database. But experience suggests that most applications need much less concurrency than their designers imagine.
If picurls gets really popular people might have start getting database errors when posting comments or updating their profiles. I certainly do not want to create such a negative user experience.
The database uses the most simple SQL constructs and thus can can easily be moved to a client/server database engine if something bad happens.
Here is how the database scheme of the first version of picurls looks:
CREATE TABLE items ( id INTEGER PRIMARY KEY AUTOINCREMENT, title STRING NOT NULL, sane_title STRING NOT NULL, url STRING NOT NULL, thumb STRING NOT NULL, site_id INTEGER NOT NULL, date_added DATE NOT NULL, visible BOOL NOT NULL DEFAULT 1 ); CREATE TABLE tmp_items ( id INTEGER PRIMARY KEY AUTOINCREMENT, title STRING NOT NULL, url STRING NOT NULL, date_added DATE NOT NULL, site_id INTEGER NOT NULL, tries INTEGER NOT NULL DEFAULT 0 ); CREATE TABLE comments ( id INTEGER PRIMARY KEY AUTOINCREMENT, comment STRING NOT NULL, item_id INTEGER NOT NULL, user_id STRING NOT NULL, anonymous_name STRING, ip_address STRING NOT NULL, date_added DATE NOT NULL ); CREATE TABLE sites ( id INTEGER PRIMARY KEY, name STRING NOT NULL UNIQUE, sane_name STRING NOT NULL UNIQUE, url STRING NOT NULL UNIQUE, visible BOOL NOT NULL DEFAULT 1, priority INTEGER NOT NULL ); CREATE TABLE users ( id INTEGER PRIMARY KEY AUTOINCREMENT, username STRING NOT NULL UNIQUE, password STRING NOT NULL, data STRING, ip_address STRING NOT NULL, date_regged DATE NOT NULL, date_access DATE NOT NULL, can_login BOOL NOT NULL DEFAULT 1 ); CREATE INDEX IDX_sites_sane_name on sites(sane_name); CREATE INDEX IDX_sites_priority on sites(priority); CREATE INDEX IDX_items_site_id on items(site_id); CREATE INDEX IDX_items_date_added on items(date_added); CREATE INDEX IDX_items_sane_title on items(sane_title); CREATE INDEX IDX_comments_item_id on comments(item_id); CREATE INDEX IDX_comments_user_id on comments(user_id); CREATE INDEX IDX_comments_date_added on comments(date_added); CREATE INDEX IDX_comments_item_user_ip on comments(item_id, user_id, ip_address); CREATE INDEX IDX_users_username on users(username); INSERT INTO sites (name, sane_name, url, priority) VALUES('Digg', 'digg', 'http://www.digg.com', 1); INSERT INTO sites (name, sane_name, url, priority) VALUES('Reddit', 'reddit', 'http://reddit.com', 2); INSERT INTO sites (name, sane_name, url, priority) VALUES('del.icio.us', 'delicious', 'http://del.icio.us', 3); INSERT INTO sites (name, sane_name, url, priority) VALUES('StumbleUpon', 'stumbleupon', 'http://www.stumbleupon.com', 4); INSERT INTO sites (name, sane_name, url, priority) VALUES('Flickr', 'flickr', 'http://www.flickr.com', 5); INSERT INTO sites (name, sane_name, url, priority) VALUES('Simpy', 'simpy', 'http://www.simpy.com', 6); INSERT INTO sites (name, sane_name, url, priority) VALUES('Furl', 'furl', 'http://www.furl.net', 7); INSERT INTO sites (name, sane_name, url, priority) VALUES('Boing Boing', 'boingboing', 'http://www.boingboing.net', 8); INSERT INTO sites (name, sane_name, url, priority) VALUES('Wired', 'wired', 'http://www.wired.com', 9); INSERT INTO users (id, username, password, ip_address, date_regged, date_access, can_login) VALUES (0, 'anonymous', 'x', '0.0.0.0', '1970-01-01 00:00:00', '1970-01-01 00:00:00', 0);
As I mentioned in part one I want the whole project to be reusable in the future (I already have an idea what I will fork from this project).
You might have noticed that the database schema almost does not contain fields specific to picurls (except 'thumb' field in items and tmp_items tables).
Here is a very brief description of the tables:
- items - contains links to pictures to be displayed on the front page of picurls.
- tmp_itmes - contains links to possible pictures which the scraper (see part one of this article) found on social bookmarking/social news sites.
- omments - contains user comments.
- sites - contains information about sites picurls is collecting pictures from.
- users - contains registered user infromation.
If SQLite becomes unsuitable for picurls at some point, I can just dump the database and use almost the same database schema (maybe changing a few field types) in MySQL or PostgreSQL.
User Interface Design
I chose PHP programming language as the server side language for the user interface of picurls. One of the reasons is that it is one of the most popular programming language and as I am releasing the full source code of picurls, I expect someone to help me with adding features or just spotting bugs :)
The logic behind handling requests of the user interface works similar to the web.py framework. First we define the URL structure, and specify which scripts will handle which request URLs.
Here is an example of picurl's URL structure:
$pages = Array( '#^/(?:index\.(?:html|php))?$#' => 'page-index.php', # main page handler '#^/site/(\w+)(?:-(\d+))?.html#' => 'page-site.php', # site handler (digg, reddit etc) '#^/item/([a-z0-9-]+).html#' => 'page-item.php', # single item handler '#^/login.html#' => 'page-login.php', # login page handler '#^/register.html#' => 'page-register.php', # registration page handler '#^/logout.html#' => 'page-logout.php', # logout page handler '#^/my-comments(?:-(\d+))?.html#'=> 'page-my-comments.php', # my comment page handler '#^/my-profile.html#' => 'page-my-profile.php' # my profile page handler );
For example, a request to 'http://picurls.com/site/digg-3.html' would get handled by 'page-site.php' script. The value '3' (page number) would get saved, so the page-site.php knew which page number got requested.
Each request to the server gets handled by the default webserver index file - index.php. To have it this way I set up mod_rewrite to rewrite URLs to index.php:
<IfModule mod_rewrite.c> RewriteEngine On RewriteBase / RewriteCond %{REQUEST_FILENAME} !-f RewriteCond %{REQUEST_FILENAME} !-d RewriteRule . /index.php [L] </IfModule>
It is very important to have human readable URLs, such as 'http://picurls.com/item/animals-really-can-get-along-pictures.html' and not something like 'http://picurls.com/item.php?id=37'. This way the website will rank better on search engines and people will easier find what they were looking for.
I noticed that people coming from search engines to digpicz and redditmedia have most often searched for a set of keywords they remembered from the picture.
Here is the index.php script that handles the URL requests and dispatches them to appropriate scripts:
See catonmat.net/ftp/picurls-index.phps
And here is the page-site.php script which handles requests for pictures from particular site (such as digg or stumbleupon):
See catonmat.net/ftp/picurls-page-site.phps
The actual contents of the pages get served by Smarty templating and caching framework. It is a good practice to separate application logic and content from its presentation. Read the about Smarty page for an example why it's a good practice if you have not done it before.
As I went with a dynamic (PHP) solution for serving contents and I expect the website to become quite popular and I am on a server with limited resources, I needed to find a good and fast way to display contents of the website. Smarty has exactly what I am looking for - support for caching.
The first version of picurls.com does not do caching dynamically (based on content change), instead it just caches pages for some constant time and then flushes cache until the next request.
That's about it. :)
Please ask in the comments if you want a more detailed explanation of some software components!
Download Picurls Website Source Code
All the scripts in a single .zip:
Download link: catonmat.net/ftp/picurls-dot-com-website-source-code.zip
The source code is released under GNU General Public License.
If you use the source and create your own picurls-like website you must link back to my (this) blog https://catonmat.net and you must link back to http://picurls.com!
The .zip archive contains several subdirectories:
- cache - directory where Smarty keeps cached pages.
- db - directory where the SQLite database is kept. I included a sample database with 90 pictures - 10 pictures from each of the sites picurls.com takes contents from (digg, reddit, delicious, flickr, stumbleupon, simpy, furl, wired and boing boing).
- locks - directory where scripts hold their lockfiles to ensure single copy of scraper/thumbnail generator scripts are running at any given time
- scraper - website data-miner/scraper program (see part one of this article for more information).
- scripts - scripts which call scraper, insert data in the database and generate thumbnails.
- templates - HTML templates for Smarty.
- templates_c - directory where Smarty keeps compiled templates.
- www - main website directory, containing all the PHP scripts, CSS, images and thumbnails (around 90 thumbnails for each item in the sample database).
To get your own picurls website running, you will have to configure config.php file in www directory where you need to specify full path to the sqlite database (in db directory). That's just the user interface, though. No new items will ever be retrieved from websites because you have to submit the scraper up. The instructions would take another article. If you want to try, though, look at cronjob.sh shell script in scripts directory. Running this script periodically will scrape websites for new posts which look like images posts and try to insert them in the database (you will have to change some constants in picurls_db_inserter.pl script). After this script has run, pic_mover.pl script has to be run (also needs constants changed).
Enjoy!