Delayed Job (DJ)

Posted by tobi — 04:04 PM Feb 17

I finally got a github invitation and used the opportunity to release another Shopify extractions.

Delayed::Job or DJ is a asynchronous priority queue which only relies on a simple database table. It doesn’t require you to run a dedicated server like many other systems do.

We use for a lot of longer running tasks in Shopify such as sending newsletters, uploading files to s3, downloading images from urls, indexing products to Solr and so on.

There are two ways to add jobs to the queue:

Jobs are simple ruby objects with a method called perform. Any object which responds to perform can be stuffed into the jobs table. Job objects are serialized to yaml so that they can later be resurrected by the job runner.


  class NewsletterJob < Struct.new(:text, :emails)
    def perform
      emails.each { |e| NewsletterMailer.deliver_text_to_email(text, e) }
    end    
  end  

  Delayed::Job.enqueue NewsletterJob.new('lorem ipsum...', Customers.find(:all).collect(&:email))

There is also a second way to get jobs in the queue: send_later.


  BatchImporter.new(Shop.find(1)).send_later(:import_massive_csv, massive_csv)                                                    

This will simply create a Delayed::PerformableMethod job in the jobs table which serializes all the parameters you pass to it. There are some special smarts for active record objects which are stored as their text representation and loaded from the database fresh when the job is actually run later.

The plugin can be found on github.

6 comments (closed) Filed under: Code Rails

ActiveMerchant PDF

Posted by tobi — 06:07 PM Jan 29

If you are working on a ruby application that requires dealing with credit cards, you are probably using ActiveMerchant. If not, you probably didn’t know about ActiveMerchant.

ActiveMerchant is an extraction from Shopify. It’s a simple to use library which translates one common interface into the wire language of 30-40 different payment processors around the globe with more added at rapid pace. As long as your application can talk to active merchant you can switch payment providers with a single line of code.

Treat yourself to Cody Fauser’s excellent ActiveMerchant PeepCode PDF which is an in depth discussion about the library and covers topics such as order pipelines, order state management and the appropiate unit testing which a financial application requires.

Cody is the main programmer ActiveMerchant which I originally started. Cody took the library further than anything I envisioned and it’s now one of the most competent libraries for ruby.

0 comments (closed) Filed under: Code Rails

Futuretalk: CouchDB

Posted by tobi — 11:42 AM Sep 02

I have to confess: I really don’t like relational databases. I can’t wait for the day we can ditch them.

Think about it for a second: Databases store data to disk. Thats all what 90% of us use them for. They are essentially elaborate hash tables backed by a disk drive. Why are they more lines of code than some operating systems?

Despite that, unless you have a really well thought out setup, a disk failure is still a major disaster. Even if you have backups, even if you have replication, there will be downtime and manual labor while a new master server is established. Databases never put your 10-20 commodity server boxes with all their spare disk space to use. They always sit on these really expensive ivory tower IBM boxes outside of your cheap cluster.

Despite million man years of research databases are actually pretty dumb. You have to tell them about every nuance of your schema, you have to tell them about indexes and so on. If you forget an index they are perfectly happy to run sequentially run through all the data you ever inserted into them many times a second.

Replication is generally a nightmare and every machine involved in the replication needs to have enough disk space to store the entire content of the database.

There are several interesting projects which try to re-invent the database as we know them. Yesterday i found out about a particularly interesting one: CouchDB a contender for “The next generation web storage” as their website proclaims. The project started out using C but eventually changed to Erlang which is a perfect choice for highly parallel server software.

CouchDB has no tables, it just has a flat global namespace for documents. A document is a simple JSON record.


POST /shopify/
{
 "value":
 {
   title:"Arbor Draft",
   type:"Product",
   price:299.00,
   tags:["snowboarding", "freestyle", "wintersport"],
   description:"...." 
 }
}

Instead of defining the schema we simply add arbitrary records. There are no tables.

So how do we receive all the records again? CouchDB uses the concept of views which are essentially javascript methods. It uses map/reduce to find matching records in its global namespace so that at query time the results are available instantaneously. This is a huge performance boost for web applications which generally have many more queries than update/inserts.

Lets install some usefull views under /shopify/all:


PUT /shopify/all
{
  "_view_documents": "function(doc) { return doc; }",
  "_view_products": function(doc) { 
       if(doc.type == 'product') { return doc; } 
   }
}

GET http://couchserver/shopify/all:products
returns:
  {
    "_id":"all:products",
    "rows":
    [
      {
        "_id":"64ACF01B05F53ACFEC48C062A5D01D89",
        "_rev":"62D22746",
         title:"Arbor Draft",
         type:"Product",
         price:299.00,
         tags:["snowboarding", "freestyle", "wintersport"],
         description:"...." 
      },
  }

There are a lot more cool things in CouchDB. Notice that the returned document has a _rev? Older revisions of documents are only deleted if you say so. If you are working on a wiki you just got your historical data for free. Unfortunately CouchDB is still in alpha but i think the fundamentals are sound. Its a lot more aligned with the way a modern web application works and needs its data represented. Its replication system is already much more powerful than that of other database systems and in fact is very similar to the way google works with tis bigtable and map/reduce infrastructure.

For further information head to the projects Wiki.

31 comments (closed) Filed under: Code

The Secret to Memcached

Posted by tobi — 11:50 AM May 22

Memcached has long been the answer to most questions containing the word scale. There are some spectacular memcached installations out there. Facebook is said to run a 200 server with 3TB of memory solely for servicing memcached; Shopify, twitter, digg, Slashdot and just about every other public facing application depends on it. Facebook’s installation is said to deliver a 99% cache hit rate while servicing tens of thousands of requests a second.

There are many ways to use this elaborate hash table and many ways which are more trouble then they are worth. In our experience the key to use memcached effectively is to ask it for the exact thing you want, but i’m getting ahead of myself.

A common pattern to using memcached is the following


class Product < AR:B

  def load(id)
    Cache.get(key, self) || Cache.set(key, find(id))
  end

  def after_save; Cache.expire(key); end
  def after_destroy; Cache.expire(key); end

  def key
   "#{table_name}/#{id}" 
  end
end

The issue is that this model only caches on a per object basis. But the real database load comes usually from loading collections. Storing a collection in memcached is harder because you have to start tracking the objects in the collection somewhere so that you can efficiently expire the collection once one of its items is changed. And that way, he knew, lay madness.

In Shopify’s case, what we really need, is to cache all the required data to render a given public URL. Two requests to the same URL should always yield a cache hit given all input parameters being equal. In code this could look something like this:


cache params.values.sort.to_s do
  ... load all data ...
end

Of course you have to keep track of all the keys you store in memcached now. A database table will do nicely here.


class CacheKey < AR:B
  def after_destroy; Cache.expire(key); end
end

cache key = params.values.sort.to_s do
  ... load all data ...
  CacheKey.create :key => key
end

CacheKey.destroy_all # Sweep cache

So far so good.

This has been the traditional approach and has worked somewhat. I’m here to offer a better solution here though:

Ask for the thing you need, be specific: The complexity to the above solution comes from the simple fact that we formulated our question to memcached too vague. Ask yourself what you really require from memcached and then ask it for exactly that. Consider this: When a product is updated all current urls should be invalidated because they are outdated. Shopify allows the designers to reference a product from any page in the system so we have to run a full sweep. Without informing memcached that its caches are stale it will continue to deliver this stale data and customers will continue to see the old version of the product. A clear miss-understanding between shopify and memcached.

The solution is simple: At the beginning of each request we load a shop object which we pick depending on the incoming host name. We use the fact that we always load this shop model anyways and add versioning to it. This version column is incremented every time we want to sweep all caches.

Now we add the version number to the cache keys:


cache shop.version + params.values.sort.to_s  do
  ... load all data ...
end

this means that we will never get an outdated version from the caches because we ask them for a very specific thing. After the version number is increased in the database all incoming requests will miss the caches but will be re-cached quickly.

Memcached will automatically get rid of the stale keys once space is needed, least recently used keys are discarded first so there is no need for manual cleanup.

In Shopify we use this technology as a way to do Page caching. We keep the rendered HTML, HTTP return status code and Content-Type in memcached and use all the differentiating input variables as keys such as content of the shopping cart. We keep the HTML because this saves our server cluster valuable bandwidth by avoiding loading and compiling the liquid templates from the NFS server. Requests for cached documents are now rendered in sub 10ms regions.

To summarize Shopify asks memcached politely to: “Hand over version 55 of the index html for www.snowdevil.com the way it would look like with one Draft 151cm snowboard in the cart”. A very specific question for which there is only one valid answer, the exact data we want, stale data can never be returned because everything which would make it stale will increase the version number.

Quick remark. When you use memcached in ruby make absolutly sure that you use memcache-client as it’s the fastest and most used ruby implementation of the protocol.

32 comments (closed) Filed under: Code Rails

Dealing with Gravatar

Posted by tobi — 03:56 PM Dec 15

We use the gravatar service on many occasions across the web applications of jadedpixel. The service recently became ridiculously slow however so we needed to find a solution.

Daniel explains how to decouple the gravatar loading from the actual page loading so that the webpages remain snappy.

Taming the Gravatar

2 comments (closed) Filed under: Code

Futuretech - Starfish

Posted by tobi — 10:41 PM Aug 18

Lucas Carlson talks about his exciting new distributed application approach dubbed Starfish which isessentially a 20% work (or less) 80% the effect implementation of google’s phenomenally clever MapReduce technology.

A distributed log file parser can look as simple as this:


    server do |map_reduce|
      map_reduce.type = File
      map_reduce.input = "/tmp/big_log_file" 
      map_reduce.queue_size = 1000 # how many lines of the file to
buffer at a time
      map_reduce.lines_per_client = 100 # how many lines each client
will process at a time
      map_reduce.rescan_when_complete = true
    end

    client do |line|
      if line =~ /some_regex/
        logger.info(line)
      end
    end

Save and run it as statistics.rb and run


# starfish statistics.rb

Which leads the server to read in the affectionaly called big_ass_file in chunks of 1000 lines, tickle them out to any amount of clients which in turn parse the data by regexp and report back their findings to the server. The server can then act upon this new found wisdom. Perhaps by updating your client’s statistics or by issuing warnings to abusive customers. Any sizable data mining task should be accomplishable with this strategy.

The library only works with ActiveRecord data sets at this point but array and file, as demonstrated above, are in the pipeline.

Update: Starfish 1.1 is now available with the file support mentioned above.

0 comments (closed) Filed under: Code

Shopify Party

Posted by tobi — 07:48 PM Jul 10

Don’t forget that tonight is the belated Shopify Launch Party

The venue is fantastic and Fiona did a great job organizing everything. We even got our very own schwag!

4 comments (closed) Filed under: Code

Easy migration between Databases

Posted by tobi — 06:01 AM May 29

Recently we did a major switch in Database architectures here at jaded Pixel and needed a simple way to move from one architecture to another.

The first thing I tried was to try to get the dump utility of the database we were using to produce sensible and portable SQL inserts. This effort fell flat on the face because of subtile differences in the database’s string escaping and handling of booleans.

So after a quick inquiry in the rails core channel Rick Olson recommended dumping the data to YAML and reloading it on the other side. Simple enough, the total amount of data to transfer was well below 100mb so this seemed like a sensible approach.

Get the backup.rake and add it to your lib/tasks/ directory.

Here is the basic process:

  1. Connect to your server and use RAILS_ENV=production rake db:backup:write to get a yaml representation of all your data
    
      rake db:backup:write
      (in /Users/tobi/Code/Ruby/shopify)
      Writing addresses...
      Writing articles...
      Writing blogs...
      Writing carts...
      Writing collections...
      ...
      
  2. Update your datababe.yml to point to your new database.
  3. Run RAILS_ENV=production rake db:backup:read to fill your new database with all written data.

Careful db:backup:read will delete all data in the target database. Use only with extreme caution.

16 comments (closed) Filed under: Code