SmarterCSV: Importing and Parallel Processing CSV files in Chunks as Arrays of Hashes in Ruby

July 12, 2012 by Tilo Sloboda

 

Ruby's CSV library's API is pretty old, and it's processing of CSV-files returning Arrays of Arrays feels 'very close to the metal'. The output is not easy to use - especially not if you want to create database records from it. Another shortcoming is that Ruby's CSV library does not have good support for huge CSV-files, e.g. there is no support for 'chunking' and/or parallel processing of the CSV-content (e.g. with Resque or Sidekiq),

As the existing CSV libraries didn't fit my needs, I was writing my own CSV processing - specifically for use in connection with Ruby on Rails ORMs like Mongoid, MongoMapper or ActiveRecord. In those ORMs you can easily pass a hash with attribute/value pairs to the create() method. The lower-level Mongo driver and Moped also accept larger arrays of such hashes to create a larger amount of records quickly with just one call.

My requirements were:

To achieve this I created a Ruby Gem smarter_csv which provides a method for smarter processing of CSV-files:

Example to populate a MySQL or MongoDB Database with SmarterCSV


    require "smarter_csv"
    filename = '/tmp/some.csv'
    n = SmarterCSV.process(filename, {:key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}) do |array|
          # we're passing a block in to process each resulting hash / row (block takes array of hashes):
          MyModel.create( array.first )
    end

     => returns number of chunks we processed

Example to populate MongoDB Database in Chunks with SmarterCSV


    require "smarter_csv"
    filename = '/tmp/some.csv'
    n = SmarterCSV.process(filename, {:chunk_size => 100, :key_mapping => {:unwanted_row => nil, :old_row_name => :new_name}}) do |chunk|
          # we're passing a block in to process each resulting hash / row (block takes array of hashes):
          MyModel.collection.insert( chunk )   # using low-level Mongo driver to create up to 100 entries at a time
    end

     => returns number of chunks we processed

Example using Chunking and Resque


    require "smarter_csv"
    filename = '/tmp/strange_db_dump' # a file with CRTL-A as col_separator, and with CTRL-B\n as record_separator (hello iTunes)
    n = SmarterCSV.process(filename, {:col_sep => "\cA", :row_sep => "\cB\n", :comment_regexp => /^#/,
                                    :chunk_size => '5' , :key_mapping => {:export_date => nil, :name => :genre}}) do |chunk|
          # we're passing a block in to process each resulting chunk (array of hashes):
          Resque.enque( MyResqueWorkerClass, chunk )  # pass chunks of CSV-data to Resque workers for parallel processing
    end

     => returns number of chunks we processed

Installation

Now available as a Ruby Gem smarter_csv, and also available as source code on GitHub.

To install, just require smarter_csv, and use SmarterCSV.process()

 

 

 

 

Original Gist

Below is the original Gist, which is at the core of the 'smarter_csv' Gem, and one of the Stackoverflow questions which raised the question.

I hope you'll find this useful :)

enjoy!

 

 

Related Stackoverflow Question

http://stackoverflow.com/questions/7788618/update-mongodb-with-array-from-csv-join-table/7788746#7788746