Shortening Field Names in MongoDB

Shortening Field Names in MongoDB

MongoDB and Field Names

10gen's MongoDB is the most promising noSQL data-store I've seen so far.

Looking at several customer testimonials and reports about MongoDB user experiences, and this MongoDB doc, there is one reappearing comment showing up:

For readability one would like to use verbose field names, but this can bloat your data collections quite a bit.. The problem is that if there is a collection with lots of documents, the field names in those documents will be mentioned over and over again throughout the collection.. If you use embedded documents, the size of the outer document will grow quite significantly, just by refering to those field names over and over again.

Whenever this is problem is mentioned, the suggested solution is to use short field names, e.g. 'vid' instead of 'venue_id', 'fn' instead of 'first_name' , etc.

This seems to be a good ad-hoc stop-gap measure, but this means that it's up to the developer to hard-code a mapping into their code -- sure, this will work in principle, but it is basically obfuscication (yikes!) and makes the code much harder to read -- this could lead to some serious bugs down the road due to typos, and it also means that the resulting data in MongoDB is much harder to read, which possibly will make debugging more painful.

Field Names, Tokens, Lexemes..

Under the covers MongoDB is using BSON -- a binary representation of JSON. Hmmm... binary...

Looking at the problem above, I was reminded of a Compiler Construction class I took a long time ago -- the lexical analyzer of a compiler uses a "symbol table" to keep track of all the tokens (strings) in a program, and to make sure to store them only once, and keep a very short reference id to refer to those tokens if they appear anywhere else throughout the program. Compilers do pretty much the same thing as we just asked the individual developer to do - but they do it transparently, consistently and behind the scenes.

I think it would be great if something like this would be embedded into the guts of MongoDB!

Sure, I could write some code which does this in my program, but the field names are typically symbols - this could get messy. Why not embed this into the MongoDB driver?

e.g.:
   Instead of:
        'lenghty_field_name_which_you_dont_want_to_repeat_often' => 'value'

   MongoDB would store:
        '3' => 'value'     # use the ID of the field name instead

   and a lookup hash persisted somewhere safe:
                     'value' = id2field['3']
                     '3' = field2id['value']

If MongoDB had an internal collection _FieldMapper which is not accessible to the users, but only to the driver, and if the drivers would be extended to do a little mapping magic, then the BSON representation could store a very short representation for each field name in each collection, and transparently map those field names back and forth -- so the user never sees the short hand representations and can use meaningful verbose field names, but the underlying data in MongoDB would use short IDs to refer to those field names.

Whenever a field name is encountered, the MongoDB driver would check the FieldMapper's symbol table if it is already known for the given collection, if yes, it would return/use the short-hand id of that field name, otherwise it would store it in the internal collection, and return/use it's short-hand id, which would then be used when storing the BSON document.

What needs to be stored/persisted are two hashes: "field2id" and "id2field" , and the value "next_available_id" These mappings should be cached in memory, and the storage should be very reliable (otherwise we could corrupt our data).

Base N

What could we use as short-hand representations for the field names? How about base N numbers..? ideally base 256, but for demonstration purposes let's use base62. Those numbers could be prefixed by a special character, or could be used just by themselves.

   e.g. let's assume that the "digits" base62 are: '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'

   decimal 61 will be encoded as 'z' , decimal   62 (= 62**1) will be encoded  '10' , 
   decimal 6343 will be encoded 'zz' , decimal 3844 (= 62**2) will be encoded '100' , ...
I doubt that you'll have more than several thousand field names in your average collection -- which means that we can use arbitrary long field names at no expense, and can store them in the ultimately short form: a variable-length baseN number (a few bytes) which represent the index into the "symbol table" stored in the field mapper for the given collection.

As the underlying data is stored in binary form (BSON), we'd probably want to use base 256 numbers to encode the IDs -- this way we could encode 65535 field names in just two bytes...

Feasible?

This idea is still a bit rough around the edges, but in principle it should work.

Problem areas:

Tilo Sloboda, 2011-Jan-30

 

 

 

 

Appendix

I'm using MongoDB in Ruby with MongoID, and created two classes BaseN and FieldMapper to play around with this idea.

Usage would be like this:

	    > require 'base_n'
	    > require 'field_mapper'

	    > FieldMapper.field_to_sym('this')
	     => "0" 
	    > FieldMapper.all
	     => {"0"=>"this"} 
	    > FieldMapper.field_to_sym('this')
	     => "0"
	    > FieldMapper.all
	     => {"0"=>"this"} 

	    > FieldMapper.field_to_sym('works')
	    => "1" 
	    > FieldMapper.field_to_sym('nicely')
	     => "2" 
	    > FieldMapper.field_to_sym('for')
	     => "3" 
	    > FieldMapper.field_to_sym('disgustingly_long_field_names')
	     => "4" 
	    > FieldMapper.field_to_sym('disgustingly_long_field_names')
	     => "4" 
	    > FieldMapper.sym_to_field('4')
	     => "disgustingly_long_field_names" 

	    > FieldMapper.all
	     => {"0"=>"this", "1"=>"works", "2"=>"nicely", "3"=>"for", "4"=>"disgustingly_long_field_names"} 

	    > n = BaseN.new
	     => "0" 
	    > n = BaseN.new('z')
	     => "z"       # one base62 digit can encode 62 strings..
	    > n.incN!
	     => "10"      # = (1 * 62**1) + (0 * 62**0) = 62
	    > n = BaseN.new('zz')
	    => "zz"       # two base62 digits can encode 3844 strings..
	    > n.to_sym
	     => :zz       # we could use n's symbol representation as well

	    > n.incN!
	    => "100"      # (1 * 62**2) +  (0 * 62**1) + (0 * 62**0) = 3844
	    > n = BaseN.new('zzz')
	     => "zzz"     # three base62 digits can encode 238328 strings..
	    > n.incN!
	     => "1000"    # 62**3 = 238328
	    
	    > n.to_sym
	     => :1000     # we could use n's symbol representation as well

Class BaseN

base_n.rb

# (C) Copyright 2011 by Tilo Sloboda 
#
# License:
#         Freely available under the terms of the OpenSource "Artistic License"
# ==============================================================================
#
# BaseN is a class which uses strings to represent BaseN numbers -- in particular base62 numbers
# e.g. decimal 61 will be encoded as 'z' , decimal 62 will be encoded '10' ,
#      decimal 6343 will be encoded 'zz' , decimal 3844 will be encoded '100'
#
# We don't care much about arithmetic baseN here, but rather just about incrementing a baseN number,
# because we want to use those numbers as a very short way to enumerate field names in MongoDB
#
# For the purpose of enumerating MongoDB field names, it would be even better if we could use base256 numbers,
# e.g. variable-length byte strings, but those wouldn't be as easy to print for debugging
#
# Because we are extending class String, we can easily call 'to_sym' to create a symbol out of a base62 number
#
class BaseN < String
  # class-variables, so we store this only once..
  @@lexicon = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' # 62 symbols / base62 digits(!)
  @@successorH = {}  # hash with successors of each digit; will be initialized upon first call of new
  @@base = @@lexicon.size

  def initialize(value = '0')
    initialize_successorH
    self.replace value
    super
  end

  def self.base
    @@base
  end
  def base
    @@base
  end

  def incN!  # increments string base 62, by in-place editing self
    done = false ; x = -1
    while ! done do
      if self[x].nil?
        self.insert(0, '1') # create a carry '1' in the first position
        done = true
        next
      end

      s = @@successorH[ self[x] ]
      self[x] = s
      if s == '0'
        x -= 1
      else
        done = true
      end
    end
    self
  end

  private
  def initialize_successorH
    symbolA = @@lexicon.split(//)
    if @@successorH.size == 0
      i = 0
      while i < @@base do
        @@successorH[ symbolA[i] ] = symbolA[ (i+1) % @@base ]
        i += 1
      end
    end
  end
end

 

 

Class FieldMapper

field_mapper.rb
# (C) Copyright 2011 by Tilo Sloboda 
#
# License:
#         Freely available under the terms of the OpenSource "Artistic License"
# ==============================================================================
#
# the FieldMapper class defines a bi-directional lookup for arbitrary strings (field names) to base62-numbers 
# (stored in a string) and vice versa. Think of it as a "string shortener". 
# Similar mechanisms are used in Compilers in the symbol table of a lexical analyzer
#
# This class can come in handy to shorten field-names e.g. in MongoDB collections

class FieldMapper
  @@next_symbol = BaseN.new # defaults to '0'
  @@field2symH = {}
  @@sym2fieldH = {}

# we could persist this in MondoDB or somewhere else..
#  include Mongoid::Document
#  field :klass
#  field :field2sym, :type => Hash
#  field :sym2field, :type => Hash

  def self.field_to_sym(name)   # lookup or add field name
    @@field2symH[ name ] ||= self.add_field(name)
  end

  def self.sym_to_field(symbol) # lookup field name by symbol
    @@sym2fieldH[ symbol ]
  end

  def self.all # dump stored symbols / field names as Hash
    @@sym2fieldH
  end

  def self.print_symbols  # pretty print all symbols / field names
    @@sym2fieldH.each do |k,v|
      printf "%5s : %s\n", k, v
    end; nil
  end

  private
  def self.add_field(name)
    symbol = @@field2symH[ name ] = @@next_symbol.clone
    @@sym2fieldH[ symbol ] = name
    @@next_symbol.incN!
#    self.save!   # we'd need an instance of FieldName here to save properly..
    symbol
  end
end