10gen's MongoDB is the most promising noSQL data-store I've seen so far.
Looking at several customer testimonials and reports about MongoDB user experiences, and this MongoDB doc, there is one reappearing comment showing up:
Whenever this is problem is mentioned, the suggested solution is to use short field names, e.g. 'vid' instead of 'venue_id', 'fn' instead of 'first_name' , etc.
This seems to be a good ad-hoc stop-gap measure, but this means that it's up to the developer to hard-code a mapping into their code -- sure, this will work in principle, but it is basically obfuscication (yikes!) and makes the code much harder to read -- this could lead to some serious bugs down the road due to typos, and it also means that the resulting data in MongoDB is much harder to read, which possibly will make debugging more painful.
Under the covers MongoDB is using BSON -- a binary representation of JSON. Hmmm... binary...
Looking at the problem above, I was reminded of a Compiler Construction class I took a long time ago -- the lexical analyzer of a compiler uses a "symbol table" to keep track of all the tokens (strings) in a program, and to make sure to store them only once, and keep a very short reference id to refer to those tokens if they appear anywhere else throughout the program. Compilers do pretty much the same thing as we just asked the individual developer to do - but they do it transparently, consistently and behind the scenes.
I think it would be great if something like this would be embedded into the guts of MongoDB!
Sure, I could write some code which does this in my program, but the field names are typically symbols - this could get messy. Why not embed this into the MongoDB driver?
e.g.:
Instead of:
'lenghty_field_name_which_you_dont_want_to_repeat_often' => 'value'
MongoDB would store:
'3' => 'value' # use the ID of the field name instead
and a lookup hash persisted somewhere safe:
'value' = id2field['3']
'3' = field2id['value']
If MongoDB had an internal collection _FieldMapper which is not accessible to the users, but only to the driver, and if the drivers would be extended to do a little mapping magic, then the BSON representation could store a very short representation for each field name in each collection, and transparently map those field names back and forth -- so the user never sees the short hand representations and can use meaningful verbose field names, but the underlying data in MongoDB would use short IDs to refer to those field names.
Whenever a field name is encountered, the MongoDB driver would check the FieldMapper's symbol table if it is already known for the given collection, if yes, it would return/use the short-hand id of that field name, otherwise it would store it in the internal collection, and return/use it's short-hand id, which would then be used when storing the BSON document.
What needs to be stored/persisted are two hashes: "field2id" and "id2field" , and the value "next_available_id" These mappings should be cached in memory, and the storage should be very reliable (otherwise we could corrupt our data).
What could we use as short-hand representations for the field names? How about base N numbers..? ideally base 256, but for demonstration purposes let's use base62. Those numbers could be prefixed by a special character, or could be used just by themselves.
e.g. let's assume that the "digits" base62 are: '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' decimal 61 will be encoded as 'z' , decimal 62 (= 62**1) will be encoded '10' , decimal 6343 will be encoded 'zz' , decimal 3844 (= 62**2) will be encoded '100' , ...I doubt that you'll have more than several thousand field names in your average collection -- which means that we can use arbitrary long field names at no expense, and can store them in the ultimately short form: a variable-length baseN number (a few bytes) which represent the index into the "symbol table" stored in the field mapper for the given collection.
As the underlying data is stored in binary form (BSON), we'd probably want to use base 256 numbers to encode the IDs -- this way we could encode 65535 field names in just two bytes...
This idea is still a bit rough around the edges, but in principle it should work.
Problem areas:
I'm using MongoDB in Ruby with MongoID, and created two classes BaseN and FieldMapper to play around with this idea.
Usage would be like this:
> require 'base_n'
> require 'field_mapper'
> FieldMapper.field_to_sym('this')
=> "0"
> FieldMapper.all
=> {"0"=>"this"}
> FieldMapper.field_to_sym('this')
=> "0"
> FieldMapper.all
=> {"0"=>"this"}
> FieldMapper.field_to_sym('works')
=> "1"
> FieldMapper.field_to_sym('nicely')
=> "2"
> FieldMapper.field_to_sym('for')
=> "3"
> FieldMapper.field_to_sym('disgustingly_long_field_names')
=> "4"
> FieldMapper.field_to_sym('disgustingly_long_field_names')
=> "4"
> FieldMapper.sym_to_field('4')
=> "disgustingly_long_field_names"
> FieldMapper.all
=> {"0"=>"this", "1"=>"works", "2"=>"nicely", "3"=>"for", "4"=>"disgustingly_long_field_names"}
> n = BaseN.new
=> "0"
> n = BaseN.new('z')
=> "z" # one base62 digit can encode 62 strings..
> n.incN!
=> "10" # = (1 * 62**1) + (0 * 62**0) = 62
> n = BaseN.new('zz')
=> "zz" # two base62 digits can encode 3844 strings..
> n.to_sym
=> :zz # we could use n's symbol representation as well
> n.incN!
=> "100" # (1 * 62**2) + (0 * 62**1) + (0 * 62**0) = 3844
> n = BaseN.new('zzz')
=> "zzz" # three base62 digits can encode 238328 strings..
> n.incN!
=> "1000" # 62**3 = 238328
> n.to_sym
=> :1000 # we could use n's symbol representation as well
# (C) Copyright 2011 by Tilo Sloboda
#
# License:
# Freely available under the terms of the OpenSource "Artistic License"
# ==============================================================================
#
# BaseN is a class which uses strings to represent BaseN numbers -- in particular base62 numbers
# e.g. decimal 61 will be encoded as 'z' , decimal 62 will be encoded '10' ,
# decimal 6343 will be encoded 'zz' , decimal 3844 will be encoded '100'
#
# We don't care much about arithmetic baseN here, but rather just about incrementing a baseN number,
# because we want to use those numbers as a very short way to enumerate field names in MongoDB
#
# For the purpose of enumerating MongoDB field names, it would be even better if we could use base256 numbers,
# e.g. variable-length byte strings, but those wouldn't be as easy to print for debugging
#
# Because we are extending class String, we can easily call 'to_sym' to create a symbol out of a base62 number
#
class BaseN < String
# class-variables, so we store this only once..
@@lexicon = '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' # 62 symbols / base62 digits(!)
@@successorH = {} # hash with successors of each digit; will be initialized upon first call of new
@@base = @@lexicon.size
def initialize(value = '0')
initialize_successorH
self.replace value
super
end
def self.base
@@base
end
def base
@@base
end
def incN! # increments string base 62, by in-place editing self
done = false ; x = -1
while ! done do
if self[x].nil?
self.insert(0, '1') # create a carry '1' in the first position
done = true
next
end
s = @@successorH[ self[x] ]
self[x] = s
if s == '0'
x -= 1
else
done = true
end
end
self
end
private
def initialize_successorH
symbolA = @@lexicon.split(//)
if @@successorH.size == 0
i = 0
while i < @@base do
@@successorH[ symbolA[i] ] = symbolA[ (i+1) % @@base ]
i += 1
end
end
end
end
# (C) Copyright 2011 by Tilo Sloboda
#
# License:
# Freely available under the terms of the OpenSource "Artistic License"
# ==============================================================================
#
# the FieldMapper class defines a bi-directional lookup for arbitrary strings (field names) to base62-numbers
# (stored in a string) and vice versa. Think of it as a "string shortener".
# Similar mechanisms are used in Compilers in the symbol table of a lexical analyzer
#
# This class can come in handy to shorten field-names e.g. in MongoDB collections
class FieldMapper
@@next_symbol = BaseN.new # defaults to '0'
@@field2symH = {}
@@sym2fieldH = {}
# we could persist this in MondoDB or somewhere else..
# include Mongoid::Document
# field :klass
# field :field2sym, :type => Hash
# field :sym2field, :type => Hash
def self.field_to_sym(name) # lookup or add field name
@@field2symH[ name ] ||= self.add_field(name)
end
def self.sym_to_field(symbol) # lookup field name by symbol
@@sym2fieldH[ symbol ]
end
def self.all # dump stored symbols / field names as Hash
@@sym2fieldH
end
def self.print_symbols # pretty print all symbols / field names
@@sym2fieldH.each do |k,v|
printf "%5s : %s\n", k, v
end; nil
end
private
def self.add_field(name)
symbol = @@field2symH[ name ] = @@next_symbol.clone
@@sym2fieldH[ symbol ] = name
@@next_symbol.incN!
# self.save! # we'd need an instance of FieldName here to save properly..
symbol
end
end