SimpleDB Primer with Python and Boto


SimpleDB is Amazon's distributed, highly-available, super-awesome database system which runs on Amazon's cloud. It does not function or behave like a database in a traditional sense - it is more akin to a spreadsheet (Excel or OpenOffice Calc). It is, in fact, Schema-less. It is without Schema. You simply have Items which have key/value pairs. Items are grouped together into Domains, and an Item can have many Attributes (keys) and each Attribute can hold multiple Values. I know it's sort of confusing, so let's have an example. Consider the following data set:


We have 6 Items in a Domain called test–domain, and each Item has a name that is unique (this is the only requirement for Items in SimpleDB). We have defined four Attributes which will hold the data for these Items, however none of these Attributes are required. In fact, you cannot specify a required Attribute in SimpleDB (if you need this functionality it must be imposed on the application side).

Keep in mind, we have no Schema. This can be a difficult notion for database-savvy people to come to terms with, but it is a key part of what makes SimpleDB so great and so flexible.

Let's get on with the example. The following code creates a connection to SimpleDB and selects the test–domain Domain and inserts a few Items into it. Before you can do anything with SimpleDB, you need to have an AWS account and opt-in for access to SimpleDB. This takes only a moment, but I'm not waiting. Moving on.

# Make the connection
import boto
conn = boto.connect_sdb(aws_access_key_id,aws_secret_access_key)
domain = conn.get_domain('test-domain')
if not domain:
 domain = conn.create_domain('test-domain')
# Define our data set
data = {}
data['some-item']   = {'color':'blue','price':10,'size':'small'}
data['some-other-item'] = {'color':'red','price':15,'size':'medium'}
data['bicolored']    = {'color':['blue','green'],'price':15,'size':'small'}
data['tricolored']   = {'color':['red','blue','green'],'price':20,'size':'medium'}
data['nocolor']   = {'price':10,'size':'small'}
data['another-color'] = {'color':'purple','price':5,'size':'tiny',
  'comment':'This one is really small'}
# Insert the items
for name,d in data.items():
 item = domain.new_item(name)
 for k,v in d.items():
  item[k] = v
 item.save()

Now that we have some data to play with, let's run some queries on it. In Boto, we are given interfaces to the two methods for getting data out of SimpleDB: Select and Query. As of the release of the 2009-04-15 SimpleDB API, Query has been deprecated in favor of Select. We will be using Select for the examples in this tutorial. Right on - let's see a query.

for item in domain.select("SELECT * FROM `test-domain`"):
 print ">>",item.name,item
 
>> bicolored {u'color': [u'green', u'blue'], u'price': u'15', u'size': u'small'}
>> some-other-item {u'color': u'red', u'price': u'15', u'size': u'medium'}
>> nocolor {u'price': u'10', u'size': u'small'}
>> some-item {u'color': u'blue', u'price': u'10', u'size': u'small'}
>> another-color {u'color': u'purple', u'price': u'5', u'size': u'tiny', 
  u'comment':u'This one is really small'}
>> tricolored {u'color': [u'green', u'blue', u'red'], u'price': u'20', 
  u'size': u'medium'}

Notice the syntax of Select is SQL-like. It is not SQL per se, but it has been designed to be very similar. There are no JOINs, UNIONs, or RDBMS things like this. Since SimpleDB is not relational, there is no need to support relational type queries like JOINs. Although relational database behavior can be emulated with SimpleDB, it is not recommended. I believe Amazon has intentionally left this and other features to keep SimpleDB nice and fast and, well, simple.

Let's get down to some serious querying, shall we.

Predicate Statements

In SimpleDB, most of the work is done in the predicate statements. SimpleDB supports most common comparison operators (equals, less than, etc) as well as a few special ones. A few examples will demonstrate how this works better than words.

Single predicate
SELECT * FROM `test-domain` WHERE `size` = 'small'
 
>> bicolored {u'color': [u'green', u'blue'], u'price': u'15', u'size': u'small'}
>> nocolor {u'price': u'10', u'size': u'small'}
>> some-item {u'color': u'blue', u'price': u'10', u'size': u'small'}
Multiple predicates on one Attribute - this is like a union.
SELECT * FROM `test-domain` WHERE `size` = 'small' OR `size`='tiny'
 
>> bicolored {u'color': [u'green', u'blue'], u'price': u'15', u'size': u'small'}
>> nocolor {u'price': u'10', u'size': u'small'}
>> some-item {u'color': u'blue', u'price': u'10', u'size': u'small'}
>> another-color {u'color': u'purple', u'price': u'5', u'size': u'tiny', 
  u'comment':u'This one is really small'}
Multiple predicates on two Attributes - this is like an intersection
SELECT * FROM `test-domain` WHERE `size` = 'small' AND `color`='blue'
 
>> bicolored {u'color': [u'green', u'blue'], u'price': u'15', u'size': u'small'}
>> some-item {u'color': u'blue', u'price': u'10', u'size': u'small'}
Single multi-valued predicate
SELECT * FROM `test-domain` WHERE `color` IN ('red','blue')
 
>> bicolored {u'color': [u'green', u'blue'], u'price': u'15', u'size': u'small'}
>> some-item {u'color': u'blue', u'price': u'10', u'size': u'small'}
>> tricolored {u'color': [u'green', u'blue', u'red'], u'price': u'20', 
  u'size': u'medium'}
>> some-other-item {u'color': u'red', u'price': u'15', u'size': u'medium'}
Single predicate on a multi-valued Attribute using every() 
SELECT * FROM `test-domain` WHERE every(`color`) = 'red'
 
>> some-other-item {u'color': u'red', u'price': u'15', u'size': u'medium'}
Single multi-valued predicate on a multi-valued Attribute using every()
SELECT * FROM `test-domain` WHERE every(`color`) IN ('red','blue')"
 
>> some-item {u'color': u'blue', u'price': u'10', u'size': u'small'}
>> some-other-item {u'color': u'red', u'price': u'15', u'size': u'medium'}

Sorting

Sorting can be a little strange. Let's look at our data set sorted by price in descending order.
SELECT * FROM `test-domain` WHERE `price` != '' ORDER BY `price` DESC
 
>> another-color {u'color': u'purple', u'price': u'5', u'size': u'tiny', 
  u'comment':u'This one is really small'}
>> tricolored {u'color': [u'green', u'blue', u'red'], u'price': u'20', 
  u'size': u'medium'}
>> some-other-item {u'color': u'red', u'price': u'15', u'size': u'medium'}
>> bicolored {u'color': [u'green', u'blue'], u'price': u'15', u'size': u'small'}
>> some-item {u'color': u'blue', u'price': u'10', u'size': u'small'}
>> nocolor {u'price': u'10', u'size': u'small'}

Wait, wait. That's not right. Last time I checked, five was certainly less than 20. This is unexpected, but completely correct. SimpleDB does all sorting lexicographically becaue (pay attention): everything in SimpleDB is a string. No numbers, no timestamps, none of that mess - only strings (and Unicode strings at that). It's hard to get used to, but not having data types is truly wonderful. I won't cover all of the details about numeric data in SimpleDB, but I will say that you need to zero-pad the strings if you want to sort them numerically. In other words, '5' should be '05' or '005' depending on how many digits you need.

Aside

A super easy way to zero-pad your strings in python is to use string formating. The only catch is you need to know roughly how large the numbers will get. For example, if you're doing dollar amounts for a store that sells ballons, 4 or 5 digits should be sufficient.

>>> x = 10
>>> "d" % x #zero-pad a 5 digit integer
'00010'
>>> ".2f" % x #zero-pad an 8 digit float with precision 2
'00010.00'

There's another easy mistake when sorting in SimpleDB. Whichever Attribute you are sorting by must be included in at least one predicate statement. This ensures the returned set of Items has the Attribute required for sorting. If an Item does not have the required Attribute (price in our example) then it is not included in the returned set. There are no explicit NULLs in SimpleDB - if an Item does not have a certain Attribute, that Attribute IS NULL.

Counting

SELECT count(*) FROM `test-domain` WHERE `color` IN ('blue','red')
 
>> Domain {u'Count': u'5'}
 

Limiting

Limiting your queries is a little strange. Basically, if you add a LIMIT clause to your query, that just tells SimpleDB how many results per page you want. This discussion really goes beyond the scope of this tutorial, but I'll just say you need to limit results on the application side. This is the only real problem with SimpleDB I've found so far. Watch this thread on the boto-user's Google Group page for a more detailed discussion and a possible workaround.

Highlights

  • SimpleDB uses a SQL-like Select language
  • Domains have Items, Items have Attributes, Attributes have Values (one or many)
  • Everything is a string, so remember to zero-pad your numbers
  • To be safe with your syntax, use backticks (``) to escape Domain names and Attribute names and use single-quotes ('') for values
  • If you sort, the sort Attribute must appear in at least one predicate statement
  • Implicit NULLs, no explicit NULL values
  • Limiting is kinda borked

Resources

Amazon SimpleDB Documentation
Boto - Python module for Amazon Web Services
david on 06/12/2009 | 0 Comments

New Comment


Commenting is not available in this weblog entry.

Categories

Calendar

July 2010
S
M
T
W
T
F
S
27
28
29
30
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
Copyright ©2009. Loot whatever you like as long you're not a Rogue.