Simple Database Challenge

9/13/2015

the challenge

Earlier last week, I was asked to implement an in-memory database similar to Redis (name-value pairs). This post is about code style, algorithmic performance, and thought process. For simplicity's sake, my program to the challenge will receive commands via stdin, and should write appropriate responses to stdout.

For a full description of the challenge and problem description, simply google "simple database challenge" or go to http://www.thumbtack.com/challenges/simple-database.

thought process & data structure

I'll be using HashMap to meet the performance requirement of O(log N) or better.

To support transaction commands - BEGIN, ROLLBACK, COMMIT - I chose to use a linked list to hold incremental state changes over time within each transaction scope. When a new transaction scope starts (BEGIN command), a block is created and added to the end of the linked list. Each block will only hold incremental state changes pertaining to the transaction scope the command is issued in.

The changes within a transaction scope is always held in its own block, making ROLLBACK simple and straightforward. On a rollback, the last block can simply be abandoned and eliminated from the end of the linked list.

The 'state changes' recorded in each block are simply in the form of name-value pairs. Only name-value that changes are recorded in the latest block of the linked list. When a name is set to a value within a transaction scope, the new name-value pair is noted in the latest block of the linked list. When the same name is set to another value in that transaction scope, the name-value in the latest block of the linked list is then updated accordingly. The name-value setting approach applies to both SET and UNSET commands. UNSET is treated as if a SET command setting a name to null value.

To retrieve a value of a given name for GET command, I traverse backward through the blocks in the linked list. Starting from the last block, if the name is found in the block, the value is the answer. If the name is not present in the block, continue to traverse backward to the previous block for the value of the name. Repeat the traversal until a value is found for the given name, or until depleting the block traversal.

To support NUMEQUALTO command efficiently, I maintain a value-counter pairs for each block similar to the way I trace the 'state change' of the name-value pairs. The value-counter pairs is a bit more complicated to maintain because when a name is set to a different value in a transaction block, it may actually involves changing multiple value-counter pairs in that transaction block. For example, when we 'SET a 10' followed by 'SET a 20', the 'SET a 20' actually involves changing value-counter pairs for both value 10 and 20. The NUMEQUALTO counter returned do not need to be tracked for names with null value, since 'NUMEQUALTO null' is not a valid syntax.

To retrieve a counter of a given value for NUMEQUALTO command, the same approach works. That is to traverse backward through the blocks in the linked list. Start from the last block, if the value is found in the block, the counter is the answer. If the value is not present in the block, continue to traverse backward to the previous block for the counter of the value. Repeat the traversal until a counter is found for the given value, or until depleting the block traversal.

When maintaining the name-value pairs and value-counter pairs, I watched out for the edge conditions of (0, null, etc.) to avoid subtle bugs from happening.

Lastly, the COMMIT command will close all open transaction blocks simply by flattening all blocks in the linked list and condense them all into one single block.

source code

You'll find two classes - Database and TransactionBlock - in the source code.

Database class - A DB contains a sequence (linked list) of uncommitted Transaction Blocks. Each block is a TransactionBlock object. These blocks (TransactionBlock objects) are formed in lined list. Extra block is attached to the end of the linked list when a transaction command BEGIN is issued. The last block of the linked list is abandoned when a transaction command ROLLBACK is issued.

TransactionBlock class - A transaction block conceptually includes (points to) all past uncommitted transactions accessible through traversing to the previous transaction block.

execute the solution written in java

To download my solution implemented in Java 1.7:

download simpledb.zip attached to end of this post
expand simpledb.zip into SimpleDB folder

simpledb.zip
File Size:	9 kb
File Type:	zip

Download File

To run the solution in interactive mode:

cd into SimpleDB folder
cd into bin folder under SimpleDB folder
java SimpleDB/Database

To run the only test case come with the original problem:

cd into SimpleDB folder
cd into bin folder under SimpleDB folder
java SimpleDB/Database < test_input/adv1.txt

discussion

One question I have is toward the UNSET command. The challenge is not clear about whether the UNSET command means the variable is never set, or never set in the context of the existing transaction scope. In other words, whether the UNSET can be rollback or not is not clear to me. To avoid confusion, it is best to apply test cases through which the spec may be well defined.

Do you have a complete different thought process? Are you aware of things to improve looking over my code style? Please leave your reply below and I'll get back to you. Thanks for reading!

7 Comments

Sachin

9/5/2016 07:15:36 pm

Runtime limitation given for this is O(Log(N)) where N is total number of variables stored in database. But for get operation, your description indicates, you'll travel till first block. so Complexity would be O(T) where T is number of transaction and T >> N

Hsufeng Ko link

9/5/2016 08:38:56 pm

You're right, Sachin.

To fix and meed the performance requirement for GET operation, simply make the name_value HashMap maintained by each transaction scope not only contains the updates for those name-value pairs happens in the current transaction scope, but also carry over the name_value HashMap of the previous transaction scope.

By trading memory for speed, as a result, a GET lookup no longer need to traverse back to the previous block, and the GET lookup will be O(Log(N)) where N is the total number of variables stored in the database.

Sachin Doiphode

9/6/2016 07:42:14 am

Yes, I did the same.
But yes need to trade memory also one thing. I'm copying previous transaction blocks map's to new block. doesn't copying map at each begin trasaction increase its complexity to O(N) ? as we I'm copying the variables.

Hsufeng Ko link

9/6/2016 03:46:06 pm

How about finding a way to speed up HashMap copy? For example,

1. Fast copy hashmap by way of serialization. (http://stackoverflow.com/questions/998376/clone-utility-for-hashmap-in-java). Or,

2. Apply some sort of copy-on-write HashMap. (http://concurrently.blogspot.com/2007/07/implementing-copy-on-write-hashmap-in.html)

2/11/2018 05:03:45 pm

Hi,
Your current code shows that the name_value pair in EACH block contains all the variables that reside in the database. because you copy the table from the previous block and so it accumulates all the keys that are inserted in the course of time. So, when does GET require to traverse back? I mean, if the key is not part of the name_value in the current block so FOR SURE it is not available in the previous blocks. Am I missing sth?

Tony

9/29/2016 06:08:42 pm

In unset, you are setting to null. But what if null is a valid string value ?

Hsufeng Ko

11/9/2016 04:52:55 pm

A command like "UNSET a" would unset variable 'a'. To implement this, the code simply nullify variable 'a'. null is a special literal in Java to signify uninitialized state, not to be confused with a string with value "null".