Tuesday, July 12, 2011

fusefs-cloudstorage: accessing cloud storage using libcloud and fusefs

Few weeks after cloud storage support appeared in libcloud I started playing with it by creating a fusefs filesystem based on it. Finally the first version is almost ready, so it's time to write a few words about it.

So, for example, you have a cloud files account and want to mount it as it were a local filesystem. You do it this way:

novel@ritual:~/code/fusefs-cloudstorage/test %> cloudstorage.py -o driver=CLOUDFILES_US -o access_id=foo -o secret=acbd18db4cc2f85cedef654fccc4a4d8 ./test

where foo is your access id and this hash is the secret key. Filesystem will be mounted to test sub-directory of the current directory.

A list of directories at top level of this filesystem represents a list of containers on your account and looks like this:

novel@ritual:~/code/fusefs-cloudstorage/test %> ls -1
backups
cloudservers
event_logs
mir_package
novel@ritual:~/code/fusefs-cloudstorage/test %>

Listing each of these directories will (as you might have guessed already) show a list of objects that belong to a corresponding container. For example:

novel@ritual:~/code/fusefs-cloudstorage/test %> ls -1 backups
1345b96a-ae25-11e0-94a9-40401629a6e1.sql.gz
135dbf74-ae25-11e0-94a9-40401629a6e1.sql.gz
27ed5e4e-7faf-11e0-bbcc-40401629a6e1.sql.gz
4fbb5e06-ae14-11e0-aace-40401629a6e1.sql.gz
4fd6274a-ae14-11e0-aace-40401629a6e1.sql.gz
645e7cde-7f9e-11e0-8a43-40401629a6e1.sql.gz
b181ef6c-ae1c-11e0-a6b9-40401629a6e1.sql.gz
b19dcbba-ae1c-11e0-a6b9-40401629a6e1.sql.gz
c626402a-7fa6-11e0-bb13-40401629a6e1.sql.gz
f460c934-9b2f-11e0-89ce-40401629a6e1.sql.gz
novel@ritual:~/code/fusefs-cloudstorage/test %>

You can create containers using mkdir command, copy files in using cp, remove them with rm and so on. You might want to check a shell script test that tests basic features for more examples: https://github.com/novel/fusefs-cloudstorage/blob/master/test.sh.

However, it seems that it's not that easy and quick tasks to implement such a filesystem.

Why is it hard to make such kind of filesystem fast?


My implementation is pretty dumb -- it just does things in a straight-forward way that doesn't work well in this case because filesystem usage patters doesn't match cloud storage usage patterns well.

A couple of examples: cloud storage doesn't support doing writes or reads with offset, so if you want to append a line to a text file, you have to remove the old one and create a new one with the new content. This gets especially slow when we're modifying large files.

Other thing that it looks like vfs design generally assumes it's easy to obtain meta-information about the files, but with cloud storage every single simple operation like checking if file exists, involves issuing an API call, which is slow.

It looks like we can speedup things dramatically if we introduce caching for both meta-data and file contents. But the hard thing with cache is that we don't have an exclusive access to the API. I.e. we can cache things in our filesystem driver and some user at the same time might open up Web Dashboard and remove or upload new files and we will never know about it unless re-request all the data.

At this point I consider this project more like "just for fun" type of thing than something really useful, but probably I will implement caching to make it more or less fast if I decide it will be useful.

Links

No comments:

Post a Comment