tck.io

Enabling browser uploads to S3

One of the coolest parts about Amazon's S3 service is that it can handle user uploads from the browser, doing checks on the data to help curb abuse, even though it's a "static" service.

However, configuring this service is a total pain — there is a lot of confusing documentation from Amazon, a billion questions on Stack Overflow and a few blog posts.

There are some security concerns to be worried about, but Amazon does a pretty good job of giving you the ability to control the source, size, content-type, etc of the upload.

The first step is to configure your bucket policy to handle uploads from unknown sources. We are going to restrict it to a specified referer. (We know that can be spoofed, but like, lets make them work a little harder, right?) Awesomely the policy is JSON, so we can actually just make an object literal and then stringify it

{
      Version: '2012-10-17',
      Id: 'upload-from-s3-hosted-site',
      Statement: [{
        Sid: 'allow-put-object',
        Effect: 'Allow',
        Principal: '*',
        Action: ['s3:PutObject', 's3:PutObjectAcl'],
        Resource: 'arn:aws:s3:::<bucketname>/*',
        Condition: {
          StringLike: {
            'aws:Referer': [
              '<awshost url>/*'
            ]
          }
        }
      }]
    }
    

We then add this policy to our bucket either via the UI or the AWS SDK. We need to wrap the policy as a string in an object to post first though.

var AWS = require('aws-sdk');
    
    var s3 = new AWS.S3();  
    var params = {  
      Bucket: '<bucketname>',
      Policy: JSON.stringify(policy);
    };
    s3.putBucketPolicy(params, function(err){  
       if(err) console.log('ooops', err);
    });
    

So far, so good.

The next step is that we need to generate an upload policy that gets sent with every request, explaining to S3 how to verify if this request is active.

The policy looks similar to the one above, but we're going to explain what headers are allowed to be sent, what the maximum filesize is that it should accept and what acl it should apply to the file. Once again, this is in JSON so we'll make an object and stringify it.

var policy = {  
      expiration: '2015-11-18T12:00:00.00Z',
      conditions: [
        { bucket: '<bucketname>' },
        { acl: 'public-read' },
        ['content-length-range', 0, 2 * 1024 * 1024],
        ['starts-with', '$Content-Type', ''],
        ['starts-with', '$key', '']
      ]
    }
    

This says this policy is valid until Nov 18, 2015, Noon UTC, and that the filesize must be between 0 and 2MB, it will be readable by the public and the request will include both a Content-Type field along with a key field. (key is what the filename will be on S3).

We need to export this in base64 though for both the request and in order to sign it, so we'll just throw this into a buffer and output the base64 encoded version. (In the browser you can just use window.atob.)

base64policy = (new Buffer(policy)).toString('base64');  
    

But! How do know someone hasn't just changed the policy in their browser or something and sent a malicious version which lets them to much more? We have to sign it with our secret key and send the signature as well. Amazon then takes your policy and encodes with your key and compares it's computed signature with your provided signature. If they don't match you get an HTTP 403 FORBIDDEN error.

This though, is where it all goes to shit. There are two differnet types of signature: Version 2 and Version 4.

Version 2 is much older (and thankfully easier) to operate. Version 4 however is probably more secure. After spending several hours though, I couldn't get V4 working correctly, so I'll explain V2.

With V2 we have to create a SHA1 hashed HMAC digest and then convert that to base64. The key is our secret access key and the content is the base64 encoded version of the policy.

var crypto = require('crypto');  
    var signature = crypto.createHmac('sha1', <secret key>).update(base64policy).digest('base64');  
    

So now we have all the pieces to our puzzle, except the form.

We can build that programatically thanks to window.FormData in our browser and I'll send it with jQuery since like, man, window.XMLHttpRequest is obnxious.

var form = new FormData();  
    form.append('AWSAccessKeyId', <access key>);  
    form.append('policy', base64policy);  
    form.append('signature', signature);  
    form.append('Content-Type', <mime-type of file>);  
    form.append('key', <filename>);  
    form.append('acl', 'public-read');  
    form.append('file', <filedata>);  
    $.ajax({
      type: 'POST',
      url: <aws host>/<aws bucket>,
      data: form,
      processData: false,
      contentType: false,
      success: function(resp){
        console.log(resp);
      }
    });
    

And there you have it!

Stop! Hammertime!

The app I work on these days is a collaborative-type editing program for writing television scripts which is originally based on etherpad-lite.

Because it's all about "real-time" communications, it uses socket.io for coordinating user changes between multiple editors of the same script, amongst other data. This is all well and good until one day you're trying to do some load testing because in a fit of pique, you got very angry at the database code and replaced it wholesale.

(No seriously. It uses MySQL by default, but uses it as a key/value store, and does some crazy buffering/caching on top of it which occasionally has some memory leak and performance issues.)

So you look for a good websocket load tester, and you can't really find one that suits your needs since your websockets need authentication. What do you do?

You write one.

But you actually write a generic harness for testing anything with websockets, and have the user provide a generator file which allows them to control the messages emitted, what happens when a message is received, if the user needs to authorize (and how to explain to socket.io that you're authorized, etc).

You can get it from npm:

npm i -g hammer-time

And check it out on github: https://github.com/scriptollc/hammer-time.

Using Prism with Ghost

This morning I dove into Ghost for the first time, changing my blog over from an old solution which used Python and S3.

But my syntax highlighting wasn't working anymore (and frankly, it was time to update to a new one...)

Without wanting to dive into Ghost internals (especially with how it converts Markdown to HTML), I hacked together a quick solution to explain to Prism how to know what language a certain code block was using jQuery.

First, before the block of code, I insert a <span> tag (since, like, remember HTML is valid Markdown!) that looks like:

<span class="lang language-javascript"></span>

Then at the bottom of the default.hbs file, I have this little snippet. It's dumb, but it works.

  $('pre').each(function(){
        var $pre = $(this);
        var span = $pre.first().prev().find('span.lang')[0];
        var lang = '';
        if(span){
            lang = Array.prototype.slice.call(span.classList).filter(function(x){ return /language/.test(x);})[0];
        }
        $pre.addClass(lang);
      });
      Prism.highlightAll();
    

Dependency Management in Go

Recently I've been playing around in Go, writing some small programs to parse data and shuffle things between data storage backends as part of our aim to minimize our dependences internally.

After experiencing some odd locking issues with LevelUP, I decided to try around writing the migration tool in Go as a lark.

My initial impressions of Go are pretty good. It feels easy to write without the cognative overhead of C or C++ (if I never have to write malloc or free again I will probably die happy) but it's depedency management is downright horrible.

On the surface it seems really easy -- you want to use the levigo package to connect to levelDB? Type go get github.com/jmhodges/leveigo and you can then do import it into your project. No downloading anything, no dealing with byzantine build processes or autoconf, make, cmake, etc.

Sounds good right?

Except that it always grabs the default branch in git when you do that. So you're always getting whatever version has been checked into that branch. This doesn't pose an immediate problem, but lets fast forward two years.

You're just hired at a company and you need to update a simple Go program. You check out the code from your company's repo and do go build. Only the levigo package is no longer available on github. Or it has an entirely new API and returns type levigo.LevelDB from opening a database. Now you've got a ton of build errors.

The official FAQ maintains that:

If you're using an externally supplied package and worry that it might change in unexpected ways, the simplest solution is to copy it to your local repository. (This is the approach Google takes internally.) Store the copy under a new import path that identifies it as a local copy. For example, you might copy "original.com/pkg" to "you.com/external/original.com/pkg".

Ugh.

There are a number of problems with this philosophy mainly a lack of upstream updates: Once I fork a library and make a copy for my use, I no longer am able to take advantage of backwards-compatible bug fixes and security patches automatically. I have to watch the original repo and then pull from the upstream repo into my copy and then republish my local copy. Imagine the maintenance nightmare when you're building a large app that might depend on tens of external libraries.

The oddest part is that they also state:

"Go get" does not have any explicit concept of package versions. Versioning is a source of significant complexity, especially in large code bases, and we are unaware of any approach that works well at scale in a large enough variety of situations to be appropriate to force on all Go users.

Funny enough, by simply letting me refer to a commit hash, tag, or other pointer to a specific time in the repository, this would probably suffice for the majority of the use-cases out there. Not specifying it would give you the current behavior, but specifying it would give you at least some concept that your package should continue to build well into the future.

(Yes I am aware that people can do dumb things with mercurial or git repos that cause tags to go away or remove commits from the logs, etc, but those require explicit actions more than just publishing new code to your repo)

JSXHint 0.3.0 released

A quick note: v0.3.0 of JSXHint is now published and available via NPM.

npm install -g jsxhint
    

Why the big deal? Well, there are finally tests! And travis integration to show that the tests work.