blog | projects | resume
Filed under

software

 

Over the last couple of months I've been dabbling with a home-spun API for the Madison Metro transit system. It is based on the work that I did building SMSMyBus, the mobile app that lets you find bus arrival estimates in real time using Twilio's SMS API. When I built that app, I simply scraped data from the Metro's website. 

But when I was finished, it immediately became clear that there were lots of other applications that needed to be built using this (and other) transit data. A good example, is the status monitor on display at the Mother Fool's coffee house. So I set out to build an API that would enable anyone to build new transit apps without having to implement the ugly screen scraping techniques. After documenting a draft of the API, I set off trying to implement the server.

Just as I did with SMSMyBus, I built the API server on Google App Engine. After grinding through the pains of scraping poorly formed data on the Metro site, I immediately ran into performance problems and started blowing through my quota. And I did this without even turning on all of the routes in the system. I was forced to actually study the GAE APIs in more detail.

This post is intended to share the experience of tuning the performance, how I measured bottlenecks, and what I did to fix them.

Problem definition

The heart of the API is the continual consumption of a fire hose of data at Madison Metro. This was accomplished using is a list of cron jobs that scrapes location information for every route in the city.

The prefetch handler parses a text file for an individual route that is read from a URL. Each entry is categorized by stop ID or vehicle ID and the arrival estimate and status models are created accordingly. In the case of the longer routes, this could be as many as 480 status entries.

Step one : Identifying the bottlenecks with the Quota API

The first version of the prefetch routine focused exclusively on just getting my hands on the useful information. I didn't pay any attention to the use of datastore calls. I just wanted to get the regular expressions right, create new model entities and shove it in the datastore.

This approach worked just fine on the local dev server, but I was consistently hitting DeadlineExceeded exceptions in the production environment. I needed to identify which elements of the prefetcher were expensive. Enter the Quota API...

    from google.appengine.api import quota


    start = quota.get_request_cpu_usage() 
    # do all the magic 
    end = quota.get_request_cpu_usage() 
    logging.warning("The magic took %s cycles" % (end - start)) 

Assuming GAE was behaving correctly, I started with my own code, the parsing of the fire hose. That did not provide fruitful so I started to add up all the time I was spending in API calls. Although individual calls were small, they quickly added up over hundreds of calls. It became obvious that this type of serial access to the datastore was a contributing factor.

Step two : Batch puts

Previously, the apps I had been building did not require bulk datastore updates so I glossed over that section of the API. But it's easy to overlook. The db.put() function supports a list as an argument. So rather than storing new entities like this, status.put(), I collected a list of model instances in the main loop and followed it with, db.put(statusList). 

This had a dramatic impact on the overall performance. The time it took to loop through the route file went from 12,000 megacycles to 2,000 megacycles (!).

Step three : Memcache tricks

Even after the improvement, I still noticed that I was spending just as much time parsing the file as I was storing results, and that didn't seem possible. After more quota probing, I discovered that I was fetching the same StopLocation entities repeatedly when I needed to find details about a particular stop.

However, when I started memcaching these entities I was disappointed by the overall performance improvement. And this led to perhaps the most revealing aspect of this entire process. The following memcache pattern is slow... 

    # loop over hundreds of entries in the fire hose 
    for e in firehose: 
        stopID = getStop(e) 
        stopEntity = memcache.get(stopID) 
        if stopEntity is None: 
            stopEntity = getItFromDatastore(stopID)
        # parse firehose entry and do other magic


In fact, very slow when you aggregate it over hundreds of accesses. Even if you get a 100% hit rate in the memcache.

The good news is that it led me down another useful API path for the memcache. Just like the batch puts for model entities, you can set and get lists of objects. Furthermore, you can get and set multiple key values with one call - set_multi() and get_multi(). 

This again led to dramatic performance improvements. The time it took to loop through the route file went from 2,000 megacycles (after the batch put optimization) to 600 megacycles. 

Step four : Install Appstats

I still wasn't happy with my performance so I went fishing for more resources and found Appstats. Truly a great resource, but it's something that should be used at the beginning of the process and not the end. :) It gives you a quick overview of how efficient you are (or aren't) being with the datastore. In my case, I had already optimized as best I could, but this tool is now in the front pocket for all future projects.

 Lessons Learned

  1. Become familiar with the quota package. It's the single best way to get granular measurements about where your app is spending its time. I'd link to the API, but I can't!?! It seems to be the only GAE API that isn't documented. The best resource I've seen is the monitoring section found on the platform quota page.
  2. Install Appstat event recorders in every GAE application you create. It's the quickest way learn - at a high level - how effective your memcaching strategy is.
  3. Familiarize yourself with the entire suite of functions in the APIs you are using. Even if you don't know the details of every call, it will help your engineering when you can recall every tool in your toolbox.
  4. Memcache is your friend. Your best friend actually. This might be stating the obvious, but stated nonetheless.
  5. Memcache access is still slow when you aggregate lots of accesses. Use batch processing - even for the memcache.
  6. The taskqueue is the best way to cheat the thirty second handler limit. I didn't talk a lot about this specifically, but if you chunk up problems into smaller ones, the task queue can help you overcome the inherit time restrictions within GAE.

The part I left out... even the best effort can't overcome a bad idea

Even after all of this analysis, tweaking and optimization, I wasn't quite at my goal. I've optimized the primary worker loop, cached all the data that wasn't changing, and optimized the expensive operation of storing new data. Things were better, but they were also more spread out. Instead lots of small costly operations, I had a handful of still costly operations. And operations that were now outside my control.

Running this algorithm for 80+ routes still wasn't feasible without considerable hiking of my billable quota. I've resolved myself to believing that although App Engine is a terrific tool for building web apps, it was never intended to build apps like this. I remain skeptical that they've resolved their datastore performance problems as they've stated they have.

It's also worth pointing out that there is a huge knowledge void when it comes to the Quota API. It is not documented well - both in terms of its use as well as an understanding of how to interpret the results.

I'm going back to the drawing board to determine if there is a better way to access and cache this transit data. Who knows, maybe Madison Metro can save all of us the trouble and just build an API for the existing data! Wouldn't that be groovy?

Filed under  //   appengine   programming   projects   software  

Comments [9]

Over the last few days, I've been camped out at a local pool for Madison's annual All City Swimming Championships. Twelve teams and 1,703 swimmers. A fun but exhausting couple of days. Managing your kids is a two-part challenge. The first part is actually getting them to the block on time for their race. The second part is trying to sort through the 100+ swimmers in the event to determine how well they did.

So I set out to solve the latter problem. I created an SMS notification system that sent text messages to parents letting them know what their swimmer's time was and more importantly, overall rank (note there are as many as 20+ heats for some events so you have no idea if they've qualified for the finals by simply watching). If a swimmer had a rank in the top sixteen, they'd have the opportunity to swim in the finals on Saturday.

As soon as the scorer's table has scored an event, they post the results to the web which was the trigger for the app to notify parents. It turned out to be hugely popular. So much so that I think there's a great opportunity to monetize it for next year's event. Parents at the meet and at work loved the little notes about their swimmers. Here's what one of my notifications looked like...

Tracy, Anna finished event 21 in 40.26, ranked 142

As much fun and useful as the app was, the most geeky and interesting element was actually the debugging part. The problem I faced was that I had no good way to test the app ahead of the meet. I had a pretty good idea how the meet scorers would format the data posted to the web, but there was no certainty. I also didn't have a lot of confidence that the app could actually follow the flow of the meet since the events don't go in order during the preliminary heats. 

Since I didn't have a laptop or access to the codebase, there wouldn't be a way to triage issues during the meet using traditional debug tools. The solution was to create SMS hooks that let me tweak the app as the meet unfolded. I combined this with the implementation of multiple regular expressions when parsing the results. I built the app on App Engine so I created multiple memcache'd variables that could be controlled via text messages from my phone. 

I coded in the following hooks...
  • The ability to set/reset the swim event being polled so I can re-run (or skip) events if there was a mistake
  • The ability to add new swimmers and phone numbers when parents wanted to be added to the app
  • The ability to disable the entire app. I was paranoid that I'd made a hideous mistake that continuously sent text messages out to users and I wanted a way to shunt the entire app.
  • The ability to query the app to determine the current event being monitored
  • The ability to modify the URL base variable used to find the results on the web
In addition to these hooks being controlled with inbound SMS, I also had a few app events that triggered outbound SMS to my phone so I knew it was behaving correctly. 

The hooks turned out to be invaluable on the first day. Both for keeping the app on the right event as well as adding parents to the app. I didn't actually have to use the emergency kill switch although it was nice to text 'disable' at the end of each day to make sure something didn't happen overnight.

The only downside to these hooks was actually a bug in the Google Voice app on my Droid. I was using a Google Voice number for the inbound events and one out of three texts I sent actually resulted in duplicate messages. The downside was when I added a new user, it added them twice (resulting in duplicate notifications when that swimmer swam!) and when I reset the event number, it reset it twice. 

Here's a look at the main handler for the inbound Twilio messages...

I forgot to take a picture in front of the results board at the meet. There's only one results board (for 1,700 swimmers), and each event is printed with a 10 point font. Most parents received their notification messages thirty minutes before the results board was updated!

Filed under  //   appengine   programming   projects   software   twilio  

Comments [4]

(download)

I gave a tech talk tonight on my experience with Google App Engine for the web608 group. I hope these talks continue - it's great for this city to have more events like these.

Link to Google presentation... http://docs.google.com/present/view?id=dhffp9s2_24xt7nmtgz&interval=5

Filed under  //   appengine   google   programming   projects   software  

Comments [0]

Two weeks ago when Twilio announced their developer contest for their new SMS API, I decided to build a mobile application that let me query the Madison Metro bus system to determine when my bus would arrive.

Although the entry did not win the contest, it was named mashup of the day last week at Programmable Web! It's called SMS My Bus, and if you live in Madison and ride the bus I encourage you to take advantage of it! You can find details about it here...

http://www.smsmybus.com

The basic architecture of the application is straight forward. SMS messages are sent to my Twilio phone number and Twilio routes them to my server via HTTP POST requests. I do a schedule look-up based on the user's input and return the results.

The tricky parts stem primarily from the fact that:

1. The Madison Metro doesn't actually provide web services for this data. The consequences for an app like mine is that I need to do a bit of screen scraping to find the data I'm after.

2. I chose to deploy this app using Google App Engine, and URL scraping can become a show stopper since GAE is resource limiting for every request that runs. GAE will only let you a single request for about 30 seconds.

Needless to say, I would love it if my fine city of Madison would join the Gov 2.0 movement, and open up more of its rich data via standard web services

In the meantime... I needed a solution that would allow me to gather disparate data across many, many URLS. As an example, the busiest stop in the Metro system has 34 buses passing through it. I may need to grab 34 different web pages to begin to piece together the schedule as it relates to the caller at that stop.

I took advantage of GAE's Task Queue API and memcache counters to tackle this problem by farming out autonomous jobs that find the next available bus per route per stop, and at the end aggregating the results.

Admittedly, this is not advanced Computer Science. But I hope other App Engine developers can find some use in the pattern.

1. Define my task queues

I used two different task queues to manage the process. One, called aggregation, that queried individual routes at a stop. And another, called aggregationSMS, that pieced the results together for the return SMS message.

 
- name: aggregation 
  rate: 20/s 
  bucket_size: 1 
 
- name: aggregationSMS 
  rate: 10/s 
  bucket_size: 1 

2. Spawn tasks

When an SMS request arrives, the request handler parses the input to determine the request parameters. If the request does not include a specific bus route, I'll query my route table for every route that passes through the respective stop. This table contains URLs for the the real time arrival estimates.

I loop over the result set and create new tasks for the aggregation task queue.

 
    q = db.GqlQuery("SELECT * FROM RouteListing WHERE stopID = :1",stopID) 
    routeQuery = q.fetch(100) 
    if len(routeQuery) > 0: 
        # create a counter for a universally unique caller ID 
        memcache.add(sid, 0) 
 
        # loop over every route at this stop 
        for r in routeQuery: 
          # the unique counter for this caller's request 
          counter = memcache.incr(sid) 
 
          # spawn a task for this stop/route tuple 
          task = Task(url='/aggregationtask', 
                      params={'sid':sid, 
                              'stop':stopID, 
                              'route':r.route, 
                              'direction':r.direction, 
                              'url':r.scheduleURL, 
                              'caller':caller 
                              }) 
          task.add('aggregation') 
    else: 
        # do some error handling 

3. Define the task handlers

There are two task handlers. One to tackle the smallest job of determining the schedule for an individual bus at a stop. And one to piece all of these results together when the system is ready to reply to the caller.

The task handler, AggregationHandler, does the specific work to scrape the scheduling information from the bus system's site. The handler does three things.

  • Scrape the web page to find the next stop time.
  • Store the results in the datastore
  • Decrement the memcache counter for this transaction

Many of the implementation details have been stripped out of the following code snippet...

 
 
class AggregationHandler(webapp.RequestHandler): 
 
 def post(self): 
 # extract the parameters for this task 
 sid = self.request.get('sid') 
 directionID = self.request.get('direction') 
 # more inputs as well... 
 
 # 1. fetch the real time data 
 result = urlfetch.fetch(scheduleURL) 
 
 # scrape the page 
 textBody = result.getNextTime() 
 
 # 2. store these results in the datastore 
 stop = BusStopAggregation() 
 stop.stopID = stopID 
 stop.routeID = routeID 
 stop.sid = sid # the sid identifies the caller's transaction 
 stop.text = textBody 
 stop.put() 
 
 # 3. decrement the counter 
 counter = memcache.decr(sid) 
 
 # if we've completed the scraping, create a task to 
 # piece the results together. 
 if counter == 0: 
   task = Task(url='/aggregationSMStask', 
                     params={'sid':sid,'caller':caller}) 
   task.add('aggregationSMS') 
 
 # delete the counter for this transaction 
 memcache.delete(sid) 
 
 return 
 

The task handler, AggregationSMSHandler, does the job of piecing the results together. It relies on the unique SID for a caller's transaction to query the datastore and find the scheduling details.

 
 
class AggregationSMSHandler(webapp.RequestHandler): 
 
 def post(self): 
 # extract the task's inputs 
 sid = self.request.get('sid') 
 phone = self.request.get('caller') 
 
 # sort the results by time to find soonest upcoming stops 
 q = db.GqlQuery("SELECT * FROM BusStopAggregation WHERE sid = :1 ORDER BY time", sid) 
 
 # we'll only send the next four stops in the reply message 
 routeQuery = q.fetch(4) 
 stopID = routeQuery[0].stopID 
 textBody = "Stop: %s\n" % routeQuery[0].stopID 
 for r in routeQuery: 
 textBody += "Route %s: " % r.routeID + " %s" % r.text + "\n" 
 else: 
 textBody = "Doesn't look good... Your bus isn't running right now!" 
 
 # send off the result via the twilio API 
 outboundSMS(phone, textBody) 
 

Results

This pattern allowed me to almost completely mitigate the DeadlineExceededExceptions on GAE. I've yet to see a timeout problem inside the app. It's always possible that a single task could take too long, but if a task fails, it will re-queue itself and run again until it succeeds.

It's worth pointing out that another use of the Task Queue that I used but didn't show in the code snippets were for other repeated, remote tasks. For example, when I interface with the Twilio API, I do that by spawning a task to do the work. Likewise, I log Twilio events in the datastore on their own task queue as well.

Filed under  //   appengine   google   programming   projects   software   twilio  

Comments [9]

I was smitten when my daughter told me she had an idea for a web application. We were at the book store and were talking about how expensive it can be to buy books. She thought it would be a great idea to create a website where her friends could list the books they each had and then setup book swaps. And poof! It was born...

Here are the mockups she drew today.

     
Click here to download:
Kids_building_software_emmatra.zip (7083 KB)

Filed under  //   kids   software  

Comments [5]

I've been building a number of different applications with the Posterous API over the last few months. I've written about most of these experiences here before.

I've used the reading API primarily within Sharendipity, but have also used the posting API with the Ringerous application that lets you post to your Posterous by phone.

The API is very easy to use and works as advertised. And while the simplicity still offers plenty of access to the Posterous platform, I think there are some really wonderful opportunities if the API continues to evolve and adds functionality.

Posterous is proving to be the Uber-Twitter platform, and one of the ways to accelerate the growth and diverse uses is through advanced applications that interface with Posterous content in new ways.

Based on my experience with the API, here's what I'd like to see be improved:

OAuth support for user authentication

This is by the far the most important feature and is really a core requirement for any application using the posting API.

One of the big problems when I built Ringerous was that it required every user of the service to give me their password. I have to post on their behalf on the backend so there is no way to prompt for a password while the user is using the app. Giving up a password is a tall order.

OAuth makes this problem go away.

A search API

Content is king. But only if you can find it. Posterous is a constant stream of great new and timely posts. For applications that are not revolving around a single user, a search API is needed.

A public timeline feed

I'd like to see Posterous add an "explore" call that is equivalent to http://posterous.com/explore/ which is available via the web today.

This page is a fun way to explore the diverse body of content being shared every minute by the Posterous community.

A user subscription feed

Similar to the explore feed, it would be nice to provide an API call for the user subscription feed so it would be possible to present a user's Posterous network of blogs.

Enable granular control for autopost

Currently, the autopost feature is either on or off when posting through the API. Just like email, there are use cases where the user may want more control over the services being updated. It would be nice to provide this feature in the posting API.

Hook up Posterous notification emails

This is likely an easy one. For whatever reason, the email notification system for subscribers is not hooked up for posts that come via the API. This significantly limits the communication benefits of Posterous for group sites.

Provide access to user profile data

The API is missing a user profile call. Providing access to details about the user such as description, thumbnail, and favorites adds a personal touch, a sense of community, and a method for exploring when third-party applications need to provide content navigation tools.

The current API is a great start, but it is clearly geared toward mechanical tasks. It's no mystery that so many of the existing API implementations are utility tools for porting blog content from other vendors.

But there are great opportunities for Posterous and its developer community to add new functionality and experiences to the platform.

 

       
Click here to download:
Posterous_API_A_wish_list_tagp.zip (116 KB)

Filed under  //   posterous   software  

Comments [1]

To celebrate being named one of the Best New Mashups over at Programmable Web this week, I'm going to list the top five use cases for Ringerous.

Family blogging


This was the original intention of the service. My extended family uses Posterous to share news, photos, and video with one another all over the country. I wanted to get the youngest and oldest in the family involved as well without the need to email.

Now my kids are calling in to the blog from their sporting events and from the backyard to announcement their latest and greatest personal achievements. All in their own voices.

Mobile blogging (for the rest of us)


There are a couple billion people in the world with mobile phones and only a fraction of them are smart phones. Ringerous has proven to be a great mobile blogging tool for the rest of us.

Podcasting in the classroom


In classrooms, Posterous can be a great resource for collaboration projects. When those projects involve story telling, interviews, or reporting, Ringerous is a good medium for recording and sharing it.

Combine this with the drop-dead simple podcasting you can do with iTunes, and you get a great distribution model for the students and teachers as well.

Bring emotion to your posts


Blogging is a great way to communicate and share stories, but text and pictures often don't tell the whole story. Sometimes, there's no better way to capture the excitement (or despair) of a moment than hearing the voice of friends and loved ones.

Public voicemail


I'm waiting for someone to create topical, public voicemail boxes with Ringerous. Perhaps an inbox for Santa so you can tell him what you want for Christmas!? :)

How are you using Ringerous?

http://www.ringerous.com

Filed under  //   posterous   projects   ringerous   software   twilio  

Comments [2]

Sharendipity + YouTube Data API = Creative Goodness

Sharendipity is an awesome way to tap into your favorite web services. I tapped the YouTube API to create this fun little TV showing the Muppets Studio videos.


Want your own TV? Go create your own and set the channel and skin and then embed it on your site. Interested in something else? Let me know - I'm always looking for fun projects to work on.

Filed under  //   google   programming   sharendipity   software  

Comments [0]

I discovered Google App Engine by accident several months ago when I first looked into building a robot for Google Wave. It was very much a bookmark-and-move-on kind of an introduction.

I eventually did get back to the bookmark and explored GAE some more and have become a huge fan. For starters, it is very much in the spirit of our mission at Sharendipity - providing tools that make it easier for everyone to create custom web applications.

App Engine still requires app creators to know how to program, but it provides an awesome infrastructure for deploying and scaling applications on the web. Without spending a penny, developers get all sorts of goodies including...

  • A data store for easy database creation
  • Built-in user management using standard Google accounts
  • Built-in logging
  • Cron jobs to manage scheduled tasks
  • Task queues to schedule and manage autonomous jobs
  • An application dashboard for analytics and viewing of application data

With a (free) daily quota of 1.3M requests per application, App Engine is a great way to start a new product. As your product grows, you can move into billable services to increase your quotas.

My Experiment

I needed to find an application to build that met the following criteria...

  • Limited amount of new programming since my time is overbooked already.
  • Enough complexity that I could explore App Engine features beyond the "Hello World" tutorial.

So I decided to port an existing service that I had built in grad school - an email distribution list for the Astronomy Picture of the Day (APOD). Previously, this was being hosted using my alumni account at the University of Wisconsin, Madison.

The APOD email service proved to work great because it fit both criteria. There was very little new programming to do since I'd already built it once. And it let me explore several elements of programming within App Engine including...

  • The use of webapp - they're web application framework for templating and handling requests.
  • The creation of tasqueue tasks to throttle outbound emails.
  • The use of the datastore to manage email subscribers.
  • The use of cron jobs to schedule the daily APOD emails.

The Hangups

The two challenges up front were learning Python plus the App Engine environment (including the APIs for the various services I needed). But the documentation for both is so thorough that it rarely held me up.

The quirks that actually caused friction were:

  1. The subtleties of the App Engine platform itself that are learned through trial and error.
  2. The non-deterministic nature of its performance.

This latter issue is the one thing that should bring pause to the decision of building out a business on top of the platform. However, I tend to be optimistic about this and assume it will improve as it matures.

In the mean time, however, I found myself actually managing bad performance in App Engine without any optimization of my own code. The code is too simple to be slow! One of the overriding quotas for App Engine is the per minute CPU quota. You have somewhere less than 30 seconds to complete a request. And while you wouldn't want to take anything near that for a web request, it becomes a little constraining for non-web requests like cron jobs and taskqueue jobs.

All of the jobs in the APOD application are small and constrained. Parse HTML, send an email, or loop through a list of email addresses. Yet, the time it takes to execute these changes wildly from day to day.

When the execution time exceeds the quota, you need to be prepared to manage the exception everywhere. When it happens in a taskqueue job, it can be particularly annoying since the task will re-queue itself - even if the meat of the job had already been completed.

After I initially deployed the app, it felt a lot like I was patching holes for a boat that was already in the water. I added more instrumentation and caught more exceptions until I mitigated all of the problems.

The most glaring problem appears to be a problem in the use of the Mail package. Sending email will frequently lead to DeadlineExceededError exceptions. Remote calls in a throttled environment like this should always be asynchronous.

It appears that they've done just this with remote HTTP requests. However, one of the subtle problems I had was the intermittent failure of urlfetch() calls. I've seen as much as 20% of these calls failing with DownloadError exceptions. As a result, I've built-in my own retry mechanism wherever urlfetch is used.

What's Missing

App Engine is awesome in its overall breadth and ease of use. But if I had to come up with a wish list, it would be the following...

  • An asynchronous Mail package
  • Better SDK tools for testing and simulating cron jobs and Mail actions.
  • As high as some of the quotas are, the email rate quota is too low (only 8 emails/minute). There is likely a very real concern about spam bots, but perhaps there could be an authorization process so legitimate applications could get higher quotas.

The App Engine is a great way to quickly explore new web application ideas. With an easy to use SDK, push-button deployment, and a wide array of built-in services, there has never been a better time to be a programmer.

Interested in Astronomy? Sign yourself up to receive the APOD picture each day - http://apodemail.appspot.com!

Filed under  //   appengine   programming   projects   software  

Comments [2]

Apparently Evan Weaver did. That's one heck of a way to start a conversation if you're looking for a job.

Filed under  //   observations   software  

Comments [0]