Don't Design A Query String You Will One Day Regret

Published on February 18, 2015

When writing the Web API book, we decided that there was no way we would ever finish if we tried to address every conceivable issue.  So we decided to setup a Google Group where readers of the book could ask for clarifications and ask related questions.  One question I received a while ago has been sitting on my to-do list for way too long.  The question from Reid Peryam is about query resources.  This is my answer.

Denial2

The Claim

Reid quotes this paragraph from the book:

To work around the inability to easily expose new resources to clients, people often attempt to build sophisticated query capabilities into their API. The problem with this approach, beyond the coupling on the query syntax, is that just a few query parameters can open up a huge number of potential resources, some of which may be expensive to generate. Numerous API providers are starting to discover the challenging economics of exposing arbitrary query capabilities to third parties. A much more manageable approach is to enable a few highly optimized resources that address the majority of use cases. It is critical, however, that new resources can be added quickly to the API to address new requirements.

and rightfully calls me out on failing to provide examples of,

a few highly optimized resources that address the majority of use cases

The Problem

Before I try and describe the solution, let me first clarify exactly what pattern I am claiming is the source of concern.  Here is an example URI template,

http://api.example.org/orders{?fields,sort,filter,limit}

and a resolved URL might look like

http://api.example.org/orders?fields=OrderNo,Customer,OrderDate&sort=OrderDate&filter=OrderDate.gt.2012-01-01&limit=50

SegwayOn the surface this looks like an amazing idea.  A client developer can choose exactly what fields they want to have returned to minimize the bytes on the wire. They can use arbitrary filter criteria to limit the results returned.  This single generic query string can allow a client to generate a representation which contains pretty much any subset of orders data that they want.

This is a very quick way of exposing data without having to think very hard about how the data might be used.  In fact this type of functionality can be built by framework developers and delivered for free to application developers.

Why Is It A Problem?

I believe there are some problems with this approach.  The first problem is, by requiring clients to provide the field list, sort order and filter criteria you are requiring a client to have a significant amount of knowledge about the data model of the server.  Now, your client may already have this knowledge for other reasons and therefore it may not place any additional burden on the client.  However, if you ever choose to remove that client/server coupling you will find it much harder.

Unpredictable Workload

The next problems are performance related.  The first is related to how the data for the query is actually going to be retrieved.  Most likely the data will be in some kind of database.  If the sort order that is chosen matches that of a database index, the results will probably come back pretty quickly.  If however it doesn't, then it could be painfully slow.  A user of the API might be understanding if they are returning a large result set with thousands of rows of data, but what if they are querying a massive dataset but are limiting the query to only return 10 rows.  The server still has to sort the entire set of data. The API user is going to wonder why the request is so slow for such a small resultset.

Wand

Indexes are the magical things that make databases actually perform well. They also often have the ability to include extra columns of data in them to prevent queries from actually needing to go and read the actual data pages.  If in the query field list, all of the fields are including in an index, it is going to be really quick.  If one field is not in the index, then performance will degrade significantly.  These are performance details that are critical once a system begins to be loaded with a large volume of data and have a significant number of users.  It is not a problem that is easily seen during the sprint to go live whilst burning through the seed round of funding.

Diluting The Cache

The other performance challenge introduced by the "uber" query string is the fact that now, instead of there just being a few pre-chosen, performance optimized, use-case verified set of representations that can be cached, we now have to deal will potentially thousands of variants.  The combination of fields, sort orders and filter criteria make for a huge number of potential data subsets.  Caching those would not only bloat the cache but make the cache hit ratio very low.

Some Things You Can't Take Back

REST and hypermedia APIs are great in that they enable you to make many changes that don't break clients.  You can do stuff that turns out to be wrong and then fix it later, when you have the wisdom of hindsight.  However, to make the "uber" query string work, the client needs to take on a fair amount of responsibility and is given a huge amount of flexibility.  You can't just take that away without breaking things.  You end up being stuck with it.

The end result is you have clients who are getting inconsistent performance behaviour, they are are executing requests that are difficult to performance optimize on the server, and you can't fix the problem without a major breaking change to the interface.

Meerkat

Give Them Only What They Need

One approach for avoiding this outcome, is to raise the level of abstraction for your API to that of your application domain.  Instead of giving your client developers the ability to effectively write queries against your data store, write the queries for them and give them a name,

http://api.example.org/orders/open{?since,customer,region}
http://api.example.org/orders/late{?dayslate,customer,highvalue}
http://api.example.org/orders/closed{?customer,closeddaterange,closedtodate}
http://api.example.org/orders/byproduct{?productid,customer,orderdaterange}
http://api.example.org/orders/bypo{?purchaseorder}
http://api.example.org/orders/recent{?customer}

Without any specific knowledge of the types of "orders" that this API is dealing with, but with a fair amount of experience working with order management type of systems, I am going to be bold and say that this set of resources addresses 90% of types of queries that are needed on an orders API.  I can optimize my database to return these specific queries efficiently and hopefully with this reduced number of query variants I can get better cache utilization.

Responding To Feedback

It is highly likely that soon enough a customer is going to want to do something that the API doesn't support.  That's OK, because we can always add new resources to our API.  If we believe it is a valid use-case and we can deliver the results without degrading system performance, then adding the new capability should be a no-brainer.  The ability to quickly add new resources to an API is critical requirement in enabling this approach of starting with a limited API and adding new features only when required.

YesWeCan

The Original Question

In Reid's question he lists a set of URLs pulled from the documentation of his API and I have taken the liberty to relist them here as URL Templates.  Hopefully not too much was lost in the translation.

/api/shipments/{id}
/api/shipments{?ShipDate}
/api/shipments{?ShipDateStart,ShipDateFinish}
/api/shipments{?EnteredDate}
/api/shipments{?EnteredDateStart,EnteredDateFinish}
/api/shipments{?Failed}
/api/shipments{?WasFailed}
/api/shipments{?WasBlind}
/api/shipments{?Phase}
/api/shipments{?CustomerId}
/api/shipments{?CustomerIds}

As you can see, this set of URLs has not attempted to provide unbounded query capabilities.  Reid's team has used their knowledge of the domain to identify which queries are likely to be required by a consumer of the API.  The team should be able to optimize the database to be able to provide good performance characteristics for these specific queries.

One interesting difference in this set of URLs, as compared to my example, is the fact that the different subsets of query parameters are all pointing to the same path.  I have a tendency to add an extra path segment as a descriptor that makes each path only have one set of query parameters.  This is pure preference from a URL design perspective.  However, it may have an impact on the way routing to controllers works in your web api framework.

The Good/Bad News

Reid describes his set of URLs as enabling "a ton of filtering " and questions how he can "enable a few highly optimized resources that address the majority of use cases".  The good and bad news, is that's what has already been done.  It may look like a lot of filtering options, but as compared to what would have been enabled in the API with a unconstrained filter query string, the result is a just a few resources.

Reid - Sorry it took so long to get you an answer, I hope it was worth the wait.

Tortoise

Image Credits:
Denial
https://flic.kr/p/74PwUj
Segway https://flic.kr/p/hupTPq
Wand https://flic.kr/p/9uVH7P
Meerkat https://flic.kr/p/2wwVCY
YesWeCan https://flic.kr/p/5zzqtM
Tortoise https://flic.kr/p/nE44yh