Build it and they will come? Performant Search Part 2: The Technology Sauce For Better SpaghettiPosted: November 8, 2012
Our job was to find a long term scalable solution to the problem of finding your activities that match your key word search. This post pertains to the technology involved. Read about product features and new capabilities here.
Turns out, search in RescueTime is a surprisingly complicated problem, for the simple fact that your typical search prioritizes ranked relevance results– it’s ok for the engine to stream results to you, it’s ok for the search to be incomplete, as long as the ranking is bubbling up the best match. It’s ok for it to be probabilistic or best-guess in nature, sometimes. Generally speaking, you are looking for something small (some words) in something large (a web page).
Our challenge, is that while the user experience semantically matches “search”, what you really need is a complete result set, combining all relevant reporting filters, of activities that match your requested search expression. It should produce repeatable results assuming new data hasn’t arrived. It should be real-time updatable as your new data comes in (~ every 3 minutes). It should be ok if every record or zero records match, there can be no cap on number of matches. All this, for about 100-400 logs of time data per user per day for many tens of thousands of users. The longer a user is with us, the huger the list of activities to match against, just for that user. The list of unique activities of our users is well over 1 billion. We should be able to completely rebuild indexes of activities at will, in near real time, to provide application improvements. Yet, cost needs to scale at worst linearly as a fraction of the rest of our products’ demands, and speed needs to remain constant or better.
All of these characteristics eventually killed our previous attempts to use industry-standard search models based first on Lucene/solr, and secondly on Sphinx. Both were able to hold up for a period of time, but were fundamentally flawed solutions in the assumptions they make expecting a relatively static, ranked-match document-based search model. To shoehorn them into our needs required operational spaghetti and some pretty ugly reporting query structures.
Our old search platform may have looked like this and required similar engineer attention, but it didn’t sound as good.
Enter MySQL fulltext Boolean search. First there is the critical advantage of being *inside* and native to our database platform in a way that even Sphinx with it’s plugin can’t be. This allows for more integrated and simpler reporting queries– no longer is the step of matching your search expression required to be isolated from the reporting query that depends on it (Sphinx could have done this, sort of, with the plugin, but not really the same). Second, in Boolean search mode, MySQL provides an unlimited result set (no cap on results). Additionally, there is less duplication of supporting data, since it can operate entirely in the same instance as the source data– this value is not to be underestimated, for all the inherent caching this leverages. Operationally, it is far easier to dynamically and programmatically add, destroy, and rebuild indexes– since they behave like tables with normal indexes to the operator.
But for performance, the most critical options it offered was a way to fluidly and transparently provide per-user-account search indexes, which lets our performance remain consistent despite constant multi-dimensional growth (new users + accruing existing users’ time data). This isolation-of-index solution would have been theoretically possible but horribly unwieldy and in need of huge operational supporting code in the other options. Secondly, it provides a clear way to constrain the size of keyword indexes, in other words, we know from your requested report you could only possibly care about activities that were in the particular time range you requested, and this can be of value both in index partitioning options and the submitted report query itself, especially in the amount of memory that must be held to build final results. A huge benefit of this known-maximum-scope for the searchable data means that at any time we can intelligently but programmatically throw away or update whatever dynamic table plus index we had that intersects the requested scope rather than the entire source tables, and rebuild it in real time, for a minor speed penalty (< 0.01 sec vs .1 to 3 sec for your search). Any subsequent search request that remains a subset of that most recently persisted scope can just reuse the current index, with the < 0.01 sec speed. We can play with permitted scope expansion to tune for speed. Furthermore, any sharding of accounts across new instances allows the search data to naturally follow or be rebuilt inline with the same scaling decision that drove the sharding to begin with– no separate stack to worry about following the shard logic.
Check out some example code for sneaking a search result into a series of joins rather than hanging out in the WHERE clause. Here it can be treated just like any other join, and like any other filter on your pivot reporting.
-- check dynamically_maintained_text_and_things_with_controllable_scope -- if exists and is up to date continue, -- else, create if needed and -- push missing scope by intersecting request scope with existing -- index is maintained by table definition SELECT * from things INNER JOIN other_things on things.id = other_things.thing_id -- begin search acting like join INNER JOIN ( SELECT * FROM ( SELECT things_id FROM dynamically_maintained_text_and_things_with_controllable_scope `things_search_data` WHERE MATCH (things_search_data.a_text_field, things_search_data.another_text_field) AGAINST ('search words here -there' IN BOOLEAN MODE) ) d_things_result_table ) things_result_table ON things_result_table.thing_id = other_things.thing_id -- end search acting like join WHERE other_things.limiting_characteristic = a_factor
We’re using the table alias for the search data there because that would allow you to chain multiple searches, one after the other (search inside a search), against the some source data by adjusting its table alias.
Engineers looking to provide high performance, easily maintainable search capabilities should carefully consider MySQL fulltext search options if they are on a late version of the product. This is especially true if you are trying to manage a search index of billions of tiny items based on database text columns that mutate very quickly, and benefit from programmatically created and maintained indexes. For example, it occurs to me that Twitter could use this model in very much the same way we do to provide some powerful realtime search indexes for targeted subsets of accounts or hashtags.
In our case, we were able to reduce our search related platform costs by about 60-70%, significantly reduce our operational pain, even while delivering a solution that provided vastly improved performance and eliminating that “well, it works now but we’ll just re-architect next year” problem.
Our spaghetti search code and virtual spaghetti cabling has now been re-factored into something tastier, ready for your consumption.