Archive by Author

“It All Seemed So Easy”

9 Jul

“The IRS will help us.” 

“Officer, the light was yellow.”

“Honey, I’m a bit pregnant.”

What trouble 5 little words can hint at.  Everything is cruising along, and then 5 little words come back at you from an unexpected angle, like some hawk swooping in and going for the eyes.

Here are 5 to think about (6 if you despise hyphens):  “We used a ‘pre-positioning’ strategy.”

In early June, feeling rather pleased, I wrote about ‘retiring a Zombie App’.  (We did too.  It’s dead, and staying that way.)  And as part of that blogpost, I included this description of part of the larger approach:

“We used a ‘pre-positioning’ strategy. If you know you will need something later and can harmlessly incorporate it into the Production platform now, do it. In this case… mappings were loaded to Production and then started flowing out with the weekend DB refreshes to the cloud test environments, preventing the need for weekly reloading….”

The italicized sentence was that way in the original, but frankly it should have read “and if you really, really, KNOW you can harmlessly incorporate it into the Production platform”.  Knowing means having certitude, and in our business certitude comes from testing, not leaps of faith.

As part of another project, we recently employed the same philosophy of pre-positioning.  We even did it in the same arena, Transaction Translation mappings.  In this case, we pre-loaded transaction transformation instructions into Production we knew we would need later.

The catch was I didn’t ‘know’ they would start interacting with other existing mappings.  The effect was that about 41,000 dividend transactions were ‘anonymized’, retaining their proper value and effect, but losing the identity of the security making the payment.  Once discovered, the issue was quickly diagnosed, but it took several days to restore the affected data to complete accuracy, several days of developer time that could have been used elsewhere.

While this incident had no adverse client-facing effects, eventually it could have.  Our checks and balances are quite extensive, but didn’t include an error of this nature.  Instead, this was noted by an attentive analyst.

This incident had two main roots.  First, I didn’t sufficiently understand the inner workings of one aspect of our transaction capture application to see that the new, broad-based Transaction Translation instructions might affect all transactions, not just the ones we were targeting in the future.

Second, and far more importantly, regardless of my or anyone else’s level of understanding of the inner workings, I should have tested for potential fallout, rather than relied on my personal conviction that there would be no adverse consequences.  Testing helps form a safety net for one’s gaps in knowledge, known or unknown.

Designing such a test can be difficult.  It’s easy to test for planned failures, but how does one test for a Rumsfeldian ‘unknown unknown’?   It’s axiomatic that one can’t do so with total certainty.  We can, however, can play the odds in a way that favors catching the most common failures.  A modest amount of parallel processing (say, a week’s worth of data) would probably not test for a rare event such as a return of capital on a short position, but the mass of ordinary transactions, shunted through a test environment and compared with the same transactions in Production, would have shined a spotlight on this error long before it struck.

As a fan of the methodical, I am also a strong believer in avoiding the same mistake twice – instead, find exciting new mistakes to make.  It’s how we learn.  (The Romans used to say ‘We progress by fault’.)  This one won’t be repeated, but others will crop up.  That’s the nature of the beast.  Testing and amelioration are essential ingredients to proper risk mitigation of even seemingly mundane functions.  I believe that ‘pre-positioning’ remains a beneficial and powerful strategy, but (to paraphrase Spidey’s Uncle Ben) ‘great power means great responsibility’.

Five little words.

~~ Joseph Konrad

Retiring a Zombie App

5 Jun

Herewith a short offering on how to kill zombies.  Even in an advanced civilization that has given us penicillin and “Here Comes Honey Boo-Boo”, zombie apps, neither growing nor dying, can still wander in and out of our resource matrixes with impunity, consuming our brains, our time, our memories (both human and mechanical), and of course our enterprise’s money.  Such was the case with Custodian Data Interface I.

Introduced, in tech-time, several geological epochs ago (okay, it was 2007), CDI 1 was a solid workhorse essential to getting Fortigent’s indigenous back-end suite of data acquisition and core database apps off the ground.  We used to have Advent DataExchange feeding into Advent Axys.  Now instead we would be in a world of CDI 1 feeding into ARI.  Raise the new flag.

Then, in 2009, CDI 2 was born.  Eventually all our Evare feeds were cut over to CDI 2, and its evolution has continued — with our first direct custodial feed, expanded IDC links, and now the beginnings of hedge fund data.

What made CDI 1 a zombie was our tardiness in cutting over one of our last and smaller data providers, DSCS.  When an app neither lets you down nor distinguishes itself with its growth potential, it can simply linger in the twilight.  That’s what CDI 1 did for nigh on 4 years, until last month, when at last the nails started being pounded into the coffin.  Here’s how:

First, we realized that permanently retiring CDI 1 was a down payment on a larger process of cleanup.  Our complement of applications has grown radically in the last 5 years.  Rationalization and streamlining are inevitably needed.  Here was some low-hanging fruit (after all, the replacement app was already working).  What was needed was willpower.  Guru Rao supplied that.

Second, knowing that this was a one-time commitment which would reap permanent benefits meant the resource allocation could be justified.  I was given primary responsibility, joined by Sarada Aragonda for testing and Bob Lewicki for file re-routing.  As to timeframe, we had no hard deadline, but we had a mandate: move slow or move fast, but don’t stop moving.  Get it done.

Third, we had to have a plan that would be quick, comprehensive, non-invasive of applications or existing operations, and as close to paralleling the expected production experience as possible.

1. In preparation, about 1,500 transaction translations and security translation mappings had to be ported from CDI 1 to CDI 2.  This required integrating data from 4 tables in CDI 1 to reside in just 2 tables in CDI 2.

2. We used a ‘pre-positioning’ strategy.  If you know you will need something later and can harmlessly incorporate it into the Production platform now, do it.  In this case, the mappings were loaded to Production and then started flowing out with the weekend DB refreshes to the cloud test environments, preventing the need for weekly reloading.  Jan Corsetty stepped in here to help.  Any necessary tweaks found during testing would be done to Prod at the same time, and those fixes would then propagate outwards in perpetuity.  This contributes to a cycle of across-the-board quality improvement and makes Production and cloud into near-exact mirror images of each other (with the primary exception of data obfuscation), improving the realism of the test experience.

3. Then, for each of 42 DSCS custodians, a test file was created whose purpose was to stress test every single mapping for its expected outcome in the new system.  Jenny Zhang from Operations ran point with any questions. This efficient move acted as both ‘outcome testing’ and a form of user-acceptance testing, with myself as proxy for the users.


4. Expected results were mapped out in advance:


5. For each custodian being tested, a dedicated test account was used. These were easily and efficiently created as ‘TBD’ accounts by the very files being used for testing. For instance, for Morgan Stanley, downloading in files tagged with the letters “MSSD”, MSSDTEST was the artificial test account (above).  All Morgan Stanley test transactions would go through this account.  Once processed, they could simply be viewed through ARI and compared with expected results.  Note that each transaction was given a unique numerical tag that served as its quantity, its dollar amount and its Comment field contents. This made sorting and identifying results quick and easy.

With this method, we were able to advance from loading mappings on April 26, to file testing throughout the first week of May.  By its end, all but two anomalies localized to one custodian had been resolved, making cutover a reasonable decision.  Cutover began on May 14 with 40 custodians (2 more followed).  The data flow is now completely shunted to CDI 2, with the data capture component slated to be replaced within a few weeks.  Then the dirt goes on the coffin.

Few cutovers are flawless, and this one did experience minor hiccups, but after the go-live date all problems encountered were easily diagnosed and remediable.  The best sign of a well-done cutover experience is that it be an operational non-event.  With methodical planning and teamwork, the retirement of CDI 1 falls into that category.

Rest in peace, CDI 1.  And don’t come back.