Thursday, August 27, 2015

Helix: Vaporware or game-changer for cloud-based genomics


A while back, I wrote a 3-part due diligence [1, 2, 3] on the cloud-based genomics space, focusing on the competitive landscape around Seven Bridges Genomics.

Lets discuss the potential impact of Helix…

It’s not supposed to make money from consumers

The most important thing about Helix is that it is playing the long-game. It does not intended to make money from consumers. My best guess at the Helix strategy is about loss-leading bottom-up disruption with lock-in.


Basically, they will sequence your exome (or genome) at no cost to the consumer and take a cut of all APP revenue (30% Apple standard?). While the vision is for community developed content, they will need to seed the ecosystem with a few tools to both validate the developer API and generate interest: queue the first batch of high-profile ‘applications collaborations.’

I’m not too concerned about the infrastructure side, as Illumina should have already solved the hard cloud-based problems with BaseSpace and the hard laboratory integration/data management problems via HiSeq X deployments. I suspect this know-how will transfer.

Bottom-up disruption

At $500 in sequencing cost and a 30% app cut, consumers would need to average ~$1500 per person on “What colour will our kid’s eyes be?” and “When will I go bald?”…lulz…I think not.

The healthcare market is notoriously difficult to change; but, get consumers asking about Viagra, and the doctors may follow. That’s the idea behind Helix. Acclimatize the consumer to ‘sharing’ their genome for an insight, and hope it trickles into the clinic one “why can’t you GATTACA” question at a time.


Once consumers are locked-in to the platform, the real money begins: approved genetic tests that will be sold directly to institutions. Unfortunately, without the consumer adoption, Helix just wouldn’t have the clout and social license to operate to show hospitals out of the 1950s.

What other platform would be able to offer ‘free’ sequencing and have a database of (hopefully) millions of users to back it up? Not to mention, a large chunk of that ‘free’ sequencing cost will be ploughed right back into Illumina’s core business as instrument and consumable sales. None of the stand-alone cloud-based genomics shops can deliver that sort of synergy to the bottom-line.

The platform will then bifurcate between the non-approved ‘consumer’ apps (virtually free: check the price of any flashlight app) and the approved ‘institutional’ apps (gravy train).

Genius, if it works.

Competing vision to Oxford Nanopore’s Metrichor

Maybe I over estimate Illumina’s baby killing desire, but I also view Helix as a direct ‘vision’ challenge to Oxford Nanopore’s Metrichor.

Will everyone have a sequencer, like every lab has a PCR machine (Metrichor)?

Will you send-off sample for sequencing as a service, like every lab orders oligos (Helix)?

Will sequencing be ‘things’ focused (Metrichor) or human focused, moving up the value chain from novelty –> clinic (Helix)?

Only time will tell, but I like that both visions are beginning to be articulated.


The Helix announcement could completely upend the ecosystem. But, as always beware the hype and remember that execution reigns.

Thursday, August 13, 2015

Even hackers have epics: Why we need Mel

Mel Kaye. To the uninitiated, a blank stare; within the sept, a folk hero. Mel’s [Free verse, Prose, Gist; Explained] is an epic tale of ‘real’ programming, about a level of heavy wizardry that only the very elite may ever approach.

As with all folk heroes, his tale has two sides.
On the one hand, there is the explicit demonstration of:
  • Supreme technical mastery,
  • Personal integrity, and
  • Code as self-expression.
On the other hand, there are elements which some may find subversive:
  • Intrinsic value of the hack,
  • Subversion of authority, and
  • Apathy to the ‘commercial’ value proposition.
Yet, only through this dualism does the story succeed in addressing the ethical questions developers face…

Is it just for programmers to subvert management?
Yes. Out of respect, our protagonist refused to report the cause of the bug.
Respect is the currency of the realm.
- j.ello

Is it just to rail against proprietary (or obfuscated) source?
Yes. Information wants to be free. Code needs to be free.
If programmers deserve to be rewarded for creating innovative programs, by the same token they deserve to be punished if they restrict the use of these programs.
- RMS, see also: GNU Manifesto

Mel encompasses the joys and sorrows of an entire discipline.

Mel does in one short story what technical guidelines and seminars can never achieve.

I say embrace the ethos. I say: What would Mel do?

Grok that.

R.I.P. Ed
23 September 1926 – 13 August 2014

Friday, July 3, 2015

Commentary: 50 Smartest Companies

I have to admit, I was a little surprised when the MIT Tech Review had 15 / 50 companies in the Biotech sector, specifically the genomics ecosystem, from the big bad of NGS instrumentation Illumina to relatively small analysis shops like DNAnexus

Is 2015 really the year of genomics biotech?

The gains (infographic) in raw throughput are indeed very impressive, but I’m skeptical:

Analysis/interpretation >> Point-of-care/Portable >> “High-throughput”

2636 genomes! 100k genomes! 1M genomes!

Maybe China will throw its hat in the ring and spring for 1B genomes. Maybe folks will get serious about somatic variation and spring for 1k genomes from an individual.

IMHO, it seems like an over-hyped pissing contest of who can pay Illumina (sequencing), AWS (compute) and Oracle/IBM (data-centers) the most.

Hopefully, I’ll be proven wrong, tax payers may rejoice in money not flushed and the Tech Review can be vindicated.

Thursday, June 25, 2015

Recap: PyDataUK 2015

This weekend, ~200 delegates trudged through typical London weather (rain) to the Bloomberg offices in London to attend PyDataUK 2015.
While it’s not your typical nerds in T-shirts meet-up; if you use Python to hack data, this conference is probably definitely for you.


Curiously, for a ‘data science’ conference the attendance list (which I would have crawled LinkedIn with…heh), was not available. Bases on my (biased) observations, the attendance was roughly as follows…
Type Sub-type Percentage (%)
Industry 70
Self-employed 20
SME 40
Sponsors 10
Large <1
Academia 30
Ugrad <1
Masters <1
PhD 15
Postdoc 5
Professor 10
Government <1
A few highlights…
* Self-employed contractors and consultants were very well represented.

Conference feel

A your data conference, not a ‘big data’ conference

Hadoop has delivered value for <10% of the companies that have installed it
- Paraphrase, anon
This conference is data focused, i.e. focused on using the Python ecosystem to solve your data challenges. The focus is on practice, and practical tools, not theory.
Type Approx Size Appropriate tools
Micro-data <1Gb Ipython
Small-data (Memory-limited) ~10Gb Pandas
Medium-data (Disk-limited) <1Tb Ad-hoc databases
Big-data Tb - Pb Consider enterprise solutions, or grep
The fact is, ‘big data tools’ would be wildly inappropriate for the vast majority of attendees. The problem seems particularly acute in the life sciences. In his war story talk, Paul Agapow covered the herculean efforts required to re-purpose an ill-advised ‘big data’ solution to recover data from a an ongoing clinical trial.
His message was very clear. Life sciences tends to have very detailed, very heterogeneous data in hundreds to thousands of rows (small/medium data): let the data guide the solutions: you probably don’t need enterprise software, so just don’t waste your money.

A Python is useful conference, not a “Python is deity” conference

All tools are shyte, but some tools (Python!) are useful.
- Paraphrase, anon
Speakers like Russel Winder and his talk on the lack of computation efficiency in Python, even using libraries like numpy set a memento mori undertone to some of the more blatant Python triumphalism.

An interpersonal conference, not a Cloister

The very high-level of interpersonal interaction is yet another way in which the conference betrays the nerds in T-shirts. This is very much a conference that one goes to seek guidance and solve problems.
While there are always the stragglers that don’t head down the pub, a good 2/3s of the conference went for fruitful discussion and drink on Saturday. Unsurprisingly, pub attendance was lower on Sunday, but still fruitful.

A place to get hired/take action, not heavy on theory

Folks were hiring like crazy, and it was very much a sellers market.
If you’re a job seeker anywhere on the Python+data spectrum, I’d strongly recommend attending. Companies were recruiting along the entire spectrum, everywhere from AWS-ineering to user-focused commercial data analysis with IPython notebooks (or re-dash, see Arik’s talk for more details on this user-friendly database interaction framework).
In-line with the action oriented nature of the conference, the Pivigo Recruitment founds were there, doing resume/CV screens and offering advice, both to students and established professionals.
If you are a PhD/Postdoc looking to make the transition, I highty recommend taking a look at their Science to Data Science training program.
Continuum may also be prototyping a training programme of their own through its Client Facing Consultant position. Not entirely sure, but 6-months of training via a 3rd-party consultuncy followed by an intentional poach (Continuum –> 3rd party) could be an interesting model.


I found the spread of talks fantastic. At least amongst the talks I attended…
Type Percentage (%)
Tools 40
War story 30
Skills 20
Under the hood 10


Tools talks were the most common. They covered ‘non-brand name’ and upcoming tools with emerging communities.
Attend/watch if:
(i) You want to learn about specific tools that may be applicable to your problem.
(ii) You want to collaborate on extending / adopting new tools.

War story

These talks gave the horrifying and nitty-grity details of a specific problem the speaker faced, and how they went about solving it (including gotcha’s and failures). The focus isn’t ‘wow, look at me’; but rather, this was some B.S., and I want no one to go through what I went through ever again.
  • Paul Agapow: Don’t use ‘big data’ tools when simpler solutions will do, particularly in the life sciences.
Attend/watch if:
(i) You want help with the problems you are immediately facing
(ii) You want exposure to problems you’ve never thought-of.


These were high-level talks that focused more on skills and knowledge than specific tools.
  • Ian Ozdvald: Writing code for you is only the begining, lets see what it takes to push a Bloomberg model to production.
Attend/watch if:
(i) You want to learn what you need to know in a new area.
(ii) You want an overview of a topic you’ve never heard of.
(iii) You want to chat with the speaker about specific War Stories, after the talk.

Under the hood

These talks focused on low-level implementation details of numpy, pandas, Cython, Numba, etc with a particular focus on performance and appropriateness. Personally, I found these talks the most useful. Where else can one gather such concentrated information from the mouth of the open-source contributors.
  • Russel Winder: If you want performance, use Python as a glue-language, and write your computationally intensive functions in a ‘real’ language.
  • Jeff Reback: In pandas, think about idioms and built-in vectorization to get the most out of your code (then write in a ‘real’ language if you still need to go faster).
  • James Powell: Why does writing good numpy feel so different than writing good Python: because the styles have diverged, and will probably continue to do so.
Attend/watch if:
(i) You want a fire-hose of information about low-level topics.
(ii) You want to know how the ‘magic’ happens.


This is very much a conference focused on solutions. If you have a problem, don’t be shy!. Ask around, and there will be people there that have faced similar problems, eager to help.
As for me, I look forward to attending next year!

Thursday, March 12, 2015

Book Review: How to lie with statistics

A data analysts bible for communicating stats to non-experts. A recommended re-read as annual absolution for your statistical sins.

There is no surprise it's a classic: the book has aged remarkably well, the (humourus) anticdoes being as pertinent today as 60 years ago.

The premise is quite straight-forward. When presented with stats, keep in mind:
1) Tools of the trade,
2) Lies, and
3) Fallacies.
Then do a "sniff-test."

Tools are the trade include bias, sample size and significance tests.

Lies are (often graphical) ways of misleading the reader (intentionally for the data scientist; plausibly unintentionally for those with less of a background): changing the scale bars, 'cleverly' chosen percentages, dishonest before/after and my personal favorite, semi-attached figures (what the medical profession now calls 'surrogate end points').

Fallacies include the ever present correlation is of course causation, and 'proving' the null hypothesis.

If a breezy 124 pages is too much, cut straight to the end. At a 'lengthy' (by this books standards) 15 pages, the 10th and final chapter enumerates a 5-step 'sniff test' that can stop a good many lie in its tracks:
1) Who says so?
2) How does he know?
3) What's missing?
4) Did somebody change the subject?
5) Does it make sense (particularly for extrapolations)

If pointy haired boss ever read this book, it'd make the data analysts job -- appease power by bending truth -- 456.7% more challenging!

PS. Speaking truth to power will get you fired 654.3% faster than appeasement. Exercise minimally bent truth with caution. You've been warned!

Sunday, January 18, 2015

The cross-functional team: Separation of concerns

Working on a cross-functional team is hard!

As specialists, be that specialization in software development, bioinformatics, or molecular biology, we are domain experts; yet, projects still fail to come to fruition on time, on budget and with the expected impact. This is as frustrating and demotivating to the non-technical manager as it is to the specialist team members.

Lets look at a case study…

Kate (software developer) and Darnell (biologist) are hustled into a meeting room by Xue (non-technical manager). In good faith, Darnell (biologist) lays bare his frustrations with the existing software. Kate (developer) records these as a list of requirements. After rubber-stamp approval by Xue (non-technical manager) and two weeks of furious coding, the revised software is ready. Unfortunately for Darnell (biologist), the software is even worse than before.

There is another meeting, with more senior developers and biologists in attendance Kate steps-through the changes she made, and how the changes address the requirements gathered from Darnell. The biologist and developers don’t understand much of each-others technical jargon, but do their best to provide input into Kate’s new requirements. The developers insist that ABI Instruments are used by the biologists because their output format is standardized, and therefore easier to import. The biologist demand that ‘big data’ capabilities are implemented by the developers. All Xue can think about is justifying the expense and deadline slippage to her annoyed higher-ups, along with the sickly feeling of having her neck being breathed down.

Sound familiar?

The organizational problem is that our specialists, Kate and Darnell, are trained to deliver ‘locally’ optimal solutions within their area of expertise; unfortunately, real-world problems are usually ‘global’. The challenge to the cross-functional team is to approximate a reasonable ‘global’ solution with a set of ‘local’ solutions contributed by each specialist. Put another way…”software problems” are few; “problems benefiting from software” are many.

Separation of concerns to the rescue

Separation of concerns (SoC) is a precept of modern software engineering. Focusing on the ‘what’, and abstracting the ‘how’, enables collaborative development on large code-bases, easing maintenance, extension and debugging.

The concept is simple: disparate modules of code must communicate through a common interface. As long as the interface remains intact, the internal workings of each individual module may be modified independently. Importantly, each developer need only know the details of their own module, and the interfaces of the modules they interact with. The concerns (implementation level details) of each module, are thus separated (self-contained, and preferably free-standing).

At first glance, this might not seem particularly relevant to Kate and Darnell, but managing separation of concerns should be a cross-functional team’s #1 tactical priority, second-only to sharing a common vision (#1 strategic priority). Separation of concerns forces our cross-functional team to focus on ‘what’, instead of ‘how’.

For the software engineer, separation of concerns means crafting sensible code modules, with a thoughtful API. For the cross-functional team member, it means understanding the high-level problem (what), coming to a common understanding (interface) with ones peers, making that understanding explicit (‘human API’), and sharing a common language to discuss solutions.

To successfully implement, each team member requires:1

Mutual ownership of the ‘global’ solution.
Gradient awareness of team member capabilities.
Commitment to communication, which includes trust, honesty and good faith.

Re-examining the case study…

Kate (software developer) and Darnell (biologist) are hustled into a meeting room by Xue (non-technical manager). In good faith, Darnell (biologist) lays bare his frusterations with the existing software. Kate (developer) politely stops Darnell (biologist), and asks Darnell and Xue about the actual ‘problem’ they’re trying to solve (mutual ownership). Putting aside specific frustrations with the software, the three discuss each others overall-process and ‘pain-points’, both biological and software (gradient awareness). The three state and adjust their understanding, and break to assess (communication):

Kate (developer): Problems that can be solved with software, pro/con for various options?
Darnell (biologist): ditto for molecular biology.
Xue (non-technical manager): Context. What are other teams doing?

After a few days of assessment, there is another meeting. Xue outlines the high-level overview discussed previously to make sure everyone is on the same-page about the problem. Kate (developer) presents the pros/cons of a few software options. Darnell (biologist) follows suit for molecular biology (communication). The group discusses the various options (mutual ownership); in consultation, Xue chooses the set of options to be implemented, and leaves the implementation details to Kate and Darnell. As the week progresses, software and biology changes are applied. Kate and Darnell touching base to reassess their understanding if an interface becomes unclear, or additional dependencies arise (gradient awareness). Their ‘local’ solutions each contribute to solving the ‘global’ problem. Concerns have been separated such that developers aren’t telling biologists how to do their jobs, and vice-versa.

Applying separation of concerns is an art

There are no right answers. The central challenge lies with each specialist coming to a common understanding of their interfaces with peers, walking the tight-rope of openness (about what) and abstraction (about how). Thus, applying separation of concerns simultaneously requires teams to understand more of the overall problem (what), so as to define sensible interfaces between team members, yet less of each-others implementation-level details (how).2.

If the proper balance of openness and abstraction is not achieved, then interfaces are either drawn too broadly (you’ll be stepping on each others toes and exposed to unnecessary implementation-level details), or too narrowly (the ‘local’ solutions of each team-member won’t work together to address the ‘global’ problem).

Done well, separation of concerns enables specialists to work together in harmony: delivering a reasonable set of ‘local’ solutions to the ‘global’ problem at hand.
Done poorly, separation of concerns stifles innovation by imposing artificial barriers: ‘local’ solutions which are ‘globally’ ineffective (and sad panda for all parties).

Cross-functional teams of the world, try giving the principle of separation of concerns a try on your next project.

  1. As corollaries, “not my problem”, dismissal of your peers capabilities and inter-specialty rivalry are unacceptable.
  2. An added-benefit is that less field-specific jargon tends to be used because implementation-level details (how) are abstracted into higher-level problems (what). For example, everyone can understand that a software application is slow (what), but the biologist could (usually) care-less that it’s due to excessive network traffic (why), or that the developer resolved the problem by local caching (how).