Verdant Force: Discoveries in Life and Proteomics: Data science

Showing posts with label Data science. Show all posts

Thursday, June 25, 2015

Recap: PyDataUK 2015

This weekend, ~200 delegates trudged through typical London weather (rain) to the Bloomberg offices in London to attend PyDataUK 2015.
While it’s not your typical nerds in T-shirts meet-up; if you use Python to hack data, this conference is ~~probably~~ definitely for you.

Attendance

Curiously, for a ‘data science’ conference the attendance list (which I would have crawled LinkedIn with…heh), was not available. Bases on my (biased) observations, the attendance was roughly as follows…

Type	Sub-type	Percentage (%)
Industry		70
	Self-employed	20
	SME	40
	Sponsors	10
	Large	<1
Academia		30
	Ugrad	<1
	Masters	<1
	PhD	15
	Postdoc	5
	Professor	10
Government		<1

A few highlights…
* Self-employed contractors and consultants were very well represented.

Conference feel

A your data conference, not a ‘big data’ conference

Hadoop has delivered value for <10% of the companies that have installed it
- Paraphrase, anon

This conference is data focused, i.e. focused on using the Python ecosystem to solve your data challenges. The focus is on practice, and practical tools, not theory.

Type	Approx Size	Appropriate tools
Micro-data	<1Gb	Ipython
Small-data (Memory-limited)	~10Gb	Pandas
Medium-data (Disk-limited)	<1Tb	Ad-hoc databases
Big-data	Tb - Pb	Consider enterprise solutions, or grep

The fact is, ‘big data tools’ would be wildly inappropriate for the vast majority of attendees. The problem seems particularly acute in the life sciences. In his war story talk, Paul Agapow covered the herculean efforts required to re-purpose an ill-advised ‘big data’ solution to recover data from a an ongoing clinical trial.
His message was very clear. Life sciences tends to have very detailed, very heterogeneous data in hundreds to thousands of rows (small/medium data): let the data guide the solutions: you probably don’t need enterprise software, so just don’t waste your money.

A Python is useful conference, not a “Python is deity” conference

All tools are shyte, but some tools (Python!) are useful.
- Paraphrase, anon

Speakers like Russel Winder and his talk on the lack of computation efficiency in Python, even using libraries like numpy set a memento mori undertone to some of the more blatant Python triumphalism.

An interpersonal conference, not a Cloister

The very high-level of interpersonal interaction is yet another way in which the conference betrays the nerds in T-shirts. This is very much a conference that one goes to seek guidance and solve problems.
While there are always the stragglers that don’t head down the pub, a good 2/3s of the conference went for fruitful discussion and drink on Saturday. Unsurprisingly, pub attendance was lower on Sunday, but still fruitful.

A place to get hired/take action, not heavy on theory

Folks were hiring like crazy, and it was very much a sellers market.
If you’re a job seeker anywhere on the Python+data spectrum, I’d strongly recommend attending. Companies were recruiting along the entire spectrum, everywhere from AWS-ineering to user-focused commercial data analysis with IPython notebooks (or re-dash, see Arik’s talk for more details on this user-friendly database interaction framework).
In-line with the action oriented nature of the conference, the Pivigo Recruitment founds were there, doing resume/CV screens and offering advice, both to students and established professionals.
If you are a PhD/Postdoc looking to make the transition, I highty recommend taking a look at their Science to Data Science training program.
Continuum may also be prototyping a training programme of their own through its Client Facing Consultant position. Not entirely sure, but 6-months of training via a 3rd-party consultuncy followed by an intentional poach (Continuum –> 3rd party) could be an interesting model.

Talks

I found the spread of talks fantastic. At least amongst the talks I attended…

Type	Percentage (%)
Tools	40
War story	30
Skills	20
Under the hood	10

Tools

Tools talks were the most common. They covered ‘non-brand name’ and upcoming tools with emerging communities.

Will Usher: Sensitivity analysis with SALib
David MacIver: Randomly test initial conditions in your code simply with Hypothesis. And while you’re at it, why not use contracts to enforce. Wasn’t a talk, but it should be!

Attend/watch if:
(i) You want to learn about specific tools that may be applicable to your problem.
(ii) You want to collaborate on extending / adopting new tools.

War story

These talks gave the horrifying and nitty-grity details of a specific problem the speaker faced, and how they went about solving it (including gotcha’s and failures). The focus isn’t ‘wow, look at me’; but rather, this was some B.S., and I want no one to go through what I went through ever again.

Paul Agapow: Don’t use ‘big data’ tools when simpler solutions will do, particularly in the life sciences.

Attend/watch if:
(i) You want help with the problems you are immediately facing
(ii) You want exposure to problems you’ve never thought-of.

Skills

These were high-level talks that focused more on skills and knowledge than specific tools.

Ian Ozdvald: Writing code for you is only the begining, lets see what it takes to push a Bloomberg model to production.

Attend/watch if:
(i) You want to learn what you need to know in a new area.
(ii) You want an overview of a topic you’ve never heard of.
(iii) You want to chat with the speaker about specific War Stories, after the talk.

Under the hood

These talks focused on low-level implementation details of numpy, pandas, Cython, Numba, etc with a particular focus on performance and appropriateness. Personally, I found these talks the most useful. Where else can one gather such concentrated information from the mouth of the open-source contributors.

Russel Winder: If you want performance, use Python as a glue-language, and write your computationally intensive functions in a ‘real’ language.
Jeff Reback: In pandas, think about idioms and built-in vectorization to get the most out of your code (then write in a ‘real’ language if you still need to go faster).
James Powell: Why does writing good numpy feel so different than writing good Python: because the styles have diverged, and will probably continue to do so.

Attend/watch if:
(i) You want a fire-hose of information about low-level topics.
(ii) You want to know how the ‘magic’ happens.

Take-home

This is very much a conference focused on solutions. If you have a problem, don’t be shy!. Ask around, and there will be people there that have faced similar problems, eager to help.
As for me, I look forward to attending next year!

Thursday, March 12, 2015

Book Review: How to lie with statistics

A data analysts bible for communicating stats to non-experts. A recommended re-read as annual absolution for your statistical sins.

There is no surprise it's a classic: the book has aged remarkably well, the (humourus) anticdoes being as pertinent today as 60 years ago.

The premise is quite straight-forward. When presented with stats, keep in mind:
1) Tools of the trade,
2) Lies, and
3) Fallacies.
Then do a "sniff-test."

Tools are the trade include bias, sample size and significance tests.

Lies are (often graphical) ways of misleading the reader (intentionally for the data scientist; plausibly unintentionally for those with less of a background): changing the scale bars, 'cleverly' chosen percentages, dishonest before/after and my personal favorite, semi-attached figures (what the medical profession now calls 'surrogate end points').

Fallacies include the ever present correlation is of course causation, and 'proving' the null hypothesis.

If a breezy 124 pages is too much, cut straight to the end. At a 'lengthy' (by this books standards) 15 pages, the 10th and final chapter enumerates a 5-step 'sniff test' that can stop a good many lie in its tracks:
1) Who says so?
2) How does he know?
3) What's missing?
4) Did somebody change the subject?
5) Does it make sense (particularly for extrapolations)

If pointy haired boss ever read this book, it'd make the data analysts job -- appease power by bending truth -- 456.7% more challenging!

PS. Speaking truth to power will get you fired 654.3% faster than appeasement. Exercise minimally bent truth with caution. You've been warned!

Tuesday, December 2, 2014

Building blast+ databases with taxonomy ID (taxid_map)

Building NCBI BLAST+ databases with linked taxonomy is far more difficult than it should be.
For example, in taxonomy-based tools such as Kraken, mapping
1) taxonomy id to sequence id (gi or accession) and
2) taxonomy id to a human-readable taxonomy tree,
are built-in and transparent to the user.
Unfortunately, with BLAST+ these steps must be completed manually and are included in two separate programs, makeblastdb for (1) and blastn/blastp/blastx for (2).

(1) Taxonomy id <–> sequence id

In BLAST+, a taxid_map file file must be created and passed to makeblastdb

makeblastdb -in <FASTA file> -dbtype nucl -parse_seqids -taxid_map taxid_map.txt

where taxid_map.txt is a space or tab separated list of sequence ids (either gi or accession) and taxonomy ids.
For example, with gi:

taxid_map.txt

556927176 4570
556926995 4573
501594995 3914

Alternatively with accession:

taxid_map.txt

NC_022714.1 4570
NC_022666.1 4573
NC_021092.1 3914

There is no turn-key way to generate this mapping taxid to sequence_id for a moderately large set of sequencing.

Creating a look-up dictionary from the complete NCBI taxonomy database is memory prohibitive.
Entrez API calls are capped at 3 per second.

Fortunately, there is always a hack work-around. NCBI allows export of both the FASTA and GenBank files. The former are used as the default input for makeblastdb, and the latter contain both the sequence_id and taxid. They can both be obtained from the NCBI, searching and exporting with Send to:
enter image description here

This simple Python code snippet will do the trick for small and moderately large datasets.

from Bio import SeqIO
genbankfile = "DNA.gb"
f = open('taxid_map.txt','w')
for gb in SeqIO.parse(genbankfile,"gb"):
    try:
        annotations = gb.annotations['gi']
        taxid = gb.features[0].qualifiers['db_xref'][0].split(':')[1]
        f.write("{} {}\n".format(annotations, taxid))
    except:
        pass
f.close()

For large datasets, the bandwidth cost of of downloading the GenBank from NCBI becomes prohibitive, and the dictionary approach would probably be warranted.
Download both the FASTA and GenBank, alternatively extract the FASTA from GenBank, e.g. with BioPython.

(2) Taxonomy id <–> Taxonomy tree

Simply include this NCBI database in the same directory as your database for the look-up to work with blastn/blastp/blastx: ftp://ftp.ncbi.nlm.nih.gov/blast/db/taxdb.tar.gz

Eating your cake

blastn -db <DATABASE> -query <QUERY> -outfmt "10 qseqid sseqid pident staxids sscinames scomnames sblastnames sskingdoms"

1,gi|312233363|ref|NC_014692.1|,86.26,310261,Sus scrofa taiwanensis,Sus scrofa taiwanensis,even-toed ungulates,Eukaryota
1,gi|223976078|ref|NC_012095.1|,86.26,9825,Sus scrofa domesticus,domestic pig,even-toed ungulates,Eukaryota
1,gi|5835862|ref|NC_000845.1|,86.26,9823,Sus scrofa,pig,even-toed ungulates,Eukaryota

Your comma separated file (-outfmt) showing human-readable taxonomy info.

Thursday, May 15, 2014

Hoping for LinkedIn Vetted API access

Just submitted a request for Vetted API access [source] on LinkedIn to do a little research project on transition probabilities. Signing up as a developer [source] only gives you access to company search and your profile and basic information from 1st degree connections.

I understand the reasons, both noble (e.g. protecting user privacy) and ignoble (e.g. enforcing a closed ecosystem, like they did with CRM, which is quite frankly evil [source]). Will just say it's more hassle than I was expecting for a simple research project.

Basically, I want to identify:

People like me (starting from me)
Where they came from.
Infer where they are likely to go.
The co-variables that are strongly correlated (positive or negative) to particular transitions and joint-transitions.

Here's hoping that LinkedIn will provide access for my little research project. There isn't much I can do with basic profile information from 1st degree connections; need the entire network (~200k people) and non-basic profile information such as companies/universities attended to really get going.

The previous post on Tessella [source], is a very basic example of what I want to do (on a much larger scale of course) if I get vetted.

Will be using LinkedIn Python API [source]. As an aside, the code itself is beautiful.

Saturday, April 19, 2014

Data Dive: Personality testing

Notice: I've been informed that the data were indeed crawled from a combination Adult & Youth surveys. There is no scientific value here, just a funny plug on the power of numpy, matplotlib and a few lines of bash/awk to take a hack at some data.

The Via Institute on Character offers a fun and informative ~10 min personality test. I was curious about how my results compared to the norm, so I took a little data dive.

Here are some humorous observations from a subset of their rich data-set (N=41513)...

People admit to lacking Humility, Self-Regulation and Spirituality.
Honesty, Love and Kindness are people's top priorities.
Spirituality and forgiveness go (very weakly) hand-in-hand.

Observation #1

People admit to lacking Humility, Self-Regulation and Spirituality.

A priori, I expected the character attributes to have flat distributions (my null hypothesis), a straight line at p=0.04 (1/24 attributes). This couldn't be further from the truth for some attributes.

In the upper-right hand corner of some images, you'll see a (+) or (-), this corresponds to a big deviation from the flat null hypothesis.

(+) attributes are ranked higher than expected.

(-) attributes are ranked lower than expected.

Others, without a marker, are more or less flat, e.g. Curiosity and Humor.

Observation #2

Honesty, Love and Kindness are people's top priorities.

Another slightly more flashy way to view the data is through a rank abundance plot. The top quartile (ranks 1-6) are dominated by these three attributes:

Observation #3

Spirituality and forgiveness go (very weakly) hand-in-hand.

Correlation coefficient plots by quartile show some very weak (+/- 0.15-0.2) correlation for top ranked (1st Q) and bottom ranked (4th Q) traits, but virtually none for average ranks (2nd and 3rd Q).

Q1 (Top Ranked attributes):

+) Spirituatlity / Forgiveness

+) Humor / Prudence+Perspective

-) Perseverance / Social Intellegence

-) Creativity / Gratitude

-) Curiosity / Perspective

-) Perspective / Creativity+Humility

Q4 (Bottom ranked attributes):

+) Prospective / Hope+Humor

-) Appreciation / Forgiveness

-) Creativity / Humility

Note that correlation does not imply causality and the correlations, are very weak. They are, however, above the background average correlation of -0.02. These are humorous insights, and in no way should they be taken seriously.

Your Homework

Take the test for youself at viacharacter.org

If you're keen to explore, ~~redacted datasets~~ (taken down at the request of the VIA Institute on Character) and analysis scripts available at https://github.com/rmharrison/viacharacter-analysis

Thursday, May 17, 2012

Infographic: The Face of Life Sciences

What does a life scientist really look like? Nifty facial averaging software,[1] combined with available government statistics answers the question. Cheers!

The Face of Life Sciences

[PDF] [PNG]

Verdant Force: Discoveries in Life and Proteomics