Courtesy of Doug Henschen of Information Week
A White House plan to step up research on big data
analytics sounds promising, but agencies could save big bucks
through consolidation, collaboration, and cost
sharing.
The Obama Administration last week unveiled a
"Big Data Research and Development Initiative" that will see at
least six government agencies making $200 million in additional
investments to "greatly improve the tools and techniques needed to
access, organise, and glean discoveries from huge volumes of
digital data."
The
big data initiative sounds good in theory, and I'm all for
promoting U.S. competitiveness in math and science. But after
sitting through nearly two hours of presentations on the feds' big
data initiative, I fear those investments will be spread too thinly
among too many agencies that aren't collaborating.
It's encouraging that the White House is at least aware of
all the agencies involved in data- and compute-intensive research.
The administration released a
fact sheet that listed at least 80 projects and
initiatives across a dozen federal agencies, including the
Department of Defense, Department of Homeland Security, Department
of Energy, Health and Human Services, and Food and Drug
Administration.
Who knew the government was funding so much data-driven
research? The White House issued this fact sheet as if to say,
"Look how much we're doing already!" But when you start reading
about all the separate initiatives and all of the high-performance
computing labs and research facilities already in place, it makes
your head spin. As a taxpayer, it pains me to see so many examples
of apparently duplicative research, staff, and infrastructure.
The big data initiative was prompted in part by a December 2010
report by the President's Council of Advisors on Science and
Technology (PCAST) on
"Designing a Digital Future," which found the U.S. is
investing too little in networking and IT research. Part of the
reason we're not spending "enough" is that we're spreading
investments among agencies conducting R&D for their respective
fields rather than on networking and IT that could benefit
everyone.
It was a good sign that last week's presentation kicked off with
the announcement of an initiative between the National Science
Foundation and National Institute of Health to fund 15 to 20
research projects to the tune of $25 million. The idea behind
this Big Data
Solicitation is to seed and provide direction for
initiatives that will speed data-driven scientific discoveries
related to health and disease. What's more, it's an invitation to
academia, non-governmental organizations, and the private sector to
participate. This is exactly the kind of collaborative effort I
think we need.
But after a promising start, the four speakers who
followed--from the U.S. Geological Survey, the Department of
Defense, the Defense Advanced Research Projects Agency, and the
Department of Energy--seemed more intent on talking about their
unique initiatives and less focused on how they could collaborate
with other agencies. Amid the din of acronyms and price-tag-unknown
projects, the same terms kept coming up: data volume, data variety,
modeling and algorithms, data visualization, making information
actionable, and so on.
It all reminded me of a conversation I had with
Don Burke a couple of years ago on the topic of the lack
of cooperation, collaboration, and consolidation among government
agencies involved in national security. "Every agency says, 'I have
unique needs.' Then their IT providers say, 'I will give you the
100% solution for that need, but you have to give us all this money
to create a unique solution,'" explained Burke, "doyen" of
Intellipedia, an intelligence-community-wide wiki started in 2006
by the Office of the Director of National Intelligence.
Intellipedia aims to help the intelligence community connect the
dots on threats by collapsing the walls between data silos. Reading
through all the big data projects and initiatives the government
already has on the table, I think there's an opportunity to do more
shared big-data research and create shared big-data platforms.
Yes, the U.S. Geological Survey, NASA, the Department of
Defense, and the National Institute of Health are doing very
different types of data-driven research and analyses, but they're
all grappling with the use of unstructured data and large-scale
machine data, they're all pushing the envelope on data mining, and
they're all looking for better data visualization and reporting
techniques.
Johns Hopkins, for one, believes in big data collaboration
across disciplines. Dr. Peter Greene, Johns Hopkins' chief medical
information officer, tells me that that institution's oncology
researchers are collaborating with the university's Department of
Astronomy. The cancer researchers face the big data challenge of
studying the human genome, which consists of 3 billion base pairs
of DNA. Johns Hopkins' Department of Astronomy, meanwhile, has a
data center with rack upon rack of compute power applied to
large-scale computational astronomy calculations. Why build a
separate data center when one can handle both astronomy and
healthcare calculations?
The government's hugely important
data center consolidation plan didn't come up at all
during last week's announcements. So what about assessments of
compute-power requirements and staffing needs? Are our current labs
anywhere near maximum utilization? It strikes me that consolidating
high-performance computing centers and relying on cloud delivery of
services to multiple agencies could go a long way toward cutting
the big cost of big-data analysis.
If we're to avoid the problem identified in the original PCAST
report--spreading budgets too thinly across too many agencies
studying parochial requirements--these departments and agencies
must recognize that there's a huge opportunity for their research
dollars to go further. If they will only give up a bit of control
and a bit of their "unique" agendas and a bit of their precious
budgets, we could be creating big data research and systems for the
common good.