The Power of Data Investigation Is For Us All – Not Just Investigative Journalists

The Panama Papers affair shows just how important data is in finding hidden truths.

Share this article

Share this article

The Panama Papers affair shows just how important data is in finding hidden truths.


The Power of Data Investigation Is For Us All – Not Just Investigative Journalists

The Panama Papers affair shows just how important data is in finding hidden truths.

Share this article

Ask editor Mar Cabra of the International Consortium of Investigative Journalists (ICIJ), the group behind the The Panama Papers investigation, what it is she thinks her team does, and the answer is pretty much what Woodward and Bernstein would have said 40 years ago: “We use technology to tell great stories.”

In the heyday of The Washington Post’s takedown of a corrupt President, Woodstein relied on a phone, perhaps a fax, and a library of clippings and information sources.

Now, reporters depend on data. Huge amounts of data that has to be probed, sifted and worked with. That’s why Cabra and rest of the global team of investigators have embraced data-based techniques as core to what they do – and as important as old school tenacity and a nose for a story.

Bigger than Snowden and more relevant to business

We now call this data-driven journalism, and it has just pulled off its biggest coup – The Panama Papers. Not only does exposure of the activities of clients of a Panamanian law firm qualify as the world’s largest financial scandal, it’s also, at 2.6 terabytes and 11.5 million documents, far larger than anything Snowden or Wikileaks managed. Let’s review how this happened.

When an anonymous source tipped off the ICIJ about a huge amount of classified internal company information of Panamanian law firm Mossack Fonseca, Cabra knew a major international scoop was possible. The problem was the data was too complex to be analysed by traditional means.

Cabra knew her team would need a sophisticated tool to analyse this data set, one that could process a large volume of highly connected data quickly, easily and efficiently.

It’s worth noting that such analysis had to be accessible to investigative journalists around the globe, regardless of their technical abilities (as the vast majority were not technical). It also had to be able to reveal patterns out of a vast pool of unstructured information, mainly in scanned bank statements and so not easily searchable by conventional means.

Beyond traditional ways of working with data

Cabra had been exposed to complex data challenges before, and so knew graph databases were the best solution. “It’s a revolutionary discovery tool that’s transformed our investigative journalism process,” she confirms.

Why is this technology highly suitable? Graph databases excel at spotting relationships inside data, at scale. As Cabra says, “Just by expanding dots, my reporters found a lot of information that they had not found previously, finding lots of connections we’d missed when we looked at the documents individually.”

Edward Snowden

The Panama Papers were more detailed than Snowden's leak

How can graph databases do that better than some of the more traditional ways of working with data, such as RDBMS systems like Oracle? Instead of using tables like relational systems, graph databases use structures that are better for analysing interconnections between data.

Instead of breaking up data artificially the way a relational database does, graphs use a notational formalism that is more closely aligned with the way humans natively think about information.

Once that data model is coded in a scalable architecture, a graph database is effectively matchless at mining connections in huge and complex datasets.

That matters if you are trying to spot hidden connections, as Cabra says, “Relationships tell you where the criminality lies, who works with whom, and so on… Understanding relationships at huge scale is what graph techniques are so great at.”

A way to have our own ‘Google’ moment

But if you think about it, it’s not just reporters, be they Woodstein or Cabra, who need to do that.

All business leaders trying to address large-scale connected data issues have their own ‘investigations’ to mount, involving building and manipulating large data structures.

That’s why, from start-ups trying to disrupt their markets to brands trying to work with data to provide a better service, spot market trends or deliver super-personalised recommendations to customers in real time, graph database technology is more and more the weapon of choice for the serious business data manipulator.

Intriguingly, graph databases have been used by Google, Facebook and LinkedIn for years, building their businesses via this technology. Google’s PageRank algorithm is really a large-scale, perpetual graph-like investigation of the links that knit together the World Wide Web; Facebook and LinkedIn ‘investigate’ our real-time networks and connections to build our ‘social graphs’ in just the same way.

Silicon Valley

Silicon Valley is essentially full of graphs

As graph database technology has matured, such highly scalable connected data analysis is now available to us all. The analyst community is predicting huge take up and interest, with Forrester Research claiming by 2017, 25% of all enterprises will be using graph databases, while Gartner reports that graphs are the fastest-growing category in database management systems, predicting 70% of leading companies will pilot a graph database project of significance by 2018.

It turns out that graph databases can do amazing things for all sorts of firms, way beyond the extraordinary use case that is The Panama Papers and what it shows when it comes to breaking unstructured data’s secrets.

In any context where large, complex datasets need to be mined, graphs are increasingly the tool of choice. And in the digital age, with IoT and the era of the petabyte just around the corner, large connected datasets are more of a factor. Think real-time online recommendations in retail, film, art, wine – even on dating sites.

Fraud detection in banking, insurance and online, where you can get alerted to scams with a high degree of accuracy. It’s also the engine behind many enterprise network management systems, alerting managers to vulnerabilities and is increasingly taken up by medical researchers for the investigation of diseases and cells – and even by government for security and welfare applications.

In each case, what links all these hard problems is that they’re all ultimately about complicated relationship data. That’s what graph databases are great at, and in our super-connected, Information Age soon a lot more of us will be able to unlock the data that will change your business forever.

The author is co-founder and CEO of Neo Technology, the company behind the world’s leading graph database, Neo4j.

Related Articles
Get news to your inbox
Trending articles on Opinions

The Power of Data Investigation Is For Us All – Not Just Investigative Journalists

Share this article