Explore Harvard's Nieman network Nieman Fellowships Nieman Lab Nieman Reports Nieman Storyboard

Automating watchdog reporting

COMMENTARY | July 22, 2009

Standardized smart data input and straight-forward, computer-programmed analysis will make conflicts of interest leap out at reporters, writes J. H Snider.



By J.H. Snider

Could watchdog journalism be automated? Am I joking to even propose such an innovation be taken seriously?
To be sure, complete automation isn’t possible. But with new semantic Web technologies, combined with appropriate government policies, great progress toward this goal could be made in the next few years.  
Contrast this optimistic vision with the current drumbeat of negative news about new information technology killing both newspapers and watchdog journalism. Only in retrospect may we come to recognize that the 20th century was a journalistic dark ages with prohibitively expensive investigative reporting. 
I want to focus mostly on the potential of the semantic Web to enhance research on government officials’ conflicts of interest, often described as “government ethics.” Journalists who want to investigate such conflicts are told to “follow the money.” The problem is the difficulty in doing so.
The semantic Web advantage
The semantic Web (also sometimes called Web 3.0, the Giant Global Graph, the Database-In-The-Sky, and Web-Of-Data) is often contrasted with today’s document-centric Web. The key difference is that the semantic Web makes possible a much higher degree of automation in data integration and analysis. Today’s document centric Web puts vast amounts of information at the public’s fingertips, but humans have to do the bulk of the cumbersome data integration and analysis.   
Let’s illustrate how the semantic Web would change a typical follow-the-money investigation. Consider the following scenario: The Mayor of City X appoints a blue ribbon commission to make recommendations on how to reduce school construction costs. The mayor uses the adjective “blue ribbon” to convey that the commission members are “independent experts.” The watchdog journalist wants to find out the extent to which this characterization is accurate. A thorough job answering this question would involve checking into tens of thousands of potential financial relationships between commissioners and the school system, and commissioners and the mayor. It probably would take a long time.
Let’s assume that today’s state-of-the-art government transparency goals have already been realized: the city has collected the key conflict-of-interest information (e.g., commissioners’ gifts, campaign contributions, and government contracts) and posted it online in a structured format. The structured format allows for the download and direct search on data such as the names of campaign contributors.
Today’s watchdog journalist can do two basic types of search on this data: a relational database search and a Google-like document search.
The Relational Database Search. The journalist’s question is simple: “Is the City X commission on school construction costs biased?” However, answering it with conventional database queries turns out to be anything but simple. To answer this question thoroughly, the journalist would have to, at a minimum, search for all combinations of the following: 1) for every commissioner and close relatives of every commissioner; for the mayor and close relatives of the mayor; for school board members and close relatives of school board members, 2) the financial entities linked to each of those persons and the names of each entity’s subsidiaries and holding companies; 3) the potential benefits provided to the mayor or school board members, including gifts and campaign contributions; 4) the potential benefits provided by the mayor and school board members, including job contracts, lobbying access, zoning changes, and building permits; and 5) citizen and professional auditor complaints that the published data are flawed. And since quid pro quos may not happen simultaneously, these searches should ideally be done frequently (e.g., daily) for years into the future.
Of course, even if all this information were readily available online (which it is not), today’s journalist cannot possibly afford the time to ask all these questions with any deadlines in mind, let alone over an extended period of time. The result is he or she will most likely just look for the most obvious, direct conflicts of interest, such as whether the commissioners are in the school construction business and have done work for the school system. Thus, a lot of potentially significant conflicts of interest do not get covered. 
The Google-like Document (“Dumb”) Search. The journalist asks the same simple question: “Is the City X commission on school construction costs biased?” But now that question is entered into a Google-like search box. Unless another watchdog journalist has already done the relational database type queries above and then published the results, this query is unlikely to generate any meaningful results. For practical purposes, the relevant data on the Web are inaccessible.
Now let’s do a semantic (“Smart”) search on the same question. The result we get is the same result we would have gotten by doing all those thousands of queries with the relational database—except that the whole search process has been automated. The search engine is able to automatically integrate all the relevant data scattered over the Web and logically infer all the subsidiary queries from the simple main query. 
The query could be made even more powerful. Instead of asking whether just this particular commission has any significant conflicts of interest, the query could ask for a daily automated update (like Google’s News Alert service) on whether any of the city’s commissions, boards, and elected officials has generated a conflict of interest within the past 24 hours. I live in Anne Arundel County, MD., population 512,790. In addition to a county council and school board, it has 38 standing commissions/boards and many more ad hoc ones. With today’s technology, even a team of 100 reporters working full-time could probably not answer such questions. The local daily newspaper, The Capital, has only two full-time reporters covering county government. Not only does The Capital rarely report on government ethics, but when it does, it tends to cover only the most obvious and direct conflicts of interest.
Semantic Web searches may seem like magic because a simple search can generate such a complex, well-structured query. But the logic behind such seemingly complicated searches is surprisingly simple. Just like the simple law of gravity can generate extraordinarily complex real world phenomena when objects are set in motion, a few simple logical rules modeling principals, agents, and their conflicts of interest can instantaneously weave together a complex but purposeful tapestry of queries. The special ingredient that transforms the simple into the complex is the data modeling language.
The Need for a “Bias Modeling Language”
The key missing technological piece to automate watchdog journalism is the development of a data model that describes the logic of government officials’ conflicts of interest. The semantic Web can only work its magic on data that have been designed and published based on a field-specific model that makes data integration and logical inference possible.  
I call this proposed model the Bias Modeling Language (BML). Countless other standardized data models are currently under development. Google, for example, on May 12, 2009 released specifications for modeling languages (called “Rich Snippets”) to describe product reviews and people. Similarly, the Federal government on May 21, 2009, launched Data.gov, which uses the Dublin Core modeling language to describe reports. But none, to my knowledge, models conflicts of interest. 
The key differentiating word in BML is “bias.” Close substitutes include objectivity, fiduciary, trustee, and conflict-of-interest. I chose bias because of its simplicity and popular use to describe the behavior of someone with a conflict of interest. 
BML would be based on the academic theory known as principal-agent theory. Wherever there is specialization of labor, there are principal-agent relationships. The principal delegates a task to a trusted agent because the agent can do the task more efficiently than the principal. But the agent may have conflicts of interest that bias its behavior in a way that harms the principal. One way for a principal to reduce such agent bias is to require the agent to disclose all conflicts of interest.   
The core of BML is the specification of a principal, agent, and any conflicts of interest the principal and agent may have. Each of these data categories can in turn be subdivided into many subcategories. For example, agents may have sub-agents (e.g., commissioners may be sub-agents of the mayor) and conflicts of interest can be broken down into many categories and sub-categories. 
In the case of the commission example above, the relevant principal is the voter and the agent is the commission. Specified at the highest level of generality (that is, at the top of the conflict of interest taxonomy), the conflict of interest is any potential payment to the commission members for their services other than their publicly reported compensation specifically for their commission work. Specified more narrowly (that is, near the bottom of the conflict of interest taxonomy), the conflict of interest could cover just payments from the school system. 
Once this core model is in place, there could be many extensions. For example, there could be links to other government databases that describe a financial entity’s holding company, subsidiaries, and major corporate investments. There could be links to government and commercial databases with valuations of real estate property (e.g. to automatically compare the price that an elected official paid for property versus the market price). There could be data items to describe data audits and other procedures to ensure data quality. And there could be data hooks for citizens to give feedback to government agencies and private watchdogs concerning missing or misleading data.   
Like the current Web, all this information would be published in a decentralized way (e.g., like the countless blogs on the Web) and aggregated in a centralized way (e.g., like Google’s search engine). But the journalist could customize the search engine to specify just the types of principals, agents, and conflicts of interest he wants to find—and how frequently he wants the search engine to conduct searches on his behalf. 
The impact of automating watchdog reporting may be greatest at the local level, where in the United States there are more than 40,000 political units. At the national and state levels, non-profits such as The Center for Responsive Politics and the National Institute On Money in State Politics already aggregate and process large amounts of conflict of interest information for journalists’ use (although still only a small subset of what is proposed here). But until now it has not been economical to provide even minimal such services at a local level. This can largely be explained by the fact that large political districts have much greater economies of scale than smaller ones; for example, thousands of reporters make annual use of the Center for Responsive Politics Web site, but only a handful at most would likely make use of such a Web site at a local level. BML, by standardizing government disclosure of conflict of interest information in a machine readable format, could make it affordable for semantic Web information aggregators to cover even small towns. 
Another advantage of BML is that it could automate and thus make it easy to do democratic audits of local government conflict of interest disclosure policies. This would apply pressure on local governments to improve those policies, thus enhancing their democratic accountability.
BML could also democratize watchdog journalism, allowing citizen journalists to do the type of investigative work previously reserved for well compensated and highly trained professional journalists. Just as buying salt in the Middle Ages was restricted to the nobility due to the high cost of its production and distribution, the expensive manual labor associated with investigating government ethics may come to be recognized as a historical artifact.
Lastly, it is important to note that BML has many important applications outside the realm of government ethics, including marketplace ethics and consumer protection. The need for a universal conflict-of-interest disclosure framework was touched on in my book, “Future Shop: How New Technologies Will Change What We Buy and How We Buy” (iUniverse 2008), co-authored with Terra Ziporyn, and is an important part of making the case for BML.
A close precedent for BML is eXtensible Business Reporting Language (XBRL), which is used to report financial information, including assets, income, and cash flow. XBRL provides a taxonomy of items on financial statements and then tags those items in a machine readable way for posting on the Web. In the past, financial statements might be posted online, but the data had to be manually cut and pasted, often from different documents, into a database for analysis. With XBRL, the data in financial documents are tagged and aggregated so it is a simple matter for an analyst to ask questions that might otherwise have been prohibitively expensive to ask; for example, concerning the relationship of CEO pay to shareholder returns and the proportion of outside members of the board of directors in companies with more than $1 billion in sales for 2007 and 2008 (this question would analyze the relationship between corporate governance and CEO performance).
Eleven countries outside the United States are in various stages of adopting XBRL for the standardized reporting of business financial information. Effective June 15, 2009, the U.S. Securities & Exchange Commission required large public companies to report their financial statements in XBRL. On January 24, 2009, President Obama announced the launch of Recovery.gov for recipients of the close to $1 trillion in Federal government stimulus money to report their expenditures. Beginning October 2009, recipients are expected to start reporting their data in a variation of XBRL. On May 15, 2009, the Ranking Minority Member of the U.S. House Committee on Oversight and Government Reform introduced legislation to mandate the use of XBRL for all Federal agency financial reporting.
One important way that BML and XBRL differ is that XBRL is only designed to provide information about government (and business) performance, not conflicts of interest. A second difference is that XBRL is not a full-fledged semantic Web modeling language—although it is evolving to become one. Like BML, XBRL tags data in a machine readable way and posts it to the Web. But its data items lack important semantic Web features, including unique Web identifiers (known as URIs) and robust logical integration, which enable webwide automated data search and analysis. 
To be realistic, it will probably take at least several decades to fully realize the benefits of BML. Laws, for example, will have to be rewritten in a new data modeling language that allows them to be automatically linked to such government resource allocations as budgets, licenses, and zoning changes. Where beneficiaries are already clearly specified, such as the tens of thousands of earmarks members of Congress now annually submit to appropriations committees, BML could, with minimal changes to the structure of current law, allow for instantaneous conflict of interest analysis. But in many other cases, the translation from laws into budgetary expenditures and other government perks may be much harder. For unlike Congressional earmarks, they are not explicitly specified. For example, laws that characterize beneficiaries in human-readable geographic terms could in the future also be specified precisely in machine-readable terms using Geographic Information System coordinates (Recovery.gov has already mandated such coordinates so the location of stimulus fund recipients can easily be identified on a map). Similarly, when a legislature changes a zoning law, the change could be logically linked to an executive branch permit office, so the beneficiaries of the new law could be automatically identified.
Despite the desirability of such enhancements as machine readable laws, BML doesn’t have to be implemented all at once. Data models can be useful even if they start off simple. For example, Google’s data model for product reviews only includes the following fields: writer of the review, date the review was written, the rating, and, for items with multiple user reviews, the number of reviews and average rating. This model is actually more complex, however, because the writer-of-the-review data item is linked to the data model for people, which, in turn, is likely to be linked to a planned data model for organizations.
Providing conflict of interest data about only a few categories of data, such as gifts and campaign contributions to elected officials, would be enough for BML to provide a valuable service. Over the years, I would expect a BML 1.0, 2.0, 3.0, etc. as the range of data expands. BML could also eventually be integrated into a more general principal-agent modeling language.
The Implementation Politics
Note, too, that there is no guarantee that, even if the government makes relevant data available in a BML-friendly format that it will be accurate. Designing BML will probably be relatively easy compared to getting governments to adopt it. Elected officials have little desire to make their conflict of interest information more readily available to the public, if only because those conflicts of interest provide a key motivation for special interests to contribute to their campaigns. Without those conflicts of interest, incumbents’ advantage over potential challengers could be much diminished. 
Nevertheless, not all politicians are obsessed by the desire for reelection and every little electoral advantage they might secure for themselves by making conflict-of-interest information hard to access. In places with volunteer and non-professional legislative bodies, such as rural towns in Vermont or Minnesota or Oregon, it may be relatively easy to get elected representatives to adopt such technology. But it’s also true that in such places such technology may be least needed. 
The precedent of Recovery.gov may offer the greatest promise for government implementation of BML. The Obama administration mandated that if local governments want part of the almost $1 trillion in stimulus funds, they would have to be accountable for the uses of those funds by posting to the Web and on Recovery.gov the uses of those funds in a standardized, easily searchable way such as XBRL (a final standard has not yet been publicly announced). What could do more to ensure that local governments use funds accountably than mandating that those who spend such funds don’t have hidden conflicts of interest? No company would allow purchasing managers to have significant hidden conflicts of interest. Neither should the federal government. 
Note that the role of the government would be limited to publishing core BML data. Supplemental BML data, such as thesauri for government BML terms and BML metadata describing the quality of the government’s BML data, would be published by private entities. The search engines that aggregate and develop customized user interfaces for all that data would primarily be private entities, although the government would be expected to offer at least a plain vanilla data viewer, just as the SEC does for the XBRL data it collects. Note, too, that there is no guarantee that, even if the government makes relevant data available in a BML-friendly format that it will be accurate. Since garbage in will generate garbage out, this is a serious concern. However, by increasing the transparency of fraud and by providing well-structured hooks for citizens to report it, BML should make possible community policing in a way that hasn’t been possible with today’s government disclosure technology.  Still, this is a crucial implementation issue that, if executed poorly, could turn BML results into garbage. 
Most current proposals to strengthen watchdog journalism in the United States focus on improving journalism’s revenue model. The advent of BML promises to radically alter journalism’s cost model.  Even if the revenue going to journalism held steady or increased by a factor of ten, it would still be prohibitively expensive to answer many of the vital questions a watchdog journalist should ask. Thus, even if BML doesn’t solve all the problems currently plaguing watchdog journalism, it does solve vital problems that aren’t being addressed in the current debate over how to fix journalism’s revenue model. And if the decline in costs is great enough—as implied by the goal of automating watchdog journalism—this could more than compensate for the drop in revenue.  
Rather than ending a golden age of watchdog journalism, new information technology may be offering us the opportunity to start one. But like putting a man on the moon, automating watchdog journalism requires more than just technology. It requires elected leaders motivated by a bold vision and the will to overcome many great obstacles; in this case, primarily political obstacles. Such leaders may be in short supply, but they are the ones who leave the greatest mark on history.
--J.H. Snider, the president of iSolon.org, has written extensively about media and democratic reform.  For more information, see iSolon.org’s BML Semantic Web Project.

The NiemanWatchdog.org website is no longer being updated. Watchdog stories have a new home in Nieman Reports.