I'm not a programmer, but I'm looking for someone who might be able to
write some code that would quickly download documents, extract text
and create ngrams from the Federal Reserve-as well as be able to look
at the total amount of communication. The documents are available at
http://federalreserve.gov/newsevents/default.htm. I'm working on a
project that would focus on the level of Fed communication before,
during and after the financial crisis. If you're interested, please
reply, and I can provide further details about what I'm looking for
and what I'd like to be able to do with the information.
I'm working on my master's thesis on the topic of transparency in
monetary policy. Specifically I want to look at the information
asymmetries between the Federal Reserve, Congress and the financial
markets. Since Congress and financial markets have different
incentives to learn about monetary policy, the Federal Reserve has to
adjust its communication strategy to avoid creating confusion. I'm
working on a simple model to show how these asymmetries were
exacerbated not only by the financial crisis, but also by the Fed's
unconventional response to the crisis. I need the data in order to
test my model and to see how the Fed has changed its communication in
recent years. I plan to combine the Fed communication data with data
from
capitolwords.org to compare the timing of Fed communication and
Congressional concern with topics like inflation.
Here are the details of what I'm looking for. Federal Reserve
communication (press releases, Congressional testimony, speeches) is
just one part of transparency, and that's what I really want to focus
on. I need time series data from at least 2004 on the words that the
Fed has released in these documents. I would like to be able to look
at total words by document type, as well as ngrams for specific words
and phrases.
It would take me a very long time to do this by downloading each file
individually, extracting the text from the pdfs, then feeding those
files into a word counter to get a spreadsheet, combining the
spreadsheets, then finally have something that I can work with in
Stata.
If you could get me something along the lines of the
capitolwords.org
setup, plus a way to look at total words and split the time series by
type of document that would be fantastic.
The document categories I'd like to be able to sub-divide the data
into are:
Press releases
Speeches
Testimony
Semi-annual Monetary Policy Report
The website is pretty simply structured and the documents are mostly
pdfs going back to 2008, prior to that, they are just .htm pages.
The ngrams will be helpful to highlight which specific Fed programs
were discussed when (TALF, TAF, PDCF, TSLF, etc.) as well as the
specific topics of the communications (housing bubble, Bear Stearns,
liquidity crunch, shadow banking, Lehman Brothers, etc.)
If you have any questions feel free to send me a message.