One of the more interesting applications of statistics is the analysis of stock quotation data. A solid analysis can give you an advantage when deciding which shares to buy. Also you can search for patterns; which shares move together with others..? However, the first step is collecting all the data which are required – and this is a cumbersome task!
I found lots of discussions in various forums, on StackOverflow and so on. But nobody provided a solution for downloading masses of data at once. As a consequence, I created my own solution (which you can download here) based on the Google Finance API. Please notice that using this tool might violate against the Google Finance rules declared in their disclaimer, because it downloads and saves the data.
Basics: Input and Output
The tool is a simple command line application. Its input consists of a CSV file where the first three columns are “Exchange” (containing the exchanges’s abbreviation as stated in the Google Finance Disclaimer), the “Symbol” and the “Name” of the shares, trusts or indices. Further columns can be added and will appear in the output as well. You configure the input CSV file in “StockDataDownloader.exe.config” by manipulating the value for the key “StockNames”. In the corresponding subdirectory you will find input files for NYSE, NASDAQ and ETR which I compiled from the exchanges’ official websites. Running the program after download should result in output similar to the following.
If you didn’t adjust the settings for “OutputDir”, then you will find a CSV file in the directory “Output”. If you want to pause downloading then just hit a key. Restarting the applications will make the process continue with those stocks for which no data is present at this point in time. This is of particular interest when Google finds out that you are a bot. The console will show dozens of errors and you are recommended to stop the process and continue later (~1h).
Notice that, as a consequence, an update of the data (e. g. some days later) can only be done if you delete the output file or change the output to another name in StockDataDownloader.exe.config. Otherwise Stock Data Downloader won’t collect new quotations.
In the configuration file StockDataDownloader.exe.config you have further options on hand:
- StockNames gives you the chance to set the input CSV file with format [path]\[filename]
- StartDate defines the oldest data which will be delivered by the Google Finance API. You receive a configuration with default “Jan+01%2C+2000” which you can modify as long as you keep the format fixed.
- ParallelConnections allows running parallel queries. This is not recommended, because Google will detect that you’re a bot even faster then.
- Delay defines the delay (in milliseconds) after querying the data for one share. You can set it to a “human-realistic” value in order to make Google think that you are actually a human being.
- OutputDir lets you set the directory of the tool’s output file.
- OutputFile names the actual file which will be created.
- IncludeLatestData can be set to “false” if you don’t want to receive all data available. This could be the case if you want to skip the data of the current day, because Google Finance API may not yield valid high and low price information before trading has stopped.
- DayLimitToIgnore is enormously important if you want to skip shares for which there has no data been available for a certain number of days (and which therefore may not exist anymore).
- LocalizeDecimalSeparator defines whether to convert Google’s decimal dot into a comma. This is not recommended, especially if you want to process your data using R – so leave it “false”.
A word on data quality
I guess they have their reasons at Google Finance to list issues regarding data quality in their disclaimer (including missing and invalid data). For me these statements came too late; I found out about the data quality the hard way 🙂 Actually, for some symbols you won’t find any data although the share is still in trade. Further there are days, weeks and even past years missing for some of the shares which I investigated. My advise for you is to think about what data actualy creates a benefit for your purposes and then clean the data before working with them. Also, try to counter-check each and every result you yield, because it could point to problems with data quality.