In this project [1], we propose an industry standard for constructing an aggregate hedge fund database by merging multiple commercial databases. Our merging approach is based on two transparent main steps that can be easily replicated on a regular basis. We will regularly update our database and research results each quarter. This will provide researchers, investors, journalists and policy makers with regular updates about stylized facts in hedge funds. We label this project LASER (aLternAtive inveStmEnt fund Research). The updates will be available on the Risk Management Laboratory webpage.  Easily replicable steps also imply that other researchers can follow them in constructing their own data set or even use the same aggregate data.

It is not a trivial task to merge several commercial hedge fund databases and to separate unique hedge funds from the share classes. The main reason is that all the commercial data vendors only provide an identifier to unique share classes, but there are no identifiers for unique hedge funds. The problem is serious even for the studies that are conducted using only one of the commercial databases, since the individual databases contain significant numbers of multiple share classes that cannot be captured only by excluding different currency classes. Thus, it is important to remove duplicate share classes even if a study is conducted using only one of the databases.          

We address these issues by combining five major hedge fund databases to form the aggregate data set using a novel merging approach that is based on two main steps. First, we develop a matching algorithm of hedge fund share classes that aims to identify exactly the same share classes with each other across databases. Second, we propose a formal statistical algorithm that combines sufficiently similar share classes within management companies as a group allowing us to obtain the longest possible time-series for each unique hedge fund.  In contrast to other studies that often very loosely describe that they ‘carefully’ remove duplicate hedge funds without describing in detail their merging process, we report and rationalize clearly the variables that are used in share class matching and open up the details of the statistical algorithm that identifies the unique hedge funds. Due to the fact that the merging procedures are not discussed in these papers, there is a high demand for the proposed standardized merging approach that could be exploited by researchers, investment professionals and even regulators.

We develop a statistical procedure that is used to separate individual hedge funds from share classes. The goal of the procedure is find out which of the share classes employ exactly the same underlying investment process. Formally, the proposed statistical procedure is closely related to clustering analysis. The procedure is based on the three main steps. First, as in clustering, we define a distance function fulfilling four properties, (i) Identify, (ii) Non-negativity, (iii) Symmetry, and (iv) Triangle inequality.

These properties of the distance function satisfy that the grouping algorithm is not inconsistent suggesting that different share classes are assigned correctly to the respective group. Second, we relate the distance between funds to their correlation. Since the correlation coefficient can have negative values, it is not a distance measure. Thus, we use a simple transformation, which ensures that the distance is never negative. The proposed distance measure is closely related to Euclidian distance that is a very common distance measure. In fact, it is very straightforward to show that the correlation coefficient is inversely related to Euclidean distance between the standardized versions of data.

As a third step, within management companies, we form share class groups that are based on the 0.99 correlation limit. All share classes that have pairwise correlation above 0.99 are assigned into the same group. The triangle inequality satisfies that groups are formed correctly. Since we impose a correlation limit, we can bypass the most difficult task of any clustering algorithm, namely determining the number of different groups or clusters. In addition, we can easily test other correlation limits like 0.95. When merging commercial databases, we first append all databases, sort funds by management company name, and run our statistical procedure. Our statistical procedure has major advantages. First, we can automatically provide frequent updates of our comprehensive aggregate database. Second, it is easier to make criteria for database merging using return time series that are much more consistently reported between databases than name information.

We could also use some other similarity measures, but the correlation captures important properties. For example, if there are two share classes that employ the same investment criteria, but another share class uses two times higher leverage, our correlation-based statistical procedure specifies them to the same group. The correlation coefficient between share classes is very high, but some other distance measures may not classify them to the same group, since their scale or mean returns differ significantly from each other.


[1] Joenväärä, Juha and Kosowski, Robert and Tolonen, Pekka, Hedge Fund Performance: What Do We Know? (October 25, 2013). Available at SSRN