How to run multiple factor analysis in R

Writing still in progress

In economics and social sciences in general, factor analysis is a very useful set of methods used to analyze data for a wide variety of topics. The most well-known method is principal component analysis (PCA), which allows to reduce a set of variables into a few “principal components” which are linear combinations of the datasets’ variables. The components are ordered according to the percentage of variance captured by it and components are orthogonal (not correlated) with each other. PCA applies only to quantitative variables, but there are many options to include qualitative and ordinal variables: multiple correspondence analysis (MCA) which applies to qualitative variables only, and mixed methods which applies to both qualitative and quantitative variables. Factor analysis is also useful to run cluster analysis. Once the principal components are estimated, hierarchical cluster (HC) can be conducted to groups the observations into clusters.

However, what happens when our variables correspond to different group ? Running a PCA directly would ignore this grouping and thus loose a lot of information. To give an example I am studying, the literature on comparative capitalism often uses clustering through PCA and MCA to identify empirically the varieties of socio-economic models which exist across countries. One of the main ideas of comparative capitalism is that social, political and economic institutions come with different forms, hierarchy as well as complementarities within countries, leading thus to different forms of capitalism between them, that we can clusterize. This is what has been done by, for instance, Amable (2003) who identified five types of capitalism by the analysis of five institutional domains: the wage-labour nexus; the financial system; the education system; product-market competition; the welfare state & social protection system. Cahen-Fourot (2020) offers a similar typology, but with slighlty different institutional domains (which he calls “institutional forms” because they are taken from Régulation Theory): the monetary regime; the wage-labour nexus; the form of competition; the form of the state and the insertion into the international regime. What is interesting in Cahen-Fourot (2020) is that he includes an additional institutional form: the social relations to the environment.

To identify and cluster different types of capitalism, this literature uses a combination of PCA and MCA through the following steps:

Run a PCA for each institutional domain separately
Cluster the countries on this institutional domain using HC on the PCA and attribute each country to their cluster
Run a MCA on the cluster classification

Despite the fact that this approach is acceptable, it can be very time consuming especially when the steps 1 and 2 above must be repeated a lot (in the example above, 5 or 6 times). There exists another principal component method which allows to do that in a much simpler way: multiple factor analysis. Using this method would allow to run a single principal components dimension reduction while taking into account the fact that the data are grouped into categories (in that case the institutional forms) and thus avoiding to run a separate analysis for each one of them. In very simple terms, MFA can be considered as a weighted PCA in which each variable is divided by the square root of the first eigenvalue of the group it belongs to. By doing so, the group structure is preserved while running the analysis over all the variables.

The present post will not explain further details on MFA, but will focus on its practical implementation and visualization in R using the FactoMineR and FactoShiny packages. There are many resources online to learn all technical details about MFA, as this series of youtube tutorials done by François Husson. this post is also very useful.

How to run MFA in R

To show of to run MFA in R, I will use the data compiled by Cahen-Fourot (2020) to analyze the diversity of capitalisms in 2015. This data contain 76 variables covering 37 countries.

library(readr)
data <- read_csv("data_full_CF2020.csv") %>%
    column_to_rownames(var = "Country") %>% # important to have countries as rownames or the MFA will consider them as qualitative indicator variable
    select(-c(MatfootGDP, CO2footGDP, EnerfootGDP)) # remove three variables which are present in the dataset but will not be used

head(data) %>% 
  gt()

SCP	GIN	DCB	GOV	PCT	CCR	ABC	ABP	BSS	LPS	CSR	LBR	ATX	BEN	FDI	TRF	TFS	TRD	Telec_struc	Elec_struc	Gas_struc	Post_struc	Rail_struc	Hstat	LernerIndex	StateControl	Taxes	HealthGDP	EducGDP	MilitaryGDP	Finalcons	HealthGOV	EducGOV	MilitaryGOV	Int_rate	Inf_rate	Fin_reg	LLR	FinOpen	Liq	PrimAgri	PrimIndus	SecManuf	SecConst	Ter_Serv	Ter_FIRE	Tradeglobfacto	Finglobfacto	Tradeglobjure	Finglobjure	Socglobfacto	Socglobjure	Polglobfacto	Polglobjure	IUCN	GreenP	EnvConflicts	ClassGHG	EnvReg	EnvTreaties	EnfEnvReg	EnerGDP	GHGGdp	MatGDP	MatEmb	EnerEmb	CO2Emb	Wshare	DistProf	Labprotec	CollBarg	Unions	Unemp	IncomeIneq	WorkingHours	Gender
2.85	2.17	1.03	5.63	0.84	1.08	0.8	0.8	1.15	4	0.55	1.47	1.43	2.75	0.77	0	0.00	0.00	1.60	0.0	0.00	3	3.0	0.31	0.17	1.94	22.1	6.1	5.32	1.96	18.1	16.65	14.08	5.22	2.26	1.51	1389	0.54	0.94	1874	0.02	0.10	0.07	0.09	0.50	0.22	239	785	839	678	868	800	908	911	1.28	0.213	0.7	5.5	5.7	24	5.7	0.100	408	0.70	-131.2	-9.1	1.9	58.8	15.0	1.9	50.6	15.0	6.1	0.3	34	13.0
3.00	2.78	1.03	2.25	0.37	1.78	1.8	0.2	4.04	2	0.05	0.20	0.00	2.47	0.63	0	0.35	1.40	1.36	0.0	1.50	2	0.0	0.71	0.55	1.67	26.8	7.8	5.45	0.71	19.9	15.10	10.70	1.37	-0.02	0.90	1654	2.65	1.00	2971	0.01	0.03	0.18	0.06	0.56	0.15	699	907	911	737	871	864	939	973	0.59	0.115	0.7	3.4	6.2	25	6.1	0.079	190	0.42	56.9	8.4	25.9	63.7	16.8	2.4	98.0	27.4	5.7	0.3	35	17.0
2.60	2.59	1.16	3.00	1.17	2.92	2.2	1.0	4.40	2	0.94	1.50	0.00	2.52	0.24	0	0.50	0.00	1.51	1.5	0.00	2	3.0	0.68	0.17	2.19	24.7	8.6	6.55	0.92	23.9	16.02	12.17	1.71	-0.02	0.56	1869	2.88	1.00	2081	0.01	0.03	0.13	0.06	0.63	0.15	932	936	880	818	826	899	926	990	1.16	0.846	0.4	3.6	5.5	26	5.2	0.104	231	0.29	53.7	-81.4	-1.4	66.8	15.9	3.0	96.0	54.2	8.5	0.3	35	4.7
2.64	1.77	2.29	4.00	2.53	2.17	3.6	2.0	3.43	6	1.57	2.07	0.00	3.44	0.61	4	0.80	3.55	1.93	0.0	6.00	3	6.0	0.74	0.01	2.51	12.8	3.8	6.24	1.36	19.7	7.66	16.25	3.57	14.25	9.03	1636	20.40	0.17	6135	0.05	0.07	0.14	0.06	0.53	0.15	191	501	479	285	453	766	955	676	0.10	0.017	0.5	14.4	4.7	24	3.7	0.125	439	1.47	-13.8	4.6	12.1	50.3	21.1	1.7	70.5	19.5	8.4	0.5	39	16.0
2.60	1.92	0.59	3.75	1.95	1.30	0.0	0.2	3.75	2	0.57	0.69	0.65	2.95	1.04	0	1.84	1.17	1.57	3.0	0.00	2	1.5	0.81	0.39	1.92	12.3	7.7	5.27	1.16	20.9	19.07	12.22	2.87	0.82	1.13	142	4.69	1.00	3762	0.02	0.10	0.11	0.07	0.52	0.19	365	841	859	748	888	842	933	937	0.62	0.253	0.7	6.8	5.0	19	5.0	0.151	400	0.58	13.7	3.8	3.5	63.4	16.0	1.5	30.6	28.6	6.9	0.3	35	18.6
2.47	1.79	0.96	3.75	2.39	1.50	1.2	0.8	1.37	6	1.50	0.47	0.00	3.11	0.34	0	0.10	1.26	1.30	0.0	0.75	1	4.5	0.81	0.40	2.10	17.5	4.9	4.87	1.91	13.2	19.58	19.59	7.62	4.94	4.35	1262	0.52	0.70	1929	0.04	0.14	0.11	0.08	0.43	0.05	382	769	772	643	581	819	853	913	0.17	0.039	2.8	8.5	4.7	22	4.7	0.136	433	2.77	-188.5	5.5	8.2	44.8	25.3	1.8	17.6	17.9	6.5	0.5	42	21.1

The first 25 variables concern the form of competition. In Regulation Theory, the form of competition (akin to product market competition in the varieties of capitalism framework) captures the nature and degree of competition in the socio-economic model. Institutions which constitute this form are for instance administrative burdens to for corporation (ABC), sole proprietor firms (ABP), entry barriers, price controls (PCT), or the degree of competition (measured here by the Lerner index).

The next 9 variables are for the wage-labour nexus, which comprises all institutions regulating employment relationships, wage setting, work organization and social protection.

The monetary regime, defined as all the institutions regulating money, credit and finance, is summarized by 6 variables (interbank 3-months interest rate, inflation rate…) while 14 variables describe the insertion into the international regime such as the importance of agriculture (PrimAgri), extractive industries (PrimIndus), manufacture (SecManuf) and the so-called globalization index from the KOF economic institute. Finally, the social relations to the environment is captured by the remaining 9 variables such as environmental regulation stringency (EnvReg), GDP intensity in greenhouse gas (GHGGDP).

To run MFA in R, one just needs to load the FactoMineR package and use its MFA() function. The only complication of this function is the group argument it needs to be a vector of numbers which describe how the variables are grouped. It works as follows:

# create a vector with the name of the groups (ie the 6 institutional forms)
group_name <- c("Competition", 
                "State",
                "Money", 
                "International regime", 
                "Environment",
                "Wage-labour nexus") 

MFA_CF2020 <- MFA(data, 
                  group = c(25, 9, 6, 14, 13, 9), 
                  type = rep("s", 6), 
                  ncp = 5, # manumber of principal components to keep 
                  name.group = group_name, graph = FALSE)

group = c(25, 9, 6, 14, 13, 9) tells the function that the first 25 variables are the first group, the second group then contains the following 9, then the third the following 6 etc. One can also give the groups name with the group_name argument. type = rep("s", 6) tells the nature of the variables for each group. “s” is for quantitative variables and thus we have to repeat this string 6 time for the six groups (hence the rep function).

Apart from the group argument, we can appreciate of simple it is to implement MFA in R, while conducting separate analyses using 6 PCA and MCA six time would have required much more time and code. The only drawback of the MFA() function is that it requires the variables to be ordered by groups in the dataframe.

Exploring separate PCA

The MFA function returns an list with two main elements:

separate.analyses: those are PCA analyses for each group. We have thus 6 separate PCA analyses for each institutional forms and we can analyse, visualize and perform cluster analysis on each of them if we are interested.
global.analysis which is the global PCA with the variables weighted accordingly to their group.

For instance, let’s have a look the to the social relations to the environment. It is common practice in factor analysis to first have a look to the contribution of the dataframe’s variables to the principal components. There are two graphs which help to visualize this: a barplot plotting the variables’ eigenvalues and the circular plot. It is very straightforward to produce such a barplot in R thanks to the factoextra package and its fviz_contrib function. Below, I show how to generate these graphs for the first 5 principal components, store them into a list and plot them together with the ggarange function from ggpubr package:

library(ggpubr)

barplots_contrib_env <- 
  1:5 |> # we want 5 plots for the first 5 PC, so I provide a vector which will integrated in the map function through axes = .x argument
    map(~fviz_contrib(MFA_CF2020$separate.analyses$Environment, 
                      choice = "var", 
                      axes = .x))

ggarrange(plotlist = barplots_contrib_env)

We have now an overview to how the first five PCs are structured. The first PC is highly correlated with variables such as GDP intensity in GHG (GHGGdp) and energy (EnerGDP), as well as environmental regulation stringency (EnvReg). The second PC is correlated with embodied energy in net imports relatively to energy consumption (EnerEmb) and organizations member of IUCN per millions inhabitants. The variable capturing environmental conflict contribute more than 50% to the third PC.

However, the barplots do not tell us to whether the variables which contribute the most to the PCs are correlated negatively or positively to the latter. To see that, circular plots are a good visualization. Here, the code is slighlty more complex because we need to provide a five elements list in which there are the combinations of the PCs represented in the circular plot:

axes_list <- combn(1:5, 2, simplify = FALSE)

circplots_contrib_env <- 
  axes_list |> 
    map(~fviz_pca_var(MFA_CF2020$separate.analyses$Environment,
                      axes = .x)+
          labs(title = ""))

ggarrange(plotlist = circplots_contrib_env)

The next step is to see how the countries are distributed along these PCs. To do so, we just use fviz_pca_ind instead of fviz_pca_var:

axes_list <- combn(1:3, 2, simplify = FALSE)

countries_plot_env <- 
  axes_list |> 
    map(~fviz_pca_ind(MFA_CF2020$separate.analyses$Environment,
                      axes = .x)+
          labs(title = ""))

ggarrange(plotlist = countries_plot_env)

One can also combine the countries and circular graphs into a single plot:

factoextra::fviz_pca_biplot(MFA_CF2020$separate.analyses$Environment)

From the graphs above, one can make the following observations regarding the position of countries on the two dimensions. On the upper right quadrant, we have countries which are characterized by (relatively) a lot of climate activism (measured by IUCN and GreenP), environmental regulations and enforcement, but are also characterized by some important degree of CO2 offshoring (EnerEmb and CO2Emb measure energy and CO2 in net imports, they measure the degree of offshoring of GHG emissions). On the upper left quadrant, we have countries such as China and India with high domestic GHG emissions (EnerGDP, GHGGDP…), high carbon inequalities, low offshoring and climate activism. On the lower right quadrant, we have countries such as France and the UK which have relatively lower domestic GHG emissions but with high offshoring and low activism.

Applying hierachical clustering is also very easy through the HCPC() function. Note that it is not possible to run this function in quarto since the function asks to cut the dendogram tree to determine the number of clusters. I thus ran the code first in a R script to get the number of clusters which is four:

cluster_env = HCPC(MFA_CF2020$separate.analyses$Environment, nb.clust = 4, graph = FALSE)
fviz_cluster(cluster_env) +
  theme_minimal()

The clusterization confirms the obervations we made above: we have a group of domestic polluters (cluster 1), low offshoring (which makes sense since they export their domestic pollution to rich countries of cluster 4 and 3) and lower activism. Cluster 3 gathers countries with high carbon inequalities and relatively higher carbon emissions (lower than cluster one but higher than cluster 4) and low activism. Cluster 4 is mostly composed of rich countries with high degree of offshoring, activism, lower inequality and GHG emissions.

Exploring global PCA

Let’s now have a look to the global weighted PCA. Let’s start with the contribution of the variables on the first two dimensions. Since there are a lot of variables, it is useful to make an interative plot with the ggplotly function from the plotly package so that we can zoom on the graph:

plot_MFA1 = fviz_mfa_var(MFA_CF2020)

ggplotly(plot_MFA1)

We can see that the first principal component is highly positively correlated with the degree of labour protection and BSS (entry barriers in professional services, freight transport services and retail distribution) while the second is highly correlated with the variables measuring the degree of domestic pollution and negatively correlated with globalization indicators, collective bargaining, the wage share and climate activisms (among other…).

plot_MFA2 = fviz_mfa_ind(MFA_CF2020, geom = "text")

ggplotly(plot_MFA2)

References

Amable, Bruno. 2003. The Diversity of Modern Capitalism. OUP Oxford.

Cahen-Fourot, Louison. 2020. “Contemporary Capitalisms and Their Social Relation to the Environment.” Ecological Economics 172 (June): 106634. https://doi.org/10.1016/j.ecolecon.2020.106634.