Analysis of the Twitter Discourse on Sustainability Using Natural Language Processing

This publication aims to map the environmental sustainability discourse on Twitter. This will be achieved through two commonly used methods of natural language processing; topic modelling, which is used to uncover hidden themes in the document collection, and sentiment analysis, which is used to detect the attitudes of the authors of the text towards a particular attitude. The exploration of communication can provide an opportunity to find a solution to a multifaceted problem in order to protect our common future.


Introduction Theoretical background overview
Before presenting the current situation on environmental sustainability in detail, I reckon it is important to relate the Internet-based revolution discussed earlier to the professional context, thus validating the relevance of Internet sources in the next chapter. At the level of everyday experience, we can observe that when we need information, we turn to the Internet first and foremost rather than going to a library or a newsagent for our journal. We simply often lack the time for anything else. So how can we expect organisations to deal with global problems to keep up with the rapidly changing situation, even on a daily basis, with new official publications on the state of play?
The primary aim of the Internet resources used in this chapter is, therefore, to show where more in-depth information and up-to-date data can be found nowadays (official websites are constantly updated between consecutive reports).
The second aim is to highlight that the scientific discourse on sustainability is not always available in the form of publications. Science also has to adapt to the information habits of an accelerated world, especially if it is to make a real difference. And this is an issue so central from the viewpoint of sustainability that it cannot be ignored.

Description of the current situation
A mass extinction event is defined as a large-scale and sudden loss of a species, trait, or characteristic of one of the billions of species around us. According to experts on the history of the Earth, humanity is on the verge of the sixth extinction event.
The Intergovernmental Panel on Climate Change (IPCC) 1 reports in its Fifth Assessment, published in 2014, that the human impact on climate is clear and that, if it continues to increase, there will be severe and irreversible impacts (IPCC, 2020). Since then, special reports 2 have shown that biodiversity is on an accelerating path to extinction as the planet's resilience declines, 3 despite a series of agreements to avoid climate catastrophe. 4 In their research summary book Big World, Small Planet, published in 2015, Johan Rockstörm and Matthias Klum describe the components that influence the Earth's ecosystems.
For each of these, a planetary limit has been set, which, if exceeded, will upset the planet's life-support balance and threaten the very existence of our own species, but if we stay within the limits, our existence can remain sustainable (Rockstörm & Klum, 2015). The elements included were climate change, ocean acidification, stratospheric ozone depletion, nitrogen and phosphorus flows, fresh-water use, land-system change, biodiversity loss, atmospheric aerosol loading chemical pollution.

Opportunities for change
Fortunately, a sustainable future is not lost yet, as a group of researchers and scientists have outlined a number of possible ways to achieve sustainability goals.
Economist Kate Raworth, in  Raworth presents seven new ways to change fundamentally economic thinking that is key to the future, also because the actions of its actors have a significant weight in both energy use and the choice of how it is delivered (Raworth, 2017). A key question in examining the process of human population is whether we will reach the plateau of sustainable living referred to in the demographic context, where we can simultaneously achieve profit, human well-being, and the good of the Earth. To achieve the goals, we would have to maintain the living standards of rich countries by reducing their ecological footprint and increase those of poor countries, thus accelerating their demographic transition by maintaining their ecological footprint. To tackle these challenges, we must both question the indicators that express the well-being of a country (Dasgupta & McKenzie, 2020), and consider whether growth is indeed a fundamental element of economic and political success. The question is: can economic growth be green? The answer is to be found in a multivariate equation, which is currently being solved (Drews & Antal, 2016;Drews & Reese, 2018).
The IEA analysis 6 published on 2 March 2021 shows that global carbon dioxide emissions in December 2020 were 2%, or 60 million tonnes, higher than in the same month the previous year. On these trends, we can expect an average temperature increase of 2.5-2.8°C by 2100, well above the 1.5-2°C limit set in the Paris Agreement. 7 Achieving this objective would require, first and foremost, a change in energy sources towards clean energy. This could reduce further greenhouse gas emissions and slow down ocean acidification. While the issue of energy storage remains unresolved, a series of efforts to promote the use of solar, wind, hydro, and geothermal energy is a sure way forward (Attenborough & Hughes, 2020).
A priority to protect our waters would be to further increase the extent of marine protected areas and to develop a sustainable fisheries strategy, which the UN and the World Trade Organisation are currently working on. 8/9 The protection of forests and land should also aim at restoring the biodiversity of the population of the area, i.e., reforesting the area. The main means of achieving this are to maintain existing forests, 10 reduce logging, afforest agricultural land (Alexandratos & Bruinsma, 2020;FABLE, 2020). It would also be important to reduce soil-damaging operations, reform agricultural practices, 11/12 and change nutritional habits. 13 EDUCATION OF ECONOMISTS AND MANAGERS • Volume 62, Issue 4, October-December 2021 Tímea Emese Tóth • Analysis of the Twitter Discourse on Sustainability Using Natural… • 101-118

Theoretical overview of the methodology Text data generation and the amount of information
With the rapid development of technology and the emergence of the Internet and social media, the ordinary user has become a content producer themselves, with the opportunity to actively comment, share, and react to content on the Internet. As a result, there are now billions of such data, which tell us a lot about people's opinions, mindsets, and attitudes (Evans-Acaves, 2016). In numerical terms, thanks to the Big Data revolution, 4.5 billion people will have used the Internet by the beginning of 2020, 3.8 billion of whom will have been members of a social media site, according to a report published by Digital Reports. 14 One of the most popular platforms is Twitter, where an average of 6,000 tweets are posted every second, which means 350,000 tweets per minute and 500 million per day. 15 Processing texts and extracting information from them is not too challenging for the average user, given that he or she only deals with a small detail. However, comprehensively reviewing and processing such a large amount of data can be timeconsuming and challenging with manual processing methods, but with algorithmic solutions this is no longer an obstacle.

Topic modelling
The method aims to identify the themes of a collection of documents. The method assumes that there are words with strong semantic information and that documents on similar topics use similar groups of words. One of the best-known topic modelling techniques is the Latent Dirichlet Allocation (LDA), first described by Blei et al. in 2003(Blei et al, 2003. LDA performance can be measured by, among other things, log likelihood, or log likelihood-based metrics such as perplexity, most often on the part of the corpus retained for testing, the test corpus (Blei et al, 2003). With the topical coherence indicator, the interpretability of the topics, the internal coherence, can be examined by considering the semantic similarity of the terms that characterise the topics (Stevens et al., 2012). Among the coherence indicators, I use the c_v indicator, which is based on the co-occurrence of word pairs and can take a value between 0 and 1. Roughly speaking, we can say that the coherence of a model EDUCATION OF ECONOMISTS AND MANAGERS • Volume 62, Issue 4, October-December 2021 Tímea Emese Tóth • Analysis of the Twitter Discourse on Sustainability Using Natural… • 101-118 is considered poor up to 0.0-0.3, low between 0.31-0.4, acceptable between 0.41-0.6, good between 0.61-0.8, and poor again above that. In practice, I perform the topology modelling using the Gensim package 16 implemented in the Python library.

Sentiment analysis
The aim of sentiment analysis is to find out people's opinions, attitudes, and even feelings about a particular issue, based on texts alone. In practice, I carried out sentiment analysis using the VADER programme. It is a special lexicon containing 9,000 lexical features, including words, emoticons, abbreviations, and slang expressing emotion. VADER does not calculate the polarity and intensity of words separately but aggregates the results in an indicator called a normalised 'valence-score'. For text fragments, it calculates the proportion of negative/neutral/positive terms in the fragment, considering the intensities, and then normalises it to a range from -1 to 1. VADER performs quite accurately when compared to other competitor tools and by itself, coming first in the evaluation of tweets, both when considering human annotators and other software. In the other tests, only the human annotator scored higher in accuracy (Hutto & Gilbert, 2015).
It is important to note that I do not treat the results of the sentiment analysis as an indication of attitudes towards the subject, but I interpret them according to the typical attitudes (positive-negative-neutral) presented by the texts on the subject.

Advantages and disadvantages of the method used
Obviously, as with all methods, there are advantages and disadvantages. I mentioned the advantages in the introduction and referred to them before, but briefly, once again I would like to point out that it is that algorithmic processing of the text allows us to analyse large amounts of data quickly without direct intervention. Its drawbacks include, among others, as compared to classical surveys, coverage errors and representativeness issues. Not everyone is represented in each community space, which further leads to the fact that we cannot analyse everyone's opinion based on different inclusion probabilities. Several reasons for the lack of representation of subjects in a given space can be listed, which can be explored either by examining classical sociodemographic variables or by examining the temporal dynamics of community surfaces. Another important methodological issue is

Description of the research Data collection
My analytical database consists of scraped tweets from Earth Overshoot Day, from the creation of Twitter. Criteria included that the tweets contain at least one of the words: 'sustainable' and 'sustainability'. The scraping was done using the Twint package, which can be implemented in Python. Both in selecting the time period and in finding relevant search terms I took into account the results of previous research on a similar topic (Kirilenko & Stepchenkova, 2014;Cody et al., 2015;Nik-Bakht & El-Diraby 2016;Dahal, Kumar, & Li, 2019;Ballestar, Cuerdo-Mir, & Freire-Rubio, 2020;Radi & Shokouhyar, 2020). Retweets are not included in the database, which contains all the relevant posts of the whole day, 61,083 in total. The categorisation is based on David Attenborough's book A Life on Our Planet and the Paris Agreement. Attenborough identified an important phase from 1997-2011 and then another phase from 2011-2020, but in order to carry out the analysis with a more balanced number of cases at even intervals, I split the second phase into two parts based on the Paris Agreement's declaration of 2015.

Steps of the analysis
In the first step, to make the data usable, I perform a data cleansing process, removing variables that I will not use later but that 'go with' the tweets while scraping, and the text that is otherwise inappropriate, such as only spaces. On the other hand, I convert the appropriate texts to the appropriate form so that the algorithm can interpret them (Németh, Katona, & Kmetty, 2020;Denny & Spirling, 2018). Then, in the second part of the analysis, I use the Gensim package to find the best fitting and meaningful topics for each section, using both qualitative text processing and algorithmic methods. For the algorithmic method, I optimise the parameters between 3 and 11 using the literature. In the third section, I carry out the sentiment analysis of the subsamples. My hypothesis is that I will find predominantly neutral and positive texts, on the one hand, because emphasising negatives in the sustainability discourse can lead to counterproductivity, and on the other hand, because social networking sites are characterised by the positivist approach. The results will be presented section by section, using a combination of methods.

The first analysis phase (2007-2010)
After completing the pre-processing, a total of 2,938 tweets were included in this section. The most frequently used words, regardless of the topic, were 'green' ,  'energy', 'design', 'pioneer', 'live', 'food', 'lifestyle', 'new', 'business', 'make', which are mostly related to the general lifestyle in line with the environment. For this section, 50.27% neutral, 42.95% positive, and 6.78% negative sentiment tweets were produced. For the tweets in the 2007-2010 range, a total of 6 different and distinct topics were found, with a good fit.
Topics three and six are quite close, which may be justified by the theory that well-chosen, quality products result in significant energy savings due to longer life. For all the themes, tweets reflecting negative attitudes were the least represented, with only the third and sixth, i.e., the connected themes, showing examples of tweets with positive attitudes being more represented than neutral ones. The most positive topics are related to agriculture and food, and the most negative are related to sustainable homes. In general, the mood of the topics is no different. The most common words that make up the topics tend to refer more to the general lifestyle of individuals and less to reflection on the global situation.

The second analysis phase (2011-2015)
A total of 24,090 tweets were included in the analysis phase from 2011 to 2015, which is almost ten times the number of tweets included in the first part. A total of 5 topics were created, similar to the previous method, also with acceptable topic coherence. The 10 most frequent words, regardless of the topic were: 'development', 'AMP', 17 'food', 'make', 'business', 'CSR', 18 'green', 'need', 'environment', 'future'. In contrast to the previous survey, social responsibility and common sustainability efforts were given greater emphasis, with 48.45% neutral and 44.54% positive, 7% negative. 17   In this case, the 3/4/5 topics are very close to each other, and their keywords include reflecting pairs of words, such as SCR (corporate social responsibility) or RTTC (responsible travel and tourism collective). Collective action and thinking are also reflected in the texts, which at this stage are mainly focused on the improvement of buildings and other areas, responsible travel, and the discovery of local values. More emphasis is given to reports on the environment, on specific objectives. Environmental and sustainability values are also infiltrating the life of companies and institutions, and consumption trends are also starting to shift towards more sustainable options. The period ends with the Paris Agreement, which further reinforces the importance of collective responsibility and global trends towards cooperation.
Taking sentiment into account, tweets with negative attitudes are also the smallest proportion, but the overall proportion is smaller than in the previous section. In two out of five cases, the majority of positive tweets outweigh the neutral ones, as can be seen in the case of the environmental solutions and the greening/ enhancement of buildings. The positive-neutral categories are also more distinct. The most negative topics are related to development and the most positive ones to environmental solutions. The third analysis phase (2016)(2017)(2018)(2019)(2020) In this section, I examined a total of 33,464 tweets. I created 7 topics with an acceptable topic coherence of 0.42. The ten most frequent terms in the corpus are: 'make', 'solution', 'AMP', 'need', 'go', 'learn', 'would', 'think', 'development', 'future'. These terms describe the phenomenon whereby environmentally conscious actors act with their own life path and future in mind and encourage others to do the same. This is also supported by the topics. The period covered is one in which people are beginning to realise that our lack of action on sustainability is putting our own future at risk. The proportions of sentiments across the whole period were 51.7% positive, 38.98% neutral and 9.32% negative. As in the past, there are also some topics that are close to each other. There are two such groups: one is a group of 1-2 topics, the other is a grouping of 4/5/6/7 members.
Let us now pay more attention to the separate topic with 1,950 tweets. The importance of this topic lies in the fact that it summarises a hitherto unrepresented part of the environmental movement, namely ethical clothing and packaging and related issues. Its relevance is underlined by the renaissance of the second-hand clothing market, the thrift shop, observed in the last few years. There are more and more posts explaining that wearing second-hand goods is not about a lack of style, but about the fact that it is possible to dress in an environmentally responsible and fashionable way. The choice of apparel nowadays also takes into account the place of production, the use of water and organic ingredients.
Another major part of this is packaging. Consumers prefer environmentally friendly packaging (paper, glass, metal) or packaging-free alternatives to the plastic packaging they used to use, and not just in fashion. Importantly, this phase has seen the emergence of explicitly topical terms that have been identified by researchers as key concepts, such as 'climate change' or 'move the date', a key term in the movement to delay the Beyond Day. In addition, there are other topics related to food, environmental movements, economics, and mindsets. The most positive topics are those related to environmental communities and projects, and the most negative are those related to mindsets.

Conclusion
The aim of this paper was to map the environmental sustainability discourse on Twitter using two commonly used tools of natural language processing: topic modelling and sentiment analysis. The analysis dataset consisted of tweets that were posted on Earth Overshoot Day.
One of the methods used to carry out the research was topical modelling. In this thesis, a model with appropriate topical coherence was fitted to each of the three sub-databases. Based on the results, it can be said that the theme of the tweets is mostly focused on green economy, energy, and sustainable agriculture options for global issues. The texts dealing with issues at the individual level are organised around the themes of environmental communities, responsible product purchase and use, and mindsets. The other method used was sentiment analysis. By examining the sentiment of tweets it can be concluded that over time the sentiment of tweets undergoes a significant polarisation. The proportion of positive sentiments decreases by almost 10% and the proportion of negative sentiments increases by 3%. This is partly due to the worsening situation and partly due to the erosion of the positivist outlook on social media platforms. The most positive topics are about environmental communities, projects, and solutions, while the most negative ones are about development and mindsets. Unfortunately, the tweets under review do not deal at all or only sporadically with the situation of our salt and fresh waters, forests, air and chemical pollution, and biodiversity loss in general. The SDGs are numbered 6, 14, 15. So these are the areas where education is still needed, in addition to informing people as widely as possible about the interrelationships between the sub-areas. Emphasising the general context is also important because improvements in some areas, particularly technological ones, may reduce the threat of climate change but exacerbate other problems. What we clearly need now is to find comprehensive solutions together to protect our common future. In the future it would certainly be worthwhile to carry out similar research on other platforms, either by defining search terms more specifically or by selecting a longer, continuous period. It may also be interesting to repeat previous research while ensuring comparability.