Wikipedia:Wikipedia Signpost/2020-04-26/By the numbers

From Wikipedia, the free encyclopedia
By the numbers

Open data and COVID-19: Wikipedia as an informational resource during the pandemic

Changwook Jung, Sun Geng, Meeyoung Cha are from the Institute for Basic Science, South Korea & KAIST. Inho Hong is from the Center for Humans & Machines, Max Planck Institute for Human Development, Germany.
Diego Saez-Trumper is a researcher employed by the WMF. This paper represents work beyond his regular duties. This article was originally published on "Medium". The text, but not the graphs, on "Medium" are licensed CC0

From the very start of COVID-19, when it was known just as an outbreak of an atypical pneumonia in China, people around the world have been finding and sharing information about the virus on Wikipedia, a frequent online resource for medical information. While the content and quality of the information on Wikipedia is shaped by volunteer editors (over 34K contributing to COVID-19 related pages) and by policies about verifiability, the activity generated by these volunteers and readers also generates a considerable amount of data itself. For example, we can explore how many Wikipedia articles have been created about COVID-19 related topics. Which sources are cited in those articles? How many people had reviewed such articles? Which are the most visited pages?

This post offers an overview of the COVID-19 related data generated in Wikipedia, highlighting the diversity of content that people read: from general information about the pandemic and regional responses, to the people who have been involved in the pandemic and misinformation about the virus. You can see some of this data in a new interactive resource, which will be updated regularly, from the Wikimedia Foundation. All the data used in this article is public and can be scrutinized, accessed, and used by third parties, using the MediaWiki API and other online resources offered by the Wikimedia Foundation. Sample source codes are made available at this Jupyter notebook.

Seeking information during COVID-19: English Wikipedia

Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus. The first case was observed in China in late 2019, quickly followed by an outbreak in nearby East Asian countries like South Korea. In a few weeks, outbreaks could be seen throughout Europe and America, leading the World Health Organization (WHO) to declare it a pandemic. Countries like Iran, Italy, Spain, and the United States had seen over 50,000 confirmed cases (Fig 1). As of April 14, less than 90 days since the lockdown in China, COVID-19 has infected over 1,880,000 people and has killed more than 117,000 patients worldwide.

Figure 1. Case Statistics of COVID-19 in China, South Korea, Spain, and the US (right axis — log scale). These countries have outbreaks at different times. While the patient count increases at a smaller rate for China and South Korea by early March, Spain and the US show a sharp rise. On gray the number of page views on English Wikipedia COVID-19 related articles (left axis — linear scale). For the original version of Figure 1. click here

What kinds of information did people most seek on COVID-19? How did their attention change over time, as the number of patients quickly rose globally? How quickly were pages tracking regional cases updated? These are critical questions that help us better cope with the current pandemic as well as any other that might come in the future. We analyzed the complex and diverse attention of the public during the COVID-19 pandemic from the browsing logs of English Wikipedia pages. This post will feature findings on English content and patterns from other languages such as Korean, Italian, and Spanish will be revealed in our next post.

Methodology

Central to any content we observe on Wikipedia lies Wikidata, a type of structured data that links all Wiki projects. Most of the articles in Wikipedia link to a Wikidata Item, which among themselves are linked. For example, there is a link between the COVID-19 Wikidata item, and the Pandemic one. Therefore, we can identify relevant content related to COVID-19 by looking at these connections. This can be easily done by clicking on the "what links here" button that exists in every Wikipedia and Wikidata article. By looking at the Wikipedia pages that link to those items, we can obtain a list of articles related to COVID-19 in each language. Constraining results to English Wikipedia results in 878 Wikipedia pages.

These Wiki pages covered a plethora of topics, which could be grouped into one of the four categories found by qualitative coding.

  • Virus: Wiki pages that directly cover topics on the virus itself (such as Coronavirus disease 2019), developments on tests and vaccines (e.g., COVID-19 vaccine), and symptoms (e.g., Severe acute respiratory syndrome coronavirus 2) belong to this category. Out of 878 total pages, 11 Wiki pages were classified in this category.
  • Region: Tracking pages dedicated to specific regions were quickly created as outbreaks spread globally (e.g., 2020 coronavirus pandemic in New York (state)). Our data contain 310 such Wiki pages.
  • People: Celebrities and public figures who are related to COVID-19 either as spokespersons, doctors, or as infected patients were grouped as the people category. This category has the largest number of Wiki pages; 516 in total.
  • Others: All other pages that had linked to the COVID-19 Wikidata were categorized as 'others.' These pages often contained information about a specific event (such as 2020 Hubei lockdowns), location (such as NHS Nightingale Hospitals), or socio-economic impact (such as 2020 stock market crash). A total of 41 Wiki pages belong to this category.

Content Dynamics

One challenge in tracking people’s attention is the dynamics in data describing the event. Figure 2 shows a live example of the daily page-views of two Wiki pages: Pandemic and 2019–20 coronavirus pandemic.

At first, the Pandemic Wiki page was not a frequently visited one. On March 11, however, this page showed a sharp increase in the number of page-views when the WHO declared the disease as a pandemic. Note that this page is not linked to the COVID-19 Wikidata item and hence is considered “not relevant” in our analysis. Nonetheless, many people might arrived at this page either by searching for “pandemic” or by following the hyperlinks that lead to this page. There are other examples (particularly disease and outbreak-related pages like Influenza pandemic) that temporarily became popular due to COVID-19.

Second, the 2019–20 coronavirus pandemic Wiki page appears to have been created much earlier than the WHO’s announcement in March. This discrepancy may appear when titles of Wiki pages change over time. In this particular case, the original title had been 2019–20 coronavirus outbreak and was later changed to 2019–20 coronavirus pandemic. Such dynamic nature of Wiki content is representative of the time-evolving nature of events. When analyzing Wikipedia content, such dynamics should be understood.


Figure 2. English Wikipedia Views on Two Pandemic Pages. “Pandemic” as a general term had high attention only on March 11–12 when WHO declared the coronavirus outbreak a global pandemic. “2019–20 coronavirus pandemic” has been viewed steadily since January. The view counts had the steepest increase in a few days before the declaration. From late March, the view counts are decreasing with the slowing down growth rate. *2019–20 coronavirus pandemic page was 2019–20 coronavirus outbreak before WHO’s announcement. For the original version of Figure 2 click here

Most Viewed Pages

So which individual pages attracted the most attention? Using the public Wikimedia pageviews tools, we can compare the number of times that each of these pages were visited. Sorting content by the number of maximum daily views, we arrive at these five Wiki pages in Figure 3: 2019–20 coronavirus pandemic (on the top red line), 2020 coronavirus pandemic in the United States (yellow line), Severe acute respiratory syndrome coronavirus 2 (green line), Tom Hanks (blue line), and Timeline of the 2019–20 coronavirus pandemic (purple line).

Figure 3. Top 5 content Wiki pages. “2019–20 coronavirus pandemic” was viewed far more than the other pages. “2020 coronavirus pandemic in the United States” are getting more views with the spreading in the US. “Tom Hanks” had a sharp peak on March 12 because of his infection, but it soon lost attention. Most pages in the People category show a similar view pattern. “Severe acute respiratory syndrome coronavirus 2” had the most views at the beginning of the spreading. For the original version of Figure 3 click here

Pages like 2019–20 coronavirus pandemic, Severe acute respiratory syndrome coronavirus 2, and Timeline of the 2019–20 coronavirus pandemic are among the most visited pages from mid-January and throughout the pandemic. The second page describes SARS-Cov-2, the virus that causes COVID-19. On the other hand, the regional tracker page 2020 coronavirus pandemic in the United States was created in January, but only became popular in March as local outbreaks began in the US. The regional tracking sites and the accumulated view counts on those pages often mimic the outbreak pattern in that country. We also turn our attention to celebrity Tom Hanks, who was infected with the virus and has now recovered. Wiki pages of individuals are connected to the "COVID-19" Wikidata, but sometimes this linkage is removed by Wikipedia editors once the event passes. This adds to the complexity of Wikipedia data. Many other pages on people involved in some way with the disease show a similar spike in page-views during the pandemic.

Content by Topical Category

Next, we check how attention is divided across the four topical categories — virus, region, people, and others — by examining the aggregate view counts on these categories. The “virus” category Wiki pages altogether receive the most views during the initial phase until the end of February (Figure 4). They continue to be popular throughout the pandemic. From March 1st and onward, however, the “region” category Wiki pages get larger aggregate views. This shows how the attention shifted between these two content types, i.e., people continue to learn more information about the virus and, at the same time, track regional progress of transmissions. The “people” category content becomes popular from mid-March, even leading to larger aggregate views during days in April. Wiki pages on celebrity and public figures are generally popular within Wikipedia. As the virus progresses, more public figures become associated with COVID-19, diverging the public interest. The structured nature of Wikidata even allows us to understand how/why these people are associated with the disease. By using a semantic query language, we can see that most of the people were linked by having "Medical Condition" or "Cause of Death" COVID-19.

The “others” category pages that describe the socio-economic impact and other related events also track more extensive attention from mid-March. However, far less attention is paid to these pages compared to the other three categories.

Figure 4. To view this figure click here English Wikipedia Views by Topical Category. We divide COVID-19 related Wikipedia pages into 4 categories; Virus, Region, People, and Others. “Virus” category was dominant at the beginning of spreading, but the “Region” category became dominant on March 1. “Region” category has its highest peak on March 23. “People” category has a sharp peak on March 12 because of Tom Hanks’s infection. Interestingly, the other categories all have peaked on that day. ‘People’ fluctuated when the news of the confirmed famous spread. Most categories except “People” show decreasing

Content Share by Category

Stacked charts or ratio charts are excellent ways to visualize how the aggregate view counts by topical categories evolve over time. It is noteworthy to observe that the total view counts on all content are prominent, reaching over 500,000 views in late January. This is faster than the time when the WHO declared the outbreak a Public Health Emergency of International Concern (PHEIC) on January 30th. Internet users sought information actively through credible sources like Wikipedia during the time when not much information was available through official sources.

The most significant peak occurs a few days before March 11th, when the WHO declared COVID-19 to be characterized as a pandemic. This remark followed large-scale outbreaks in several countries in Europe and the Middle East, especially in Italy, Iran, and France. Since this point, the aggregate attention has stayed at a similar level, reaching over 6,000,000 views a day.

Figure 5. Stacked Chart and Ratio Chart by Topical Category. (a) The total number of views on COVID-19-related pages had exponentially increased from late February to early March. In March, the daily views were varying in the range of 6–8 millions. “Virus” category was dominant at the beginning of spreading, but the “Region” category has been dominant since March. For the original version of Figure 5a. click here
Figure 5(b). To view this figure click here. The proportion of the “Region” category has rapidly increased since February with domestic spreading outside China. The “Virus” category had its peak in January and was gradually decreasing. The “People” has the highest variance which is also significant in the proportion. The peak related to Tom Hanks even has a significant proportion among all pages related to COVID-19.

The ratio chart could identify the day-to-day composition of public attention across diverse topics. The shift of attention from the virus Wiki pages to regional tracker pages suggests that internet users are most interested in first gaining knowledge about the disease, but their attention shifts to more geographically constrained information (that likely have immediate impact on individuals). The speed at which these regional pages are updated is unprecedented, it is as quick as any local CDC reports (which we will look at in future reports). Overall, the increasing attention from the virus to regional and people pages indicate that Wikipedia has served multiple purposes during the pandemic, from a reliable source to collect scientific facts about the virus as well as to serve regional breaking news, and to other less critical updates related to the people’s category. Meeting such dynamic attention would not have been possible without the dedicated participation of over 34,000 editors who contributed to creating and updating these English Wikipedia articles.

Final remarks

The open data available on Wikipedia allows researchers and the community, in general, to cope with the urgent information needs during crises. The huge demand for local and regional-specific content highlights the importance of having a distributed community of editors who can generate such content. In our next post, we will show, by analyzing non-English Wikipedia pages, how readers are especially interested in what is happening in their local regions.

Reproducibility

All this analysis is based on public information. You can learn more about the results and methodology used in this analysis by visiting this page.