Wikipedia:Wikipedia Signpost/2019-03-31/In focus

From Wikipedia, the free encyclopedia
In focus

The Wikipedia SourceWatch

A new project to find unreliable sources cited by Wikipedia

A few years back, while working on WikiProject Academic Journals' Journals Cited by Wikipedia (JCW) compilation, I realized we could harness the power of bots to identify a variety of unreliable sources which are cited by Wikipedia. I've dubbed the project The Wikipedia SourceWatch (or just The SourceWatch),[a] as it aims to identify and combat unreliable sourcing, similarly to Quackwatch, which aims to identify and combat medical quackery and Retraction Watch, which reports retracted research in scientific journals.

For context, the JCW compilation takes the various |journal= parameters of {{cite xxx}} templates found in articles, and compiles them into various lists. For example, in the following citation

  • {{cite journal |last1=Yager |first1=K. |year=2006 |title=Wiki ware could harness the Internet for science |journal=Nature |volume=440 |issue=7082 |pages=278–278 |doi=10.1038/440278a}}

a bot would find |journal=Nature and then report it at WP:JCW/N7.[b][c] The compilation is organized in many ways (alphabetically, by citation count, and so on) and is typically updated a few days after the 1st and 20th of each month, when database dumps are generated. Those who want a bit of history and technical details can check the main JCW page or this talk I gave in Montreal for Wikimania 2017.

The Directory of Open Access Journals does not allow predatory journals to be listed on its directory. As such, several journals will lie about being included in DOAJ to appear more legitimate. Predatory journals will also lie about having impact factors or about being included in high-reputation databases like Scopus or Web of Science. The DOAJ advises "to ALWAYS check at https://doaj.org that a journal is indexed in DOAJ even if its web site carries the DOAJ logo or says that it is indexed [in DOAJ]". This is good advice, which applies equally to the other indexing services.

The idea of using the JCW compilation to fight unreliable sourcing stewed in my mind for a while, until I finally decided to take action in August 2018. I contacted JLaTondre, who runs the bot, and together we began laying down the first bricks of The SourceWatch. The bot would look for the various |journal= parameters of citation templates and cross-check them against Beall's List, a list maintained by librarian Jeffrey Beall to identify predatory journals and publishers until it was taken down in 2017. Beall's List is not perfect by any means, especially if you want a list that only identifies journals that are definitely predatory, rather than journals that range from questionable to definitely predatory, but it was a good start. Since there are other efforts beyond Beall's List to identify unreliable sources in general, I expanded The SourceWatch to draw from a variety of additional sources, including circular references to Wikipedia, deprecated or generally unreliable sources, journals lying about being included in the Directory of Open Access Journals, Quackwatch's list of non-recommended periodicals, self-published sources and vanity publications, and sources from notoriously unreliable fields (which are broadly speaking the subcategories of Category:Pseudo-scholarship and a few others). While journals from Cabell's blacklist could not be included as of writing due to the exorbitant paywall, they might get included in the future.

Two main ways of using The SourceWatch exist:

  1. Browsing WP:SOURCEWATCH directly. If 5 or fewer articles cite a specific publication, the links to these articles will be given. If more than 5 articles cite it, you will have to search Wikipedia to find where it is cited. This is useful to find articles which need to be updated with reliable sources, or where unreliable sources need to be removed.
  2. Using Special:WhatLinksHere on an article and looking for links from Wikipedia:WikiProject Academic Journals/Journals cited by Wikipedia/Questionable1 (or .../Questionable2, .../Questionable3, ...). This won't directly tell you which potentially unreliable publication is cited, but it will let you know that some potentially unreliable citation is cited. This is useful when you edit an article and want to make sure you are not citing bad sources. However, this method only works if 5 or fewer articles cite a specific publication.

For example, as of writing, the article on Heinrich Albert cites Deutsche Allgemeine Zeitung, a German newspaper published from 1861 to 1945, which is categorized in Category:Propaganda > Category:Nazi propaganda > Category:Nazi newspapers. This does not mean that citing Deutsche Allgemeine Zeitung is necessarily inappropriate – the newspaper did not exclusively publish Nazi propaganda over the 84 years of its existence – but it is good to verify that we are not citing Nazi propaganda inappropriately. This can be found either by browsing WP:SOURCEWATCH, which features Deutsche Allgemeine Zeitung under the 'Propaganda' category, or through Special:WhatLinksHere/Heinrich Albert, which shows a link from Wikipedia:WikiProject Academic Journals/Journals cited by Wikipedia/Questionable1.

A figure from the famous "Get me off Your Fucking Mailing List" paper by David Mazières and Eddie Kohler, accepted in the International Journal of Advanced Computer Technology.[1] The journal's 'review' process deemed the paper "excellent". Figure 2 in the paper shows even more rigorous data on why Mazières and Kohler should be taken off from the aforementioned mailing list.

Of course, due to the inherently subjective nature of what constitutes an unreliable source, The SourceWatch includes sources that range from questionable to definitely unreliable, but it also has a few false positives. For the questionable we have, for example, journals and publishers which may merely engage in questionable practices such as sending spam emails to researchers, but which nonetheless remain committed to scientific and academic standards. For the definitely unreliable, we have journals that literally accept anything, even SCIgen papers, if you pay them. For false positives, we have hijacked journals, which are fraudulent publications designed to have identical or similar names to established publications.[d] Other false positives can include members of categories such as Category:Paranormal magazines, which may set out to debunk hoaxes and nonsensical claims, rather than perpetuate them. Yet another cause of false positives is that the algorithm used to find those unreliable sources is not perfect. It is designed to find typos and similar names (Journal of Science vs Journal of Sciences), but will sometimes pick up journals that are obviously (to humans) unrelated (African Journal of ... vs American Journal of ...). However, false positives can be manually identified, and the compilation will be updated accordingly in future bot runs. And lastly, The SourceWatch is heavily based on third party lists and will to an extent reflect the opinion of those lists' compilers, which could be inaccurate or outdated in certain cases.

I want to emphasize here just how much work JLaTondre has done on this and JCW over the nearly 10 years of the compilation. The original JCW compilation and The SourceWatch may be my ideas, but JLaTondre is the one responsible for the heavy lifting and making them a reality since 2011.[e] I must also acknowledge the contributions of several people: Ronhjones's for their help managing the configuration pages,[f] Tokenzero's for their help with the creation of several redirects useful to The SourceWatch,[g] as well as the help of many people at Village Pump (technical) over the years with various matters, Galobtter in particular. Hundreds of citations were cleaned up using The SourceWatch during development, but it was only known to a handful of people due to its unpolished state. The compilation was at times plagued with a staggering number of false positives and poor presentation structure. Now, after several iterations, The SourceWatch is something that should be usable by the community at large. While there likely is still room for improvements and debates on what should or should not be listed, one no longer needs to be familiar with the intricate workings of the bot to make sense of The SourceWatch lists, or spend months playing Whac-A-Mole against false positives.

The SourceWatch does not definitely answer whether a source is unreliable. Even if a source were unreliable, it does not definitively answer whether it is appropriate to cite it either. However, The SourceWatch is a good starting point to find unreliable sources, at least those which make use of citation templates. Once they are found, the community can then critically evaluate whether or not they should be cited, leading to a better, more reliable, Wikipedia. Whether a source should be cited can be discussed at the reliable sources noticeboard, or alternatively at a relevant WikiProject's talk page, such as WikiProject Medicine for medically dubious sources, or WikiProject Physics for sources claiming to have proven aether theories.

Suggestions on how to improve The Wikipedia SourceWatch can be made at WT:SOURCEWATCH. Particularly welcomed would be suggestions for additional sources that The SourceWatch could draw from, like lists of journals lying about being indexed by reputable databases. Other efforts to identify and prevent unreliable sourcing can be found in the "other efforts" section of the WP:JCW navbox.

Notes and references

Notes
  1. ^ Renamed The Wikipiedia CiteWatch or The CiteWatch in May 2019, per RFC.
  2. ^ As of writing. If you are reading this at a later date, Nature may be reported at a different location.
  3. ^ Non-templated citations like
    • Maddox, J.; Randi, J.; Stewart, W. W. (1988). "'High-dilution' experiments a delusion". ''Nature''. '''334''' (6180): 287–290. {{doi|10.1038/334287a0}}.
    are completely ignored by the bot.
  4. ^ For example, the perfectly respectable journal Wulfenia's web presence has been hijacked (with the fake websites www.wulfeniajournal.at / www.wulfeniajournal.com / www.multidisciplinarywulfenia.org), while the real website is hosted by the Regional Museum of Carinthia. As of writing, the bot will report Wulfenia, out of concern it may be a citation to one of the fraudulent websites, even though in all likelihood those citations will be to the real website. This behaviour may change in the future.
  5. ^ From 2009 to 2011, ThaddeusB coded WikiStatsBOT to take care of JCW.
  6. ^ Specifically, Ronhjones coded RonBot (Task #10), which sorts and organizes WP:SOURCEWATCH/SETUP (upon which The SourceWatch is based) and WP:JCW/EXCLUDE (which removes false positives).
  7. ^ Specifically, Tokenzero coded TokenzeroBot (Tasks #5 and #6 especially), which creates redirects of the type Predatory JournalPredatory Publisher, including the ISO 4 abbreviations of such journals. It also puts appropriate disambiguation notes in articles, when relevant.
References
  1. ^ Beall, J. (20 November 2014). "Bogus journal accepts profanity-laced anti-spam paper". Scholarly Open Access. Archived from the original on 2014-11-22.
  2. ^ Wales, Jimmy (23 March 2014). "Jimmy Wales, Founder of Wikipedia: Create and enforce new policies that allow for true scientific discourse about holistic approaches to healing. > Jimmy Wales's response". Change.org. Retrieved 18 February 2019.