随着科学研究规模的增加,学术评价这个难题越来越难,但又是一个必须进行的工作,因为政府、研究机构、基金资助机构和学者个人都对学术评价有需求。过去对学术评价几乎全部依赖于同行,因为没有其他值得依赖的指标,但是现在情况发生了改变,学术评价越来越依赖量化指标。用量化指标作为评价依据比同行评价似乎更加客观,但学术评价并没有绝对可靠的量化指标,因为被引用次数等多是代表研究的热度,而不是研究的水平。
没有特效药的结果就是某种病有许多药可以选择,于是学术评价的量化指标越来越多,根本原因就是没有一个能完全进行学术评价。一味药不行,只能采用复方联合用药了。不过这些指标往往不具有普遍适用性,设置的目的和出发点是健康积极的,但是往往被错误使用。例如关于杂志影响因子,是判断学术期刊的一个重要指标,但许多学术机构,尤其是中国的学术机构,将影响因子代表论文的水平。
这显然是荒唐的,一篇论文发表在哪个杂志上,有许多偶然因素,杂志的水平确实有高低,在粗线条上有可比性,高端杂志上发表论文数量是国际上认可的学术水平重要参考标准,比如你能靠实力在CNS上发表10多篇某一领域的论著,你的学术水平可说是不言而喻。但是杂志的影响因子绝对不是学术水平的准确代表值,论文发表杂志影响因子只能作为一个学术水平的重要参考,相对准确的是该文章被引用的情况,当然也不能绝对代表研究水平。
《自然》一篇文章最近再次讨论了这个问题,并将学术量化评价的大致历史进行了回顾。2000年以前只有美国科学信息研究所提供的科学引文索引SCI光盘版被一些专业人员使用进行文献计量分析。2002年汤森路透公司启用了SCI网络版,使这种工具的使用更方便。其他一些公司也相继建立了自己的类似学术评价平台,例如爱思唯尔2004年开始用Scopus,谷歌学术β版2004年开始用。
这些都是基于文献引用作为基本指标的评价工具,不同的是收录的文献范围不同。SCI只分析被SCI收录的文献被其他SCI收录文献引用的情况,如果一篇文章被非SCI收录的文献引用,该系统就视而不见。Scopus的收录范围更大一些,但也是局限于其收录范围内。谷歌学术β版就毫无限制,只要有引用,全部统计。
从全面性上看,谷歌学术最好,SCI最差;从准确度上看,谷歌学术最差,SCI最好;从时效性上看,谷歌学术最好,SCI最差。也有利用网络数据比较各个学术机构的学术产出和影响力,例如基于SCI数据的InCites和基于Scopus的SciVal,也有利用谷歌数据的个人引用分析软件如2007年发布的Publish or Perish。
2005年,加州大学圣地亚哥分校物理学家Jorge Hirsch提出h-index,又称为h指数或h因子,是利用全部发表论文被引用的排序计算出的个人学术影响力指标。h代表“高引用次数”(high citations),一名科研人员的h指数是指他至多有h篇论文分别被引用了至少h次。h指数能够比较准确地反映一个人的学术成就。一个人的h指数越高,则表明他的论文影响力越大。
例如,某人的h指数是20,这表示他已发表的论文中,每篇被引用了至少20次的论文总共有20篇。要确定一个人的h指数非常容易,到SCI网站(其他数据库也可以,会得出不同的数值),先查出某个人发表的所有SCI论文,按被引次数排序,往下核对,直到某篇论文的序号大于该论文被引次数,那个序号减去1就是h指数。而期刊影响因子引起关注的时间是起自1995年。
可以这么说,h指数是个人学术影响力的判断指标,影响因子是期刊的影响力指标。后来,随着社交网络评价的出现,2002年的F1000Prime,2008年的Mendeley,2011年的Altmetric.com(Altmetric是自然集团母公司Macmillan Science and Education资助的)。文献计量学家、社会科学家和学术管理机构,已经注意到现在对学术评价量化指标被滥用的情况。
例如,一些大学排行榜如上海大学排行榜的和泰晤士报大学排行榜,甚至这些所谓排行榜采用一些不准确的数据和滥用指标。一些招聘人员要求申请者提供h-index,许多大学对招聘岗位对受聘人员的h-index分值和高影响杂志论文数量都设定了标准。许多科学家也在自己的简历中以显著位置显示自己的h-index分值和高影响杂志论文数量,在生物医学领域这种情况最流行。
导师让博士生去发表高影响期刊,因为这是他们将来混学术江湖的最好说明书。斯堪的那维亚和中国的一些大学分配研究经费或奖金只根据一个数字,就是发表学术期刊的影响因子。许多情况下,同行评议仍然发挥重要作用,但是滥用量化指标的现象已经变的非常普遍祖哲和恶劣。
文章后面提供了Leiden Manifesto科学计量10条原则,为避免误用,原文转发:Wetherefore present the, named after the conference at which it crystallized (see http://sti2014.cwts.nl). Its ten principles are not news to scientometricians, although none of us would be able to recite them in their entirety because codification has been lacking until now. Luminaries in the field, such as Eugene Garfield (founder of the ISI), are on record stating some of these principles 3, 4. But they are not in the room when evaluators report back to university administrators who are not expert in the relevant methodology. Scientists searching for literature with which to contest an evaluation find the material scattered in what are, to them, obscure journals to which they lack access.
Ten principles
1) Quantitative evaluation should support qualitative, expert assessment. Quantitative metrics can challenge bias tendencies in peer review and facilitate deliberation. This should strengthen peer review, because making judgements about colleagues is difficult without a range of relevant information. However, assessors must not be tempted to cede decision-making to the numbers. Indicators must not substitute for informed judgement. Everyone retains responsibility for their assessments.
2) Measure performance against the research missions of the institution, group or researcher. Programme goals should be stated at the start, and the indicators used to evaluate performance should relate clearly to those goals. The choice of indicators, and the ways in which they are used, should take into account the wider socio-economic and cultural contexts. Scientists have diverse research missions. Research that advances the frontiers of academic knowledge differs from research that is focused on delivering solutions to societal problems. Review may be based on merits relevant to policy, industry or the public rather than on academic ideas of excellence. No single evaluation model applies to all contexts.
3) Protect excellence in locally relevant research. In many parts of the world, research excellence is equated with English-language publication. Spanish law, for example, states the desirability of Spanish scholars publishing in high-impact journals. The impact factor is calculated for journals indexed in the US-based and still mostly English-language Web of Science. These biases are particularly problematic in the social sciences and humanities, in which research is more regionally and nationally engaged. Many other fields have a national or regional dimension — for instance, HIV epidemiology in sub-Saharan Africa.
This pluralism and societal relevance tends to be suppressed to create papers of interest to the gatekeepers of high impact: English-language journals. The Spanish sociologists that are highly cited in the Web of Science have worked on abstract models or study US data. Lost is the specificity of sociologists in high-impact Spanish-language papers: topics such as local labour law, family health care for the elderly or immigrant employment 5. Metrics built on high-quality non-English literature would serve to identify and reward excellence in locally relevant research.
4) Keep data collection and analytical processes open, transparent and simple. The construction of the databases required for evaluation should follow clearly stated rules, set before the research has been completed. This was common practice among the academic and commercial groups that built bibliometric evaluation methodology over several decades. Those groups referenced protocols published in the peer-reviewed literature. This transparency enabled scrutiny. For example, in 2010, public debate on the technical properties of an important indicator used by one of our groups (the Centre for Science and Technology Studies at Leiden University in the Netherlands) led to a revision in the calculation of this indicator 6. Recent commercial entrants should be held to the same standards; no one should accept a black-box evaluation machine.
Simplicity is a virtue in an indicator because it enhances transparency. But simplistic metrics can distort the record (see principle 7). Evaluators must strive for balance — simple indicators true to the complexity of the research process.
5) Allow those evaluated to verify data and analysis. To ensure data quality, all researchers included in bibliometric studies should be able to check that their outputs have been correctly identified. Everyone directing and managing evaluation processes should assure data accuracy, through self-verification or third-party audit. Universities could implement this in their research information systems and it should be a guiding principle in the selection of providers of these systems. Accurate, high-quality data take time and money to collate and process. Budget for it.
6) Account for variation by field in publication and citation practices. Best practice is to select a suite of possible indicators and allow fields to choose among them. A few years ago, a European group of historians received a relatively low rating in a national peer-review assessment because they wrote books rather than articles in journals indexed by the Web of Science. The historians had the misfortune to be part of a psychology department. Historians and social scientists require books and national-language literature to be included in their publication counts; computer scientists require conference papers be counted.
Citation rates vary by field: top-ranked journals in mathematics have impact factors of around 3; top-ranked journals in cell biology have impact factors of about 30. Normalized indicators are required, and the most robust normalization method is based on percentiles: each paper is weighted on the basis of the percentile to which it belongs in the citation distribution of its field (the top 1%, 10% or 20%, for example). A single highly cited publication slightly improves the position of a university in a ranking that is based on percentile indicators, but may propel the university from the middle to the top of a ranking built on citation averages 7.
7) Base assessment of individual researchers on a qualitative judgement of their portfolio. The older you are, the higher your h-index, even in the absence of new papers. The h-index varies by field: life scientists top out at 200; physicists at 100 and social scientists at 20–30 (ref. 8). It is database dependent: there are researchers in computer science who have an h-index of around 10 in the Web of Science but of 20–30 in Google Scholar 9. Reading and judging a researcher's work is much more appropriate than relying on one number. Even when comparing large numbers of researchers, an approach that considers more information about an individual's expertise, experience, activities and influence is best.
8) Avoid misplaced concreteness and false precision. Science and technology indicators are prone to conceptual ambiguity and uncertainty and require strong assumptions that are not universally accepted. The meaning of citation counts, for example, has long been debated. Thus, best practice uses multiple indicators to provide a more robust and pluralistic picture. If uncertainty and error can be quantified, for instance using error bars, this information should accompany published indicator values. If this is not possible, indicator producers should at least avoid false precision. For example, the journal impact factor is published to three decimal places to avoid ties. However, given the conceptual ambiguity and random variability of citation counts, it makes no sense to distinguish between journals on the basis of very small impact factor differences. Avoid false precision: only one decimal is warranted.
9) Recognize the systemic effects of assessment and indicators. Indicators change the system through the incentives they establish. These effects should be anticipated. This means that a suite of indicators is always preferable — a single one will invite gaming and goal displacement (in which the measurement becomes the goal). For example, in the 1990s, Australia funded university research using a formula based largely on the number of papers published by an institute. Universities could calculate the 'value' of a paper in a refereed journal; in 2000, it was Aus800 (around US480 in 2000) in research funding. Predictably, the number of papers published by Australian researchers went up, but they were in less-cited journals, suggesting that article quality fell 10.
10) Scrutinize indicators regularly and update them. Research missions and the goals of assessment shift and the research system itself co-evolves. Once-useful metrics become inadequate; new ones emerge. Indicator systems have to be reviewed and perhaps modified. Realizing the effects of its simplistic formula, Australia in 2010 introduced its more complex Excellence in Research for Australia initiative, which emphasizes quality.