The percentiles are updated annually using ORCID_Public_Data_File_YYYY snapshots.
Below some statistics:
- ORCID user number doubled in last two years (with 1.23Mi of users with at least one work in PUBMED)
Is this good or not? To answer this question we need to compare those scores to some data set. In fCite
, the ORCID data are used for this purpose.
As mentioned in the FAQ
the problem of identifying the author is not trivial and it is actually very hard to automate if we want to be very accurate (e.g., people change surnames, and sometimes they use initials only, or use special characters not present in the English alphabet). The data from the ORCID database make the problem easier to solve, but still some care is needed (we have already seen that people have begun to game the ORCID system by adding publications not belonging to them, e.g., John Smith's records "enriched" by Jane Smith's records abbreviated by the initial only). Nevertheless, parsing out the author from the publication author list is a doable task because ORCID provides the name and surname of the user. In the simplest scenario, having "name+surname"
allows us to calculate a so-called Levenshtein distance
between two strings (the minimum number of single-character edits: insertions, deletions or substitutions, required to change one word into the another). This allows us to identify the potential position of a given author on the authorship list (the algorithm, of course, is not perfect, and when there is some ambiguity, e.g., two authors with perfect matches are identified, for instance two John Smiths in the list, then the record is not taken into considerations).
If none of the authors match the ORCID names and surnames, then a slightly more complicated procedure is used:
- a set of possible names and surnames is generated from ORCID's 'name'+'surname' i.e. ([name+' '+ surname, surname+' '+name, name'+' '+surname, surname+' '+name, name'+' X '+surname, surname+' X'+name]). Remark: all names and surnames are case insensitive because people frequently mix/overuse capital letters)
- for a given set a Jaro–Winkler distance
is calculated for all authors and, then normalized, and the one with the highest score (which must be above 0.65) is chosen
As a result, the average portfolios used to calculate the fractional versions of metrics are usually shorter than the original portfolios (those for which the author position could not be identified with high probability). Consequently, the percentiles presented in fCite
for fractional metrics can be considered overly optimistic. For instance, the 95th
percentile could actually be the 90the
percentile. However, currently, better data do not exist, and those estimates will be more accurate than comparing such scores by eye. Moreover, the percentile thresholds are updated yearly when new ORCID data appear.
Below, you can find detailed data for given percentiles depending on the score and the type:
the score must be > 0 (this is an important remark because ~1/4 of the records have RCR, citations, etc. equal to 0, which means that
they did not (have time yet to) show any importance), and the ORCID portfolio
must contain at least one item assigned to the author with a >0.65 Jaro–Winkler distance.
Returning to John Smith, 100 citations and a RCR of 5 gives him:
Research only 50.2 63.8
All 48.5 62.7
Now, let us divide the publications in the ORCID portfolio into four categories:
| | Single | First | Middle | Last |
| | 128455+-1207 | 1427848+-3213 | 3964752+-11425 | 1487607+-6908 |
| % | 1.833+-0.016 | 20.373+-0.044 | 56.569+-0.059 | 21.225+-0.059 |
Authorship patterns for 572,910 ORCID users having at least one publication above the cut off (7,008,012 unique publications in total)
The mean and standard deviation were calculated by bootstrapping the data 1000 times, and the Jaro–Winkler distance >0.65 was used as a cut-off.
- In most research publications, the researcher is a middle author (57% of cases)
- Every fifth publication is either a first or last author contribution
- Single author research publications are extremely rare
The above table was calculated cumulatively for all publications, but to analyse individual researchers, we should bin the data per portfolio. To do so, we analysed 2471 ORCID users having 30 papers (otherwise small portfolios having just a few items would introduce considerable noise into the model).
| | Single | First | Middle | Last |
| % | 1.54+-5.03 | 21.07+-15.49 | 58.00+-20.70 | 19.40+-18.57 |
- The means are very similar in both tables
- There is substantial variance in what can be considered normal (on the other hand, this type of information can be used to statistically judge whether given portfolio is enriched or depleted by given types of publications, e.g., a portfolio having 40% of first author papers is quite unexpected)
Now let us check how this changes across the number of items in the portfolio. It is expected that smaller portfolios (most likely younger researchers at the beginning of their careers, aka PhD students, postdocs) will have more first author publications than people with dozens of publications (principal investigators). The raw data for the plots below are available here
Research only articles
- As hypothesized above, we clearly see that the more publications there are in the portfolio, the fewer first author publications there are in the portfolio (and the opposite is true for the last author publications). This simply reflects that at some point in time, a successful scientist becomes the head of the lab/group leader/principal investigator, and then she/he ends up in the last position on the author list as a corresponding author. *
- The percentage of single and middle author publications is fairly stable across the lifetime of the researcher (~2% and ~60%).
- The standard deviation for first, last and middle author publications can be as high as ~20% which means that there is considerable variance, but still as mentioned above, a portfolio having 40% of first author papers is quite unexpected.
- "The tipping point" at which the proportion of first and last author publication is similar at ~27-31 papers. If we consider that last author papers are a sign of creating a new laboratory or beginning to become an independent researcher, on average ~30 item portfolios already have six first and six last author publications.
- Single author papers are very rare (additionally, it is more likely that single author papers will be non-research articles, for instance, an editorial, comment, or review, rather than a research paper).
- On average, there are more middle author papers in the research only fraction, which again is expected because research papers have more authors on average; (un)surprisingly, it seems that writing non-research items requires fewer authors.
* This trend is expected and actually confirms two things: a) most of the life science (PubMed) publications use some kind of FLAE model; b) the Jaro–Winkler distance threshold we used to surmise the authorship position is reasonable. If both conditions a) and b) were not be met, then we would not be able to observe such a swap between first and last author position vs the portfolio size.
Number of research vs. non-research publications in PUBMED since 1995
Research vs. non-research items in the period 1995-2018
(based on 17,787,016 PMIDs) (the raw format
- The number of publications in PUBMED grows every year (it doubled in last 25 years)
- Most of the publications are "research" items, and they constitute ~80% of portfolios
Number of authors vs. research_non-research items
It is very interesting to also investigate the number of authors for individual papers. This may differ across fields of science (e.g., in mathematics, publications usually tend to have fewer authors than in medicine); nevertheless, it is crucial to analyse such patterns. We can try to answer a number of questions. For instance, how many authors does the average paper have? What is the fraction of papers for single author, two author, or three author papers? Is there any relationship between the number of authors and type of publication (research vs. non-research)?
Average number of authors
(whole PUBMED, 17 M items, raw format
- The average number of authors increased over time from 3.5 in 1995 to 5.9 in 2018
- The average number of authors for research papers is constantly larger than for non-research papers (by approx. two authors)
- The average non-research paper in the 1990s had 2 authors, while now it has 4 authors per paper
- The average research paper in the 1990s had 4 authors, while now it has over 6 authors per paper
Number of authors vs research_non-research items
(whole PUBMED, 17 M items, 1995-2018 raw format
- Most of the research papers have less than 10 authors (usually 3-5 authors), with a long tail of the papers authored by >15 people
- Over half (51.4%) of single author papers are non-research items
- The fewer authors, the more probable it is non-research paper
The last observation is quite unexpected; thus, it is worth checking this relationship more closely. From one of the previous plots, we learned that on the average number of non-research papers is ~20% (18.9 for all years to be exact). Let us normalize the data based on the number of authors.
The fraction of non-research papers vs. number of authors
(the raw format
- Most of non-research papers (editorials, reviews, commentaries, etc.) are written by 1-4 persons
- A single author paper is approximately three times more likely to be a non-research work than the average (51.4% vs 18.9%)
- A ten-author paper is a than four-times more likely to be a research item than average (4.4% vs 18.9%)
Let us now check the trends over the time
Number of authors vs research_non-research items over time
- Over time, the single/fewer author papers decline relative to multiple author papers
- In 1995, the mode was a single author publication; in 2004, it was a three author publication; in 2007, it was a four author publication
- While the papers with more than 15 authors were almost unheard of in 1995, they have become increasingly popular, and in 2018 they represented 2.7% of all items (for >10 author papers, the statistics are 0.3% in 1995 vs. 9.8% in 2018)
Note that the fraction of single author publications in PUBMED overall is different than for ORCID portfolios. This can be explained by the fact that for some people or in some fields (e.g., mathematics), it is common to publish alone; thus this changes the pattern if you compare statistics between the portfolio and the global value, but the trend is the same: over time, a single item has more and more authors, and fewer and fewer publications are single authored.
Fractional metrics (FLAE, FLAE2, FLAE3, EC)
The weights for the first, the middle and the last author up to ten authors for the FLAE, FLAE2, FLAE3, EC models.
Fractional metrics vs total metrics (RCR or Citations) with respect to portfolio size
- The FLAE model assigns the greatest importance to the first author and then later the last author
- The EC model penalises the last and first authors
- The FLAE3 model has weightings between those of the FLAE and FLAE2 models
The correlations of fractional models
- All models are highly correlated and produce similar results
- There is an almost linear correlation between portfolio size and the scores
- On average every 10 papers awards an RCR of ~2.5
- On average every 10 papers receives ~35-40 citations
The lower triangular portion of the matrices (green) correspond to the ORCID portfolios with 2-50 items (394,189 portfolios)
and the upper triangular portion of the matrices correspond to all ORCID portfolios with at least a single item (600,755 portfolios)
- All models are correlated
- The fractional models are extremely positively correlated with each other (>0.9)
- The fractional models are moderately positively correlated with global metrics (RCR, citations) at ~0.6-0.8
QUESTION: Thus, since the models are so well correlated, why should we bother considering more in the first place? Can we not just re-calculate one model to another?
ANSWER: No, you cannot, especially if you are in the business of looking for outliers. When you analyse one, particular portfolio, you can use the averages only as a baseline (no matter how good the correlations are, unless the correlation is 1.0). Actually you are looking for what is odd (e.g., ultra high FLAERCR in comparison to the portfolio size, the difference between FLAERCR and ECRCR to highlight the importance of first author papers, etc.)
Look at the spreads:
- The spread among portfolios increases as portfolios became larger
- Regardless of size, the spread is always significant
- The averages are placed closer to the upper bounds, as the lower bonds are limited by zero
Now, let us examine the ratio of averages
- The ratio of any fractional model and the total score is larger for small portfolios (more first author papers)
- For portfolios with up to 10 items it is approximately 20-25%
- At approximately 40-60 items, all fractional models start to produce virtually identical results
- The ratio for RCR scores is usually lower than for citations
Next, let us take a quick look at the spreads:
- There is a massive spread among portfolios
- As expected, the more items there are, the smaller the spread
- Regardless of size, the spread is always at least 20%
- On average, the ratio should be approximately 18-20%, which is consistent with the average number of authors on the average paper