Thursday, December 08, 2016 ... Français/Deutsch/Español/Česky/Japanese/Related posts from blogosphere

ArXiv.org to be modernized for $445,000

Live, hot, via Bill Zajc: at 5 pm CERN winter time – 11 am Boston winter time – Sam Ting will report "unexpected results" from the five years of the Alpha Magnetic Spectrometer. See an abstract of the CERN colloquium, video page, a separate "camera" video window, previous AMS blog posts. I am a bit excited but be careful: Ting expected to make a big discovery so "unexpected results" may mean that he didn't discover anything interesting. ;-) Ting speaks in Chinglish which I understand well, probably because it's a dialect of Czenglish.
At the top of arXiv.org, the main scientific e-archive of preprints serving primarily high energy physics and other fields, you may read some occasional news.

For example, since 2017, the daily deadline for submissions will be changed to 2 pm, local New York time (either standard or daylight-saving, whichever is valid at the moment). So when you're competing for the first or last position among the papers in the listings, don't forget about this change.

Also, the Alfred P. Sloan foundation is paying some $445,000 for software work that should modernize arXiv.org over the following three years. After the upgrade, arXiv should become arXiv-NG – note that ST-NG stands for "Star Trek – Next Generation".




In the early 1990s, Paul Ginsparg's arXiv was a useful technological advance. In the 1980s, secretaries would often type papers (drafts) for the physicists, using old typewriters etc. Suddenly, \(\rm\TeX\) appeared and physicists could do it themselves, with prettier results. The paper journals were still important around 1990. Physicists would still like to see papers of their colleagues before they're sent to journals. So they were sending papers to each other by e-mail. Cliques and mailing lists of such physicists developed.

See a New York Times article about some background on the arXiv. The article started with two rather fundamental words LOL.




This process was automatized and collectivized by Paul Ginsparg who created a central machinery to collect and resend the papers in \(\rm\TeX\). It was done using state-of-the-art sharing IT technologies. So there was just an FTP server at some moment. Then it got an access by Gopher. And people could have received the daily list of abstracts via e-mail mailing lists. I don't know how many people remember the pre-web Internet of the early 1990s but I surely do. I was accessing Ginsparg's archive by FTP clients (haven't used those for many years) as well as Gopher. And when the web was discovered, I was still using Lynx – a text-based web browser – rather frequently.

At any rate, things switched to the web with graphics-enhanced browsers in the mid 1990s. I think that if a younger semi-informed person is asked about the oldest browsers, he could say "Mozilla" or "Firefox". Well, that's close but not quite. Both of these names are rather recent. The oldest modern web browsers were related but called NCSA Mosaic and Netscape Navigator. Netscape Navigator is basically a direct predecessor of Mozilla Firefox – it's funny how the names were changing so that you can't see the relationship anymore.

The arXiv.org (previously named xxx.lanl.gov by Ginsparg because he wanted the airport security to think that physicists watch porn all the time, even at the airports; but he claims that "xxx" was just an innocent parody of "www") was serving well and is still serving well. But you can't overlook that it hasn't been modernized for 25 years – if you overlook some minor changes.

The grant from the Alfred P. Sloan foundation should modernize the "search function", we're told. I am sorry for being afraid that I would probably argue that $445,000 for a new search engine for arXiv.org is a theft. A much more ambitious modernization should be planned, realized, and paid for.

After all, the arXiv's own "search function" is so bad that I am – and many people certainly are – using very different ways to search for the papers. For a full text search, it's probably optimal to use Google. You may restrict Google searches to arXiv.org. For example, this is the query that searches for "screwing string theory", the unscrewed name of what is known as "matrix string theory". Google, the search engine, is placing the relevant hits at the top and the calculation of the relevance is really the "top advance" that helped to turn Google into a top 3 capitalized company in the world.

Sometimes, you want to search for science papers, whether they are on the arXiv or not. I think that Google Scholar has become optimal for all these things. Try screwing string theory from that perspective. It naturally lists the papers by relevance – which is highly correlated with the citation count. Cool, for the first time, I noticed a paper that has actually used my term "screwing procedure". Or did I forget? At any rate, good job, Okuyama and Sugawara. ;-)

And then there's INSPIREhep.net, previously known as SPIRES. I think that it doesn't search through the body of papers. Update: Oops, fulltext search is possible in INSPIRE, an example, thanks to INSPIRE twitter and mmanu_F for the fix.

I find it obvious that the search functions should be rather Google-like. But I do think that a modernization of the archive system may want to go well beyond a better search function. For example, a personalization system that knows how to find the relevant paper for a particular user should be developed, along with some undistracting methods to communicate with the authors (Facebook-like or less social methods) – perhaps with tools to filter people who are not eligible, methods to peacefully block others from communication, ratings, expert-based recommended hyperlinks, conversion of all \(\rm \TeX\) paper to HTML with MathJax, and many other things.

Skillful programmers who are up-to-date could do lots of truly wonderful things.

One must realize that the amount of data hosted by the arXiv is just slightly over one million. If the average paper had a megabyte, I don't know the value exactly, you could still copy the whole arXiv to your laptop's hard disk. Well, Joanna Karczmarek was once distributing a 10-GB file with all the HEP preprints and it's such a decade ago plus something. There's a lot of room for "redundant" things surrounding each paper. The capacity of disks and RAM chips has grown much more quickly than the scientists' ability to produce papers, of course.

Maybe some skillful students should play with their mini-model of the arXiv. And if the upgrade to arXiv-NG turns out to be disappointing, they should propose a viable alternative to the community. But don't get excessively excited. Various modified arXivs – e.g. those with an extra discussion thread under each abstract – have been unsuccessful.

Add to del.icio.us Digg this Add to reddit

snail feedback (0) :