2017-04-26: Discovering Scholars Everywhere They Tread

Though scholars write articles and papers, they also post a lot of content on the web. Datasets, blog posts (like this one), presentations, and more are posted by scholars as part of scholarly communications. What if we could aggregate the content by scholar, instead of by web site?

Why would we want to do this? We can create stories, or collections of a scholar's work in an interface, much like Storify. We can also index this information and create a search engine that allows a user to search by scholar and find all of their work, not just their published papers, as is offered by Scopus or Web of Science, but their web-based content as well. Finally we can archive their work before the ocean of link rot washes it away.

To accomplish our goal, two main questions must be answered: (1) For a given scholar, how do we create a global scholar profile describing the scholar and constructed from multiple sources? (2) How do we locate the scholar's work on the web and use this global scholarly profile to confirm that we have found their work?

Throughout this post I attempt to determine what resources could be used by a hypothetical automated system to build our global scholar profile and then use it to discover user information on scholarly portals. I also review some scholarly portals to determine what resources they provide that can be used with the global scholar profile. Note: our hypothetical system is currently just attempting to find the websites to which scholars post their content; discovering and processing the content itself is a separate issue.

Building a global scholar profile

Abdel-Hafez and Xu provide "A Survey of User Modeling in Social Media Websites". In that paper, they describe that "modeling users will have different methods between different websites". They discuss the work that has been done on constructing a user model from different social media sites, using a rather broad definition of social media that includes blogs and collaborative portals like wikis. They discuss the problems associated with building a user profile from social media, which inspires the term global scholar profile in this post.

They also provide an overview of the "cold start problem" where insufficient starting information is available to begin using a useful user profile. Existing solutions to the cold start problem in recommender systems, such as those by Lika, Kolomvatsos, and Hadjiefthymiades rely on the use of demographic data to create user profiles, which will not be useful for identifying scholars. Instead, we can use some existing sources containing information about scholars.

The EgoSystem project, by Powell, Shankar, Rodriguez, and Van de Sompel, concerned itself with building a global scholarly profile from several sources of scholarly information. It accepts a scholar's name, the university where they earned their PhD, their fields of study, their current affiliation, their current title, and some keywords noting their field of work. Using this information, the system starts with a web search using Yahoo BOSS search API (now defunct) with these input terms and the names of portals, such as LinkedIn, Twitter, and Slideshare. After visiting each page in the search results, the system awards points to a page for each string of data that matches. If the points reach a certain threshold, then the page is considered to be a good match for the scholar and additional data is then acquired via a site API -- or scraped from the web page, if necessary -- and added to the system's knowledge of the scholar for future iterations. This scoring system was insipred by Northern and Nelson's work on disambiguating university students' social media profiles. EgoSystem's data is stored in a graph database for future retrieval and updating, much like the semantic network profiles discussed by Gauch, Speretta, Chandramouli, and Micarelli.

Kramer and Boseman created the Innovations in Scholarly Communication project. As part of that project, they developed a list of 400+ Tools and Innovations in Scholarly Communication. Many of the tools on this list are scholarly portals, places where scholars post content.

Our hypothetical system must first build a global scholar profile that can be tested against content from various scholarly portals. To do so our automated system needs is data about a scholar. Many services exist which index and analyze scholar's published works from journals and conference proceedings. All of this can provide information to be used for disambiguation.

If we have access to all of this information, then we should be able to use EgoSystem's scoring method of disambiguation against scholarly portals. What if we do not yet have this information? Given just a name and an affiliation, from what sources can we construct a global scholar profile?

In the table below, I reviewed the documentation for several sources of information about scholars, based on their published works. In the access restrictions section I document which restrictions I have found for each source. Included in this table is the name of the web service, which data it provides that is useful to identify a scholar, and the access restrictions of the service. I reviewed each service, to determine which fields were available in the output. I did not sign up for any authentication keys, so the data useful for scholar identification comes from each service's documentation. I also only included services that allow one to query by author name.

Service	Data Useful for Scholar Identification	Access Restrictions
arXiv API	Authors and Co-authors Terms from titles Terms from abstracts Terms from documents Affiliations Keywords	None
Clarivate's Web of Science API	Authors and Co-authors Terms from titles Terms from abstracts Terms from documents Affiliations Keywords	Institution Must Be Licensed Additional Restrictions on Data Usage
CrossRef REST API	Authors and Co-authors Terms from titles Affiliations Keywords	None
Elsevier's Scopus API	Authors and Co-authors Terms from titles Terms from abstracts Terms from documents Affiliations Keywords	Institution Must Be Licensed Additional Restrictions on Data Usage
Europe PMC database	Authors and Co-authors Terms from titles Terms from abstracts Terms from documents Affiliations Keywords	None
IEEE Xplore Search Gateway	Authors and Co-authors Terms from titles Terms from abstracts Affiliations Keywords	None
Microsoft Academic Knowledge API	Authors and Co-authors Terms from titles Terms from abstracts Journal/Proceedings Information Affiliations Keywords	Free for 10,000 queries/month, otherwise $0.25 per 1,000 calls
Nature.com OpenSearch API	Authors and Co-authors Terms from titles Links to landing pages	Non-Commercial Use Only All downloaded content must be deleted within 24 hour period Application requires a "Powered by nature.com" logo Requires signing up for authentication key
OCLC WorldCat Identities API	Authors Terms from titles	Non-commercial use only
ORCID API	ORCID Other Identifiers Authors and Co-authors Terms from titles Journal/Proceedings Information Links to landing pages Employment Education Links to additional websites Keywords Biography	None
PLOS API	Authors and Co-authors Terms from titles Terms from abstracts Terms from documents Affiliations Keywords	Rate limited to 10 requests per minute Data must be attributed to PLOS Requires signing up for authentication key
Springer API Service	Terms from Titles Journal/Proceedings Information Keywords Links to landing pages	Requires signing up for authentication key

Some of these services are not free. Microsoft Academic Search API, Elsevier's Scopus, and Web of Science all provide information about scholars and their works, but with limitations and often for a fee. Microsoft Academic Search API has become Microsoft Academic Knowledge API and now limits the user to 10,000 calls per month unless they pay. Scopus API is free of charge, but "full API access is only granted to clients that run within the networks of organizations" with a Scopus subscription. Clarivate's Web of Science API provides access with similar restrictions, "using your institution's subscription entitlements".

There are also restrictions on how a system is permitted to use the data from Web of Science, including which fields can be displayed to the public. Scopus has similar restrictions on text and data mining, which may affect our system's ability to use these sources at all. Furthermore, the Nature.com OpenSearch API requires that any data acquired is refreshed or deleted within a 24-hour period, also making it unlikely to be useful to our system because the data cannot be retained.

Some organizations, such as PubMed Central, offer an OAI-PMH interface that can be used to harvest metadata. Our system can harvest this metadata and provide its own searches. Similarly, other organizations, such as the Hathi Trust Digital Library, offer downloadable datasets of their holdings. Data from API queries is more desirable because it will be more current than data obtained via datasets.

Not all of these sources are equally reliable for discovering information about scholars. For example, a recent study by Klein and Van de Sompel indicates that, in spite of the information scholars can provide about themselves on ORCID, many do not fill in data that would be useful for identification.

Because the global scholar profile is supposed to be the known good information for future disambiguation, the data gathered for the global scholar profile at this stage may need to be reviewed by a human before we trust it. For example, the screenshot below is from Scopus, and shows multiple entries for Herbert Van de Sompel which refer to the same person.

Scopus has multiple entries for "Herbert Van de Sompel".

Discovering Where Scholars Post Their Work

Once we have a global scholar profile for a scholar, we can search for their content on known scholarly portals. Several methods exist to discover hints as to which scholarly portals contain a scholar's content.

Homepages

If we know a scholar's homepage, it might be another potential source of links to additional content produced by that scholar. I decided to see if scholars acted this way. In August-September of 2016, I used Microsoft Academic API to find the homepages of the top 99 researchers from 13 different knowledge domains. From these 1287 scholarly records, they broke down as shown in the table below until I had 733 homepages with a 200 status. For those 733 homepages, I downloaded each homepage and extracted its links.

Total Records	1287
Records without a homepage	133
Homepages Resulting in soft-404s	369
Homepages with connection errors	61
Homepages has too many redirects	1
Homepages with a 200 status	723
Homepages containing one or more URIs from the list of scholarly tools	204

Each link was then compared with the domain name of the tools listed in Kramer and Boseman's 400+ Tools and innovations. Out of 723 homepages 204 (28.2%) contained 1 or more URIs matching a tool from that list. This does indicate that homepages could be used as a source of additional sites that may contain the work of the scholar in question.

Now that Microsoft Academic API has changed its terms, alternatives to finding homepages will be useful. Fang, Si, and Mathur tested several methods of detecting faculty homepages in web search engine results. Their study provides some direction on locating a scholar's work on the web. They used the Yahoo BOSS API to acquire search results. These search results were then evaluated for accuracy using site specific heuristics, logistic regression, SVM, and a joint prediction model. They discovered that the joint prediction model outperformed the other methods.

Social Media Profiles

In addition to scholarly databases, social media profiles may offer additional sources for us to find information about scholars. The social graph in services like Twitter and Facebook provides additional dimensions that can be analyzed.

For example, if we know an institution's Twitter account, how likely is it that a scholar follows this account? If we cannot find a scholar's Twitter account using their institution's Twitter account, can we discover them using link prediction techniques like Schall's triadic closeness.

In addition, there is ample work in discovering researchers on Twitter. For example, Hadgu and Jäschke used Twitter to determine the relationships between computer scientists, reviewing several machine learning algorithms to discover demographic information, topics, and the most influential computer scientists. Instead of looking at the institution's Twitter account as a base for finding computer scientists, they used the Twitter accounts of scientific conferences. Perhaps our hypothetical system can use conference information from a scholar's publication list in this way.

It is also possible that a scholar's social media posts contain links to websites where they post their data. We can use their social media feeds to discover links to scholarly portals and then disambiguate them.

Querying Portals Directly

On top of hints, we can query the portals directly using their native capabilities. Unfortunately, the same capabilities for finding scholars are not available at all portals. To discover these capabilities, I started with Kramer and Boseman's 400+ Tools and Innovations in Scholarly Communication. I sorted the list by number of Twitter followers as a proxy for popularity. I then filtered the list for those tools categorized as Publication, Outreach, or Assessment. Finally, I selected the first 36 non-journal portals for which I could find scholarly output hosted on the portal. I then reviewed the different ways of discovering scholars on these sites.

The table below contains the list of portals used in my review. In order to describe the nature of each site, I have classified them according to the categories used in Kaplan and Haenlein's "Users of the world, unite! The challenges and opportunities of Social Media". The categories used in the table below are:

Social Networking applies to portals that allow users to create connections to other users, usually via a "friend" network or by "following". Examples: MethodSpace, Twitter
Blogs encompasses everything from blog posts to magazine articles to forums. Examples: HASTAC, PubPeer, The Conversation
Content Communities involves portals where users share media, such as datasets, videos, and documents, including preprints. Examples: Figshare, BioRxiv, Slideshare, JoVe
Collaborative Works is reserved for portals where users collaboratively change a single produce, like a wiki page. Examples: Wikipedia

Portal	Kaplan and Haenlein Social Media Classification
Academic Room	Blogs, Content Communities
AskforEvidence	Blogs
Benchfly	Content Communities
BioRxiv	Content Communities
BitBucket	Content Communities
Dataverse	Content Communities
Dryad	Content Communities
ExternalDiffusion	Blogs
Figshare	Content Communities
GitHub	Content Communities, Social Networking
GitLab.com	Content Communities
Global biodiversity information facility Data	Content Communities
HASTAC	Blogs
Hypotheses	Blogs
JoVe	Content Communities
JSTOR daily	Blogs
Kaggle Datasets	Content Communities
Methodspace	Social Networking, Blogs
Nautilus	Blogs
Omeka.net	Content Communities
Open Science Framework	Content Communities
PubMed Commons	Blogs
PubPeer	Blogs
ScienceBlogs	Blogs
Scientopia	Blogs
SciLogs	Blogs
Silk	Content Communities
SlideShare	Content Communities
SocialScienceSpace	Blogs
SSRN	Blogs
Story Collider	Content Communities
The Conversation	Blogs
The Open Notebook	Blogs
United Academics	Blogs
Wikipedia (& Wikimedia Commons)	Collaborative Works
Zenodo	Content Communities

Local Portal Search Engines

I wanted to know if I could find a scholar, by name, in this set of portals using local portal search engines. If such services are present on each portal, then our automated system could submit a scholar's name to the search engine and then scrape the results.

I reviewed whether or not the portal contained profile pages for its users. Profile pages are special web pages that contain user information that can be used to identify the scholar. Contained within a profile page might be the additional information necessary to identify that it belongs to the scholar we are interested in. This is important, because a profile page provides a single resource where we might be able verify that the user has an account on the portal. Without it, our system would need to go through the actual contributions to each portal.

For our 36 portals, 24 contained profile pages. This indicates that 24 portals associate some concept of identity with their content. With the exception of Academic Room, profiles in the portals also provide links to the scholar's contributions to the portal.

This screenshot shows a common case of a search engine that provides profiles in its search results. I have outlined one of the links to a profile in a red box and shown a separate screenshot of the linked profile.

Next, I reviewed each portal to discover if its local search engine, if present, provided profiles as search results. For 13 portals, the local search engine provided profile pages in their search results. This means that I was able to type a scholar's name into the portal's search bar and find their profile page directly linked from the result. In this case, an automated system would only need to scrape the search results pages to find the profile pages. Once the profile pages are acquired, the system can then compare them against what we know about the scholar to determine if the scholar has an identity on that portal. In some cases, a scraper can use pattern matching to eliminate the non-profile URIs from the list of results.

An example of a search engine providing profiles in its results is shown above with Figshare. In this case, searching for "Ana Maria Aguilera-Luque" on Figshare leads to a list of landing pages for uploaded content. Content on Figshare is associated with a user, and a clickable link to that user's profile shows up in the search results under the name of the uploaded content.

This screenshot shows an example of a search engine that does not provide profiles in its search results, even though the portal has profiles. The screenshots are of the search results, following the link to the document, and then following the link from the document to the profile page. Each followed link is outlined in a red box.

Unfortunately, this is not the case for all results. For 4 portals, the profile page is only available if one clicks on a search result link, and then clicks on the profile link from that search result. This increases the complexity of our automated system because now it must crawl through more pages before finding a candidate set of profiles to review.

The figure above shows an example of this case, where searching for "Chiara Civardi" on the magazine web site UA Magazine leads one to a list of articles. Each article contains a link to the profile of its author, thus allowing one to reach the scholar's profile.

This screenshot shows an example of a site that does not provide user profiles at all, but does provide search results if a scholar's name shows up in a document.

For 9 portals, the search results are the only source of information we have for a given scholar on the portal. Because the search results may be based on the search terms in the scholar's name, our automated system must crawl through some subset, possibly all, of the results to determine if the scholar has content on the given portal.

The figure above shows a search for "Heather Cucolo" on the the audio site "The Story Collider" which leads a user to the list of documents containing that string. Our automated system would need to review the content of the linked pages to determine if the Heather Cucolo we were searching for had content posted on this site.

And for 10 portals, the local search engine was not successful or did not exist. In these cases I had to resort to use a web search engine -- I used Google -- to find a profile page or content belonging to the scholar. I did so using the site search operator and the name of the scholar.

The table below shows the results of my attempt to manually find a scholar's work on each of the 36 portals.

Portal	Profiles Exist?	How did I find portal content based on actual scholar's name?	How did I get from local search results to profile page?
Academic Room	Yes	Web Search
AskforEvidence	No	Local Search	No profile, only search results
Benchfly	Yes	Web Search
BioRxiv	No	Local Search	No profile, only search results
BitBucket	Yes	Web Search
Dataverse	No	Local Search	No profile, only search results
Dryad	No	Local Search	No profile, only search results
ExternalDiffusion	No	Local Search	No profile, only search results
Figshare	Yes	Local Search	Profile pages in results
GitHub	Yes	Local Search w/ Special Settings	Profile pages in results, if correct search used
GitLab.com	Yes	Web Search
Global biodiversity information facility Data	Yes	Local Search	Click on result, Profile linked from result page
HASTAC	Yes	Local Search	Profile pages in results
Hypotheses	Yes	Web Search
JoVe	Yes	Local Search	Profile pages in results
JSTOR daily	Yes	Local Search	Click on result, Profile linked from result page
Kaggle Datasets	Yes	Local Search	Profile pages in results
Methodspace	Yes	Local Search	Profile pages in results
Nautilus 3 sentence science	No	Local Search	No profile, only search results
Omeka.net	No	Web Search
Open Science Framework	Yes	Local Search	Profile pages in results
PubMed Commons	No	Web Search
PubPeer	No	Local Search	No profile, only search results
ScienceBlogs	Yes	Local Search	Profile pages in results
Scientopia	Yes	Local Search	Profile pages in results
SciLogs	Yes	Web Search
Silk	No	Web Search
SlideShare	Yes	Local Search	Click on result, Profile linked from result page
SocialScienceSpace	Yes	Local Search	Profile pages in results
SSRN	Yes	Local Search	Profile pages in results
Story Collider	No	Local Search	No profile, only search results
The Conversation	Yes	Local Search	Profile pages in results
The Open Notebook	Yes	Web Search
United Academics	Yes	Local Search	Click on result, Profile linked from result page
Wikipedia (& Wikimedia Commons)	Yes	Local Search w/ Special Settings	Profile pages in results, if correct search used
Zenodo	No	Local Search	No profile, only search results

Portal Web APIs

The result pages of local search engines must be scraped. A web API might provide structured data that can be used to effectively find the work of the scholar.

To search for web APIs for each portal, I used the following method:

Look for the terms "developers", "API", "FAQ" on the main page of each portal. If present, follow those links to determine if the resulting resource contained further information on an API.
Use the local search engine to search for these terms
Use Google search with the following queries

site:<hostname> rest api
site:<hostname> soap api
site:<hostname> api
site:<hostname> developer
<hostname> api
<hostname> developer

Using this method, I could only find evidence of web APIs for 14 of the 36 portals. PubPeer's FAQ states that they have an API, but they request that API users contact them for more information, and I could not find their documentation online. I included PubPeer in this count, but was unable to review its documentation.

By reviewing the public API documentation, I was able to confirm that a search for scholars by name on 5 of the portals allowed one to match names to strings in multiple API fields. For example, the Dataverse API allows one to search for a string in multiple fields. The example response in the documentation is for the search term "finch", which does return a result containing an author name of "Finch, Fiona".

Like most software, some of these APIs were continuing to add functionality. For example, the current version of Zenodo's REST API allows users to deposit data and metadata. The beta version of this API provides the ability to "search published records", but this functionality is not yet documented. This functionality is expected to be available "in the autumn". Zenodo also provides an OAI-PMH interface, so a system could conceivably harvest metadata about all Zenodo records and perform its own searches for scholars.

Other APIs were did not provide the ability to search for users based on identity. Much like its local search engine, BitBucket's API requires that one know the id of the user before querying, which does not help us find scholars on their site. Omeka.net has an API, but Omeka.net contains many sites running the Omeka software. The users of these sites do not necessarily enable their API. Regardless, Omeka's API documentation states that "users cannot be browsed". I was uncertain if this applied to search queries as well, but found no evidence in the documentation that they supported search of users, even as keywords.

Below are the results of my review of all 36 portals. It is possible that some of the portals marked "No" actually contain an API, but I was unable to find its documentation or evidence of it using the method above.

Portal	Evidence of API Found
Academic Room	No
AskforEvidence	No
Benchfly	No
BioRxiv	No
BitBucket	Yes
Dataverse	Yes
Dryad	Yes
ExternalDiffusion	No
Figshare	Yes
GitHub	Yes
GitLab.com	Yes
Global biodiversity information facility Data	Yes
HASTAC	No
Hypotheses	No
JoVe	No
JSTOR daily	No
Kaggle Datasets	No
Methodspace	No
Nautilus 3 sentence science	No
Omeka.net	Yes
Open Science Framework	Yes
PubMed Commons	No
PubPeer	Yes
ScienceBlogs	No
Scientopia	No
SciLogs	No
Silk	Yes
SlideShare	Yes
SocialScienceSpace	No
SSRN	No
Story Collider	No
The Conversation	No
The Open Notebook	No
United Academics	No
Wikipedia (& Wikimedia Commons)	Yes
Zenodo	Yes

Web Search Engines

If local portal search engines and web APIs are ineffective, we can use web search engines, much like EgoSystem and Yi Fang's work. As noted above, I did need to use web search engines to find profiles for some users when the local portal search engine was either unsuccessful or nonexistent. Depending on the effectiveness of these site-specific services, web search engines may also be useful in lieu of the local search engine or API.

The table below shows four popular search engines, what data is available via their API, and what restrictions any system will encounter with each. As noted before, the Yahoo Boss API no longer exists, but is included because Yahoo! is a well known search engine. DuckDuckGo's Instant Answers API does not provide full search results due to digital rights issues, focusing on areas of topics, categories, and disambiguation. It is focused on topics, so "most deep queries (non topic names) will be blank". This leaves Bing and Google as the offerings that may help us, but they have restrictions on the number of times they can be accessed before limiting occurs.

Search Engine	Data available via API	Restrictions
Bing	Links to Search Results Date Published	Free for 1K calls per month up to 3 months
DuckDuckGo	Topic Summaries Links to Some Search Results No full search results	Rate limited
Google	Search Results	100 queries per day for free $5 / 1000 queries up to 10K queries per day
Yahoo!	Search Results	Defunct as of March 31, 2016

Queries would likely be of a form like that use with EgoSystem, e.g., "LinkedIn+Marko+Rodriguez".

Because web search engines can return a large number of results, our hypothetical system would need to have limits on the number of results that it reviews. It would also need to determine the best queries to use for generating results for a given portal.

Crawling the Portal and Building Our Own Search Engine

If using web search is cost prohibitive or ineffective, we can potentially crawl the sites ourselves and produce our own search engine.

I evaluated each portal to determine if the website served a robots.txt file from its root directory in compliance with the Robots Exclusion Protocol. Using this file, the portal indicates to a search engine which URI paths it does not wish to have crawled using the keyword "disallow". Because the disallow applies only to certain paths or even certain crawlers, it may not apply to our hypothetical system. I discovered that 29 out of 36 portals have a robots.txt.

Portals may also have a sitemap exposing information about which URIs are available to the crawler. A link to a sitemap can be stored in the robots.txt. Sitemaps are also located at different paths on the portal. For example, http://www.example.com/path1/sitemap.xml is a sitemap that applies to the path /path1/ and will not contain information for URIs containing the string http://www.example.com/path2. I only examined if sitemaps were listed in the robots.txt or existed at the root directory for each portal. I discovered that 11 portals listed a sitemap in their robots.txt and 12 portals had a sitemap.xml or sitemap.xml.gz in their root directory.

The results of my review of these portals is shown below.

Portal	Robots.txt present	Sitemap in robots.txt	Sitemap in root level directory
Academic Room	Yes		Yes
AskforEvidence	Yes
Benchfly	Yes	Yes	Yes
BioRxiv	Yes	Yes	Yes
BitBucket	Yes
Dataverse
Dryad	Yes	Yes
ExternalDiffusion
Figshare
GitHub	Yes
GitLab.com	Yes
Global biodiversity information facility Data	Yes
HASTAC	Yes		Yes
Hypotheses	Yes		Yes
JoVe	Yes	Yes
JSTOR daily			Yes
Kaggle Datasets
Methodspace	Yes	Yes	Yes
Nautilus 3 sentence science	Yes	Yes	Yes
Omeka.net	Yes
Open Science Framework	Yes
PubMed Commons	Yes	Yes
PubPeer	Yes
ScienceBlogs	Yes		Yes
Scientopia	Yes	Yes	Yes
SciLogs	Yes
Silk	Yes
SlideShare	Yes	Yes
SocialScienceSpace	Yes
SSRN	Yes
Story Collider	Yes	Yes	Yes
The Conversation	Yes	Yes	Yes
The Open Notebook	Yes
United Academics
Wikipedia (& Wikimedia Commons)	Yes
Zenodo

It is likely that portals with such technology in place will already be well indexed by search engines.

Next Steps

In searching for information sources to feed our hypothetical system, I discovered some sources of information that can be used as a scholarly footprint. I conducted an evaluation of the documentation for these systems, with an eye on what information they provide, but a more extensive evaluation of many of these systems is needed. Other portals, such as ResearchGate and Academia.edu, were not evaluated, but may be useful data sources as well. How often do scholars put useful data in their Twitter, Facebook, or other social media profiles? Also, what can be done to remove human review from the process of generating and verifying a global scholar profile?

Some portals have multiple options when it comes to determining if a scholar has posted work there. Many have local portal search engines that we can use, but I anecdotally noticed that some local search engines are more precise than others when it comes to their results. Within the context of finding the work of a given scholar, a review of the precision and recall of the search engines on these portals might help determine if a web search engine is a better choice than the local search engine for a given portal.

Open access journals, such as PLOS, have requirements that authors have posted data online, in sites such as Figshare and Open Science Framework. If we know that a scholar published in an open access journal, can we search one of their journal articles for links to these datasets and hence find the scholar's profile pages on sites such as Figshare and Open Science Framework?

Much like the APIs used for the global scholar profile, the APIs for each portal will need to be evaluated for precision and usefulness to our system. Some provide the ability to search for users, but others only provide the ability to find the work of a user if the scholar's user ID is already known.

Preliminary work using web search engines shows promise, but may also require a study to determine how to most effectively build queries that discover scholars on given portals. Such a study would also need to determine the ideal number of search engine results to review before our system stops trying to find the scholar on a portal using this method.

I evaluated 36 portals to determine if they contained robots.txt and sitemaps to help search engines crawl them. If a portal utilizes these items, do they have a good ranking for our search engine queries when trying to find a scholar by name and portal? How many of the portals lacking these items have poor web search engine ranking?

Ahmed, Low, Aly, and Josifovski studied the dynamic nature of user profiles, modeling how user interests change over time. Gueye, Abdessalem, and Naacke attempted to account for these changes when building recommendation systems. With this evidence that user information changes, how often does information about a scholar change? Scholars publish new work and upload new data. In some cases, such as Figshare, a scholar may post a dataset and never return, but other sites, like United Academics, may feature frequent updates. How often should our hypothetical system refresh its information about scholars?

Our case of disambiguating scholars is a subset of the larger problem of entity resolution as a whole: are the records for two items referring to the same item? Lise Getoor and Ashwin Machanavajjhala provide an excellent tutorial on the larger problem of entity resolution. They note that the problem has become even more important to solve with the web providing multiple information from heterogeneous sources. Their summary mentions that different matching techniques work better for comparing certain fields, such as the idea that similarity measures like Jaccard and Cosine similarity work well for text and keywords, but not necessarily for names, where Jaro-Winkler has better performance. In addition, they cover the use of machine learning and crowdsourcing as ways to augment the simple matching of field contents to one another. Which parts of the global scholar profile are useful for disambiguation/entity resolution? In addition to the scholar, will other entities in the profile need to be resolved as well? What matching techniques are most accurate for each part of the global scholar profile?

Buccafurri, Lax, Nocera, and Ursino attempted to solve the problem of connecting users across social networks, a concept they referred to as "social internetworking". Recognizing that the detection of the same account on different networks is related to the concept of link prediction, they offer an algorithm that takes into account the similarity between user names and the similarity of common neighbors. How many scholarly portals can make use of a social graph information?

Lops, Gemmis, Semeraro, Musto, Narducci, and Bux build profiles for use in recommender systems with specific focus on solving two problems. The first is polysemy, where a term can have multiple meanings. Next is synonymy, where many terms have the same meaning. Their work focused on associating tags with users and constructing recommendations based on the terms encountered. For our system, polysemy will need to be investigated because the same term will mean different things in different disciplines (e.g., port means something different to a computer hardware engineer than a network engineer). Our system may become even more confused if presented terms from interdisciplinary scholars. However, unlike recommender systems, the system will use more than just terms for disambiguation, relying on other, less ambiguous data like affiliations and email addresses. Thus are polysemy and synonymy issues that our system needs to resolve? With which parts of the global scholar profile (i.e., fields) are they needed?

For the fields we have chosen to identify scholars, which matching techniques are most accurate? As noted before, some algorithms work better for names and userids than for keywords. What algorithms, including machine learning, network analysis, and probabilistic soft logic might best match some of our fields? Do they vary between scholarly portals?

Not all scholars may want to have their online work discovered in this way. What techniques can be employed to allow some scholars to opt-out?

Summary

Searching for scholars is slightly easier than searching for other online identities because, by their very nature, scholars produce content that can be useful for disambiguation. In this article, I reviewed sources of data that can be used by a hypothetical automated system seeking to identify the websites to which scholars have posted content. I provided a listing of different services that might help one build a global scholar profile that can be further used to disambiguate scholars online.

In order to discover where scholars post items online, I looked at scholar homepages, social media profiles, and the services offered by the portals themselves. After sampling homepages from Microsoft Academic Search API I found that 28.2% of the homepages contained links to websites on Kramer and Boseman's list of 400+ Tools and innovations.

I reviewed capabilities of 38 scholarly portals themselves, discovering that, in some cases, local portal search engines can be used to locate content for a scholar. Any automated tool would need to scrape these results, and getting to a scholar's information on the site falls into one of three patterns. To find an alternative to scraping, I also discovered APIs for a number of portals, but could only determine if scholars can be searched for on 5 of them.

To augment or replace the searching capabilities of scholarly portals, I examined the capabilities of search engine APIs and discovered the cost associated with each. As a few existing research projects looking for scholarly information (e.g., EgoSystem) make use of search engine APIs, I wanted to shows that this was still a viable option.

So, sources do exist for building the global scholar profile and methods exist at known scholarly portals to find the works of scholars at each portal. Evaluating solutions for disambiguation will be the next key step to finding their footprints in the portals in which they tread.

--Shawn

Search This Blog

Web Science and Digital Libraries Research Group