I began the conference at the SAVE-SD workshop, focusing on the semantics, analytics, and visualization of scholarly data. They had 6 full research papers, 2 position papers, and 2 poster papers. The acceptance rate for this conference is relatively high. The conference was kicked off by Alejandra Gonzales-Beltran and Francesco Osborne. They encouraged the use of Research Articles in Simplified HTML.
Alex Wade gave us an introduction to the Microsoft Academic Service (MAS) and a sneak peek at the new features offered by Microsoft Academic, such as the Microsoft Academic Graph. They are in the process of adding semantic, rather than keyword search with the intention of understanding academic user intent when searching for papers. They have opened up their dataset to the community and provide APIs for future community research projects.
Angelo Salatino presented "Detection of Embryonic Research Topics by Analysing Semantic Topic Networks". The study investigated the discovery of "embryonic" (i.e. emerging) topics by testing for more than 2000 topics in more than 3 million publications. The goal is to determine it we can recognize trends in research while they are happening, rather than years later. They were able to show the features of embryonic topics and their next step is to automate their detection.
Bahar Sateli presented "Semantic User Profiles: Learning Scholars’ Competences by Analyzing their Publications". The goal of this study is to mitigate the information overload associated with semantic publishing. They found that it is feasible to semantically model a user's writing history. By modeling the user, better search ranking of document results can be provided for academic researchers. It can also be used to allow researchers to find others with similar interests for the purposes of collaboration.
Francesco Ronzano presented "Knowledge Extraction and Modeling from Scientific Publications" where they propose a platform to turn data from scientific publications into RDF datasets, using the Dr. Inventor Text Mining Framework Java library. They also generate several example interactive web visualizations of the data. In the future, they seek to improve the Text Mining Framework.
Joakim Philipson presented "Citation functions for knowledge export - a question of relevance, or, can CiTO do the trick?". He explored the use of the CiTO ontology in order to understand knowledge export - "the transfer of knowledge from one discipline to another as documented by cross-disciplinary citations". Unfortunately, he found that CiTO is not specific enough to capture all of the information needed to understand this.
Sahar Vahdati presented "Semantic Publishing Challenge: Bootstrapping a Value Chain for Scientific Data". The study discussed "the use of Semantic Web technologies to make scholarly publications and data easier to discover, browse, and interact with". Its goal is to use many different sources to produce linked open datasets about scholarly publications with the intent of improving scholarly communication, especially in the areas of searching and collaboration. Their next step is to start building services on the data they have produced.
Vidas Daudaravicious presented "A framework for keyphrase extraction from scientific journals". His framework is able to use keyphrases to define topics that can differentiate journals. Using these keyphrases, one can improve search results by comparing journals to queries, allowing users to find articles of a similar nature. It also has the benefit of noting trends in research, such as when journal topics shift. Researchers can also use the framework to identify the best journals for paper submission.
Ujwal Gadirju presented "Analysing Structured Scholarly Data Embedded in Web Pages". They analyzed the use of microdata, microformats, and RDF used as bibliographic metadata embedded in scholarly documents with the intent of building knowledge graphs. They found that the distribution of data across providers, domains, and topics was uneven, with few providers actually providing any embedded data. They also found that Computer Science and Life Science documents were more apt to contain this metadata than other disciplines, but also admitted that their Common Crawl dataset may have been skewed in this direction. In the future, they are planning a targeted crawl with further analysis.
On the right below, Kata Gábor demonstrated "A Typology of Semantic Relations Dedicated to Scientific Literature Analysis". Her poster shows a model for extracting facts about the state of the art for a particular research field using semantic relations derived from pattern mining and natural language processing techniques.
And shown to the left Erwin Marsi discussed his poster, "Text mining of related events from natural science literature". His study had the goal of producing aggregate facts on the concepts from articles within a corpus. For example, it aggregates the fact that there is an increase in algae based on the text from many papers that had research results finding an increase in algae. The idea is to find trends in research papers through natural language processing.
In closing, the SAVE-SD 2016 workshop mentioned that selected papers could be resubmitted to PeeRJ.
The morning opened with a Keynote by Wolfgang Nejdl of the Alexandria Project. Wolfgang Nejdl discussed the work at L3S and how they were trying to consider all aspects of the web, from the technical to its effects on community and society. He discussed how social media has become a powerful force, but tweets and posts link to items that can disappear, losing the context of the original post. This reminded me of some other work I had seen in the past. He mentioned how important it was to archive these items.
project BUDDAH, including the problem of ranking temporal search results. Seen below, he demonstrates an alternative way of visualizing temporal search results using the HistDiv project. This visualization for understanding the changing nature of a topic. In this case, we see how searching for the term Rudolph Giuliani changes with time, as the person's career (and career aspirations) change so do the content of the archived pages about them. He closed by discussing the use of curated archiving collections in Archive-It in the collaborative search and sharing platform ArchiveWeb, which allows one to find archive collections pertinent to their search query.
Matthias Steinbauer presented "DynamoGraph: A Distributed System for Large-scale, Temporal Graph Processing, its Implementation and First Observations". DynamoGraph a system also allowing for one to query and calculate metrics on temporal graphs.
Both researchers used the following lunch to discuss temporal graphs at length. I wondered if one could model TimeMaps in this way and use these tools to discover interesting connections between archived web pages.
Mohsen Shahriari discussed "Predictive Analysis of Temporal and Overlapping Community Structures in Social Media". He went into detail on the evolution of communities, represented by graphs, detailing how they can grow, shrink, merge, split, or dissolve entirely. Using datasets from Facebook, DLBP citations, and Enron emails, his experiments showed that smaller communities have a higher chance of survival and his model had a high success rate in predicting whether a community would survive.
Aécio Santos presented "A First Study on Temporal Dynamics on the Web". He used topical web page classifiers in a focused crawling experiment to analyze how often web pages about certain topics changed. Pages from his two topics, ebola and movies, changed at different rates. Pages on ebola were more volatile, losing and gaining links, mostly due to changing news stories on the topic, whereas movies pages were more stable, with authors only augmenting their contents. He did find that, in spite of this volatility, pages did tend to stay on topic over time. The goal is to ensure that crawlers are informed by differences in topics and adjust their strategies accordingly.Jannik Strötgen presented "Temponym Tagging: Temporal Scopes for Textual Phrases". He discussed the discovery and use of temponyms to understand the temporal nature of text. Using temponyms, machines can determine the time period that a text covers. He explained the issues with finding exact temporal intervals or times for web page topics, seeing as many pages are vague. His temponym project, HeidelTime, has been tested on the WikiWars corpus and the YAGO semantic web system. He also presented further information on this topic, later in WWW 2016.
We then shifted into using temporal analysis for security. Staffan Truvé from Recorded Future presented "Temporal Analytics for Predictive Cyber Threat Intelligence". His company specializes in using social media and other web sources to detect potential protests, uprisings, and cyberattacks. He indicated that protests and hacktivism are often talked about online before they happen, allowing authorities time to respond.
We show on the poster that, because many use browser bookmarks or citation managers that store these locating URIs, there must be an easy way to help tools find the DOI. Our suggestion is to store this DOI in the Link header for easy access by these tools.
In closing, Omar Alonso from Microsoft presented "Time to ship: some examples from the real-world". He highlighted some of the ways in which the carousel from the top of Bing is populated, using topic virality on social media as one of the many inputs. He talked about the concept of social signatures, derived from all of the social media posts referring to the same link. Using this text, they are able to further determine aboutness for a given link, helping further with search results. He switched to other topics that help with search, such as connecting place and time. Search results for points of interest (POI) for a given location in effect is trying to match people looking for things to do (queries) with social media posts, checkins, and reviews for a given POI. He concluded by saying that there is much work to be done, such as allowing POI results for a given time period "things to do in Montréal at night".
Sir Tim Berners-Lee
Sir Tim Berners-Lee spoke of the importance of decentralizing the web, ensuring that users own their own data, web security, work to standardize and improve the ease of payments on the web, and finally the Internet of Things (IoT).
Mentioning the efforts of projects like Solid, he highlighted the need to ensure that users retain their data to ensure their privacy. The idea is that a user can tell the service where to store their data and then they have ownership and responsibility over that data.
He mentioned that, in the past the Internet had to be deployed by sending tapes through the mail, but now we are heading to a point where the web platform, because it allows you deploy a full computing platform very very quickly, may become the rollout platform for the future. Because of this ability, security is becoming more and more important and he wants to focus on a standard for security that uses the browser, rather than external systems, as the central point for asking a user for their credentials, thereby helping guard against trojans and malicious web sites. He said that the move from HTTP to HTTPS has been less easy than expected, considering many HTTPS pages are "mixed" containing references to HTTP URIs. This results in three different worlds: those that are HTTP pages, those that are HTTPS pages, and upgrade insecure requests which still provide a mixed page, but one that is endorsed by the author.
Next, he spoke about making web payments standardized, comparing it to authentication. There are a wide variety of different solutions for web payments and there needs to be a standard interface. There is also an increasing call to allow customers to pay smaller amounts than before, which many current systems do not handle. Of course, customers will need to know when they are being phished, hence the security implications of a standardized system.
Finally, he covered the Internet of Things (IoT), indicating there are connections to data ownership, privacy, and security.
In the following Q&A session, I asked Sir Tim Berners-Lee about the steps toward browser adoption for technologies such as Memento. He said the first step is to discuss them at conferences like WWW, then engage in working groups, workshops, and other venues. He noted that one also needs to define the users for such new technologies so they can help with the engagement.
Later, during the student Q&A session the following day, Morgannis Graham from McGill University asked Sir Tim Berners-Lee about his thoughts on the role of web archives. He replied that "personally, I am a pack rat and am always concerned about losing things". He highlighted that while the general web users are thinking of the present, it is the role of libraries and universities to think about the future, hence their role in archiving the web. He stated that universities and libraries should work more closely together in archiving the web so that if one university falls, others exist having the archives of the one that was lost. He also stated that we all have a role in ensuring that legislation exists to protect archiving efforts. Finally, he tied his answer back to one of his current projects: what happens to your data when the site you have given it to goes out of business.
Lady Martha Lane-Fox
Wednesday evening ended with an inspiring talk from Lady Martha Lane-Fox. She works for the UK in a variety of roles advancing the use of technology in society. She states that a country that can: (1) improve gender balance in tech, (2) improve the technical skills of the populace, and (3) improve the ability to use tech in the public sector, will be the most competitive.
She went further in explaining how the current gender balance is very depressing, noting that in spite of the freedom offered by technology, old hierarchies and structures have been re-established. She indicated that there are studies showing that companies with more diverse boards are more successful, and how we need to tackle this problem, not only from a technical, but also a social perspective.
She discussed the challenges of bringing technology to everyday lives and applauded South Korea's success while highlighting the challenges still present in the UK. She relayed stories of encounters with the citizenry, some of whom were reluctant to embrace the web, but after doing so felt they had more freedom and capability in their lives than ever before. She praised the UK for putting coding on the school curriculum and looking toward the needs of future generations.
She then talked about re-imagining public services entirely through the use of technology. The idea is to make government agencies digital by default in an effort to save money and provide more capability. She highlighted a project where a UK hospital once had 700 administrators and 17 nurses, and, through adopting technology, were able to then take the same money and hire 700 nurses to work with 17 administrators, thus providing better service to patients.
She closed by discussing her program DotEveryone, which is a new organization promoting the promise of the Internet in the UK for everyone and by everyone. Her goal is for the UK to be the most connected, most digitally literate, and most gender equivalent nation on earth. In a larger sense, she wants to kick off a race among countries to use technology to create the best countries for their citizens.
Mary Ellen Zurko
Wednesday morning started with a keynote by Mary Ellen Zurko, from Cisco. She discussed security on the web. Her first lesson: "The future will be different; so will the attacks and attackers, but only if you are wildly successful". Her point was the the success of the web has made it a target. She then covered the history of basic authentication, S-HTTP, and finally SSL/TLS in HTTPS.
She then discuss the social side of security, indicating that users are often confused about how to respond to web browser warnings about security. There is a 90% ignore rate on such warnings, and 60% of those are related to certificates. She highlighted how difficult it is for users to know whether or not a domain is legitimate and if the certificate shown is valid. She also highlighted where most users, even expert users, do not fully understand the permissions they are granting when asked due to the cryptic and sometimes misleading descriptions given to them, mentioning that 17% of Android users actually pay attention to permissions during installation and only 3% are able to answer questions on what the security permissions mean.
Reiterating the results of a study by Google, she stated that 70% of users clicked through malware warnings in Chrome, but Firefox had more participation. The Google study found that the Firefox warnings provided a better user experience, and thus users were more apt to pay attention and understand them. Following this study, Google changed its warnings in Chrome.
She said that the open web is an equal opportunity environment for both attackers and defenders, detailing how fraudulent tech support scans are quite lucrative. This was discovered in recent work by Cisco, "Reverse Social Engineering Social Tech Support Scammers", where Cisco engineers actively bluffed tech support scammers in order to gather information on their whereabouts and identities.
Of note, she also mentioned that there is a largely unexploited partnership between web science and security.
On Friday morning, Peter Norvig gave an engaging speech on the state of the Semantic Web. He mentioned that his job is to bring information retrieval and distributed systems together. He went through a history of information retrieval, discussing WAIS and the World Wide Web, as well as ARCHIE. Before Google, several were trying to tame the nascent web at the time.
After Google, the Semantic Web was developed as a way to extract information from the many pages that existed. He talked about how Tim Berners-Lee was a proponent, whereas Cory Doctorow highlighted that there were noting but obstacles in its path. Peter said that Cory had several reasons for why it would fail, but the main were (1) people lie, (2) people are lazy, and (3) people are stupid, indicating that the information gathered from such a system would consist of intentional misinformation, lack of complete information, or misinformation due to incompetence.
Peter then highlighted several instances where this came about. Initially, excellent expressiveness was produced by highly trained logicians, giving us DAML, OWL, RDFa, FOAF, etc. Unfortunately, they found a 40% page error rate in practice, indicating that Cory was correct on all 3 fronts. Peter's conclusion was the highly trained logicians did not seem to solve the identified problems.
Peter then posited "what about a highly trained webmaster?". In 2010, search companies promoted the creation of schema.org with the idea of keeping it simple. The search engines promised that if a site were marked up, then they would show it immediately in search results. This gave users an incentive to mark up their pages and now has resulted in technologies that can better present things like hotel reservations and product information. This led most to conclude that schema.org was an unexpected success.
Peter closed by saying that obstacles still remain, seeing as most of the data comes from web site owners, still leading to misinformation in some cases. He talked about the need to be able to connect different sources together so that one can, for example, not only find a book on Amazon, but also a listing of the Author's interests on Facebook. He hopes that neural networks could be combined with semantic and syntactic approaches to solve some these large connection problems.
Tzviya Siegman, from John Wiley & Sons Publishing, presented "Scholarly Publishing in a Connected World". She discussed how publications of the past were immutable, and publishers did little with content once something was published. She confessed that in a world where machines are readers, too, publications are a bit behind the times. She further said that we still have an obsession with pages, citing them, marking them, and so on, when in reality the web is not bound by pages. She wants to standardize on a small set of RDFa vocabularies that would enable gathering of content by topic, whether the documents published are just articles, but also data and electronic notebooks. She closed by talking about how Wiley is trying to extract metadata from its own corpus to provide additional data for scholars.
Hugh McGuire presented "Opening the book: What the Web can teach books, and what books can teach the Web". He talked about how books seem to hold a special power and value, specific to the boundedness of a book. The web, by contrast, is unbounded; even a single web site is unknowable with no sense of a beginning or an end. On the web, however, anyone can publish documents and data to a global audience without any required permission. He talked about how books are a singular important node of knowledge, with the ebook business having the opposite motive of the web, making ebooks a kind of restricted, broken version of the web. He wants to be able to combine the two. For example a system can provide location-aware annotations of an ebook while also sharing those annotations freely, essentially making ebooks smarter and more open.
Ivan Herman revealed Portable Web Publications which has serious implications for archiving. The goal is to allow people to download web publications like they do ebooks, PDFs, or other portable articles. There is a need to do so because connectivity is not yet ubiquitous. With the power of the web, one can also embed interactivity into the downloaded document. Of course, there are also additional considerations, like the form factor of the reading device and the needs of reader. The concept is more than just creating a ebook with interactive components or a web page that can be saved offline. He highlighted the work of publishers in terms of egonomy and aesthetics, stating that web designers for such portable publications should learn from this work. Portable Web Publications would not be suitable for social web sites, web mail, or anything that depends on real-time data. PWP requires 3 layers of addressing (1) locating the PWP itself, (2) locating a resource within a PWP, and (3) locating a target within such a resource. In practice, locators depend on the state of the resource, creating a bit of a mess. His group is currently focusing on a manifest specification to solve these issues.
Of course, I was here to present a poster, "Persistent URIs Must Be Used to be Persistent", developed by Herbert Van de Sompel, Martin Klein, and I, which indicates important consequences for the use of persistent URIs such as DOIs.
Thanks to everyone at #www2016 who came by to discuss and experience our poster: https://t.co/XkFueTMqZZ pic.twitter.com/59LgAHSdZB— Shawn M. Jones (@shawnmjones) April 13, 2016
In looking at the data from "Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot", we reviewed 1.6 million web references from 1.8 million articles and discovered 3 things:
- use of web references is increasing in scholarly articles
- frequently authors use publisher web pages (locating URI) rather than DOIs (persistent URI) when creating references
I appreciate the visit from Sarven Capadisli and Amy Guy who work on Solid. Many others came by to see our work, like Takeru Yokoi, Hideaki Takeda, Lee Giles, and Pieter Colpaert. Most appreciated the idea, noting it as "simple" with some asking "why don't we have this already?".
WWW Conference Presentations
Even though I attended many additional presentations, I will only detail a few of interest.
As a person who has difficulty with SPARQL, I appreciated the efforts of Gonzalo Diaz and his co-authors in "Reverse Engineering SPARQL Queries". Their goal was to reverse engineer SPARQL queries with the intent of producing better examples for new users, seeing as new users have a hard time with the precise syntax and semantics of the language. Given a database and answers, they wanted to reverse engineer the queries that produced those answers. Unfortunately, they discovered that verifying a reverse engineered SPARQL query to determine if it is the canonical query for a given database and answer is an NP-complete (intractable) problem. They were however able to perform some heuristics on a specific subset of queries to solve this problem in polynomial time.
Fernando Suarez presented "Foundations of JSON Schema". He mentioned that JSON is very popular because it is flexible, but there is no way to describe what kind of JSON response a client should expect from a web service. He discussed a proposal from the Internet Task Force to develop a JSON schema, a set of restrictions that documents must satisfy. he said the specification is in its Fourth Draft, but is still ambiguous. Even online validators disagree on some content, meaning that we need clear semantics for validation, and he proposes a formal grammar. His contribution is an analysis shows that the validataion problem is PTIME-complete, but that determining if a document has an equivalent JSON schema is PSPACE-hard for very simple schemas. For the future, he intends to work further on integrity constraints for JSON documents and more use cases for JSON schema.
David Garcia presented "The QWERTY Effect on the Web; How Typing Shapes the Meaning of Words in Online-Human Communications". He highlights a hypothesis that words typed with more letters from the right side of the keyborard are more positive than those with more letters from the left. He tests this hypothesis on product ratings from different datasets and found that 9 out of 11 datasets see a significant QWERTY effect which is independent of the number of views or comments on an item. He does mention that he needs to repeat the study with different languages and keyboard layouts. He closed by saying that there is no evidence yet that we can predict meanings or change evaluations based on this knowledge.
Justin Cheng presented "Do Cascades Recur?" where he analyzes the rise and fall of memes multiple times throughout social media. Prior work shows that cascades (meme sharing) rises, then falls, but in reality there are many rises and falls over time. He studies these different peaks and tries to determine how and why these cascades recur. Seeing as these bursts are separated among different network communities, cascades recur when people connect communities and reshare something. It turns out that a meme with high virality has less chance of recurring, but one with medium virality will recur months or perhaps years later. He would like to repeat his study with networks other than Facebook and develop improved models of recurrence based on other data.
Prahmod Bhatotia presented "IncApprox: The Marriage of incremental and approximate computing". He discussed how data analytic systems transform raw data into useful information, but they need to strike a balance between low latency and high throughput. There are two computing paradigms that try to strike this balance: (1) incremental computations and (2) approximate computing. Incremental computation is motivated by the fact that we are recomputing the output with small changes in the input and can reuse memorized parts of the computation that are unaffected by the changed input. Approximate computing is motivated the fact that the approximate answer is good enough. With approximate computing we get the entire input dataset, but compute only parts of the input and then produce approximate output in a low latency manner. His contribution is the combination of these two approaches.
Jessica Su presented "The Effect of Recommendations on Network Structure". She worked with Twitter on the rollout of a recommendation system that suggests new people to follow. They restricted the experiment to two weeks to avoid any noise from outside the rollout. They found that there is an effect; people's followers did increase after the rollout. They also confirmed that the "rich get richer", with those who already had many followers gaining more followers and those with few still gaining some followers. She also mentioned that people did not appear to be making friends, only following others.
Samuel Way presented "Gender, Producitivity, and Prestige in Computer Science Faculty Hiring Networks". This study tried to investigate why women are not participating in computer science. He mentioned that there are conflicting results. Universities have a 2-to-1 preference for female faculty applicants, but at the same time there is a bias favoring male students. They developed a framework for modeling faculty hiring networks using a combination of CVs, social media profiles, and other sources on a subset of people currently going through the tenure process. The model shows that gender bias is not uniformly, systematically affecting all hires in the same way and that the top institutions fight over a small group of people. Women are a limited resource in this market and some institutions are better at competing for them. The result is that accounting for gender does not help predict faculty placement, leading them to conclude that the effects of gender are counted for by other factors, such as publishing or post-doctoral training rates or the fact that some institutions appear to be better at hiring women than others. The model predicts that men and women will be hired at equal rates in Computer Science by the 2070s.
Of course, I did not merely enjoy the presentations and posters. Among the Monday night SAVE-SD dinner, the Thursday night Gala, and lunch each day, I took the opportunity to acquaint myself with many field experts. Google, Yahoo!, and Microsoft were also there looking to discuss data sharing, collaboration, and employment opportunities.
I always had lunch company thanks to the efforts of Erik Wilde, Michael Nolting, Roland Gülle, Eike Von Seggern, Francesco Osborne, Bahar Sateli, Angelo Salatino, Marc Spaniol, Jannik Strötgen, Erdal Kuzey, Matthias Steinbauer, Julia Stoyanovich, Jan Jones, and more.
Furthermore, the Gala introduced me to other attendees, like Chris LaRoche, Marc-Olivier Lamothe, Ashutosh Dhekne, Mensah Alkebu-Lan, Salman Hooshmand, Li'ang Yin, Alex Jeongwoo Oh, Graham Klyne, and Lukas Eberhard. Takeru Yokoi introduced me to Keiko Yokoi from the University of Tokyo who was familiar with many aspects of digital libraries and quite interested in Memento. I also had a fascinating discussion about Memento and the Semantic Web with Michel Gagnon and Ian Horricks, who suggested I read "Introduction to Description Logic" to understand more of the concepts behind the semantic web and artificial intelligence.
As my first academic conference, the WWW 2016 conference was an excellent experience, bringing me in touch with paragons on the forefront of web research. I now have a much better understanding of where we are in the many aspects of the web and scholarly communications.
Even as we left the conference and said our goodbyes, I knew that many of us had been encouraged to create a more open, secure, available, and decentralized web.
Au revoir Montréal pic.twitter.com/nSIpC2UvUT— Shawn M. Jones (@shawnmjones) April 16, 2016