Tuesday, October 27, 2015

2015-10-21: Grace Hopper Celebration of Women in Computing (GHC) 2015

On October 13-17, the atmosphere at the George R. Brown Convention Center in Houston, Texas was electric with 12,000 women in tech from all around the world attending the Grace Hopper Celebration of Women in Computing (GHC), the world's largest gathering for women in computing. GHC is presented by the Anita Borg Institute (ABI) for Women and Technology, which was founded by Dr. Anita Borg and Dr. Telle Whitney in 1994 to bring together research and career interests of women in computing and encourage the participation of women in computing. The incredible progress of GHC went from 500 women in technology at 1994 to 12,000 women this year.

I was humbled to receive a scholarship from the ABI to attend GHC 2015. I also was thrilled twice before to attend the GHC 2013 in Minnesota and GHC 2014 in Phoenix. This year, I represented the Computer Science department at Old Dominion University, the ArabWIC organization, as a member of the leadership committee and as a mentor in the academic mentoring sessions, and the ABI organization, in which I volunteered for blogging and taking notes from GHC. You can visit the Grace Hopper Celebration 2015 wiki page for reading more about the sessions note updates.

The conference was filled with exciting lineup of inspiring speakers, panels, sessions and workshops. There were multiple technical tracks: career, emerging tech, general sessions, open source, organizational transformation, and technology (e.g., data science, artificial intelligence, HCI, security, software engineering). Conference presenters represented many different fields, such as academia, industry, and government. The non-profit organization "Computing Research Association Committee on Women in Computing (CRA-W)", also offered sessions targeted towards academics and business. I had a chance to attend Graduate Cohort Workshop in 2013, which was held in Boston, MA, and created a blog post about it.

The first day was kicked off by the amazing and inspiring Telle Whitney, the president and the CEO of the ABI, welcoming the audience. Whitney gave the audience a piece of advice: "talk to almost anyone you pass by in the conference and introduce yourself. It is your time to learn, to join new communities, to reach out people, and offer advice. It is our time to lead". She introduced the featured keynote speakers of the three days of the conference: Susan Wojcicki (the CEO of YouTube), Megan Smith (the first female CTO of the United States), and Sheryl Sandberg (the CEO of Facebook), Manuela M. Veloso (Professor in the Computer Science Department at Carnegie Mellon University), Clara Shih (CEO and Founder of Hearsay Social), Hilary Mason (the Founder of Fast Forward Labs). At the end, Whitney introduced Alex Wolf, the President of the Association of Computing Machinery (ACM) and a professor in the Department of Computing at Imperial College London, UK, for opening remarks.

As the day progressed, the Open Source Day sessions and presentations were talking place. Open Source Day: Code-a-thon for Humanity gives women from around the world the chance to learn how to contribute to the open source community, regardless of their skill or experience level through developing a variety of humanitarian projects. The Open Source Day 2015 page contains more details about the projects.

The Wednesday Keynote by Hilary Mason: "This is the best room in the world !!", this is how Mason started her keynote, which was about machine intelligence research. Mason introduced herself as a data scientist, CEO, software engineer and followed up with "I look like all of those". She talked about the importance of data and mentioned that data products are everywhere. She mentioned many example for different apps that use machine intelligence research: Foursquare, an app from New York city company collect data and based on this data, the app provides recommendations of the places to go around a user's current location and Dark Sky app, which predicts when it will rain or snow. Dark Sky app was built on the top of government weather data. It may be not interesting for a Californian, but it is interesting for the rest of people :-).

Mason talked about how she become passionate about data science. She defined a data scientist as a professional role to combine multiple capabilities: math, statistic, coding ability to build infrastructure, and communication domain knowledge, everything they need to know to go to talk someone who has a problem. A data scientist works on analytical problems. She said technology is changing rapidly, and people's adaption of technology is growing faster. One of the interesting parts of her talk was about predicting the future. She said, "predicting the future is hard", then showed a picture for people from the past imagining the future.

At the second part of the talk, she talked about her company, Fast Forward labs, which started in 2014, to introduce a new method for applied research. They focus on innovation opportunities through data and algorithms. FF sits in the middle of three communities: established companies, startups, and academic research. What makes a machine intelligence technology interesting?
  1. A theoretical breakthrough 
  2. A change of economics 
  3. A capability becomes a commodity (ex: Hadoop
  4. a) Wikipedia: new data is available b) data is made useful 
Mason ended her talk with thanking everyone who helped her, then she gave the audience a piece of advice: "If you are at the beginning of your career and you are thinking of where you might end up, you need to know that my first GHC was in 2002, and I was a shy quiet student who mostly sit in the back in every talk and shy to ask a question. But it is amazing to be in this room today with so many people who have affected my career".

At the end of the keynote, the 2015 Technical Leadership ABIE Award was given to Lydia E. Kavraki, the Noah Harding Professor of Computer Science and professor of Bioengineering at Rice University.

After the keynote, I attended the "CRA-W Early Career: The Tenure Process" session by Julia Hirschberg from Columbia University and Joan Francioni from Winona State University. The session and tenure process, i.e., research, teaching, service, expectations of department, annual reviews, letter writers, and the typical process. The speakers gave advice and tips on understanding the requirements/expectations of your institution, such as, have an overall teaching plan/goals, do not be hard or too easy. They also gave tips regarding collaboration: the successful collaboration is a multiplier; you can achieve more than you can on your own and the unsuccessful collaboration can be a negative multiplier; waste times, stressful, creates hard feelings.

The panel of Global Women Technical Leaders Program 
Next, I attended the "Global Women Technical Leaders Program: After the Grace Hopper Celebration: Building and Sustaining Community" panel in the career track. The panelist were Josephine Ndambuki of Safaricom ltd Rosario Robinson of Anita Borg Institute, Alaa Fatayer of JawwalSana Odeh of NYU and ArabWIC and moderated by Arezoo Miot of TechWomen. Sana introduced the panel and thanked ABI and the panelists for their support and for increasing the women in tech communities. Rosario talked about her journey. She said she was the only woman in mathematics. The panel discussed the essentials of building a community to support women in technology. Alaa talked about her experience starting with tech women to building a community in Palestine. Some of the addressed questions were: How and where do you start in creating a community? What programs are out there to support technical women? Overcoming obstacles in creating local communities. How can we develop allies in our communities? At the end, Rosario gave the following advice: "be clear about what you are and what you do".

A panel by directors from Apple in the scholars lunch 
The scholars lunch: Suzanne Mathew, an Assistant Professor of Computer Science at the United States Military Academy, introduced the panel of three amazing ladies from Apple. The panel was by Esther Hare, a Director in Worldwide Developer Marketing team and Maryam Najafi, a Director of UX, and moderated by Karen Sipprell, the VP Marcom at Portal Software. The scholar lunch was sponsored by Apple, in which many scholars get together with discussions on tables and each table has one women who has a role in Apple. The number of scholars in GHC15 are 500 out of 2,000 applicants as Professor Nancy Amato from Texas A&M University announced. The panel by three seniors ladies from Apple handled many interesting experience by each one of them. Here are some advice from the panelist:
  • Be around as much as you can, the more you get around the more opportunities you will find. 
  • Find you passion, so you can solve problems. 
  • Go out and solve problems that freaks you out. 
The lunch ended with the fun part, seven Apple watches for seven lucky women who found animal stickers on the bottom of their chairs!

After lunch I had to work on some stuff for the ABI blogging and social media activities. I also communicated with many amazing women during the conference.

The Wednesday Afternoon Plenary: We had three TED style talks on "Transforming the Culture of Tech" by Clara Shih, the CEO and Founder of Hearsay Social, Blake Irving, the CEO and Board Director of GoDaddy, and the amazing Megan Smith, the Chief Technology Officer (CTO) of the United States of America.

The afternoon plenary speakers
Clara Shich mentioned that she attended GHC for the first time in 2004 when there were 800 attendees at GHC. She told the audience about her journey in the past decade, starting from a student to software engineer, then project manager, to being CEO and Founder of Hearsay Social. Shich shared with the audience the lessons she learned through her journey: 1) Listen carefully, 2) Be ok with being different 3) Cherish relationship above all and help other women. 4) There is no failure, only learning 5) The future is on us, because if not us then who? when, if not now? "if people just sat back 11 year ago, GHC would not be 12,000 today!". Every time we decide to lift a woman up we lift all women up.

Blake Irving talked about how he closed the gender gap at the company since he took over as CEO two years ago and mentioned the solid progress in the ratio of women in GoDaddy. Since last year's GHC, GoDaddy has more than doubled the number of women interns and graduate hires. Blake talked about payment equality and showed many graphs based on data of GoDaddy. "If you are a leader of tech company, be vulnerable again and again. Do not hide your problems. Go public with your diversity statistics, publish your salary. Seek change from the top and bottom. Do the research, find your issues. Surround yourself with people that will challenge you," Blake said, "bad things live in the dark, bad things die in the light."

Megan Smith with the President tech team showing
 the Declaration of Sentiments
"It is great to be back to my people!", this is how the amazing Megan Smith started her talk. Before mentioning the highlight of Megan Smith talk, I would like to highlight her amazing job during the conference to encourage and inspire the attendees by talking to them by herself. This lovely inspiring woman passed by the community booths at the career fair and allow people to talk to her personally and take pictures with her. She also was creative in showing some of the federal tech projects nowadays and bringing many ladies in tech from the president team. At the beginning, Smith talked about her new a role as a CTO of the USA, in which she serves as assistant to the President through advising him and his team on how technology policy, data and innovation can advance our future as a nation. She described the people in the federal government as so passionate, mission driven, and extraordinary.

GHC archive that was found in the previous Thanksgiving 
Smith mentioned that they found GHC archives in the previous Thanksgiving. She talked about many projects they are working on, such as, Innovation Nation, Active STEM Learning, Police Data Initiative. She described the President as "an incredible leader, so smart, so technical, science tech president, and he opens the doors for us to innovate”. Smith introduced many amazing young ladies from the president tech team, who talked about their different roles to serve the nation.

At the end, Smith talked about Declaration of Sentiments, a document signed by 103 of people in 1848 (68 women and 32 men) at the first women's rights convention to be organized by women in Seneca Falls, New York. The document is missing and they are looking for it with many archivist using the #FindTheSentiments.

There was a short discussion at the end with the three speakers about why changing is hard and what strategies are working for them.

In the meantime, the career fair, in which many famous companies, such as Google, Thomson Reuters, Facebook, Microsoft, IBM, etc., were there for hiring talented woman in tech as much as they can, and the community fair, which is a dedicated with in the Expo for attendees to interact with GHC communities, such as the BlackWIC and ArabWIC. The ABI booth was at the center of the Community Fair, where I met the amazing Telle Whitney and talked to her many times. The career fair was the place for anyone who wants to apply for job opportunities at all levels across industry and academia. Each company in the career fair has many representatives to discuss the different opportunities they have for women. A few men also attended the conference. The companies were very creative in advertising themselves.

Megan Smith at ArabWIC booth in the community fair 

The amazing Megan Smith passed by ABI community booths and stopped by the ArabWIC booth. We had a great chance to talk to her personally and take a look at the Declaration of Sentiments closely. She left us with encouragement and inspiration for leading communities and attract more women in tech!

At the end of the first day, I attended the ArabWIC reception, which was sponsored by the Qatar Computing Research Institute (QCRI). We had many new Arab ladies in computing and non-Arab women as well. We exchanged our bios and how each one of us is contributing to serve the women in technology.
The Thursday Keynote had two speakers: Susan Wojcicki, the CEO of YouTube and Hadi Partovi, the CEO and Cofounder at Code.org. "I’m feeling that I’m really the talking guy in the room," Hadi Partovi said in the beginning of his talk. He shared with the audience his personal story that changed his life; when his dad brought a computer that did not have any games on it, and a book for Hadi to learn so he could write his own games. He talked about Hour of Code, a non-profit bootstrapped project that started in 2013 to expand access to computer science in schools. code.org has support from both Democrats, Republicans, and many celebrities (e.g., President Obama, Bill Gates, Mark Zuckerberg). Code.org has trained 15,000 teachers to teach computer science this year, reaching 600K students (43% female)

The Hour of Code
Partovi insisted that his main goal for code.org is not teaching kids how to code, it is teaching kids computer science. He claimed that CS education is on the recovery after many years of declines and there is a problem in CS. He also mentioned that about 9 out of 10 parents want their children to learn CS. I started already code.org with my 7 years old and he was so excited to start his first code :-). Partovi claimed that the gender gap started at K-12; "Almost 70% of the high school kids do not have access to the computer science field. When kids go to school every kid learn about how electricity works or the basic math equations. In the 21th century, it is equally foundational to learn how algorithm work or how the internet works”. Partovi continued that “the school system can evolve to tech kids computer science field. Over 70 schools have embraced CS, including NY, Houston, Chicago, etc". Regarding to the diversity, Partovi asked if we can change the stereotype without changing the facts on the ground. He commented that the way to change the stereotyping is the Hour of Code, which has now 300 partners from 196 countries and 150,000 teachers. At the end, Partovi asked all the audience to help to get more volunteers. To encourage the people to get involved, Microsoft and Amazon will give away gift cards to any teacher who will organize Hour of Code.

After Partovi's talk, 2015 Grace Hopper Celebration Change Agent ABIE Award Winners, Maria Celeste Medina from Kenya and Mai Abualkas Temraz from Palestine were announced. The Award winners gave short inspiring talks about their journey to lead women in technology and how they started.
Susan Wojcicki described the conference as a lifeline where women come together, learn, feel supported, be a computer scientists, and be ourselves. She started her speech with a story about her girl who told her she hated computers, although she used to go to Google since she was born. Susan talked about the serious impact of leaving the girls out of conversation when it comes to technology. "Girls think that technology is insular and anti-social. By 2020, jobs in computer science are expected to grow nearly two times faster than the national average, totaling nearly 5 million jobs. Technology is revolutionizing almost every part in our lives. Every car today has more computing technology than Apollo 11 that first landed on the moon. Yet, today women hold only 26% of all tech jobs. The fact that women represent small portion of tech work force is not just a wake up call, it is a 'Sputnik 'moment. It risks future competitiveness,” Susan said "If women don't participate in tech, with its massive prominence in our lives and society, we risk losing many of the economic, political and social gains we have made over decades." Susan continued that the female representation in Tech is a problem and it is getting worse. The women in tech representation was better in the 80s. Susan Wojcicki shared an exclusive teaser of the Codegirl movie, directed by Lesley Chilcott, the Oscar winning film producer.

She talked about balance between family and work. She had her baby 5 months after she joined Google. The constraints of family (for example, how it is tough for kids to be the last one who are picked up from day care) enabled her to develop a work style that focus on efficiency, productivity, prioritization, and to do that at the office hours. She mentioned a Harvard study that shows that employees who take breaks from work have higher level of focus compared to those who do not. Furthermore, employees who feel encouragement by their bosses to take breaks are 100% more loyal to their employers.

Susan Wojcicki is the first one to take maternity leave in Google, and she the only person to take five maternity leaves at Google. Interestingly, each leave enriched her life and left her with peace of mind and gave her a chance to reflect on her career. A generous maternity leave increases retention. When women are given short maternity leave and they are under the pressure of having a call, they quit. When Google increased its paid family leave from 12 to 18 weeks, the rate at which mothers quit fell by 50%. 88% of women in USA are not given family leave. Susan said, "men don't get asked how they balance it all". Susan's daughter now loves computer science. She enrolled her in a computer camp that are for girls, afterward she sketched a computer watch that has her friends contacts and info, before Samsung and Apple came up with their watches.

At the end, Susan insisted that we have to make it our personal responsibility to show the next generation of girls that they belong to the world of computer science.

Advice from Susan:
  • We need to give everyone a chance to understand computer science. 
  • Make computer science available to everyone in the USA by making it mandatory. 
  • Focus on working smart. Work smart, work hard. Do a great job, but then GO home.
  • Keep asking, look out for yourself, be an advocate and do not feel guilty about it! 
  • For tech companies, you need to help employees to find balance between work and family.
  • Tech companies need to pay generous maternity leave. 
  • A step back helps sometime.
  • If you work for a company and you feel you can not work a balanced day and the maternity leave is bad, I recommend that you leave and search for a supportive company and by the way, we are hiring! 

The Thursday Afternoon Plenary: Thursday Afternoon Plenary was a conversation between Sheryl Sandberg, Facebook CEO and and author of best-selling book Lean In and Nora Danzel, Board Director of Ericsson, AMD, and Outerwall (makers of Redbox, Coinstar and ecoATM) about "What it means to be an effective leader and why it is so important to have women at the table to create technology". Sheryl shared her story about being a keynote speaker in GHC. The conversation handled gender diversity in technology and the pay gap. Sandberg asked the audience to negotiate regarding to payment equality. She talked about Lean In book and Lean In circles and how mentoring is important. She advised the audience to join Lean In circles. Sandberg said, "Starting a Lean In circle is a great leadership opportunity". To read more about the conversation, here is a nice article:
Sandberg: Tech offers the best jobs, needs more women voices, and women need to stick with it

I attended the "Change Agent and Social Impact Awards” session by the ABI award winners: Michal Segalov of Mind the Gap, Maria Celeste Medina of Ada IT, Daniel Raijman of Mind the Gap, Mai Abualkas Temraz of Gaza Sky Geeks.

The moderator had a conversation with the ABI award winners to draw out their stories. The winners talked about the turning points in their life and what continues to motivate them to make a difference. The moderator asked the panelists about the challenges they faced, the turning points in life, and what motivates them.
Daniel said they started Mind The Gap 8 years ago to expose many girls to computer science. They have interacted with 10,000 girls. Mind The Gap expanded globally and is now in its 8th year, with more than 10,000 participants to date.

Michal said that they cared the most about making Mind The Gap scalable. Mind The Gap offers the people to choose how to give/volunteer. For example, some people can provide tech classes, some other can give talk, etc. They had about 100 people volunteered and each volunteer only give one hour of their time per month, so that makes it easy for the people and encourage them to volunteer. Michel advice was to be open to changing things, yourself, and your passion.

María mentioned that her mom encouraged her and support her the most. In one year, Maria has worked with the Programá Tu Futuro team and has initiated more than 6,000 people in coding: kids, adults, teenagers and senior citizens (of which 30% are women). She said that there is also of studies to how to empower woman.

Mai from Gaza was talking over Skype because she could not attend for political reasons. Mai was asked for some fun facts, but she said that she is not in a good status because she could not make it the conference, which made it hard to mention fun facts. In 2014, she became a TechWomen Emerging Leader. She also encouraged everyone to help and support them, and also keep inviting them, so may be in one day they will be able to attend. Mai said they face a lot of challenges in Gaza, but she like to call them opportunities to learn and get more powerful in solving problems they face. At the end, Mai said, I’m kept motivated by events like this where I’m exposed to the global women’s tech community. My goal here is to bring back as much of your energy as I can to Gaza. You can come mentor in Gaza. She mentioned many examples for people who went to Gaza before for mentoring: Angie Chang, the founder of Women 2.0, Dave McClure, the Founder of 500 Startups, and many others. "Don’t worry, it’s safe," Mai Said "or you can mentor women in Gaza remotely." Mai is a member of ArabWIC as well.

Thursday speed mentoring sessions took place during the lunch table on Thursday and Friday. I joined mentoring discussions around academic careers. It was useful to hear from many senior women in academia about their career journey and also hear some questions about applying in academia.

At the career fair, I was lucky to meet Sinead Borgersen, a Principal HR Business Partner at CA Technologies and Dr. Michele Weigle's friend. We had a quick discussion about the careers in CA Technologies and how they will fit with my interest. Siena is an amazing lady who is full of enthusiasm.

The Friday Keynote: Friday morning started with a cool technical keynote on "Robotics as a Part of Society" by Manuela Veloso, Herbert A. Simon University Professor, Computer Science Department, Carnegie Mellon University. Manuela has become well-known in the AI community for being the guiding force behind robot soccer. In her keynote, Manuela highlighted different perspectives of robots in collaborative network of robots and humans. Manuela talked about CoBots, the robots she and her students created to help them with simple tasks in their offices and labs. There robots can use the internet or send emails to ask for help. She showed that autonomous robots learn from interacting with humans. "Technology is about diversity, "Manuela said. "You don’t have to do everything, but some do things that others can’t."

At the end of the keynote, there were announcements about the Grace Hopper 2016. The GHC 2016 will take place in Houston, Texas. The general program co-chairs for GHC 2016 will be Kaoutar El Maghraoui, from IBM Research and the ArabWIC and Maria Gini from University of Minnesota. I spent most of the time on Friday at the career fair, then I attended the mentoring session on ArabWIC lunch table and met many women in computing from different fields.

The Friday Afternoon Plenary: The day wrapped up with an afternoon plenary session focused on the importance on diversity in technology by Janet George, Chief Data Scientist for Big Data/Data Science and Cognitive Computing at SanDisk, Isis Anchalee, Platform Engineer at OneLogin, Miral Kotb, Director, Producer, Choreographer and Playwright for iLuminate.

I couldn’t attend the afternoon keynote, but I heard from many friends about iLuminate, which is a wearable lighting system that enables novel dance act, performance, in which the audiences were treated with at the end of the conference. For more about the afternoon plenary, here are nice wrap ups for the three talks:
GHC 2015 ended with busting a move on the dance floor in a night to remember at the Minute Maid Park. There were many photos booths, t-shirts, glowing sticks, and dessert. It is a Grace Hopper Celebration, after all!

It was fascinating to be in GHC 2015 to hear from the most talented and inspiring women in technology and get advice from them. Furthermore, spending the best time with many awesome ladies and get back with many friends who support each other. I also was glad to be involved in many activities this year for the ABI community and the ArabWIC.


Friday, October 23, 2015

2015-12-22: 60% of Web Annotations are Orphaned or in Danger of Being Orphaned

Figure 1. An Annotation is defined by OAC
 as a set of connected resources  
In our TPDL paper, we studied 6281 highlighted text annotations (out of 7744 annotations) available in the Hypothes.is annotation system in January 2015. The main goal was to investigate the prevalence of orphaned annotations, where neither a live Web page nor an archived copy of the web page contains the text that had previously been annotated.

Recently, we applied the same analysis as in our TPDL paper to a larger number of annotations.  Figure 2 illustrates that the number of annotations in Hypothes.is has been increasing since July 2013. Our TPDL paper focused on the 7744 annotations available in January 2015.  Our updated paper (available at arXiv.org) analyzed the 20,133 highlighted text annotations (out of 33,946 total annotations) available in August 2015.  In this post, I will focus on reporting results of our arXiv paper.
Figure 2. January 2015 - dataset used in TPDL paper
August 2015 - dataset used in arXiv version  

Based on my experience in analyzing web annotations in Hypothes.is, I have seen annotations created just for the purpose of testing the system to see how it works (e.g. some annotations contain the tag "test" in Hypothes.is). Although some annotations can be considered as not beneficial, the majority of annotations are valuable to the community in different aspects. For example, 9 out of the 10 most annotated websites in Hypothes.is are related to education, academic research, or publishing.

The Hypothes.is annotation system offers free accounts allowing users to annotate the Web by, for example, creating tags/notes to highlighted text or to a web page as a whole. Hypothes.is supports collaborative work by letting users reply to each other's comments as shown in Figure 3.

Figure 3. Annotating the Web Using Hypothes.is Annotation System

It is known that web pages are not fixed resources, and they might be changed or become unavailable at any time. These changes in webpages can affect the associated annotations. Figure 4 shows the target URI http://climatefeedback.org/ as it appeared in December 2014. The highlighted text “Scientific feedback for Climate Change information online” in the webpage was annotated with “After reading about your project at MIT news, I visited your page and ...”. In August 2015, this annotation can no longer be attached to the target web page because the highlighted text no longer appears on the page, as shown in Figure 5. Although the live Web version of http://climatefeedback.org/ has changed and the annotation was in danger of being orphaned, the original version that was annotated has been archived and is available at the Internet Archive. The annotation could be re-attached to this archived resource, or memento.

Figure 4. http://climatefeedback.org/ in December 2014 
Figure 5. http://climatefeedback.org/ in August 2015
Because web pages are changing, the status of annotations is also affected. We can classify web annotations into 4 categories based on the attachment to their target live web pages and to mementos:
  • Safe - The annotation can be attached to the target live web page and also to at least one memento. 
  • In Danger - The annotation can be attached to the target live web page but it is not attached to any mementos. In this case, if the live web page is changed such that the associated annotations become unattached, then these annotations, unfortunately, would become orphaned.
  • Re-attached - The annotation is no longer attached to the live web page but, fortunately, it can be reattached to at least one memento from public web archives. 
  • Orphaned - The annotation is neither attached to the live web page nor any mementos.

Safe and re-attached annotations can be recovered with web archives, so they are in better situation than the other two categories. We want to make annotations that belong to the second category (In danger) safe or re-attached by archiving their target web pages. Obviously, we can do nothing about annotations that belong to orphaned category. They are lost.

We used the LANL Memento Aggregator to look for archived copies of web pages (mementos) in the public archives. To be more specific, we were looking for the closest mementos to annotations' creation date. In the example shown in Figure 4, we would need to find the closest mementos captured immediately before and after the annotation creation date (e.g., December 3, 2014 at 12:47 AM for the web page http://climatefeedback.org).

Figure 6(a) shows an example where mementos are available before and after the annotation creation date. In this example, only M1 and M3 will be tested to see if the associated annotations can be re-attached to these mementos. Figure 6(b) shows mementos that are only available before the annotation creation date while Figure 6(c) shows mementos that are only available after the annotation date. Finally, Figure 6(d) shows annotations that have no existing mementos for their target web pages in the web archives.

Figure 6. Discovering Mementos for Annotations' Target Web Pages
After we discovered the closest mementos to annotations' creation date and checked if annotations are still attached to their live web pages and to mementos, we get to the conclusion illustrated in Figure 7. It shows that 19% of annotations are orphaned while 41% are in danger of being orphaned. The remaining 40% of annotations are in an acceptable situation as 37% of annotations are considered safe while only 3% of them can be re-attached using archives. Results indicate also that if mementos are available for an annotation target web page, there will be a high chance that the annotation can re-attached. In addition, a copy of the same memento can be available in different web archives.

Figure 7. The Status of  Current Hypothes.is Annotations

As we can see, having 60% of annotations orphaned or in danger of being orphaned will lead us to a conclusion that archiving webpages at the time of annotation is important to avoid orphaned annotations.

-- Mohamed Aturban

Wednesday, October 7, 2015

2015-10-07: IMLS and NSF fund web archive research for WS-DL

In the spring and summer of 2015, the Web Science and Digital Libraries (WS-DL) group has received a total of $950k of funding from the IMLS and the NSF to study various aspects of web archiving.  Although previously announced on twitter (IMLS: 2015-03-31 & NSF: 2015-08-25), here we provide greater context for how these awards support our vision for the future of web archiving*.

Our IMLS proposal is titled "Combining Social Media Storytelling With Web Archives" and a PDF of the full proposal is available directly from the IMLS.  This proposal is joint with our partners at Archive-It and is informed by our experiences in several areas, such as:
Our most illuminating insight (somewhat obvious in retrospect) is to not try to include all of the collection's holdings in its summarization, but to only surface the exemplary components sufficient to distinguish one collection from the next.  One example we frequently use is "how do we distinguish the many `human rights' collections available in Archive-It?"  They all have different perspectives, but they can be difficult to navigate for those without detailed knowledge of the seed URIs and the collection development policy. 

The IMLS proposal will investigate two main thrusts:
  1. Selecting a small number (e.g., 20) of exemplary pages from a collection (often 100s of archived copies of 1000s of web pages) and loading them in an existing tool such as Storify as a summarization interface (instead of custom & unfamiliar interfaces).  Yasmin AlNoamany has some exciting preliminary work in this area; for example see her TPDL 2015 paper examining what makes a "good" story on Storify, and her presentation "Using Web Archives to Enrich the Live Web Experience Through Storytelling".
  2. Using existing stories to generate seed URIs for collections.  One problem for human-generated web archive collections is that they depend on the domain knowledge of curators.   For example, the image above shows two Storify stories about early riots in Kiev (aka Kyiv) which predated much of the exposure in Western media and then the subsequent escalation of the crisis.  The collection at Archive-It was not begun until the annexation of the Crimea was imminent, possibly missing the URIs that document the early stages of this developing story.  Our idea is to mine social media, especially stories, for semi-automated, early creation of web archive collections. 
The NSF proposal is titled "Increasing the Value of Existing Web Archives" and represents a shift in how we think about web archiving.  One point we've made for a while now (for example, see our 2014 presentation "Accessing the Quality of Web Archives") is that we must shift our current focus of simply piling up bits in the archive to more nuanced questions of how to make the archives more immediately useful (as opposed to just insurance for future loss) and to how to assess & meaningfully convey the quality of the archived page.  This proposal will have three main research thrusts:
  1. Inspired by Martin Klein's PhD research and Hugo Huurdeman et al.'s "Finding Pages on the Unarchived Web" from JCDL 2014, we would like to see archives provide recommendations of related pages in the archive, as well as suggested "replacements" for pages that are not archived.  Web archives now just return a "yes" (200) or "no" (404) when you query for a URI -- they should be able to provide more detailed answers based on their holdings.
  2. We'd like to further investigate the various issues of how well a page is archived.  We have some preliminary work from Justin Brunelle for automatically assessing the impact of missing embedded resources (typically stylesheets and images), as well as from Scott Ainsworth on detecting temporal violations -- combinations of HTML and images that never occurred on the live web (see "Only One Out of Five Archived Web Pages Existed as Presented" from HT 2015).  
  3. Related to #2, we need to find a better way to visualize the temporal & archival makeup of replayed pages.  For example, the LANL Time Travel service does a nice job of showing the various archives that contribute resources to a reconstruction, but questions remain about scale as well as describing temporal violations and their likely semantic impact.  Similarly, we'd like to investigate how to convey the request environment that generated the representation you're viewing now (see our 2013 D-Lib paper "A Method for Identifying Personalized Representations in Web Archives" for preliminary ideas on linking various geoip, mobile vs. desktop, and other related representations). 
We have been very fortunate with respect to funding in 2015 and we look forward to continued progress on the research thrusts outlined above.  We'd like to thank everyone that made these awards possible.  We welcome any feedback or interest on these (and other) projects as we progress.  Watch this blog and @WebSciDL for continued research updates.


* = See also our 2014 award for $324k from the NEH for the study of personal web archiving and our 2014 award for $49k from the IIPC for profiling web archives for a more complete picture of our research vision for web archives.