2018-10-09 Statusmeeting
Agenda for the joint NetarchiveSuite tele-conference 2018-10-09, 13:00-14:00.
Participants
- BNF: Sara, Géraldine
- ONB: Andreas
- KB/DK - Copenhagen: Tue, Stephen, Anders
- KB/DK - Aarhus: Colin
- BNE: Mar
- KB/Sweden: Bengt
Update on NAS latest tests and developments
Preparation of 2019 NAS workshop
Status of the production sites
Netarkivet
Our main focus is on the following:
- Solving problems with the broad crawl: We started step 1 of our third broad crawl for 2018 on 25 August. We had lots of hanging jobs. We got developers help to solve a part of the problem, but still some jobs are hanging.
- Testing Wayback access with different browsers: we found that different browsers replay more or less content from the Wayback archive. We tested with different browsers, IE, Chrome, Edge. The result is, that Edge is best to replay content (i.g. images)
- Event harvest with BCWeb: we finished the event crawl for the official commemoration day for Danish soldiers, who had been deployed in war or conflict zones. Just some documentation is to be done. We did this crawl in collaboration with archivists from the National Archive. The fact that the hardcore coded schedules do not match the figures in the configurations on NAS domain pages caused some confusion. This was our first collection collaboration using BCWeb (besides the pilot project)
- We have implemented Umbra in our test environment and are looking forward to the results.
- Special crawl for man hunt by Danish police: Almost every part of Denmark got closed by the police 28 September (no ferries, no flights abroad). We primarily crawled foreign articles on the event.
- Kim Larsen, a Danish rock musician, known by almost every Dane died last Saturday, 29 September. We added a hashtag to our Twitter crawls – otherwise content on his death is captured by our selective news media crawls.
BnF
- After the two workshops on crawling YouTube (covered in our June update), we were able in July to launch a production crawl using the process previously outlined. This first crawl lasted 20 days. The curators selected 42 channels and we crawled all the videos from these channels: 28 063 videos, with the exception of 10 videos that had been removed and one video excluded because of our filters. The crawl represents 1.8 TB and more than 3 000 hours of video. A second crawl is planned in November.
- We have also finished work on giving access to these videos, as well as those crawled during the elections last year. To replay videos within YouTube pages, we built on the system already used for Dailymotion. A specific rule is applied to pages for which videos have been collected, allowing us to replace the YouTube player with another called FLV Player, which is present in our archives. We use the metadata collected during the crawl to establish the link between the web page and the correct video file. As the page listing all the videos on a channel is not fully collected by Heritrix, we created pages within our access application with the full list of videos collected for each channel, and inserted a button within the YouTube page to link to this list. Finally, we created a "guided tour", similar to that which already exists for news sites, with a list of all the YouTube channels collected. This is also based on the metadata, with additional description added by curators.
ONB
- We are still running our domain crawl for this year. We are in the middle of the first stage (10MB)
- And we are doing testcrawls of new seeds for the next run of our half yearly Woman/Gender Crawl which starts next month
BNE
KB-Sweden
Next meetings
- November 6th
- December 4th
- January 8th 2019