Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Panel

We're preparing our first broad crawl for 2023. For this purpose we're writing a Python program to automate creation of new harvest passes based on a short YAML config file containing values for maxBytes, maxObjects, maxSeconds and ordertemplate per harvest pass. Eg:

auto:
  P1:
    objects: 3
    bytes: 1000
 comment: this is an automatically created harvest pass
  seconds objects: 36003
    commentbytes: |-1000
     seconds: this3600
is a comment
    autostart: true
    previous: truefalse
    template:
      name: broad_harvest_type_1
      placeholder_namespace: KB.
      placeholders:
        MAX_OBJECT_SIZE_BYTES: 400000000
        EXTRACT_JAVASCRIPT: false
P2:
previous: true
objects: ...

We have ended a number of older selective harvests that were started because of earlier general elections in Sweden, among them a couple of unsuccessful attempts to harvest Twitter, Facebook and Instagram.

We have added selective harvests for local authorities and regions and will soon add government agencies. These harvests are introduced as a part of our work with the e-legal collections where our other methods of collecting material (RSS-based or OAI-PMH partial harvesting, FTP, web uploading) have been less successful.

Next meetings

  • April 11th
  • May 9th
  • June 6th
  • July 4th
  • September 5th
  • October 3rd
  • November 7th
  • December 5th
  • January 9th 2024

...