RSS Harvests

From NAS 5.2 onwards, it is possible to harvest RSS feeds using the Crawler RSS (https://github.com/Landsbokasafn/crawlrss) module developed by Kristinn Sigurðsson at the National Library of Iceland. In order to use the module it is necessary to configure the feeds to be harvested in a special crawler-bean template. At present it is not possible to define the seeds of an RSS harvest directly through the NAS GUI. A sample template suitable for use with NAS can be downloaded from https://raw.githubusercontent.com/netarchivesuite/crawlrss/master/src/main/conf/jobs/CrawlRSS-Sample-Profile/netarkivet-crawlrss.dr.dk.cxml .

The template can be customised by replacing this section

            <list>
                <bean class="is.landsbokasafn.crawler.rss.RssFeed">
                    <property name="uri" value="http://www.dr.dk/nyheder/service/feeds/indland" />  <!--RSS url -->
                    <property name="impliedPages">
                        <list>
                            <value>https://www.dr.dk/nyheder/</value>
                            <value>http://www.dr.dk/nyheder/allenyheder/indland</value> <!-- Landing Page -->
                        </list>
                    </property>
                </bean>
                <bean class="is.landsbokasafn.crawler.rss.RssFeed">
                    <property name="uri" value="http://www.dr.dk/nyheder/service/feeds/udland" />  <!--RSS url -->
                    <property name="impliedPages">
                        <list>
                            <value>http://www.dr.dk/nyheder/allenyheder/udland</value>
                        </list>
                    </property>
                </bean>
                <bean class="is.landsbokasafn.crawler.rss.RssFeed">
                    <property name="uri" value="http://www.dr.dk/nyheder/service/feeds/penge" /> <!--RSS url -->
                    <property name="impliedPages">
                        <list>
                            <value>http://www.dr.dk/nyheder/allenyheder/penge</value>
                        </list>
                    </property>
                </bean>
            </list>

with your own list of feeds to be harvested. Associated with each rss-feed uri is a list of implied pages. These can be ordinary html landing pages associated with the feed. By harvesting these together with the rss-feed one can ensure a consistent browsing experience in the harvested data.

To use the rss-template one needs to define, for any domain, a configuration with an empty seed list. Strictly speaking, seed lists cannot be completely empty, but a seed list can consist solely of a single comment character "#". Then simple define a harvest configuration using the crawlrss template together with the empty seed list.