IAradio4

To create a script that scrapes audio files and metadata from BBC Radio 4 schedules and uploads them to the Internet Archive, you need to perform web scraping and then upload the collected data. Below is an example of how you can achieve this using Python:

Step 1: Set Up Required Libraries

Install the necessary libraries:

pip install requests beautifulsoup4 lxml

Step 2: Scrape Audio Files and Metadata

First, create a script to scrape the audio URLs and metadata from the BBC Radio 4 schedules.

import requests
from bs4 import BeautifulSoup
import json

# URL of the BBC Radio 4 schedule
schedule_url = 'https://www.bbc.co.uk/sounds/schedules/bbc_radio_fourfm'

def scrape_bbc_radio_schedule(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, 'lxml')

    listings = []

    # Scrape schedule details
    for item in soup.find_all('div', class_='programme'):
        title = item.find('h3').text.strip()
        time = item.find('span', class_='broadcast__time').text.strip()
        description = item.find('p', class_='programme__synopsis').text.strip() if item.find('p', class_='programme__synopsis') else 'No description available'
        audio_url = item.find('a', class_='programme__link')['href']
        
        listings.append({
            'title': title,
            'time': time,
            'description': description,
            'audio_url': audio_url
        })

    return listings

listings = scrape_bbc_radio_schedule(schedule_url)
print(json.dumps(listings, indent=4))

Step 3: Download Audio Files

Modify the script to download the audio files from the retrieved URLs.

import os

def download_audio(listings):
    for listing in listings:
        audio_response = requests.get(listing['audio_url'])
        filename = f"{listing['title']}.mp3"

        with open(filename, 'wb') as f:
            f.write(audio_response.content)
        
        listing['filename'] = filename

download_audio(listings)

Step 4: Upload to Internet Archive

Use the Internet Archive API to upload the audio files and metadata.

def upload_to_internet_archive(listing, username, password):
    metadata = {
        'collection': 'opensource_audio',
        'title': listing['title'],
        'description': listing['description'],
        'creator': 'BBC Radio 4',
        'subject': 'Radio',
        'licenseurl': 'http://creativecommons.org/licenses/by/4.0/'
    }

    session = requests.Session()
    session.auth = (username, password)

    files = {
        'file': open(listing['filename'], 'rb')
    }
    response = session.put(f'https://s3.us.archive.org/{listing["title"]}/{listing["filename"]}', files=files)
    print(response.status_code, response.text)

    metadata_url = f'https://archive.org/metadata/{listing["title"]}'
    response = session.post(metadata_url, data=metadata)
    print(response.status_code, response.text)

username = 'YOUR_ARCHIVE_USERNAME'
password = 'YOUR_ARCHIVE_PASSWORD'

for listing in listings:
    upload_to_internet_archive(listing, username, password)

Summary

This script scrapes audio files and metadata from BBC Radio 4 schedules, downloads the audio files, and uploads them to the Internet Archive. Replace the placeholders with your actual Internet Archive credentials. You may need to adjust the scraping logic based on the structure of the BBC website.

Referenences

API reference

FILES, FORMATS, AND DERIVATIVES – FILE DEFINITIONS

Example of radio page listings with transcript

Comments

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.