Judging by the GitHub code of DatasetViewer, it doesn’t appear to actively convert audio for preview purposes, so adding a lightweight version of the audio data for previewing would be a good idea.
For example, with image datasets, you often see a setup where high-quality main data is accompanied by a JPEG for previewing. I think the same approach could work for audio datasets as well…
just in case, pinging, @lhoestq
Audio sample rates supported by the Hugging Face Dataset Viewer?
Short answer
I do not see a documented Hugging Face Dataset Viewer whitelist that says something like:
“The Dataset Preview audio player supports only 8 kHz, 16 kHz, 44.1 kHz, 48 kHz, etc.”
The public Hugging Face docs say that audio datasets can get a Dataset Viewer when they use a supported repository structure and supported file formats. The Hugging Face Hub audio dataset guide explicitly lists AIFF, FLAC, MP3, OGG, and WAV as supported audio formats.
However, that is format-level support, not a guarantee that every possible WAV profile will play in the browser preview player.
For my dataset, the safest interpretation is:
65,536 Hz / 24-bit WAV is valid research audio, but it is not a safe browser-preview target. The sample_rate / sampling_rate metadata is useful for programmatic loading and resampling, but it should not be treated as a setting that forces the Hub Dataset Preview player to transcode or accept the original uploaded WAV files.
So, yes: a separate browser-preview copy at 44.1 kHz / 16-bit or 48 kHz / 16-bit is the practical workaround.
I would keep the original 65,536 Hz / 24-bit WAV files as the canonical corpus, and add a clearly labeled preview config or preview split such as browser_preview_48k16.
Why this is not just “WAV supported or unsupported”
There are several layers involved:
| Layer |
What it answers |
Relevance here |
| Hub repository storage |
Can Hugging Face host the files? |
Yes. The files can be stored on the Hub. |
| Dataset structure recognition |
Can Hugging Face infer splits, metadata, and audio columns? |
Mostly yes, if the repository follows the expected audio dataset layout. |
| Dataset Viewer backend |
Can the Viewer generate precomputed rows, Parquet exports, and preview assets? |
This can fail independently of ordinary file hosting. |
| Browser audio playback |
Can the user’s browser play the served audio file? |
This is the risky part for 65,536 Hz / 24-bit WAV. |
Python datasets loading |
Can users load and resample the audio in code? |
Usually a separate and more controllable path. |
The Dataset Viewer backend docs describe the Viewer as a backend/API layer that precomputes data and auto-converts Hub datasets to Parquet. The Dataset Viewer Quickstart also exposes separate endpoints for checking validity, splits, first rows, row slices, Parquet files, size, and statistics.
That matters because a dataset can be valid and loadable while the preview UI still fails.
In other words:
A working Hugging Face dataset does not necessarily imply a working browser audio player for every source file.
What is special about this dataset?
The dataset page for cjweaver/ARU_speech_corpus is already recognized by Hugging Face as an audio/text dataset in soundfolder format. It shows a default subset with 8.64k rows, train/validation/test splits, and auto-converted Parquet. It also currently shows the Dataset Viewer failing for the selected split with UnexpectedApiError.
That suggests the issue is probably not basic dataset discovery. Hugging Face is not simply ignoring the repository. It sees it as an audio dataset.
The dataset card says the corpus contains:
- 8,640 utterances
- 12 native British English speakers
- 720 IEEE sentences per speaker
- single-channel recordings
- controlled anechoic recording conditions
- 65,536 Hz sampling rate
- 24-bit depth
The card also notes that the high sample rate makes the corpus useful for wideband and super-wideband speech processing research, while also recommending downsampling for many applications that do not need that bandwidth.
So I would not describe the source audio as “wrong.” It is a legitimate archival/research format. It is just not a conservative browser-preview format.
What the Dataset Viewer backend code suggests
The public huggingface/dataset-viewer repository clarifies the backend path, but with one important limitation: the repository README says it is the backend that provides the Dataset Viewer with precomputed data through an API, and that the frontend viewer component is not part of that repository.
So the repo can clarify things like:
- audio row post-processing;
- audio asset generation;
- supported backend audio extensions;
- whether there is an obvious sample-rate whitelist;
- whether
.wav is treated specially.
It cannot fully reveal:
- the closed Hub frontend audio component;
- browser-specific decoding behavior;
- how every browser handles an unusual WAV profile.
Still, the backend code is very useful.
1. Backend support appears extension/MIME-based, not sample-rate-whitelist-based
In asset.py, the backend defines supported audio extensions and MIME types like this:
SUPPORTED_AUDIO_EXTENSION_TO_MEDIA_TYPE = {
".wav": "audio/wav",
".mp3": "audio/mpeg",
".opus": "audio/ogg",
".flac": "audio/x-flac",
}
I do not see a visible backend-side list like:
SUPPORTED_SAMPLE_RATES = [8000, 16000, 44100, 48000]
or:
MAX_AUDIO_SAMPLE_RATE = 48000
The visible backend check is about extension and media type, not “this WAV must be 16 kHz / 44.1 kHz / 48 kHz.”
That supports this interpretation:
Hugging Face’s backend likely recognizes .wav as a supported audio extension, but that does not mean every WAV profile is browser-playable.
2. Existing .wav files may be passed through unchanged
In features.py, the audio handling code checks whether the audio value has a path, whether there are no embedded bytes, whether the file extension is supported, and whether the path starts with hf://datasets/....
If so, it returns an AudioSource using the resolved Hub URL and the MIME type for that extension.
Conceptually, for this dataset, that can look like:
65,536 Hz / 24-bit .wav in dataset repo
↓
Dataset Viewer backend sees ".wav"
↓
".wav" is a supported audio extension
↓
backend returns URL + type "audio/wav"
↓
Hub frontend / browser tries to play the original WAV
That is the key point.
If the backend is returning a URL to the original .wav, then adding a sample_rate metadata field will not necessarily change what the browser receives.
3. Conversion exists, but likely not for already-supported .wav files
The backend has a create_audio_file path in asset.py.
The relevant behavior is roughly:
- if the source extension matches the target extension, it writes the bytes directly;
- if the source extension differs, it converts using
pydub.AudioSegment.from_file(...) and exports to the target format;
- the code comment notes that this conversion may spawn FFmpeg.
That means conversion exists, but the subtle issue is:
If the source file is already .wav, the backend may treat it as supported and may not normalize it to 44.1/16 or 48/16.
So the fact that the files are WAV may cause the backend to pass them through, rather than convert them to a browser-safe derivative.
4. sampling_rate exists, but not as a preview-player target setting
The backend also has an AudioDecoder path in features.py. In that path, it extracts an audio array and _sampling_rate, then writes a WAV using soundfile.write(..., _sampling_rate, format="wav").
That is not the same thing as:
“Please make all preview audio 44.1 kHz or 48 kHz.”
It simply means that when the backend has decoded audio-array data, it can write a WAV using the sampling rate associated with that decoded object.
This matches the Hugging Face datasets audio loading docs, which describe audio decoding and access through the Audio feature.
What sample_rate / sampling_rate metadata actually does
This is the point most likely to cause confusion.
The datasets audio processing docs show that an audio column can be resampled programmatically with:
from datasets import load_dataset, Audio
dataset = load_dataset("PolyAI/minds14", "en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
The docs say audio files are decoded and resampled on the fly the next time an example is accessed.
The Hugging Face Audio Course preprocessing page makes the same point: cast_column(..., Audio(sampling_rate=...)) does not change the audio in place; it tells datasets to resample examples on the fly when they are loaded.
So:
| Mechanism |
What it does |
Audio(sampling_rate=16000) |
Resamples decoded examples when accessed through datasets |
dataset_info / feature metadata |
Describes the dataset schema/features |
Uploaded .wav file |
Remains the actual stored file |
| Dataset Preview player |
Receives an audio URL/source and depends on backend/frontend/browser handling |
In plain language:
sampling_rate is useful for Python users. It is not a guaranteed setting for forcing the Hub web player to transcode the uploaded WAV before playback.
This is valid for Python users:
from datasets import load_dataset, Audio
ds = load_dataset("cjweaver/ARU_speech_corpus", split="train")
ds_16k = ds.cast_column("audio", Audio(sampling_rate=16_000))
example = ds_16k[0]["audio"]
But I would not assume that this happens automatically in the Hub UI:
uploaded 65,536 Hz / 24-bit WAV
↓
Hub silently creates browser-safe 44.1 kHz / 16-bit WAV
↓
Dataset Preview player uses the converted file
That would be a preview transcoding feature. I do not see documentation that says the Dataset Preview player does that for this kind of WAV edge case.
Browser playback is its own compatibility layer
The final playback layer is the browser.
MDN’s <audio> element documentation describes browser audio playback as source-based: the browser receives one or more sources and attempts to play a suitable one.
MDN’s audio codec guide frames web audio as a codec/container compatibility problem, not merely “the file has an audio extension.”
That matters because this file is not just “a WAV.” It is:
WAV container
PCM audio
65,536 Hz sample rate
24-bit depth
single channel
A browser may handle conventional 44.1 kHz or 48 kHz PCM WAV more reliably than an unusual 65,536 Hz / 24-bit WAV. The Dataset Viewer backend may consider .wav supported, but the browser still has to decode the actual stream.
My diagnosis for this exact case
I would rank the likely explanations like this.
Most likely
The Dataset Viewer backend recognizes the dataset and treats .wav as supported, but the original 65,536 Hz / 24-bit WAV reaches the preview/player path unchanged. The frontend/browser or the viewer asset path then fails on this unusual WAV profile.
This explanation fits:
- the dataset being recognized as
soundfolder;
- the presence of Parquet conversion;
- the backend code supporting
.wav by extension/MIME type;
- the absence of an obvious backend sample-rate whitelist;
- the unusual nature of 65,536 Hz / 24-bit audio for browser playback.
Also possible
One or more individual WAV files may be malformed, inconsistent, or difficult to decode. A single problematic file can sometimes break first-row generation or row post-processing.
Also possible
There may be a metadata/path/layout issue, especially if metadata.csv paths do not exactly match what AudioFolder expects.
The Hugging Face audio dataset docs say metadata.csv must contain a file_name column that links audio files to metadata, and relative paths must be full relative paths when files are not next to the metadata file.
Less likely
A hidden documented rule such as:
Dataset Viewer WAV sample_rate must be <= 48,000 Hz
I have not found such a public rule in the docs or in the visible backend code.
Should the original files be replaced?
No.
I would not replace the canonical corpus with downsampled 44.1/16 or 48/16 files.
The original format is part of the dataset’s value. The dataset card describes anechoic recordings, professional measurement equipment, 65,536 Hz sampling, 24-bit depth, careful filtering, active-speech-level normalization, and use cases such as speech intelligibility, signal processing, ASR, speech quality assessment, and acoustic research.
That original format is not a mistake.
Instead, I would treat this as two parallel purposes:
| Purpose |
Best format |
| Canonical research corpus |
65,536 Hz / 24-bit WAV |
| Programmatic ASR/model use |
Load original and resample with Audio(sampling_rate=...) |
| Hub browser preview |
44.1 kHz or 48 kHz / 16-bit PCM WAV |
| Human quick inspection |
Browser-preview config |
This gives different users the right version without weakening the original dataset.
Should the workaround be a samples subdirectory?
The idea is right, but I would avoid the name samples.
A folder named samples is ambiguous. It can mean:
- example files;
- a reduced training subset;
- a split;
- a second config;
- a browser preview;
- or a separate derivative dataset.
Use a name that encodes the purpose and format:
browser_preview_48k16/
or:
browser_preview_44k16/
I would use:
browser_preview_48k16/
because 48 kHz / 16-bit PCM WAV is a conservative media/web-style preview target. The proposed 44.1 kHz / 16-bit version is also reasonable.
44.1 kHz / 16-bit vs 48 kHz / 16-bit
Both are reasonable.
| Preview format |
My view |
| 48 kHz / 16-bit PCM WAV |
My preferred browser-preview choice. Common in media/video/web workflows; comfortably preserves speech-band content. |
| 44.1 kHz / 16-bit PCM WAV |
Also good. Familiar “CD-quality” convention; likely much safer than 65,536/24. |
| 16 kHz / 16-bit WAV |
Good for many ASR workflows, but too model-specific as the only general preview. |
| MP3 |
Very browser-friendly and small, but lossy and less appropriate as a research-facing preview. |
| FLAC |
Lossless and compact, but I would test browser/viewer behavior before relying on it. |
Because the source card says the signal was low-pass filtered above 9 kHz, both 44.1 kHz and 48 kHz are more than enough for listening preview. The main goal is not preserving the original acquisition rate in the preview; the main goal is providing a stable browser-playable representation while leaving the original intact.
Recommended repository structure
A clear structure would be:
README.md
train/
metadata.csv
*.wav # canonical 65,536 Hz / 24-bit
validation/
metadata.csv
*.wav # canonical 65,536 Hz / 24-bit
test/
metadata.csv
*.wav # canonical 65,536 Hz / 24-bit
browser_preview_48k16/
metadata.csv
*.wav # selected 48 kHz / 16-bit browser-preview files
Then document the preview directory clearly.
A conceptual YAML configuration could look like:
---
configs:
- config_name: default
data_files:
- split: train
path: train/**
- split: validation
path: validation/**
- split: test
path: test/**
- config_name: browser_preview_48k16
data_files:
- split: preview
path: browser_preview_48k16/**
---
I would test this on a small branch or a tiny duplicate dataset first. The exact YAML may need adjustment depending on how AudioFolder infers the layout, but the design principle is strong:
Separate canonical source audio from browser-preview derivative audio.
Recommended preview metadata
For preview files, I would include metadata that links each preview file back to the canonical source file.
Example:
file_name,source_file_name,speaker_id,sex,age,accent,list_number,sentence_number,text,preview_sample_rate,preview_bit_depth,preview_purpose
ID01_list01_sent01_preview_48k16.wav,train/ID01_ARU_Fs=65536Hz_Standard speech - List 1 - Sentence 1 - Version 1_0.wav,01,M,47,Avon,1,1,"<transcription>",48000,16,browser preview only
The important required-style column is file_name, because the Hugging Face audio dataset docs use that column to link metadata rows to audio files.
The source_file_name and preview-related columns are for provenance and clarity.
How many preview files?
I would start with a representative subset, not a full duplicate of all 8,640 files.
A good first pass:
12 speakers × 10 utterances = 120 preview files
Make sure the preview subset covers:
- all 12 speakers;
- both sexes;
- varied ages and accents;
- examples from train, validation, and test;
- varied sentence/list numbers;
- transcripts;
- source-file provenance.
That is enough for a useful Hub preview while keeping the derivative set clearly secondary.
If it works and row-by-row preview for every example becomes important, the preview config can be expanded later.
Conversion commands
For 48 kHz / 16-bit mono WAV:
mkdir -p browser_preview_48k16
ffmpeg -y -i "input.wav" \
-ac 1 \
-ar 48000 \
-sample_fmt s16 \
"browser_preview_48k16/output_preview_48k16.wav"
For 44.1 kHz / 16-bit mono WAV:
mkdir -p browser_preview_44k16
ffmpeg -y -i "input.wav" \
-ac 1 \
-ar 44100 \
-sample_fmt s16 \
"browser_preview_44k16/output_preview_44k16.wav"
Batch example:
mkdir -p browser_preview_48k16
find train validation test -name "*.wav" | head -n 120 | while read -r f; do
base="$(basename "${f%.wav}")"
ffmpeg -y -i "$f" \
-ac 1 \
-ar 48000 \
-sample_fmt s16 \
"browser_preview_48k16/${base}_preview_48k16.wav"
done
Because the original source is 24-bit and high-rate, I would use FFmpeg or another well-tested resampler rather than a quick-and-dirty script that just drops samples. The Hugging Face Audio Course preprocessing page notes that resampling is not an in-place rewrite in datasets; for actual derivative files, do the conversion explicitly and document it.
Diagnostic checks I would run
1. Dataset Viewer API checks
Use the Dataset Viewer API to locate the failure layer.
The Dataset Viewer API docs and /rows docs are useful here. The /rows docs say image and audio samples are represented by URLs, and those assets are cached temporarily.
Run:
curl "/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Fsplits%3Fdataset%3Dcjweaver%2FARU_speech_corpus"
curl "/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Fparquet%3Fdataset%3Dcjweaver%2FARU_speech_corpus"
curl "/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Ffirst-rows%3Fdataset%3Dcjweaver%2FARU_speech_corpus%26amp%3Bconfig%3Ddefault%26amp%3Bsplit%3Dtrain"
curl "/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Frows%3Fdataset%3Dcjweaver%2FARU_speech_corpus%26amp%3Bconfig%3Ddefault%26amp%3Bsplit%3Dtrain%26amp%3Boffset%3D0%26amp%3Blength%3D10"
Try multiple offsets:
curl "/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Frows%3Fdataset%3Dcjweaver%2FARU_speech_corpus%26amp%3Bconfig%3Ddefault%26amp%3Bsplit%3Dtrain%26amp%3Boffset%3D100%26amp%3Blength%3D10"
curl "/static-proxy?url=https%3A%2F%2Fdatasets-server.huggingface.co%2Frows%3Fdataset%3Dcjweaver%2FARU_speech_corpus%26amp%3Bconfig%3Ddefault%26amp%3Bsplit%3Dtrain%26amp%3Boffset%3D1000%26amp%3Blength%3D10"
Interpretation:
| API result |
Meaning |
/splits works |
Dataset config/split structure is recognized. |
/parquet works |
Auto-converted Parquet exists. |
/first-rows fails |
First-row post-processing or asset generation may be failing. |
/rows fails at specific offsets |
Specific files may be problematic. |
/rows returns audio URLs, but browser does not play them |
Browser/player compatibility is likely. |
| 48/16 preview config works |
Source WAV profile was likely the practical trigger. |
2. Local WAV validation
Run:
ffprobe -hide_banner -show_format -show_streams "example.wav"
Look for:
codec_name=pcm_s24le
sample_rate=65536
bits_per_sample=24
channels=1
Batch check:
find train validation test -name "*.wav" -print0 |
while IFS= read -r -d '' f; do
ffprobe -v error -select_streams a:0 \
-show_entries stream=codec_name,sample_rate,bits_per_sample,channels,duration \
-of default=nw=1:nk=0 "$f" >/dev/null || echo "BAD: $f"
done
This does not prove browser compatibility, but it helps rule out corrupt files, inconsistent headers, unexpected channel counts, or nonstandard PCM variants.
3. A/B preview test
Create two tiny configs or a test dataset:
test_original/
metadata.csv
*.wav # 65,536 Hz / 24-bit
test_preview_48k16/
metadata.csv
*.wav # 48,000 Hz / 16-bit
Then compare.
| Result |
Interpretation |
| Original fails, 48/16 works |
Strong evidence that the original WAV profile is the issue. |
| Both fail |
More likely metadata/layout/viewer issue. |
| Both work in API but original fails in browser |
Browser playback issue. |
| Preview works everywhere |
Add the preview config to the real dataset. |
Dataset-card wording I would use
## Audio format and browser preview
The canonical ARU Speech Corpus audio is stored as mono 65,536 Hz / 24-bit PCM WAV, matching the original measurement-oriented acquisition format.
This sample rate and bit depth are useful for archival and acoustic research, but may not be reliably playable in browser-based dataset preview players.
For convenience, this repository includes a separate `browser_preview_48k16` configuration containing selected utterances converted to 48 kHz / 16-bit PCM WAV. These files are intended only for quick listening and inspection in the Hub UI.
For training, evaluation, or acoustic analysis, use the canonical `default` configuration and resample explicitly for your target workflow.
Programmatic loading example:
from datasets import load_dataset, Audio
# Canonical source audio
ds = load_dataset("cjweaver/ARU_speech_corpus", split="train")
# Example: resample on access for a 16 kHz ASR model
ds_16k = ds.cast_column("audio", Audio(sampling_rate=16_000))
This mirrors the official datasets path for audio resampling rather than implying that the Hub web player will perform the conversion.
What I would ask Hugging Face
A precise support/discussion question would be:
The dataset uses canonical 65,536 Hz / 24-bit PCM WAV files. The public Dataset Viewer backend appears to treat .wav as a supported extension and may return an AudioSource pointing at the original dataset file with MIME type audio/wav, rather than resampling/transcoding it. Is the Dataset Preview player expected to support 65,536 Hz / 24-bit WAV directly, or should datasets with nonstandard WAV profiles provide a separate browser-preview config at 44.1/16 or 48/16?
That question is better than only asking “what sample rates are supported?” because it reflects the actual backend behavior visible in the code.
Final recommendation
For this dataset, I would do this:
- Keep the original 65,536 Hz / 24-bit WAV files as the canonical
default dataset.
- Do not rely on
dataset_info.sample_rate or Audio(sampling_rate=...) to fix the Hub player.
- Add a separate preview config named
browser_preview_48k16 or browser_preview_44k16.
- Use 48 kHz / 16-bit PCM WAV for the preview files if choosing one default.
- Start with a representative subset, around 120 files, covering all speakers and splits.
- Include metadata linking every preview file back to the canonical source file.
- Use the Dataset Viewer API to confirm whether the current error is row generation, asset generation, or browser playback.
- If 48/16 preview files still fail, investigate metadata/config/layout rather than sample rate.
Short summary
- Hugging Face documents audio dataset support for WAV, but not a precise Dataset Viewer sample-rate whitelist.
- The public Dataset Viewer backend code appears to support audio by extension/MIME type, not by a visible sample-rate list.
- Existing
.wav files may be passed through to the frontend/browser unchanged.
sampling_rate metadata helps Python loading/resampling; it should not be assumed to transcode Hub preview audio.
- The original 65,536 Hz / 24-bit WAV files are valuable as canonical research audio.
- A separate
browser_preview_48k16 or browser_preview_44k16 config is the clean, low-risk workaround.