Up until now the standard has contained no specific guidance on the maximum size of an XML file in IATI format. Guidance has recommended a best practice of publishers segmenting their file by recipient country or region to ensure a manageable file size.
This guidance needs revision for two reasons:
- The first file of over 100 MB has been published. The IATI Registry (as well as other data consuming systems) failed to process it correctly.
- Country segmentation has never been mandatory and the only accurate way to gather all data about a single country or region is through the IATI Datastore or other similar repositories.
Here is the advice I have received from our developer Ben Webb:
- We suggest that publishers segment only in order to ensure no one file is bigger than 40 MB.
- There is no definitive answer to when/if an XML file is too large. However, since we would like to not rewrite our current software, and we want to make the data easier for other software authors, I suggest we impose a limit appropriate for reading the files into memory.
- 50MB was the limit adopted by the registry, and all files other than the most recent arrival are lower than this, I suggest we use this to inform our limit.
- I suggest we use, 40 MB as the limit for publishers, and use 50 MB as the largest file processing systems are required to handle. This guarantees some tolerance, such that if a publisher's files are accidentally slightly too large, systems should not break.
- Viewing the larger files in browsers will still be tricky. However, I think this is okay, as we can point people towards proper XML tools (e.g. BASEX), or the growing number of IATI visualisation tools
- To be useful, this limit must be an actual requirement, so I suggest adding it to the 2.01 process. However, we should also adopt it as guidance as soon as it's agreed
- We suggest abandoning country level segmentation as best practice, as it doesn't necessarily make sense, and the datastore now provides this view of the data.
- We should not actively discourage other methods of segmentation (e.g. country level, or putting archived/closed data into a different files) if that's easier for a publisher's publishing process.