Skip to Main Content

ScholarsArchive@OSU User Guide

This guide helps ScholarsArchive users deposit and manage their content.

Preferred File Formats

To maximize the ability to share, preserve and re-use digital files, carefully consider the format you use for digital files. Selection of a file format can help you in the future by limiting the chances of your data becoming obsolete when a proprietary format is no longer supported or available. 

 Formats more likely to be accessible in the future are: 

  • Non-proprietary 

  • Open, documented standards 

  • In common usage by the research community 

  • Use standard character encodings (ASCII, UTF-8) 

  • Uncompressed (desirable, space permitting) 

Use the table below to find an appropriate and recommended format for preserving and sharing your digital files over the long term. The table indicates the level of confidence that file formats will continue to be accessible over time, based on these characteristics. There may be aspects of formats that are not considered highest confidence that are desirable for near-term use. In some instances, it may be appropriate to submit proprietary or other lower-confidence formats alongside a highest-confidence format in order to facilitate near term use (for instance, an Excel/XLSX file for fully featured near-term use, submitted with a CSV file for long-term preservation).

Most content deposited to ScholarsArchive@OSU is textual in nature: theses and dissertations, research articles, presentations, technical reports, conference proceedings, posters, etc. The PDF file format is required for this content. PDF/A-1 -- ISO 19005-1 is preferred with fonts embedded (.pdf). PDF without fonts embedded is also acceptable but not recommended. To save a Microsoft word document as a PDF with fonts embedded, follow these instructions: https://www.bc.edu/content/dam/files/libraries/pdf/embed-fonts.pdf.

For other content types--such as quantitative and statistical data, spreadsheets, databases, graphics, audio, and video (among others)--use the table below to find an appropriate and recommended format for preserving and sharing your digital files in ScholarsArchive@OSU over the long term. Please note that per ScholarsArchive@OSU Preservation Policy, repository staff commit to performing format migration for any files submitted using recommended Highest Confidence formats, should those formats become obsolete. For all other file formats, only bit-level preservation is guaranteed.

 

Document Formats

  Highest Confidence Medium Confidence Lowest Confidence
Word Processing PDF/A-1 (ISO 19005-1)
(.pdf)

PDF/UA (ISO 14289-1)
(.pdf)
Portable Document Format / PDF (All other types)
(.pdf)

OpenDocument Text (Open Office)
(.sxw, .odt)

MS Word 2007+ (OOXML)
(.docx)

Rich Text Format
(.rtf)
Microsoft Word
(.doc)

Google Docs
(.gdoc)

WordPerfect
(.wpd)
Plain Text Plain text (US-ASCII, UTF-8, UTF-16 with BOM)
(.txt)
Plain text (ISO 8859-x)
(.txt)
 
Structured Text SGML – with included DTD
(.sgm, .sgml)

XML – with included schema
(.xml)

XSL
(.xsl)
HTML
(.htm, .html)

Cascading Style Sheets
(.css)

LaTeX with referenced files
(.latex, .tex)

Markdown
(.md)
 
Presentations PDF/A-1 (ISO 19005-1)
(.pdf)
Portable Document Format / PDF
(.pdf)

OpenDocument Presentation (Open Office)
(.sxi, .odp)

MS PowerPoint 2007+ (OOXML)
(.pptx, .ppsx)
Microsoft PowerPoint
(.ppt, .pps)

MS PowerPoint 2007+ with macros enabled
(.pptm)
Scanned Documents PDF/A-1 (ISO 19005-1)
(.pdf)

TIFF – uncompressed
(.tif, .tiff)

JPEG2000 – lossless compression
(.jp2)
Portable Document Format / PDF
(.pdf)
 
 
eBooks Open eBook File
(.epub)
Portable Document Format / PDF
(.pdf)
 
 

Structured Data Formats

  Highest Confidence Medium Confidence Lowest Confidence
Tabular Data Comma-Separated Values
(.csv)

Tab-Separated Values
(.tsv)

Delimited Text
(.txt)
OpenDocument Spreadsheet (Open Office)
(.sxc, .ods)

MS Excel 2007+ (OOXML)
(.xlsx)
Microsoft Excel
(.xls)

MS Excel 2007+ with macros enabled
(.xlsm)
Databases SQLite
(.sqlite, various)

Software Independent Archiving of Relational Databases (SIARD)
(.siard)
dBASE / DBF
(.dbf)
 
Statistical Data Comma-Separated Values
(.csv)

Delimited Text
(.txt)

HDF5
(.hdf)
R
(.R, .rdata)

SPSS
(.sav, .sps, spv, spo)

SAS
(.sas, .sas7dat)

HDF4
(.hdf)
Other proprietary formats
Geospatial Data GeoTIFF
(.tif, .tiff)

GeoJSON
(.json, .geojson)

netCDF
(.nc)
ESRI Shapefile, with component files
(.shp, .shx, .dbf)

ESRI Geodatabase
(.gdb)

ESRI Export Format
(.e00)

Geography Markup Language
(.gml)

Keyhole Markup Language
(.kml, .kmz)
Other ESRI files

Other proprietary formats
Metadata and Markup XML – with included schema
(.xml)

JSON – with included metadata
(.json)

Data Documentation Initiative
(.ddi)
   

Audio-Visual Material Formats

  Highest Confidence Medium Confidence Lowest Confidence
Images:
Raster Graphics
TIFF – uncompressed or CCITT 4 compressed
(.tif, .tiff)

JPEG2000 – lossless compression
(.jp2)

PNG – 24-bit true color
(.png)
TIFF – compressed
(.tif, .tiff)

JPEG2000 – lossy compression
(.jp2)

PNG – 8-bit indexed
(.png)

JPEG
(.jpg, .jpeg)

GIF
(.gif)

BMP
(.bmp)

DNG Digital Negative
(.dng)
PhotoShop
(.psd)

MrSID
(.sid)

Proprietary RAW files
(various)
Images:
Vector Graphics
Scalable Vector Graphics
(.svg)

PDF/A-1 (ISO 19005-1)
(.pdf)
Computer Graphics Metafile
(.cgm)
Encapsulated Postscript
(.eps)

Macromedia Flash
(.swf)
Audio Broadcast WAVE (BWAV) – LPCM codec
(.bwf)

WAVE – PCM, LPCM codec
(.wav)

AIFF – PCM, LPCM codec
(.aif, .aiff)
MPEG Audio Layer III
(.mp3)

Advance Audio Coding
(.mp4, .aac)

Apple Lossless Audio Codec / ALAC
(.m4a)

Free Lossless Audio Codec
(.flac)

Standard MIDI
(.mid)

SUN Audio – uncompressed
(.au, .snd)
WAVE – compressed
(.wav)

AIFC – compressed AIFF
(.aifc)

RealAudio
(.rm, .ra)

Windows Media Audio
(.wma)

Ogg Vorbis
(.ogg)
Video FFV1 / Matroska
(.mkv)

AVI – uncompressed
(.avi)

QuickTime – uncompressed, motion JPEG
(.mov)

MXF – uncompressed
(.mxf)

MPEG-4 – H.264
(.mp4, .m4v)

SubRip (subtitle file)
(.srt)
MPEG-2
(.mp2, .mpg, .vob)

MPEG-1
(.mp1, .mpg)

Ogg Theora
(.ogg, .ogv)

Apple ProRes
(.mov)

Motion JPEG2000
(.jp2)
Windows Media Video
(.wmv)

RealVideo
(.rm, .rv)

3D/CAD Formats

  Highest Confidence Medium Confidence Lowest Confidence
3D/CAD Industry Foundation Class
(.ifc)

Standard for the Exchange of Product Model Data
(.step, .stp, .p21)

Initial Graphics Exchange Specification
(.igs)
AutoDesk’s Drawing Interchange File Format / Data eXchange Format
(.dxf)

AutoCAD
(.dwg)

Extensible 3D
(.x3D)

Universal 3D
(.u3D)

Portable Document Format / Engineering or PDF3D
(.pdf)
 
Other proprietary formats

Container Formats

  Highest Confidence Medium Confidence Lowest Confidence
Containers ZIP – uncompressed
(.zip)

TAR Tape Archive
(.tar)

BitTorrent files (.torrent) are required for depositing files larger than 5GB. See the BitTorrent Guide for more information.
ZIP – compressed
(.zip)

GNU Zip / GZip – compressed
(.gz)

GZip compressed tarballs
(.tar.gz)
 

Remember that container files are only as good as their contents! The files inside the ZIP or TAR wrapper should still adhere as closely as possible to the best practices and recommendations elsewhere in this guide.

Email Formats

  Highest Confidence Medium Confidence Lowest Confidence
Email MBOX Email Format
(.mbox)

EML Internet Message Format
(.eml, .mht, .mhtml)
MSG Microsoft Outlook Item Message Format
(.msg)

PST Microsoft Personal Folders Format
(.pst)
 

Software/Computer Code Formats

  Highest Confidence Medium Confidence Lowest Confidence
Software or Computer Code   Computer program source code
(various)
Compiled or executable files
(various)

Resources