Organizations have an increasing need to preserve data from websites and social media platforms due to a growing cadre of regulatory requirements, legislation such as Dodd-Frank and the Freedom of Information Act, and general e-discovery readiness.
While it may be desirable to treat web sites like other file types for the purposes of archiving, there are some critical differences inherent in the dynamic nature and unique architecture of the Internet that necessitate additional steps in order to ensure a complete and accurate archive.
The second requirement is having the ability to search, review, produce, and generally utilize the content of an archived website. These two requirements are common to any archived data repository, but the unique architecture of the web introduces a third requirement.
To archive a web page in it’s entirety on a given day means including content from related pages as well as third party servers such as video providers like YouTube or social media streams from Twitter. The Cloud Preservation platform offers a thorough approach that results in a viable strategy for tackling all three core preservation requirements.
Original, unaltered source files
First, source files must be preserved in their original, unaltered format. This means that it’s necessary to save all original files that were used to create the site or that were included on the site. For example, Cloud Preservation captures the native HTML file (including source code, developer comments, etc.) representing the code that was running the website on any particular day as well as any native file types such as a video file or perhaps a PDF document or Excel worksheet if these were available for download from the site.
Additionally, hyperlinks, which are core to the architecture of the Internet, would need to be modified to link to their respective target pages in the archive. While some website archival strategies attempt to build and store a browse-able version of a website, they in fact have to modify the core source files by changing links and attempting to make static versions of dynamic website components. This does not result in an archive of the original, unaltered source files.
Review, export, and production
Organizations have numerous and diverse needs for historical copies of websites largely centered around legal and regulatory compliance. For these purposes, website data often needs to be searched and produced, printed, or exported in a usable format. While Cloud Preservation stores all original source files and makes them easily available for export, it also creates an image rendering (similar to a screen capture) with a forensically sound timestamp for each page of a website and inserts all text from the page into a powerful search engine.
Third party data
Unlike most documents, websites are aggregations of content that is being provided by multiple live sources on the web. For example, a web-page might show a video that is hosted on YouTube or Vimeo or perhaps it shows recent posts from a blog or recent updates from a Twitter feed. For a financial organization, a website might display real-time loan rates or stock quotes all coming from a third party server. Nearly all websites include hyperlinks to other sites which must also be archived.
In these cases, the data is not actually included in the underlying HTML source files but is brought in, via a technology called AJAX, directly to the browser from the third party. As a result, there is potentially critical data that an organization would be unable to reproduce or render at a later date if needed. From a compliance standpoint, this is akin to saving an email without the attachment.
Cloud Preservation resolves this problem by capturing the image of how the page rendered in a browser at the time of archival along with the full-content and text of the page after render. And for a good measure, a forensically sound timestamp is included on that rendering. When you search for web pages in Cloud Preservation, the third party content WILL be included. Additionally all linked pages and documents, including external sites and files such as PDF’s or Office documents, are captured providing a comprehensive view that includes not only what was on the website on a given day, but also a complete picture of third-party resources that were utilized. These related links can be accessed under a tab on the image preview of any given page. Clicking the link will take the user to archived version of the external data as it appeared that day, allowing the user to essentially “recreate” the navigation experience.
A Complete Approach to Preservation
The unique architecture and connectedness of the web means that if you want to browse a website exactly as it was at some point in the past, you would need more than an archival tool, you’d need a time machine. That said, Cloud Preservation takes a complete approach that not only preserves original unaltered source files, but also preserves the entire visual experience along with all text and related content for each page.