Machine learning stands out as a thrilling breakthrough for companies aiming to extract greater knowledge and insight from their data reservoirs. The primary hurdles when integrating machine learning into data applications encompass the proficiency gap required for implementation and the consequential influence on applications and databases.

This is the juncture where intriguing possibilities arise, courtesy of employing storage volume snapshots for offline machine learning endeavors. The utilization of zero-footprint snapshots in the realm of machine learning presents many advantages, particularly when evaluating the structure, operational process, and holistic soundness of your data ecosystems.

Benefits of Using Volume Snapshots for Machine Learning

The main benefit of using volume snapshots is the reduced strain on your operational databases. Despite the touted resource efficiency of various machine learning tools, using machine learning often causes performance challenges. This becomes particularly critical for applications sensitive to latency, which might encounter difficulties accessing the production database during machine learning processing.

Numerous other advantages arise from incorporating snapshots into machine learning workflows:

  1. Data Consistency: Snapshots capture a specific moment’s state within a database, ensuring the consistency of data used for training and testing machine learning models. This consistency leads to reproducible results.
  2. Minimal Production Impact: Machine learning tasks, particularly those involving data extraction and transformation, can demand substantial resources. Operating on a snapshot rather than the live database reduces the strain on the production environment’s performance. The snapshots are thin-provisioned, providing significant opportunity without the footprint of a physical clone of the environment.
  3. Secure Testing Ground: Utilizing snapshots permits experimentation without the risk of modifying or compromising the primary database. This proves especially valuable for data engineers and scientists who need to adapt data structures for improved machine learning compatibility.
  4. Temporal Analysis: Multiple snapshots, captured at different points in time, enable temporal analysis. For example, models can be trained on data from various time periods, facilitating performance comparison or application in time series forecasting.
  5. Swift Recovery: If an issue arises during the machine learning preprocessing phase—such as accidental data deletion or transformation—reverting to the original snapshot or creating a fresh one is incredibly simple as refreshing from a new volume snapshot.
  6. Enhanced Collaboration: Snapshots serve multiple teams concurrently. For instance, the data engineering team can engage in data preprocessing using one snapshot, while the data science team employs another snapshot for model development.
  7. Cost Efficiency: Volume snapshots in Silk are thin-provisioned, requiring no additional space outside of writes subsequently created as part of the development or testing.

Creating and Using a Volume Snapshot with Silk

Silk introduces dynamic read/write volume snapshots that provide real-time, non-static perspectives of workloads, including relational systems like Oracle and SQL Server, within the Silk virtualized storage environment. Coupled with its exceptionally high IO storage performance, Silk empowers the rapid creation of zero-footprint thin provisioning, a feature instrumental in accelerating machine learning workflows within seconds. These snapshots serve a multitude of purposes, encompassing development, testing, reporting, and notably, facilitating machine learning endeavors.

To initiate a Silk volume snapshot, the process commences via command-line interaction within the terminal or SSH prompt. For instance, consider the scenario of employing SQL Server on a Windows host. Launching the Microsoft DiskShadow service requires invoking ‘diskshadow’, while ensuring that the SQL Writer service is concurrently operational. Once a validated installation is in place, Silk becomes an accessible provider within the diskshadow utility when invoked from the command-line interface.

Illustrating the usage with a SQL Server database located on volume E:\ and the intention to generate a secondary volume snapshot on G:\, the subsequent commands can be executed through the DiskShadow service within the terminal environment:

reset

set context PERSISTENT

set option TRANSPORTABLE

set metadata E:\diskshadow\SqlShadow1.cab

set verbose on

begin backup

add volume G: alias data

add volume H: alias log

create

end backup

This script would create both an un-exposed snapshot and a view to the volume by performing the following:

  1. Reset- resets the session to begin the process.
  2. Set context PERSISTENT- the shadow copy will remain after the command exits and will persist even if the system reboots.
  3. Transportable- The snapshot can be moved to another VM host.
  4. .cab file- this is the metadata file that contains the snapshot.
  5. Verbose- output will be displayed and makes for easier logging for errors.
  6. Backup then begins, the volume is created on G:\ for the datafiles and H:\ for the logs, then ends the volume snapshot creation.

We’ve now created one snapshot that can then be mounted and used for a restoration in our next step. Keep in mind, all of this can be scripted for ease of use and automation.

load metadata E:\diskshadow\SqlShadow1.cab

begin restore

add shadow %data% G:

add shadow %log% H:

resync novolcheck

end restore

Upon successful completion, the resynchronization operation will yield the following command indicating success:

“The resynchronization operation has been successfully executed.”

The provided script will accomplish the following tasks:

– Load metadata from the file generated in the initial script.

– Initiate the restoration process.

– Incorporate the shadow and the volume using the alias specified in the metadata.

– Restore the snapshot and subsequently conclude the session.

Now, let’s consider a scenario where we intend to import a VSS (Volume Shadow Copy Service) snapshot from the Silk Data Pod (SDP) to another connected host. This action aims to enable the utilization of the snapshot with SQL Server for machine learning purposes by a different team.

Achieving this goal involves executing the subsequent commands through the command line:

load medata E:\diskshadow\SqlShadow1.cab

import

expose %data% P:

expose %log% O:

BREAK READWRITE NOREVERTID

%VSS_SHADOW_SET%

The snapshot has been successfully transferred to the new virtual machine (VM), and the drive letters have been configured to enable access to the snapshot’s database from the new host, which is connected to the SDP (Silk Data Pod). This process can be repeated for any number of VMs linked to the SDP, offering multiple users the ability to utilize the snapshot for different purposes. Each user can maintain their own read/write copy without affecting the production database.

Machine Learning with Volume Snapshots

In our current scenario, we’ve gained the capability to generate numerous snapshots of the SQL Server databases residing on the SDP volume through snapshots. We can produce hundreds of these snapshots with minimal impact on storage and cloud expenses. To illustrate, we’ll utilize machine learning, a task undertaken by a group of 100 data scientists, each requiring an individual dataset copy. Initially, we’ll craft a snapshot and subsequently link the volume to 100 virtual machines connected to the SDP. This enables us to furnish each data scientist with a personalized read/write version of the database, essential for their distinct machine learning models.

The procedure is easily automated through scripting, taking just a few seconds to allow all data scientists to establish connections with the databases and link to the targeted SQL Server. Subsequently, they gain access to the SQL Server Machine Learning Services (or a comparable tool) and can seamlessly log into the database as if working with a complete recovery copy. Utilizing tools like Python or R alongside features such as “sp_execute_external_script,” they can train their machine learning models using the extensibility framework derived from the volume snapshots. Notably, the only additional storage consumed is associated with the machine learning models being developed. This stands in contrast to more conventional methods, which necessitate storage for every read/write duplicate of the original SQL Server clone.

Volume Snapshot Clean Up

Once all machine learning exercises are complete, the snapshot can just as easily be removed by simply deleting the view:

delete shadow expose P:

delete shadow expose O:

In this use case, we created numerous snapshots and exposed them to all the VMs for the project to perform their machine learning projects.  We can just as easily clean them up with the following command when the project has been completed:

delete shadows all

Provide Data Copies to ML Without Impact

In conclusion, the fusion of machine learning with SQL Server and the utilization of volume storage snapshots marks a pivotal juncture in the evolution of data management and analytics. This synergy empowers organizations to glean invaluable insights from their data, while simultaneously ensuring data integrity and efficiency. By harnessing the capabilities of SQL Server’s robust data processing and management functionalities, combined with the agility and efficiency of volume storage snapshots, businesses can accelerate decision-making processes, enhance resource utilization, and streamline their operations.

Machine learning algorithms integrated within SQL Server open the door to predictive analysis, pattern recognition, and anomaly detection, enabling organizations to uncover hidden trends and gain a competitive edge. Leveraging volume storage snapshots enhances data protection, disaster recovery, enabling rapid and reliable data restoration without jeopardizing performance in the production system. This dual approach not only bolsters data security and compliance efforts but also paves the way for real-time decision-making and strategic planning.