Linux+, Certification for Data Scientists

In the world of Data Science, Linux is often the tool of choice. Unlike Macintosh or Windows, Linux offers a sandbox of sorts as an operating system that coders and scientists alike enjoy. Linux is also extensively used in penetration testing. The reason Linux is often the tool of choice between operating systems is that it is open source. The code is compiled and freely available for any user. And because of that, it can also be changed to suit a users needs. Unlike with Mac or Windows, where settings and preferences are limited to what each company determines a user can alter, Linux allows its users to design their platform from the bottom up. And, indeed, this is why Linux has so many ‘flavors’ of distributions, like Debian, Fedora, or Pop-Os. Each distribution has different settings, preferences, or even aesthetic properties differentiating distinct tastes among different tech communities. To make full use of Linux, however, is a much more involved task than learning a Mac or Windows computer. This is because, fundamentally, Linux is a command line operating system. The greatest use of the operating system is achieved not through a graphical user interface- think point-and-click with a mouse- but rather through a series of textual commands, like ‘cd’, ‘ls’, or ‘sudo’. As such, Linux users must know all the terms, commands, and concepts necessary to fully make use of their operating system. Fortunately, there are certifications that can be pursued to fully educate anyone in Linux. These go by the LPI101 for Linux+ and the LPI102. Where 101 covers core concepts in system architecture, linux installation and package management, GNU and Unix commands, as well as devices, filesystems and file system hierarchy, 102 covers shell scripts; data management; user interface; desktops; admin tasks; networking; and security. The best way for anyone new to Linux to learn each of these different features is through virtualization.

Virtualization is the process of running on operating system on top of another. The purpose of virtualization is to create a working environment where one can learn about an operating system, without worrying about accidentally breaking the environment. Virtualization can be achieved through one of two softwares: VirtualBox, provided by Oracle, or VMware.

Linux has two distributions any data scientists should know if they hope to be competitive in the field of Big Data: Debian, and Fedora.

To begin working with virtualization, if it necessary to download one of the virtualization machines, say VirtualBox, and a live image of the operating system to be installed. After starting VirtualBox, we would then go into our settings and select Live CD, and link our download image, .iso, of Debian to our virtual machine. After doing so, the Debian installation package would appear in a separate screen, a virtual screen, and ask us how to properly install Debian for a new user. Once this is complete, we are ready to begin.

Fedora can be installed the same way. An .iso image placed into a virtual container, and then run.

If we were to install a different distribution, however, these are the things, ultimately, one should consider:

The launcher- In addition to text and GUI install options in the boot menu, desktop flavors contain a launcher on the desktop that can be used to install while running a live image. This too comes in flavors. For example, the desktop can come in the GNOME, KDE, LXDE or Xfce formats.

More on these to come!