File Operations in Perl¶
Table of Contents¶
File Functions¶
Opening Files¶
You can open files with the open function:
open: Builtin perl function to open a file.my $fh: Defines$fhas a file handle. Like a pointer to the opened file.- The
mykeyword makes it lexically scoped (only available in that block)
- The
'<': Open in read-only mode.<: Read>: Write (truncate/overwrite)+<: Read and Write>>: Append-|: Open a filehandle to read from an Input Stream (for use with shell commands)|-: Open a filehandle to write to an Output Stream (for use with shell commands)
'file.txt': The file to openor die $!;: Ifopenfails, this will terminate the script and print the system error msg from$!.
Closing Files¶
Close files with the close function, passing the file handle as an argument:
If you forget to close, Perl may close it automatically when the script ends. But, in long-term running scripts, not closing the file can cause file descriptor leaks.
Reading from Files (Diamond Operator)¶
When you have opened a file and assigned a file handle, you can read from the file
using the diamond operator (<>).
You can use while (<$fh>) to loop over the lines of the file:
If you want, you can also save the line into a variable to use in the while loop:
$_.
File operation workflow:
Slurping a Whole File¶
To read the contents of a file into a single scalar variable, undefine
the $/ variable (input record separator) before reading from the file handle:
open(my $fh, '<', 'file.txt') or die "Couldn't open file: $!";
undef $/;
my $contents = <$fh>;
close $fh;
print $contents;
In bash it would be something like:
Writing to Files¶
Writing to files requires opening the file first, then you can print into it.
To truncate (overwrite) a file, open with the > mode.
print $fh "...";: Theprintfunction can take a filehandle as an argument.- The output of the print statement will be directed to that filehandle.
To append to the file, open with the >> mode.
In bash, this is equivalent to:
Reading Filenames from a Directory¶
There are specific perl directory functions to help with this kind of thing.
opendirclosedirreaddir
Ex, opening the notes directory and reading the filenames inside:
my $dir_name = '/home/kolkhis/notes';
openddir(my $dir, $dir_name) or die "Can't open dir: $dirname: $!";
my @files = readdir($dir);
closedir($dir);
foreach my $f (@files) {
print "File: $f\n";
}
opendir()opens a directory and returns a directory handle.readdir()reads all filenames in the directory into an array.closedir()closes the directory handle.
Excluding . and ..¶
-
Use a
grepto exclude the.(current) and..(parent) directories.
Usingmy $dir_name = '/home/kolkhis/notes'; openddir(my $dir, $dir_name) or die "Can't open dir: $dirname: $!"; my @files = grep { $_ ne '.' && $_ ne '..' } readdir($dir); closedir($dir); foreach my $f (@files) { print "File: $f\n"; }grep { ... } readdir($dir)filters out unwanted entries.
It accepts thene(not equal) function for conditionals. -
Just add a
-fcheck in thegrepto only get regular files.This has some limitations though. It will only print the filenames, so if the script isn't being run inmy $dir_name = '/home/kolkhis/notes'; openddir(my $dir, $dir_name) or die "Can't open dir: $dirname: $!"; my @files = grep { $_ ne '.' && $_ ne '..' && -f $_ } readdir($dir); closedir($dir); foreach my $f (@files) { print "File: $f\n"; }$dir_name, it may break.
If we just used "$dir_name/$_", it would break on Windows and not be
cross-platform.
We can fix that with File::Spec (or Path::Tiny).
Getting Regular Files Only with File::Spec¶
File::Spec is a perl module.
my $dir_name = '/home/kolkhis/notes';
openddir(my $dir, $dir_name) or die "Can't open dir: $dirname: $!";
my @files = grep {
$_ ne '.' && $_ ne '..' && -f File::Spec->catfile($dir_name, $_)
} readdir($dir);
closedir($dir);
foreach my $f (@files) {
print "File: $f\n";
}
File::Spec->catfile($dir_name, $_):- Safely joins the directory path and filename regardless of OS.
- So instead of
"$dir_name/$_", we usecatfile.
-f: Tests if the path is a regular file. Skips directories, pipes, symlinks, etc.- Just like bash.
Getting Regular Files Only with Path::Tiny¶
File::Spec is great, but Path::Tiny is more "ergonomic."
Path::Tiny is an official CPAN module. It's not in the Perl core but it's common in
modern Perl codebases.
Using this module abstracts away all of the lower level things like opendir and
closedir.
use Path::Tiny;
my $dir = path("/home/kolkhis/notes"); # The `path` function from `Path::Tiny`
my @files = $dir->children(qr/\.md$/); # All the .md files
foreach my $file (@files) {
print "File: $file\n";
}
- The
path()function returns adirobject. -
This
dirobject has a methodchildren().-
You can pass that object around or use it like a regular regex. It's likeqr/: Theqrmeans "quote regex." It compiles regex as an object.r'...'in Python. -
Inside
children(qr/\.md$/);, it's filtering filenames with.mdfile extensions.
-
This is more platform agnostic. It handles the platform differences (slashes, symlinks, etc.) more cleanly.
File Test Operations (File Conditionals)¶
Perl and Bash have a lot in common. This is true for file tests as well.
Perl supports a lot of conditionals that Bash has for checking the state of files
(e.g., -f $FILE).
A list of Perl file test operators (and their bash equivalents):
-f: Is a regular file[[ -f "$file" ]]
-d: Is a directory[[ -d "$dir" ]]
-e: Exists[[ -e "$path" ]]
-s: Size > 0[[ -s "$file" ]]
-r,-w,-x: Readable / writable / executable- Same as Bash
-z: Size is zero[[ -z "$file" ]][[ ! -s "$file" ]]
-l: Is a symlink[[ -L "$file" ]]
An example:
Recursive Directory Traversal (like find)¶
You can use File::Find to emulate the behavior of the find command:
use File::Find;
find(sub{
return unless -f; # only regular files
print "Found file: $File::Find::name\n";
}, '/some/directory');
File::Find::find function takes a submodule as an argument (a perl function).
- TODO:
File::Findalso supports-execfunctionality.
The bash equivalent:
Modifying File Permissions (chmod, chown, utime)¶
Perl has direct functions like bash to do these things:
Though Perl allows you to leave out parentheses for function calls like Bash, it's usually better to use them (avoids ambiguity).In bash, this would be:
Temporary Files¶
You can create temporary files with File::Temp.
use File::Temp qw/tempfile/; # qw = a list of quoted words. same as ('tempfile')
my ($fh, $filename) = tempfile();
print $fh "Hello!\n";
close $fh;
print "File written to $filename\n"
-
use File::Temp qw/tempfile/;: Theqwstands for "quote words." It's a perl shortcut for space-separatesd strings.- So this is the same as
('tempfile'). File::Tempis passing a list of symbols to import. So usingqw/tempfile/;only imports thetempfile()function from the module.
- So this is the same as
File Locks (flock)¶
For safe concurrent writing, you can implement file locks with flock().
File locking prevents multiple processes from writing to the same file at the same
time (i.e., avoiding race conditions or corruption).
- It's like a mutex but for files
-
If one process locks a file, others have to wait (or fail) if they try to lock it too.
-
flock($fh, 2): Lock the file with an exclusive lock (write lock). - The 2 means "Exclusive Lock". There are others:
| Number | Constant | Meaning |
|---|---|---|
| 1 | LOCK_SH |
Shared Lock (read) |
| 2 | LOCK_EX |
Exclusive Lock (write) |
| 8 | LOCK_NB |
Non-blocking |
If you'd rather use the constants, you can do that -- it's probably more readable.
In bash, this would be:
This says:- Lock the file
log.lockexclusively (-x) - Then run the command (
-c '...') - While that command runs, the lock is held
- Other processes trying to
flock log.lockwill wait
This prevents two scripts from writing to shared.log at the same time.
Combining Bash and Perl to Read Filenames¶
Also see: operators.
If using File::Find is too cumbersome, you can use qx (or backticks) to
spawn a subshell and execute a shell command (find).
my @names;
@names = qx(find /home/kolkhis/notes -name '*.md');
chomp(@names); # get rid of the newlines at the end
foreach my $n (@names) {
-f $n && print "Regular file: $n\n";
}
chomp(@names); then the files will contain newlines, and -f will
not work on them.
Reading from a Piped Input Stream¶
You can utilize the open function with the -| operator to indicate a piped input
stream.
open(my $fh, '-|', 'find /home/kolkhis/notes -name "*.md"') or die $!;
# or, use multiple argument form (this is safer, and faster)
open(my $fh, '-|', 'find', '/home/kolkhis/notes', '-name', '*.md') or die $!;
while (my $filename = <$fh> ) {
chomp($filename); # Get rid of the newline
print "File: $filename\n";
}
close $fh;
-| opens a pipe for reading the output of a command.
This runs the find command in a subprocess and produces a filehandle to read from
its stdout.
You can also utilize open to write output to pipes.
Which one should I use?
Both commands use Perl's internal exec() function to execute the command.
But, exec() behaves differently when it's passed a single string vs. an array (list) of items.
- When
exec()is passed only a single string as the command, it uses the shell ($SHELL) to execute it. - When
exec()is passed an array, it behaves like C'sexecvp()syscall and skips the shell.
So, these two open commands behave a bit differently:
-
The first
openonly has one single string command, so it relies on the shell to parse it.- Spawns a shell (like
sh -c 'find ...'). - This version is vulnerable to quoting bugs or shell injection if any part of the command comes from user input.
- Spawns a shell (like
-
The second
openhas each argument being passed as an array (list items), so Perl can bypass the shell and directly invoke the command withexecvp()-like behavior.- Perl uses
fork()to create a child process. - Then it uses perl's builtin
exec()function, which wraps C'sexecvp()syscall. - Since it's not a string, it does not invoke a shell.
- Perl uses
Writing to a Piped Output Stream¶
You can also utilize open with the |- operator to specify that you want to write
to a pipe.
Like above, it's safer to pass an array instead of a string to
open when using
pipes:
This will open a filehandle that pipes out to the command tee output.log.
You can also specify tee -a to append to the file instead:
open(my $fh, '|-', 'tee -a output.log') or die $!;
# or, to avoid using the shell:
open(my $fh, '|-', 'tee', '-a', 'output.log') or die $!;
print $fh "Hello again!\n";
close $fh;
Just like with reading from a pipe (-|), the form you use matters:
- If you pass a single string:
- Perl will invoke the shell (
sh -c "tee -a output.log") - This is less safe and can cause quoting issues or shell injection
- Perl will invoke the shell (
- If you pass a list of arguments:
- Perl directly invokes the command using
execvp()(no shell) - Safer and more predictable:
- Perl directly invokes the command using
Using open() to Fork and Exec a Command¶
Like I pointed out earlier when reading from and writing to pipes, formatting your
command as an array is safer.
- This tells Perl:
- Do not spawn a subshell.
- Fork the current Perl process.
- Use
exec()to run this command directly with those exact arguments.
This works more like how execvp() works in C.
This is safer than using shell pipelines (e.g., spawning subshells with qx(ls -alh /tmp) or `ls -alh /tmp`):
-
Those versions use the shell (
$SHELL) to run the command, and are basically equivalent tobash -c 'ls -alh /tmp'. -
They involve parsing the command string, so quoting and escaping can be tricky.
-
No shell parsing means no injection risk.
- No need to escape quotes, backslashes, or whitespace in paths.
- No reliance on
$SHELL
So if you're using open to run the command, it bypasses the shell (avoids quoting
issues), and you pass arguments directly to the command.
Behind the scenes, this is what's happening:
That's why this works:'-|': Tells Perl to create a child process (a fork).- The child process then does an
execof the given command. - The parent gets a filehandle to read from the child's STDOUT.
Using select to Redirect Output¶
The select() function in perl allows you to switch the "currently selected" filehandle.
The currently selected filehandle determines where all your ouput will go.
If an argument is given, and it's a filehandle, it will set the 'selected' filehandle
to the one given. This will redirect any print/write/say calls to that
filehandle until you switch back.
Without arguments, select() returns the currently selected filehandle.
The currently selected filehandle is STDOUT by default (a constant filehandle).
open(my $fh, '>>', 'file.txt') or die $!;
select($fh);
print "Hey\n"; # goes into file.txt
select(STDOUT); # switch back to stdout
print "Hey\n"; # goes to stdout
You can keep track of your old filehandle before switching by saving it to a variable:
To print to standard error:
my $stderr_fh = select(STDERR);
print "Hello stderr.\n";
select(STDOUT);
print "Hello stdout.\n";
print $stderr_fh "Hello again, stderr.\n";
Any variables related to output will also be affected by select().
This includes input/output record separators ($/, $\ respectively) and the
autoflush output buffer variable ($|).
When setting $| to autoflush (disable buffering before output), it will honor the
filehandle that is currently selected.
open(my $log, '>', 'output.log') or die $!;
select($log);
$| = 1; # Autoflush now applies to $log
$\ = "\n"; # Output record separator (automatically add newline to print statements)
print "Logging started."; # Goes to output.log
select(STDOUT); # Restore default output location
Reading from STDIN¶
You can read from standard input (STDIN) using a while loop with the diamond
operator (<>).
STDIN is a constant filehandle, so it works with the diamond operator.
In-Place Editing (Emulating -i Behavior)¶
The edit-in-place behavior that is enabled via the -i option (man perlrun)
can also be enabled by setting the $^I variable (man perlvar).
This method only works on the ARGV file handle.
So, you must either manually insert the file into ARGV, or pass the file(s)
as command-line arguments to the perl script itself.
#!/usr/bin/env perl
use strict;
use warnings;
for my $file (<*.md>) {
chomp($file);
our $^I = '.bak';
our @ARGV = ($file);
while(<>) {
s/old/new/g;
print;
}
}
Note that we're using our instead of my for the variables. This is because
they're global, and my creates a lexical variable for the current scope only,
and won't be recognized by the diamond operator.
Using local would also work for this operation. This method is probably
better, since it doesn't modify the global ARGV.
#!/usr/bin/env perl
use strict;
use warnings;
for my $file (<*.md>) {
chomp($file);
our $^I = '.bak';
our @ARGV = ($file);
while(<>) {
s/old/new/g;
print;
}
}
Using File::Find¶
Perl core comes with a File::Find module that is meant to be fully portable
across operating systems and shells.
This is preferred for getting a list of filenames over globbing (via glob()
or <*>), since globbing is less portable.
For example, if we wanted to find all the markdown files in the current directory and all its subdirectories and save them into an array:
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
my @markdown_files;
find(
sub { push @markdown_files, $File::Find::name if /\.md$/ },
'./'
);
print "Markdown files in the current directory:\n" . @markdown_files;
The find subroutine recursively searches through the directory given, loops
over the filenames, and passes them through the anonymous subroutine given.
It takes in a subroutine as the first argument, the "wanted" function, which defines what happens with the input. Each filename (and directory name) is passed through this subroutine as the default argument.
So, within that subroutine we define what we want to do with the files that we find.
This anonymous subroutine:
Adds the file to the@markdown_files array if the conditions are met.
Note
Using $File::Find::name produces the full path to the file. If we only
needed the basenames of the files, we could use $_ (the default variable).
The $_ variable stores only the name of the file without any directory
path.
We can also define our own subroutine beforehand to use as the "wanted function."
sub addfile {
push (@markdown_files, $File::Find::name) if m/\.md$/ && -f;
}
File::Find::find(\&addfile, './');
- Note that we're passing a reference to the subroutine.
This is functionally equivalent to doing something like this with a glob: