File Operations in Perl¶
Table of Contents¶
File Functions¶
Opening Files¶
You can open files with the open
function:
open
: Builtin perl function to open a file.my $fh
: Defines$fh
as a file handle. Like a pointer to the opened file.- The
my
keyword makes it lexically scoped (only available in that block)
- The
'<'
: Open in read-only mode.<
: Read>
: Write (truncate/overwrite)+<
: Read and Write>>
: Append-|
: Open a filehandle to read from an Input Stream (for use with shell commands)|-
: Open a filehandle to write to an Output Stream (for use with shell commands)
'file.txt'
: The file to openor die $!;
: Ifopen
fails, this will terminate the script and print the system error msg from$!
.
Closing Files¶
Close files with the close
function, passing the file handle as an argument:
If you forget to close, Perl may close it automatically when the script ends. But, in long-term running scripts, not closing the file can cause file descriptor leaks.
Reading from Files (Diamond Operator)¶
When you have opened a file and assigned a file handle, you can read from the file
using the diamond operator (<>
).
You can use while (<$fh>)
to loop over the lines of the file:
If you want, you can also save the line into a variable to use in the while
loop:
$_
.
File operation workflow:
Slurping a Whole File¶
To read the contents of a file into a single scalar variable, undefine
the $/
variable (input record separator) before reading from the file handle:
open(my $fh, '<', 'file.txt') or die "Couldn't open file: $!";
undef $/;
my $contents = <$fh>;
close $fh;
print $contents;
In bash it would be something like:
Writing to Files¶
Writing to files requires open
ing the file first, then you can print
into it.
To truncate (overwrite) a file, open with the >
mode.
print $fh "...";
: Theprint
function can take a filehandle as an argument.- The output of the print statement will be directed to that filehandle.
To append to the file, open with the >>
mode.
In bash, this is equivalent to:
Reading Filenames from a Directory¶
There are specific perl directory functions to help with this kind of thing.
opendir
closedir
readdir
Ex, opening the notes
directory and reading the filenames inside:
my $dir_name = '/home/kolkhis/notes';
openddir(my $dir, $dir_name) or die "Can't open dir: $dirname: $!";
my @files = readdir($dir);
closedir($dir);
foreach my $f (@files) {
print "File: $f\n";
}
opendir()
opens a directory and returns a directory handle.readdir()
reads all filenames in the directory into an array.closedir()
closes the directory handle.
Excluding .
and ..
¶
-
Use a
grep
to exclude the.
(current) and..
(parent) directories.
Usingmy $dir_name = '/home/kolkhis/notes'; openddir(my $dir, $dir_name) or die "Can't open dir: $dirname: $!"; my @files = grep { $_ ne '.' && $_ ne '..' } readdir($dir); closedir($dir); foreach my $f (@files) { print "File: $f\n"; }
grep { ... } readdir($dir)
filters out unwanted entries.
It accepts thene
(not equal) function for conditionals. -
Just add a
-f
check in thegrep
to only get regular files.This has some limitations though. It will only print the filenames, so if the script isn't being run inmy $dir_name = '/home/kolkhis/notes'; openddir(my $dir, $dir_name) or die "Can't open dir: $dirname: $!"; my @files = grep { $_ ne '.' && $_ ne '..' && -f $_ } readdir($dir); closedir($dir); foreach my $f (@files) { print "File: $f\n"; }
$dir_name
, it may break.
If we just used "$dir_name/$_"
, it would break on Windows and not be
cross-platform.
We can fix that with File::Spec
(or Path::Tiny
).
Getting Regular Files Only with File::Spec
¶
File::Spec
is a perl module.
my $dir_name = '/home/kolkhis/notes';
openddir(my $dir, $dir_name) or die "Can't open dir: $dirname: $!";
my @files = grep {
$_ ne '.' && $_ ne '..' && -f File::Spec->catfile($dir_name, $_)
} readdir($dir);
closedir($dir);
foreach my $f (@files) {
print "File: $f\n";
}
File::Spec->catfile($dir_name, $_)
:- Safely joins the directory path and filename regardless of OS.
- So instead of
"$dir_name/$_"
, we usecatfile
.
-f
: Tests if the path is a regular file. Skips directories, pipes, symlinks, etc.- Just like bash.
Getting Regular Files Only with Path::Tiny
¶
File::Spec
is great, but Path::Tiny
is more "ergonomic."
Path::Tiny
is an official CPAN module. It's not in the Perl core but it's common in
modern Perl codebases.
Using this module abstracts away all of the lower level things like opendir
and
closedir
.
use Path::Tiny;
my $dir = path("/home/kolkhis/notes"); # The `path` function from `Path::Tiny`
my @files = $dir->children(qr/\.md$/); # All the .md files
foreach my $file (@files) {
print "File: $file\n";
}
- The
path()
function returns adir
object. -
This
dir
object has a methodchildren()
.-
You can pass that object around or use it like a regular regex. It's likeqr/
: Theqr
means "quote regex." It compiles regex as an object.r'...'
in Python. -
Inside
children(qr/\.md$/);
, it's filtering filenames with.md
file extensions.
-
This is more platform agnostic. It handles the platform differences (slashes, symlinks, etc.) more cleanly.
File Test Operations (File Conditionals)¶
Perl and Bash have a lot in common. This is true for file tests as well.
Perl supports a lot of conditionals that Bash has for checking the state of files
(e.g., -f $FILE
).
A list of Perl file test operators (and their bash equivalents):
-f
: Is a regular file[[ -f "$file" ]]
-d
: Is a directory[[ -d "$dir" ]]
-e
: Exists[[ -e "$path" ]]
-s
: Size > 0[[ -s "$file" ]]
-r
,-w
,-x
: Readable / writable / executable- Same as Bash
-z
: Size is zero[[ -z "$file" ]]
[[ ! -s "$file" ]]
-l
: Is a symlink[[ -L "$file" ]]
An example:
Recursive Directory Traversal (like find
)¶
You can use File::Find
to emulate the behavior of the find
command:
use File::Find;
find(sub{
return unless -f; # only regular files
print "Found file: $File::Find::name\n";
}, '/some/directory');
File::Find::find
function takes a submodule as an argument (a perl function).
- TODO:
File::Find
also supports-exec
functionality.
The bash equivalent:
Modifying File Permissions (chmod
, chown
, utime
)¶
Perl has direct functions like bash to do these things:
Though Perl allows you to leave out parentheses for function calls like Bash, it's usually better to use them (avoids ambiguity).In bash, this would be:
Temporary Files¶
You can create temporary files with File::Temp
.
use File::Temp qw/tempfile/; # qw = a list of quoted words. same as ('tempfile')
my ($fh, $filename) = tempfile();
print $fh "Hello!\n";
close $fh;
print "File written to $filename\n"
-
use File::Temp qw/tempfile/;
: Theqw
stands for "quote words." It's a perl shortcut for space-separatesd strings.- So this is the same as
('tempfile')
. File::Temp
is passing a list of symbols to import. So usingqw/tempfile/;
only imports thetempfile()
function from the module.
- So this is the same as
File Locks (flock
)¶
For safe concurrent writing, you can implement file locks with flock()
.
File locking prevents multiple processes from writing to the same file at the same
time (i.e., avoiding race conditions or corruption).
- It's like a mutex but for files
-
If one process locks a file, others have to wait (or fail) if they try to lock it too.
-
flock($fh, 2)
: Lock the file with an exclusive lock (write lock). - The 2 means "Exclusive Lock". There are others:
Number | Constant | Meaning |
---|---|---|
1 | LOCK_SH |
Shared Lock (read) |
2 | LOCK_EX |
Exclusive Lock (write) |
8 | LOCK_NB |
Non-blocking |
If you'd rather use the constants, you can do that -- it's probably more readable.
In bash, this would be:
This says:- Lock the file
log.lock
exclusively (-x
) - Then run the command (
-c '...'
) - While that command runs, the lock is held
- Other processes trying to
flock log.lock
will wait
This prevents two scripts from writing to shared.log
at the same time.
Combining Bash and Perl to Read Filenames¶
Also see: operators.
If using File::Find
is too cumbersome, you can use qx
(or backticks) to
spawn a subshell and execute a shell command (find
).
my @names;
@names = qx(find /home/kolkhis/notes -name '*.md');
chomp(@names); # get rid of the newlines at the end
foreach my $n (@names) {
-f $n && print "Regular file: $n\n";
}
chomp(@names);
then the files will contain newlines, and -f
will
not work on them.
Reading from a Piped Input Stream¶
You can utilize the open
function with the -|
operator to indicate a piped input
stream.
open(my $fh, '-|', 'find /home/kolkhis/notes -name "*.md"') or die $!;
# or, use multiple argument form (this is safer, and faster)
open(my $fh, '-|', 'find', '/home/kolkhis/notes', '-name', '*.md') or die $!;
while (my $filename = <$fh> ) {
chomp($filename); # Get rid of the newline
print "File: $filename\n";
}
close $fh;
-|
opens a pipe for reading the output of a command.
This runs the find
command in a subprocess and produces a filehandle to read from
its stdout
.
You can also utilize open
to write output to pipes.
Which one should I use?
Both commands use Perl's internal exec()
function to execute the command.
But, exec()
behaves differently when it's passed a single string vs. an array (list) of items.
- When
exec()
is passed only a single string as the command, it uses the shell ($SHELL
) to execute it. - When
exec()
is passed an array, it behaves like C'sexecvp()
syscall and skips the shell.
So, these two open
commands behave a bit differently:
-
The first
open
only has one single string command, so it relies on the shell to parse it.- Spawns a shell (like
sh -c 'find ...'
). - This version is vulnerable to quoting bugs or shell injection if any part of the command comes from user input.
- Spawns a shell (like
-
The second
open
has each argument being passed as an array (list items), so Perl can bypass the shell and directly invoke the command withexecvp()
-like behavior.- Perl uses
fork()
to create a child process. - Then it uses perl's builtin
exec()
function, which wraps C'sexecvp()
syscall. - Since it's not a string, it does not invoke a shell.
- Perl uses
Writing to a Piped Output Stream¶
You can also utilize open
with the |-
operator to specify that you want to write
to a pipe.
Like above, it's safer to pass an array instead of a string to
open
when using
pipes:
This will open a filehandle that pipes out to the command tee output.log
.
You can also specify tee -a
to append to the file instead:
open(my $fh, '|-', 'tee -a output.log') or die $!;
# or, to avoid using the shell:
open(my $fh, '|-', 'tee', '-a', 'output.log') or die $!;
print $fh "Hello again!\n";
close $fh;
Just like with reading from a pipe (-|
), the form you use matters:
- If you pass a single string:
- Perl will invoke the shell (
sh -c "tee -a output.log"
) - This is less safe and can cause quoting issues or shell injection
- Perl will invoke the shell (
- If you pass a list of arguments:
- Perl directly invokes the command using
execvp()
(no shell) - Safer and more predictable:
- Perl directly invokes the command using
Using open()
to Fork and Exec a Command¶
Like I pointed out earlier when reading from and writing to pipes, formatting your
command as an array is safer.
- This tells Perl:
- Do not spawn a subshell.
- Fork the current Perl process.
- Use
exec()
to run this command directly with those exact arguments.
This works more like how execvp()
works in C.
This is safer than using shell pipelines (e.g., spawning subshells with qx(ls -alh /tmp)
or `ls -alh /tmp`
):
-
Those versions use the shell (
$SHELL
) to run the command, and are basically equivalent tobash -c 'ls -alh /tmp'
. -
They involve parsing the command string, so quoting and escaping can be tricky.
-
No shell parsing means no injection risk.
- No need to escape quotes, backslashes, or whitespace in paths.
- No reliance on
$SHELL
So if you're using open
to run the command, it bypasses the shell (avoids quoting
issues), and you pass arguments directly to the command.
Behind the scenes, this is what's happening:
That's why this works:'-|'
: Tells Perl to create a child process (a fork).- The child process then does an
exec
of the given command. - The parent gets a filehandle to read from the child's STDOUT.
Using select
to Redirect Output¶
The select()
function in perl allows you to switch the "currently selected" filehandle.
The currently selected filehandle determines where all your ouput will go.
If an argument is given, and it's a filehandle, it will set the 'selected' filehandle
to the one given. This will redirect any print
/write
/say
calls to that
filehandle until you switch back.
Without arguments, select()
returns the currently selected filehandle.
The currently selected filehandle is STDOUT
by default (a constant filehandle).
open(my $fh, '>>', 'file.txt') or die $!;
select($fh);
print "Hey\n"; # goes into file.txt
select(STDOUT); # switch back to stdout
print "Hey\n"; # goes to stdout
You can keep track of your old filehandle before switching by saving it to a variable:
To print to standard error:
my $stderr_fh = select(STDERR);
print "Hello stderr.\n";
select(STDOUT);
print "Hello stdout.\n";
print $stderr_fh "Hello again, stderr.\n";
Any variables related to output will also be affected by select()
.
This includes input/output record separators ($/
, $\
respectively) and the
autoflush output buffer variable ($|
).
When setting $|
to autoflush (disable buffering before output), it will honor the
filehandle that is currently selected.
open(my $log, '>', 'output.log') or die $!;
select($log);
$| = 1; # Autoflush now applies to $log
$\ = "\n"; # Output record separator (automatically add newline to print statements)
print "Logging started."; # Goes to output.log
select(STDOUT); # Restore default output location
Reading from STDIN¶
You can read from standard input (STDIN
) using a while
loop with the diamond
operator (<>
).
STDIN
is a constant filehandle, so it works with the diamond operator.
In-Place Editing (Emulating -i
Behavior)¶
The edit-in-place behavior that is enabled via the -i
option (man perlrun
)
can also be enabled by setting the $^I
variable (man perlvar
).
This method only works on the ARGV
file handle.
So, you must either manually insert the file into ARGV
, or pass the file(s)
as command-line arguments to the perl script itself.
#!/usr/bin/env perl
use strict;
use warnings;
for my $file (<*.md>) {
chomp($file);
our $^I = '.bak';
our @ARGV = ($file);
while(<>) {
s/old/new/g;
print;
}
}
Note that we're using our
instead of my
for the variables. This is because
they're global, and my
creates a lexical variable for the current scope only,
and won't be recognized by the diamond operator.
Using local
would also work for this operation. This method is probably
better, since it doesn't modify the global ARGV
.
#!/usr/bin/env perl
use strict;
use warnings;
for my $file (<*.md>) {
chomp($file);
our $^I = '.bak';
our @ARGV = ($file);
while(<>) {
s/old/new/g;
print;
}
}
Using File::Find
¶
Perl core comes with a File::Find
module that is meant to be fully portable
across operating systems and shells.
This is preferred for getting a list of filenames over globbing (via glob()
or <*>
), since globbing is less portable.
For example, if we wanted to find all the markdown files in the current directory and all its subdirectories and save them into an array:
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
my @markdown_files;
find(
sub { push @markdown_files, $File::Find::name if /\.md$/ },
'./'
);
print "Markdown files in the current directory:\n" . @markdown_files;
The find
subroutine recursively searches through the directory given, loops
over the filenames, and passes them through the anonymous subroutine given.
It takes in a subroutine as the first argument, the "wanted" function, which defines what happens with the input. Each filename (and directory name) is passed through this subroutine as the default argument.
So, within that subroutine we define what we want to do with the files that we find.
This anonymous subroutine:
Adds the file to the@markdown_files
array if the conditions are met.
Note
Using $File::Find::name
produces the full path to the file. If we only
needed the basenames of the files, we could use $_
(the default variable).
The $_
variable stores only the name of the file without any directory
path.
We can also define our own subroutine beforehand to use as the "wanted function."
sub addfile {
push (@markdown_files, $File::Find::name) if m/\.md$/ && -f;
}
File::Find::find(\&addfile, './');
- Note that we're passing a reference to the subroutine.
This is functionally equivalent to doing something like this with a glob: