123096 Kumar 3
111111 Kiran 4
323456 AAAA 4
If the user has given input as 123096, The script should remove the
entire line (with 123096). How can i do this.?
-Swaroop
I assume that you look for the first token in the line which is
delimited by whitespace. In this case I interpret the input line as a
list, hence comparing the first list element with the pattern. I do this
for all lines in the inputfile and copy them to another outputfile.
set ifp [open {C:\InputFile.txt} r]
set ofp [open {C:\OutputFile.txt} w]
set pattern 123096
while {[gets $ifp line] >= 0} {
if {[lindex $line 0] == $pattern} {puts $ofp $line}
}
close $ifp
close $ofp
exit
Regards - Leo
Hi,
If i am right, by doing like above, duplicate files will be
created. To avoid this do i need to move the output file to inputfile
after the script. Moreover, i guess i should use " if {[lindex $line
0] != $pattern} {puts $ofp $line} " to skip the matching line. [I have
replaced == with !=]
-Swaroop
Yes. You can do this from the tcl script itself by:
file rename $output_filename $input_filename
> Moreover, i guess i should use " if {[lindex $line
> 0] != $pattern} {puts $ofp $line} " to skip the matching line. [I have
> replaced == with !=]
In this case I guess it's safe to use lindex directly. And I admit
that I often write code that uses lindex directly on input data. But
you should be aware that lindex is sensitive to unbalanced ", { and }.
By sensitive I mean that your program will abort immediately when
lindex throws and error (unless you [catch] it of course).
If you can't control the input data format then I'd suggest:
if {[lindex [split $line] 0] != $pattern} {...
or
if {[regexp -inline {^\d+}] != $pattern} {...
If the files are sooo large that this is a concern, then the
processing is probably already so slow, that the whole task is
next to infeasible, anyway. :-)
You could also edit in-place:
either you just overwrite the portion with dummy-chars, e.g. spaces,
or you shift the whole block of data that follows.
The former is a bit easier, but you need to be extra careful with
positioning for overwrite (seek, tell), and determining the number
of spaces to write (this depends on both the encoding of input file
and the length of the current matched line!)
The latter requires opening the file with "r+", and once the matching
line is found, repeated seek-read-tell-seek-puts-tell. The encoding
*might* be irrelevant there (but no guarantees).
Then your data file exists only in one instance - but you must have
enough memory to hold all data...
Okay, the first thing to realize is that flat files, at least under
linux, unix, and windows, have no special access routines. This means
that one has to read in the entire file, then write out the parts of
the file that you want out.
Given that there are no silver bullets, there are several ways you
could go at this task:
1. read the entire file into memory, then write everything out to a
new, temporary file, then rename the original, rename the temporary,
and delete the original. This technique keeps the original around
until the last moment, so that, in case of a power failure or some
other problem, you still have the original data available. You are,
however, left with a brief moment (truly less than a second, assuming
decent access to your files), where there is no file by the original
filename present. This would be a problem if the file is critical
(say, a password file, etc.)
2. Read the file a line at a time, writing out a line at a time.
Again, you have to deal with the "write to a temporary file" issues,
but if the original file is very large, then you don't take up as much
memory.
3. Open the original file read, read through to the point where you
want to delete, save the offset from the beginning, read the next line
then open the file a second time, in read/write mode, seek to the
saved offset, and write out the next record, and continue reading from
the first descriptor and writing to the second. WARNING! If you
experience a power outage, program crash, user interference, network
loss, etc. you would end up with an incomplete file. However, the file
does remain in place at all times.
4. you could read in the file, write it out to a database (one record
per line), delete the record required, then read back through the
database, writing out to the original file. Again, you remove the
temporary file, but you again could experience a truncated original
file in the case of a power outage, program crash, etc.
Basically, there is no _safe_ way to do this and ensure that what you
want to do gets done completely in the case of extreme problems. I'd
go with version 1 above, typically.
mv datafile t
gawk '$1!="123096"' t > datafile
Which still has the problem of leaving the system without the file for
a period of time. P.S. that can be done on windows as well - take a
look at any of the windows unix-utility suites like UWIN, Cygwin,
Microsoft's Interopt/SFU software for Windows XP and Windows Server,
MKS toolkit and quite a number of other alternatives. One of these
days, maybe I'll get around to gathering information about all of
these into a page on the wiki...
There aren't many operating systems out there which allow you to just
go into a plain text file to delete lines.
If this is not a one time affair, but something that you need to do
frequently, you might want to consider changing over to use a database
that permits trivial row deletion (which is, I'd guess, most of
them ;-)
Larry W. Virden wrote :
> .....
> 3. Open the original file read, read through to the point where you
> want to delete, save the offset from the beginning, read the next line
> then open the file a second time, in read/write mode, seek to the
> saved offset, and write out the next record, and continue reading from
> the first descriptor and writing to the second. WARNING! If you
> experience a power outage, program crash, user interference, network
> loss, etc. you would end up with an incomplete file. However, the file
> does remain in place at all times.
However, once the copy in place is completed, file would have to be
truncated to current write offset. While there are many usual situations
where one need to truncate a file at arbitrary position (see ftruncate()
POSIX/SV function), and this operation is supported by most modern
operating system/file systems , this is unfortunatly still impossible
with current Tcl stable release (8.4), so this approach is not
applicable. TIP #208 introduces a new "chan" command, available in Tcl
8.5, and more especially "chan truncate channelId ?length?" subcommand.
<OT>
Working with very large files (say several GBytes) was probably not very
frequent in 2002 (when Tcl 8.4.0 was released). But with storage getting
less and less expensive, and with most filesystems and OS supporting
large files, it is quite natural TIP #206 (later merged into TIP #208)
was proposed some time later (proposed june 2004, accepted november 2004).
My feeling is that this is just an example, among many others, of Tcl
currently getting little by little out of sync with some developers'
needs. Among all goodies part of Tcl 8.5 (see http://wiki.tcl.tk/10630
), many offers solutions to immediate problems developers are faced
with. Having them still unavailable into stable (and most widely used)
branch, 3 to 5 years later, doesn't help providing a dynamic "brand
image" of Tcl.
I have no jugdement about 8.5 roadmap, I understand core devels are
already making impressive work on it, and I'm not advocating here for a
quick release. I'm only concerned about the opportunity to introduce new
features in Tcl more often than once every 5 years or so, so a larger
community can view Tcl as an agile language, brought by a dynamic
community, and offering practical solutions to their needs.
No doubt some features introduced in 8.5 need to wait for a major
release, because they break compatibility, need long validation, or
because they imply refactoring of Tcl core code. But others could be
very easily backported to (or even just put in) 8.4. I'm thinking of a
Tcl/TK based on 8.4 for stability, with e.g. new commands and
subcommands like chan, dict, lassign, lrepeat, string reverse, encoding
dirs, binary with new formats, maybe Xft support, etc...
Maybe this possibility has been already discussed among TCT members?
</OT>
Eric
-----
Eric Hassold
Evolane - http://www.evolane.com/
Well.. the only dangerous part is:
mv datafile t
On most modern filesystems this is fairly atomic and safe, just like a
database, since it only involves changing the file's name. On a
journaled filesystem, if this operation happens to fail then on next
powerup the file name will be restored to its original name.
So if you're worried about this then don't use a filesystem like
FAT32. Instead use NTFS or ext3 or HFS+ (and remember to turn on
journaling for HFS+).
This is way offtopic, but since we're talking about UNIX commands this
one will edit the file in-place:
pat=123096
ex file <<END
/${pat}/d
x
END
> Well.. the only dangerous part is:
>
> mv datafile t
Perhaps my comment wasn't clear - my concern wasn't so much the safety
of the rename, but the fact that datafile's contents are temporarily
unavailable using this technique. Think password file - with this
approach, until the gawk finishes, there would be no password file on
the system... not a good state to have your machine. Now, the shell
will immediately create an _empty_ datafile with the next line. That
still isn't the way you'd want your password file. There are other
types of files that are similarly problems.
My point is that for some files, this is not an issue. If the file is
purely used by one human reader, and that human reader is the one
working on the file, then the temporary missing file is no big deal.
If, on the other hand, one is working in a production environment
(which is my typical mode of thinking and working), and the presence
of the file, with contents, is critical to keeping the data center
running, then you don't want a solution that removes that file, even
for a bit. Instead, you have to think of alternatives - like the
database approach.
This can be avoided by simply changing the order of operations done,
i.e. read the contents of the original file and write the new contents
to a temporary file. Then move the temporary file as the new original.
Using awk as an example:
awk '$1 != "123096"' datafile > tmpfile
mv tmpfile datafile
Now the datafile always exists and because moving a file is an atomic
operation on most operating systems/filesystems the datafile is always
in consistent state (unless your script bothced the file, of course).
This same technique of course works when using tcl also, and I'd also
suggest using it in this case:
set ifp [open "datafile" r]
set ofp [open "tmpfile" w]
puts -nonewline "Pattern? "
set pattern [string trim [gets stdin]]
while {[gets $ifp line] >= 0} {
if {[lindex $line 0] != $pattern} {puts $ofp $line}
}
close $ifp
close $ofp
file rename "tmpfile" "datafile"
exit
This can be cured by simply rearranging the operations in the original
script:
awk '$1!="123096"' datafile > tmpfile
mv tmpfile datafile
Now the datafile always exists and contains consistent data.
The same thing can of course be accomplished using tcl, too:
open("file", O_RDONLY) = 3
open(".file.swp", O_RDONLY) = -1 ENOENT (No such file or directory)
open(".file.swp", O_RDWR|O_CREAT|O_EXCL, 0600) = 4
open(".file.swpx", O_RDONLY) = -1 ENOENT (No such file or directory)
open(".file.swpx", O_RDWR|O_CREAT|O_EXCL, 0600) = 5
close(5) = 0
close(4) = 0
open(".file.swp", O_RDWR|O_CREAT|O_EXCL, 0600) = 4
close(3) = 0
open("file", O_RDONLY) = 3
close(3) = 0
close(4) = 0
uwe
> This can be avoided by simply changing the order of operations done,
> i.e. read the contents of the original file and write the new contents
> to a temporary file. Then move the temporary file as the new original.
Yes! That meets the objections that might arise.
Until two people do that at once.
I never understood why everyone adopted the most primitive useless file
system organization available as the "standard".
--
Darren New / San Diego, CA, USA (PST)
His kernel fu is strong.
He studied at the Shao Linux Temple.
Well, true - there is that possibility. One would need to establish
some sort of locking mechanism.
>
> I never understood why everyone adopted the most primitive useless file
> system organization available as the "standard".
I suspect the simplicity scores points for less to maintain, less to
go wrong in the filesystem code itself, etc.
Nope, that never happens. Rename is the only "file destroying" part.
mv datafile t
# at this point datafile still exist but renamed to t
gawk '$1!="123096"' t > datafile
# at this point the *old* datafile is still not destroyed,
# we haven't deleted t yet it would be wise at this point
# to check the validity of the new datafile before deleting t.
# If something went wrong we can simply restore the
# datafile by renaming t
or, what is more typically found in scripts (and suggested by Atte
Kojo):
gawk '$1!="123096"' datafile > temp
# nice to check if t is valid at this point
mv temp datafile
# if the move operation above failed, datafile wouldn't be
destroyed!
But note that either way, at no point is the original file destroyed
accidentally during power failure. At least not on journaled
filesystems. Also note that:
gawk '$1!="123096"' datafile > temp
cp temp datafile
rm datafile
is not equivalent and is not safe. Rename, don't copy.
Ah yes, for that you'd need file locks. Unfortunately, on unix, almost
all locking mechanisms end up being merely advisory, not compulsory.
Any program with access to that file is free to ignore your lock. I'm
sure there's a page on the wiki discussing this.
> I never understood why everyone adopted the most primitive useless file
> system organization available as the "standard".
Well.. when Palm tried to improve it everyone cried out and complained
that they wanted a "real" filesystem (whatever that means).
I do sometimes wish that more OSes support compulsory file locks. But
then I'll end up periodically facing "undeletable file syndrome" when
programs crash like on Windows (Interestingly I've seen malware
treating this as a *feature* preventing antivirus from deleting them).
I suspect that it is because the pre-Unix systems with a zillion
different types of files with different types of indices and locks and
access restrictions, not to mention all sorts of things that Unix
caused to be treated like files but were not files on pre-Unix
systems, were, correctly, considered hopelessly baroque and
obtructive. The few useful features of the Dark Ages of file systems
got thrown out with the bathwater.
Having used such things long ago, let me observe that there was a
tremendous lot of bathwater that we were well rid of.
Donal.
And let me note that while it's true there was a lot of junk on some
systems, there was a lot of good too.
I think we're starting to get to where the want-to needs is starting to
run into the able-to power of the machines again. Systems had mechanisms
back then for (say) seeking to meaningfully-identifiable places in a
file, adding/deleting/modifying records in the middle, and so on,
because you couldn't afford to duplicate a multi-megabyte file just
because you added something in the middle. Nowadays we're getting files
too large to reasonably fit on one disk, and we have no way of editing
them. Reinventing the wheel here.
If UNIX had just a couple more operations, like "insert some bytes" and
"delete some bytes", which would be pretty easy to add[1] by simply
keeping a "bytes used in this block" kind of counter for each block, you
could eliminate a whole class of problems.
Of course, what people are doing is writing their own file systems and
publishing them as services rather than OS APIs. I refer here to things
like Google's file system, Amazon's S3, database servers, and so on. So
maybe what we really need is portable IPC that doesn't suck, instead.
[1] Of course, all the infrastructure like locking and caching and stuff
would have to be updated, but it's conceptually simple.
I think Larry's point is that if another process now tries to access
this file using the name "datafile" it won't find it. Reversing the
operations as Atte Kojo suggests solves this.
-- Neil
Those are things that it would be nice to have. But it's amazing how
much can be done without them, and they would certainly make
implementing a filesystem much more difficult...
Donal.
I can't imagine why they would make implementing the file system more
difficult, other than perhaps lseek(), which is a silly interface to
start with for a variety of reasons.
Remember that the point of using a flat file system wasn't that it was
better for programmers, but that it was easier to implement in the
kernel and matched the semantics of mapping a file into memory space a
la Multics.
Of course, a mechanism for having UNIX-style files and more powerful
files that nevertheless could be read compatibly would be best. To have
to know which kind of file your source code is in so you can compile it,
for example, is one of the bad features that some of the old file
systems indeed had.
Inserting a block should be fairly easy, I admit, but inserting a byte
at a random location? That's a whole 'nother kettle of fish.
Donal.
No, that was not what he was worried about:
On May 15, 7:50 pm, "Larry W. Virden" <lvir...@gmail.com> wrote:
> but you again could experience a truncated original
> file in the case of a power outage, program crash, etc.
>
> Basically, there is no _safe_ way to do this and ensure that what you
> want to do gets done completely in the case of extreme problems. I'd
> go with version 1 above, typically.
Larry still had the notion that there is no safe way to modify a file
in the event of a random power outage. Which isn't true anymore for
modern journaled filesystems. I was just pointing out that Richard's
solution doesn't lose data in case of failure (the fact that the data
is in a file with a different name is a different issue).
And if you're keeping track of the number of used bytes in each block,
that's pretty much the same operation. If there's room in the block,
insert the byte. If not, insert a new block in between and insert the
byte into it. That's why I said it just makes byte-offset seeking
harder, because there's no fixed relationship between bytes and blocks.
On most implementations of filesystems, it's not the same. Blocks are
essentially a link-list like structure, sometimes implemented as a
table of link pointers (like FAT32) sometimes an actual link list.
Inserting a random block in the middle of a sequence of blocks is very
fast. Simply point the current table entry to the location of the
inserted block and point the inserted block's table entry to the next
block in the list.
Bytes on the other hand don't have a table (or tables) to keep track
of their order. Assuming a block of less than 65k and a table per
block keeping track of bytes the same way we keep track of blocks
would require two bytes of overhead per byte of data. This is
obviously a huge waste of space, basically 60% of your disk is
unusable.
> If there's room in the block,
> insert the byte. If not, insert a new block in between and insert the
> byte into it. That's why I said it just makes byte-offset seeking
> harder, because there's no fixed relationship between bytes and blocks.
Ah this is much more sensible. Instead of treating bytes like blocks
have an automatic function at the OS level to do insert-and-move for
you. Kind of like a memmove() function for disk I/O. However, such an
algorithm can easily be implemented by the user. And unix has a long
tradition of letting such things be solved by the user (see:
http://www.jwz.org/doc/worse-is-better.html for example). Doing it at
the OS level of course has the advantage of being able to prevent
simultaneous edits from happening.
I think that UNIX filesystem are simple (primitive if you want)
because UNIX itself is simple (primitive). It's just like keeping all
configuration information in ASCII files (or using ex ;-).
The system provides a very simple filesystem and it's up to the
programmer to implement a more advanced (complicated) mechanism on top
of it in the rare cases it's needed. Take syslog for example; file
access is serialized by using a global daemon that that is the only
process writing to the log files. Really simple and elegant if you ask
me :). If you want something more complicated than that, then you
should be using a databas which has its own methods for locking,
caching and stuff.
It would take a lot of legwork to convince me to use a filesystem with
almost database-like functionality just because a few programs might
need the features ;-)
I think you are contradicted by the quoted history of this thread (above):
"Which still has the problem of leaving the system without the file for
a period of time."
and by a previous message, where Larry says:
"Think password file - with this
approach, until the gawk finishes, there would be no password file on
the system... not a good state to have your machine."
-- Neil
> I think you are contradicted by the quoted history of this thread (above):
Now, now, no need to argue over what I thought or meant. I'm sitting
right here - feel free to ask me, in private or public, what I meant.
I'll try, yet again, to explain.
There are several states that critical files (like passwords or other
types of access control or resource listing files) can be in.
1. Fully existent. Think a quiet period when no changes are occurring
to the list of users and passwords.
2. Not present - in some of the previous discussions, this would be
the case if one did the move of password to some temporary name.
3. Incomplete - again, this would be the case from the earlier
examples, where one moved the file and then rewrote it.
4. Inconsistent - this state applies to the last example, where one
creates a temporary file then does a move of the temporary file to the
authority file. So, how can there be inconsistent state here? Let's
play "imagine this"...
Application 1 opens a file and starts reading through, looking for
information.
Application 2 creates a temporary file, consisting of new,
replacement, or remaining records after a deletion.
Application 1 continues reading
Application 2 completes the creation of the replacement file and does
the move.
Application 1 is reading a file that doesn't exist any longer,
essentially. I don't believe that it is going to see the new file - it
didn't open it. So it is only going to see what was in the original
file, which must be in cache or something.
This is where it would really be useful to have fully functional file
locking, so that at the time of opening a file, one opens it with
locking that says "someone is reading this file" or "someone is
writing this file". If someone is writing the file at the time someone
wants to read it, then likely one would wait. If someone is reading
the file at the time someone else is reading it, no problem, let it
happen. If someone is reading the file when someone wants to move the
temporary file into place, then I guess one would wait, or, perhaps,
some sort of "override lock" mechanism might be put into place that
would signal the reader "hey, something has changed - you've lost your
lock, you need to start over somehow ".
J Average Developer is going to look at this thread and say "boy, that
old guy is certainly paranoid". And I say "yup, young whippersnapper.
After programming for developers for 30 years ... and in particular
doing maintenance fixes for most of that time on the same code base,
you'd find you became paranoid as well."
And sometimes as a b-tree, and ... yes.
> Inserting a random block in the middle of a sequence of blocks is very
> fast. Simply point the current table entry to the location of the
> inserted block and point the inserted block's table entry to the next
> block in the list.
Right. You now have an empty block, ready for up to 510 bytes of data. ;-)
>> If there's room in the block,
>> insert the byte. If not, insert a new block in between and insert the
>> byte into it. That's why I said it just makes byte-offset seeking
>> harder, because there's no fixed relationship between bytes and blocks.
>
> Ah this is much more sensible. Instead of treating bytes like blocks
> have an automatic function at the OS level to do insert-and-move for
> you.
Well, sure. If the block is in memory, inserting something in the middle
is quick compared to writing it back out again.
> Kind of like a memmove() function for disk I/O. However, such an
> algorithm can easily be implemented by the user.
Sure. Now go compile it.
If it's in the OS, then when someone opens the file for sequential
reading (i.e., the only mode available in UNIX), then the blank parts of
the blocks are skipped.
I used a system that had three kinds of files:
Random were like partitions - consecutive cylinders of a fixed size, but
allocated inside another partition. Uually used for things like database
managers.
Consecutive - a file full or records. You could read forwards and
backwards, delete and insert records, etc. Seeking required basically
reading through the file to find the record you wanted.
Keyed - A btree structure pointing to records. This is what (forexample)
an editable file was. You could insert and delete records and it would
update the b-tree and so on. The same routines were used internally to
point directories at files.
But if you opened a keyed file in consecutive mode, it would read back
records based on their key order. So you could edit a file, delete and
insert lines, then hand it to the compiler without disturbing the keys.
It wasn't perfect (e.g., you couldn't use consecutive-insert on a keyed
file, since you would have to specify the key), but it worked pretty well.
Nowadays, you might want multiple keys, or files that span disks or
servers, or etc all "built in". You've lost the "everything is an array
of bytes" already, in that remote file systems don't really work that
way any more (see google file system, amazon S3, etc).
> And unix has a long
> tradition of letting such things be solved by the user (see:
> http://www.jwz.org/doc/worse-is-better.html for example). Doing it at
> the OS level of course has the advantage of being able to prevent
> simultaneous edits from happening.
It also prevents every program in existence from having to be rewritten
to know about every file format. See BSD readdir(): Remember when to get
a directory listing you just used open(".", O_RDONLY)? Here, clearly,
worse is not better.
And sure, you could also do RAID1 in software at the user level without
involving the kernel. You could also do all the permission handling
stuff, encrypting/decrypting on the fly, etc at the application level.
We also don't do that. :-)
I dispute the "worse-is-better" philosophy, myself.
I've spent a fair amount of time thinking about this, and come to the
conclusion that a *lot* of benefit would come from being able to insert
and delete bytes in the middle of a file transparently. It's pretty much
the minimum you need to make all the user-level stuff not have to
reinvent the file system to accomplish a lot of stuff.
Traditional Unix filesystems (i.e. anything local with inodes) do this
for you. What happens is that when you open the file, you increment the
(internal) reference count on the inode so that when the file is
deleted, it doesn't *actually* get deleted until the last process with
it open closes that file handle. This is *very* nice indeed, and is a
major factor behind the way that Unix systems don't need to be rebooted
very often, even when carrying out fairly significant surgery on
applications and libraries. (IIRC, by convention sending SIGUSR1 to
services gets them to drop open handles and reopen everything as well as
rereading their config files.)
Windows instead goes for an approach using locking, with the side effect
that systems are far more likely to need a reboot after a library
update. (There's even a special call to arrange for a file to be deleted
on next reboot, precisely to work around the over-zealous locking...)
Donal.
Agreed. I'd be happier if Linux had non-suckful IPC too. :-)
But the fact that if you want to (say) edit the syslog file you have to
turn off syslog (so it's not writing to the file while you edit it),
that tells me you're missing something in the kernel. That you have to
actually have the web server (for example) start using a new log file so
you can rotate the old log records to a different partition says there's
something wrong there, to me. That you need lock files so sendmail
doesn't clobber something while you're reading your mail with a MUA says
there's something missing there.
When you actually say "what are all the work-arounds I use to account
for the crummy file system", you begin to realize the work-arounds are
so common you don't even notice them any more.
> It would take a lot of legwork to convince me to use a filesystem with
> almost database-like functionality just because a few programs might
> need the features ;-)
Have you ever used one?
I'm not being snide here. I'm just pointing out that I bet a bunch of
people who never used a heirarchical file system would say the same
thing about directories.
Once you get used to being able to seek in files based on
contextually-relevant information, and being able to edit a large file
without having to rescrub the whole thing every time, you realize how
much you're missing.
I.e., the reason you only have a few programs that take advantage of
such functionality is that such functionality is so difficult to take
advantage of. And there are a ton of programs that could certainly use
such functionality, but instead just rewrite the entire file.
No. If nobody is writing the file, or waiting to write the file, no
problem. Otherwise, you get what Linux does (did?), which is
write-starvation.
> J Average Developer is going to look at this thread and say "boy, that
> old guy is certainly paranoid".
Nah. Just worried about reliability. It's amazing how many corner cases
just plain aren't handled right in lots of code.
The only files that you can't open in a way that prevents deleting them
while they're running is executables. If you have a data file, a log
file, a config file, etc, just open it with delete-while-open
permissions turned on, and you can delete it while it's open, with the
same semantics UNIX uses.
It's just not the default in most implementations of stdio, apparently,
for some reason.
>Agreed. I'd be happier if Linux had non-suckful IPC too. :-)
Whatever that means. Want to bet that if you provide a high-level
definition of what you want then loads of people will argue about it?
>But the fact that if you want to (say) edit the syslog file you have to
>turn off syslog (so it's not writing to the file while you edit it),
>that tells me you're missing something in the kernel.
If you need to edit it, it's not a log file. The mechanism is fine for
what it needs to do. If you're talking about keeping summary instead
of detail, see below.
> That you have to
>actually have the web server (for example) start using a new log file so
>you can rotate the old log records to a different partition says there's
>something wrong there, to me.
So you want to allow infinite log files? Anything that writes a log
should have a way of splitting "old" from "current" - then you can do
what you like with the "old" - summarize, or even edit it!
>That you need lock files so sendmail
>doesn't clobber something while you're reading your mail with a MUA says
>there's something missing there.
Missing? If two processes need to change the same thing you need a
lock (type unspecified). Don't mistake an imperfect implementation for
evidence either way - whatever mechanisms are provided can be misused.
>When you actually say "what are all the work-arounds I use to account
>for the crummy file system", you begin to realize the work-arounds are
>so common you don't even notice them any more.
If you think "work-around" you will do the wrong thing. Unix/Linux is
a low-level platform. If you need something that is not there you have
to find it or create it. Most people don't look very far, and whether
they have looked properly or not those who create something don't
usually make much effort to make it separable so that it could be
re-used.
Editting includes rotating the log file, removing old entries, etc.
> So you want to allow infinite log files?
Sure, why not? NTFS does, as well as anything else where you can delete
from the front of the file without messing up the back of the file. Look
up, for example, how the USN journal is implemented. Each entry has a
unique serial number, which is the lseek() offset in the journal file
where the entry was written. Entries get deleted off the front, but that
doesn't change where the later entries are - it just reclaims the space
for entries that everyone has presumedly seen.
> Anything that writes a log
> should have a way of splitting "old" from "current"
And that's my point. Every program that writes a log file under UNIX
needs a way to split the old from the current. If *every* program has to
implement a mechanism for doing something with a file, wouldn't it make
sense to put that in the file system?
As it is, it's fairly difficult to split a log file at a precise
boundary, such as exactly at midnight, unless it's built into the
program that happens to be writing the log file, and you know in advance
that's what you want to do. (BTDT.)
Contrast with a file system that actually stores records you can delete.
Simply open the log file, copy the records you want to "rotate", and
delete them off the front of the log file as you go. You can even delete
"normal" records and keep the "abnormal" records in the log file, for
example. I've never seen, for example, anyone write a "SQL Table Rotate"
program. You just select into another table, and delete from the
original, and you rotate exactly the records you want to rotate.
> Missing? If two processes need to change the same thing you need a
> lock (type unspecified).
Yes. And that lock (or contention resolution) should be in the file
system, rather than as a user-level work-around for a lack of
functionality. I shouldn't have to lock the entire file to delete one
message out of it. I shouldn't need 2G of disk space to delete data from
a file holding 1G of data. Try working with a 40-gig tar file some time,
trying to keep it synced to a directory tree somewhere, just for
example. It takes way longer than syncing two file systems, because you
have to frob 80 gig of data for every operation. As data gets bigger,
this will get more and more painful over time, to the point where folks
with big data are already reimplementing the file system in various ways
to use better semantics.
If I could delete individual records from a file, sendmail could just
append messages as new records, and my MUA could delete anything already
there, and neither of us would have to explicitly lock the file. Of
course, there would be locking of some sort inside the code implementing
the file system, but that's where it belongs, so everyone can use it
consistently.
Note that you can say "well, implement that sort of thing yourself", but
then all your tools like grep and CRM114 and all that have to know your
privileged file format.
> If you think "work-around" you will do the wrong thing. Unix/Linux is
> a low-level platform.
That's what I'm saying, yes. The things I mention are work-arounds for
the fact that Linux is a low-level platform. You may not view them that
way, but I've used enough FSes to view them that way.
I program mostly in Tcl, too, because C is a low-level platform.
Anything I want to do in Tcl, I can obviously do in C also. That doesn't
mean that C is as good as Tcl.
Very well, implement that sort of thing *in the filesystem* yourself
(together with syscalls to access it of course) and then adapt the
tools to take advantage. Take care to get it right without massive
impact on existing code, of course.
It's really very simple: if you're moaning but not working on (or
paying for someone else to work on) at least an experimental
demonstration of how such things are fixed, you're part of the problem-
space and not the solution-space. This applies here, and to anything
else technical too. (Also in the non-tech arena, for that matter).
Donal.
I'm not moaning. I'm having an intellectual discussion. Surely you can
tell the difference. I'm simply pointing out the inherent problems with
having an overly-simplified file system that most people don't seem to
even realize is *overly* simplified, thinking that it's a *good* and
*intentional* thing that it is so primitive. (Heck, I'm not even
flaming. :-)
I apologise that I'm having it in the wrong forum. Given the ease of
ignoring a thread, I don't think that's too problematic.
> but not working on (or
> paying for someone else to work on) at least an experimental
> demonstration of how such things are fixed,
They were fixed many decades ago, in mainframe operating systems, PR1ME
OS, and so on. In systems where I/O is the bottleneck instead of
computation, you get much more powerful file systems. Google, Amazon,
and Oracle are all improving file systems as well, altho they have to do
it layering on top of an existing primitive file system and using a
completely different API to get to it. Even NTFS beats the pants off of
most UNIX file systems, in many ways.
> you're part of the problem- space and not the solution-space.
Nonsense. This means one is not allowed to talk about possible
improvements in a system unless one is willing to code those
improvements yourself. Debugging consists of three phases:
1) Recognise the bug
2) Determine what causes it
3) Determine how to fix it
You're suggesting that anyone who does (1) without doing (2) or (3)
should just STFU. I'll respectfully disagree with that.
I'm saying that the computing world has been round this mill more than
enough times for (1) and (2) to be old hat. Time to get on with (3). :-P
Donal.
It's been around (3) often enough that people shouldn't be arguing over
(1) or (2) anymore either. ;-)
Well, I suspect that you're largely out on your own here. Modern
filesystems represent a balance between speed, stability and complexity,
and they tend to skew the balance towards speed and (especially)
stability. Now, if we were to add sub-block byte-range splices as
primitive operations, we have to work out how to do the operations
quickly while still supporting other ops.
One way might be to keep a linked list (or, more likely, B-tree) of
sections of contiguous bytes. That would allow inserts and deletes to be
really quick most of the time, and seeking wouldn't be made much slower
than it is right now. But if you had a flat (single section) file that
you then inserted a single byte into somewhere near the beginning,
before subsequently mmap()ing the file, you'd have a horrible mess of
non-aligned data manipulations to do in the memory manager...
Insert/delete of units of a page or cluster, I could see being made
efficient and have as reasonable things to work with. But byte-level
stuff would suck, and would be next-to-impossible to make not-suck. (I
remember using record oriented FSes back as an undergraduate, and they
were horrible for text; there was a special command to get rid of the
wasted bytes and that had to be run regularly.)
If we look at locking, the situation is different. Firstly, each of the
different styles of locking is not necessarily superior. The styles used
by Windows are more rigorously enforced, but this has the downside of
causing more trouble when something is holding a lock a bit too tightly
(alas, too common). The styles used by Unix (being advisory and
mandatory) are lighter-weight and less likely to get in people's way,
but can go wrong. Nobody has good locking on networked filesystems of
any kind; this is an inherent problem of distributed systems, and can't
be resolved (or not without causing worse trouble of one kind or
another, either lock failures or deadlocks).
The current FS APIs represent a large-scale consensus on what is really
needed. You want to change it? You've got a real battle on your hands
(nobody wants an unreliable FS, nobody wants a desperately slow FS, many
don't know or care that they don't have all conceivable operations) as
you're seeking to move away from a (pretty good) local optimum.
Donal (hmm, this message is way too long and off-topic).
Sure. I was discussing the APIs, not the file systems themselves.
Putting more complex semantics on top of something like a UNIX file
store without breaking the semantics of the APIs already there is
admittedly daunting.
> The current FS APIs represent a large-scale consensus on what is really
> needed.
I disagree. Note that every big system tends to use something other than
the native FS as the backing store (Google FS, the stuff they left out
of Longhorn, SQL servers, Mac OSX FS layers, etc), and many popular file
systems implement stuff like name/value tags on files (Linux), multiple
streams (NTFS), and "object" style files (Mac HFS). I believe this trend
will continue in the future as I/O starts becoming the bottleneck of
general computation again. Even Linux tends to start using directories
full of files instead of individual files, even for a bunch of
associated files, when it would be too tedious or inefficient to manage
space in a flat file (such as MySQL's use of separate files for database
tables, various configuration trees, Beagle caches, etc.)
I'm simply predicting this stuff will start moving more and more into
"mainstream" file systems instead of additional layers, and it'll happen
faster on OSes that are more API-oriented than format-oriented and OSes
that allow for greater experimentation with file system changes
implemened by users.
When I expect to use this kind of requirement more than a once, I try
to make a convenient API, e.g., in this case, treat the file as a
list:
# the api
proc with_file_as_list {fname listVar body} {
upvar $listVar lines
set fid [open $fname r]
set lines [split [read $fid] \n]
close $fid
set copy $lines
uplevel 1 $body
if {$lines ne $copy} {
set fid [open $fname w]
puts $fid [join $lines \n]
close $fid
}
}
# use some list manipulation package
package require struct::list
with_file_as_list lines.txt lines {
set lines [struct::list filter $lines {apply {line {expr {[lindex
$line 0] != 123096}}}}]
}
Alas, the environment I use most typically doesn't use local disk
much. Instead, NFS is king, with farms of file servers and, these
days, specialized network storage devices. And I still don't trust NFS
locking. I still remember the days when apps would crash after having
created an NFS lock, leaving the app in a state where one had to
reboot to fix things. And of course, in a server environment, you
don't want to have to reboot the server to fix such things...
Of course, part of the problem is the design of these files. Why are
they writing to one log file, then rotating that file to a new name
while creating a new one? Why not just open a log file a day? Or one a
week/month/year (what ever time span is used for splitting? If the
application designer simply took a parameter indicating the frequency
of creating a new log file (NN minutes/hours/days/months/years), then
nothing extra would be necessary - the application would just close
the old, open the new, and move on.
The point I'm trying to make is that many of the problems that people
attempt to work around stem from the "simplicity" design strategy. So
instead of someone writing a library for managing log files, they use
a simple "open file/write to file" strategy and allow the policy
management to fall to the administrator. And I'd suspect that in this
case, more time/energy/money is spent on the admin juggling log files
than would be spent on writing and debugging a library which used some
sort of config parameters to do the juggling, then integrating calls
to the library into each application generating logs that need
juggled.
> # the api
> proc with_file_as_list {fname listVar body} {
> upvar $listVar lines
> set fid [open $fname r]
> set lines [split [read $fid] \n]
> close $fid
> set copy $lines
>
> uplevel 1 $body
>
> if {$lines ne $copy} {
> set fid [open $fname w]
> puts $fid [join $lines \n]
> close $fid
> }
>
> }
>
> # use some list manipulation package
> package require struct::list
>
> with_file_as_list lines.txt lines {
> set lines [struct::list filter $lines {apply {line {expr {[lindex
> $line 0] != 123096}}}}]
>
> }
And don't forget to write out that file afterwards. And of course,
hopefully this is a single machine/single user/single program type
system, because otherwise, you need to add locking, at the very least.
Or even better than integrating calls to the library: using pipes.
This way, the app doesn't have to be modified. It just writes to
stderr (for example), and an external process takes care of slicing
that to one file per day.
I've been doing this for years for tens of daemons, it just works.
Sorry Darren, afraid you'll have to submit this to
comp.os.darrenos :-}
-Alex
While that's OK for lightly-loaded stuff, it's terrible for anything
that needs real performance and resilience. But doing better is
non-trivial. (We use a SAN for some things, and local disks with careful
backup strategies for others. User filestore is usually on AFS, but I
keep mine local so that I can continue to work without any network at
all; a feature that's frequently very useful.)
> And I still don't trust NFS locking.
Wise man.
Donal.
All true, but the file *is* written at the end of the block. That's
what's nice about this proc. You treat the file as a list, and if any
change occures to the list, the file is updated.
This kind of "with" procs is very powerful and can I use it quit a
bit. I think this method originated in Lisp, and it now penetrates
other dynamic languages, such as Ruby and Python.
Oh, and btw....
wiki.tcl.tk/17618
BTDT