Parsing a text file with mutuple seperator

wolfp...@yahoo.com

unread,

Apr 7, 2008, 9:11:52 PM4/7/08

to scr...@perl.org

Hi all,

I am writing a perl script to parse a file. The data in the file is
seperated by space/tab. However, certain fields may be empty or
consist of mutiple words and are double quoted and this makes it
difficut for me to do a split.

Example of data:
"" "This is 2nd field"
3 4
1 2
"" 4
1 2 "The field may consist of (meta)
characters" ""

What I am doing is as such:
while ($line=~/(".*?")/) {; <- Loops until all double-
quoted string is replaced
$line=~s/""/__EMPTY__/g;
$tmp1=$1;
$tmp2=$1;
$tmp1=~s/"//g;
$tmp1=~s/ /__SPACE__/g;
$tmp2=~s/([])/\\$1/g;
$line=~s/$tmp2/$tmp1/; <- needs to replace meta-
characters in $tmp2
}
@tmp=split /\s+/, $line;
foreach $i (0..$#tmp) {
$tmp[$i]=~s/__SPACE__/ /g;
$tmp[$i]=~s/__EMPTY__//g;
// Store data
}

Substitue "" with __EMPTY__
While line matches ".*?" (non-greedy match), remember the content
between the quotes.
Assign this content to $tmp1 and $tmp2. Remove " from $tmp1, Replace '
' with __SPACE__.
Replace metacharacters of $tmp2 with escape, ie (meta) to $meta$.
Substition of $tmp2 with $tmp1 (non-global).
Do a split /\s+/,
Replace __EMPTY__ with empty string
Replace __SPACE__ with " ".

Does you one have a neater and more efficient way either by split of
regexp?

Thanks
Shu Teng

Brad Baxter

unread,

Apr 8, 2008, 8:39:20 AM4/8/08

to wolfp...@yahoo.com, scr...@perl.org

If I were you, I'd use Text::ParseWords::parse_line()

Otavio

unread,

Apr 8, 2008, 9:08:44 AM4/8/08

to scr...@perl.org

Either you use the module mentioned or try a multi stage split. It´s
uglier but is a way to get the work done.

First I´d split he data by ("\s+\"") then by ("\"\s+") then I´d deal
with the tabs....

Just my two cents. ;-)

On 8 abr, 09:39, b...@mail.libs.uga.edu (Brad Baxter) wrote:
> If I were you, I'd use Text::ParseWords::parse_line()
>

Johan Vromans

unread,

Apr 8, 2008, 6:02:43 PM4/8/08

to wolfp...@yahoo.com, scr...@perl.org

wolfp...@yahoo.com writes:

> I am writing a perl script to parse a file. The data in the file is
> seperated by space/tab. However, certain fields may be empty or
> consist of mutiple words and are double quoted and this makes it
> difficut for me to do a split.
>
> Example of data:
> "" "This is 2nd field"
> 3 4
> 1 2
> "" 4
> 1 2 "The field may consist of (meta)
> characters" ""

I think Text::CSV (Text::CSV_XS) can handle this. Just set the
separator to Tab.

-- Johan

wolfp...@yahoo.com

unread,

Apr 8, 2008, 8:28:36 PM4/8/08

to scr...@perl.org

Yes Brad,

I have tried the Text::ParseWords and that is exactly what I am
looking for.

Thanks

> > Shu Teng- Hide quoted text -
>
> - Show quoted text -

John W. Krahn

unread,

Apr 9, 2008, 1:07:50 AM4/9/08

to scr...@perl.org

wolfp...@yahoo.com wrote:
> Hi all,

Hello,

> I am writing a perl script to parse a file. The data in the file is
> seperated by space/tab. However, certain fields may be empty or
> consist of mutiple words and are double quoted and this makes it
> difficut for me to do a split.
>
> Example of data:
> "" "This is 2nd field"
> 3 4
> 1 2
> "" 4
> 1 2 "The field may consist of (meta)
> characters" ""

$ echo '"" "This is 2nd field" 3

4
1 2 "" 4
1 2 "The field may consist of (meta)

characters" ""' | \

perl -lne'
my @x = /"[^"]*"|\S+/g;
print "Number of fields: " . @x . " ", map " >$_<", @x;
'
Number of fields: 4 >""< >"This is 2nd field"< >3< >4<
Number of fields: 4 >1< >2< >""< >4<
Number of fields: 4 >1< >2< >"The field may consist of (meta)
characters"< >""<

John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall