Regex section from multiline

27 views
Skip to first unread message

Paolo Aciano

unread,
Mar 27, 2014, 6:29:36 AM3/27/14
to re...@googlegroups.com
Hello everyone,

this is my first post here and I was hoping in finding someone, who could help, cause I am struggling with this for 2 days now. I am trying to achive this in Python with re.MULTILINE.

I have following structure of a file:

-------------------------------------------------------
filename: pictures_1.zip
owner: john
datecreated: Mon 2014-02-24 15:16:34 +0200
info:
  Files added to AB-12345
  Hash: AB-12345
  Salt: sugar
    -------------------------------------------------------
    filename: HPM_4217.jpg
    owner: john
    datecreated: Mon 2014-02-24 14:23:49 +0200
    info:
      List of things
    -------------------------------------------------------
    filename: UIH_9754.jpg
    owner: john
    datecreated: Mon 2014-02-24 12:33:15 +0200
    info:
      Coffee-break
-------------------------------------------------------
filename: pictures_2.zip
owner: john
datecreated: Mon 2014-02-24 15:16:34 +0200
info:
  Files added to CD-78954
  Hash: CD-78954
  Salt: skyfall
    -------------------------------------------------------
    filename: PIC_789.jpg
    owner: john
    datecreated: Mon 2014-02-24 14:23:49 +0200
    info:
      Transformer
    -------------------------------------------------------
    filename: PIC_789.jpg
    owner: john
    datecreated: Mon 2014-02-24 12:33:15 +0200
    info:
      Fiji Island
-------------------------------------------------------
filename: pictures_3.zip
owner: john
datecreated: Mon 2014-02-24 15:16:34 +0200
info:
  Files added to EF-45654
  Hash: EF-45645
  Salt: jigsaw
    -------------------------------------------------------
    filename: IMG_704.jpg
    owner: john
    datecreated: Mon 2014-02-24 14:23:49 +0200
    info:
      Vermount Mountains
    -------------------------------------------------------
    filename: IMG_9741.jpg
    owner: john
    datecreated: Mon 2014-02-24 12:33:15 +0200
    info:
      New York
-------------------------------------------------------

And I am trying to search for and received the whole section based on Hash in the sections.

Example:
I would like to create regex with the hesh term "CD-78954" in it and receive one or more sections from the file based on it.

This is what I would like to receive:
-------------------------------------------------------
filename: pictures_2.zip
owner: john
datecreated: Mon 2014-02-24 15:16:34 +0200
info:
  Files added to CD-78954
  Hash: CD-78954
  Salt: skyfall
    -------------------------------------------------------
    filename: PIC_789.jpg
    owner: john
    datecreated: Mon 2014-02-24 14:23:49 +0200
    info:
      Transformer
    -------------------------------------------------------
    filename: PIC_789.jpg
    owner: john
    datecreated: Mon 2014-02-24 12:33:15 +0200
    info:
      Fiji Island
-------------------------------------------------------

Thank you in advance.

Paolo

Prashant Patole

unread,
Mar 27, 2014, 2:28:41 PM3/27/14
to re...@googlegroups.com

I have come across such situations many times during my career. Have tackled such issues of finding relational data from plain text file.

Best solution I found is converting such files to XML and then apply XML queries to get back relational data.

 

You can assume these Best Rules

1.       apply regular expression in multiple steps…

2.       number of steps is directly proportional to easiness of regular expression

 

Here, I will use only two steps (as per my own difficulty level.)

(you can do it in single step too… but I really think making it very complex is always stupid)

 

Step one

Find string :

filename:(?<filename>.+)\s*owner:(?<owner>.+)\s*datecreated:(?<datecreated>.+)\s*info:\s+(?<info>.+)\s*Hash:\s+(?<Hash>.+)\s*Salt:\s+(?<Salt>.+)\s*(?<data>(?:\s{4}.*)+)

 

replace string

<FileHash>

     <Hash>${Hash}</Hash>

     <FileName>${filename}</FileName>

     <Owner>${owner}</Owner>

     <DateCreated>${datecreated}</DateCreated>

     <Info>${info}</Info>

     <Salt>${Salt}</Salt>

     ${data}

</FileHash>

 

Your result after step 1.

-------------------------------------------------------

<FileHash>

      <Hash>AB-12345

</Hash>

      <FileName> pictures_1.zip

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 15:16:34 +0200

</DateCreated>

      <Info>Files added to AB-12345

</Info>

      <Salt>sugar

</Salt>

          -------------------------------------------------------

    filename: HPM_4217.jpg

    owner: john

    datecreated: Mon 2014-02-24 14:23:49 +0200

    info:

      List of things

    -------------------------------------------------------

    filename: UIH_9754.jpg

    owner: john

    datecreated: Mon 2014-02-24 12:33:15 +0200

    info:

      Coffee-break

</FileHash>

 

-------------------------------------------------------

<FileHash>

      <Hash>CD-78954

</Hash>

      <FileName> pictures_2.zip

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 15:16:34 +0200

</DateCreated>

      <Info>Files added to CD-78954

</Info>

      <Salt>skyfall

</Salt>

          -------------------------------------------------------

    filename: PIC_789.jpg

    owner: john

    datecreated: Mon 2014-02-24 14:23:49file +0200

    info:

      Transformer

    -------------------------------------------------------

    filename: PIC_789.jpg

    owner: john

    datecreated: Mon 2014-02-24 12:33:15 +0200

    info:

      Fiji Island

</FileHash>

 

-------------------------------------------------------

<FileHash>

      <Hash>EF-45645

</Hash>

      <FileName> pictures_3.zip

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 15:16:34 +0200

</DateCreated>

      <Info>Files added to EF-45654

</Info>

      <Salt>jigsaw

</Salt>

          -------------------------------------------------------

    filename: IMG_704.jpg

    owner: john

    datecreated: Mon 2014-02-24 14:23:49 +0200

    info:

      Vermount Mountains

    -------------------------------------------------------

    filename: IMG_9741.jpg

    owner: john

    datecreated: Mon 2014-02-24 12:33:15 +0200

    info:

      New York

</FileHash>

 

-------------------------------------------------------

 

 

Step two

Find string

filename:(?<filename>.+)\s*owner:(?<owner>.+)\s*datecreated:(?<datecreated>.+)\s*info:\s+(?<info>.+)

 

Replace String

<FileInfo>

     <FileName>${filename}</FileName>

     <Owner>${owner}</Owner>

     <DateCreated>${datecreated}</DateCreated>

     <Info>${info}</Info>

</FileInfo>

 

Result after step two

-------------------------------------------------------

<FileHash>

      <Hash>AB-12345

</Hash>

      <FileName> pictures_1.zip

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 15:16:34 +0200

</DateCreated>

      <Info>Files added to AB-12345

</Info>

      <Salt>sugar

</Salt>

          -------------------------------------------------------

    <FileInfo>   

      <FileName> HPM_4217.jpg

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 14:23:49 +0200

</DateCreated>

      <Info>List of things

</Info>

</FileInfo>

 

    -------------------------------------------------------

    <FileInfo>   

      <FileName> UIH_9754.jpg

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 12:33:15 +0200

</DateCreated>

      <Info>Coffee-break

</Info>

</FileInfo>

 

</FileHash>

 

-------------------------------------------------------

<FileHash>

      <Hash>CD-78954

</Hash>

      <FileName> pictures_2.zip

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 15:16:34 +0200

</DateCreated>

      <Info>Files added to CD-78954

</Info>

      <Salt>skyfall

</Salt>

          -------------------------------------------------------

    <FileInfo>   

      <FileName> PIC_789.jpg

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 14:23:49file +0200

</DateCreated>

      <Info>Transformer

</Info>

</FileInfo>

 

    -------------------------------------------------------

    <FileInfo>   

      <FileName> PIC_789.jpg

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 12:33:15 +0200

</DateCreated>

      <Info>Fiji Island

</Info>

</FileInfo>

 

</FileHash>

 

-------------------------------------------------------

<FileHash>

      <Hash>EF-45645

</Hash>

      <FileName> pictures_3.zip

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 15:16:34 +0200

</DateCreated>

      <Info>Files added to EF-45654

</Info>

      <Salt>jigsaw

</Salt>

          -------------------------------------------------------

    <FileInfo>   

      <FileName> IMG_704.jpg

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 14:23:49 +0200

</DateCreated>

      <Info>Vermount Mountains

</Info>

</FileInfo>

 

    -------------------------------------------------------

    <FileInfo>   

      <FileName> IMG_9741.jpg

</FileName>

      <Owner> john

</Owner>

      <DateCreated> Mon 2014-02-24 12:33:15 +0200

</DateCreated>

      <Info>New York

</Info>

</FileInfo>

 

</FileHash>

 

-------------------------------------------------------


prashant


Super Simple Software

Software Development, Internet Marketing, SEO and Academic Projects




--
--
Sub, Unsub, Read-on-the-web, tune your personal settings for this Regex forum:
http://groups.google.com/group/regex?hl=en

---
You received this message because you are subscribed to the Google Groups "Regex" group.
To unsubscribe from this group and stop receiving emails from it, send an email to regex+un...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Paolo Aciano

unread,
Mar 28, 2014, 5:05:42 AM3/28/14
to re...@googlegroups.com
Dear Prashant,

first of all thank you for such quick respond, I was hoping in. This seems really interesting and I was hoping someone else will give me other ideas so I have something to work with. Thank you again and hope this is helpful for someone else as well.

Paolo

Dne čtvrtek, 27. března 2014 19:28:41 UTC+1 prash napsal(a):

Prashant Patole

unread,
Mar 31, 2014, 12:34:10 AM3/31/14
to Paolo Aciano, re...@googlegroups.com

Everyone is new at some point of time.. nothing to worry. :-)

Me totally in .Net. don't even know 'p' of python.

I think this page will help you out with Python regex usage and syntax. http://docs.python.org/2/howto/regex.html
Look out for 'search and replace' section.

What you looking for, getting a list, is quite easy In .net. it simply needs a 'matches()' function of regex lib. That returns array of all matches found in given string. You need to search for its alternative in python.


Sorry, was on weekend vacation so didn't reply u early.

Supersimplesoft.com


On Mar 30, 2014 6:40 PM, "Paolo Aciano" <paolo....@gmail.com> wrote:

Also, I was thinking if it would be possible to just to split the file by the ^-+ in the begening of a new line and put it into a list instead. I know it is not good approach for large files but for this small example it could be just fine. how would you split it like that with regex and put it into a list?

Dne 30. 3. 2014 14:41 "Paolo Aciano" <paolo....@gmail.com> napsal(a):

Hi Prashant,

could you also highlight how to do the find and replace in Python. First I thought I could do it from the description you gave me, I am struggling with it.  As you already guessed, I am new to Python. I have been searching on Google, but I must be doing something wrong. I have already been playing with the re module and re.compile and re.sub, but I might be looking at wrong place?

Paolo

Dne 27. 3. 2014 19:28 "Prashant Patole" <prashan...@gmail.com> napsal(a):


--
Prashant. 9423968815. SSS.Sent from Gmail Mobile

Prashant Patole

unread,
Mar 31, 2014, 1:52:26 AM3/31/14
to Paolo Aciano, re...@googlegroups.com
Also try this page.
I think its re.findall in python.


--
Prashant. 9423968815. SSS.Sent from Gmail Mobile
Reply all
Reply to author
Forward
0 new messages