Incosistent handling of regular expressions

36 views
Skip to first unread message

Łukasz Dywicki

unread,
Sep 19, 2023, 9:18:13 AM9/19/23
to Copybara OSS
Hello copybara fellows,
I come back to my copybara experiments as I wished to push it forward towards a place where I could update some of maven pom.xml files. These files are plain XML files with some basic placeholders encoded using ${var} syntax.
I am stuck since a day with copybara/starlark as I wish to remove a section of XML which consist of ${var}. I keep also running into a fairly weird issues with regex_groups when I use core.replace() transformation.

Subject snippet:
    </dependency>
    <dependency>
      <groupId>org.openhab.core.bundles</groupId>
      <artifactId>org.openhab.core.ephemeris</artifactId>
      <version>${project.version}</version>
    </dependency>
    <dependency>

Transformation which I keep getting in troubles:

re_groups2={'ws': '\\s*', 'any': '.*', 'ws2': '\\s*'};
dependencyStatement = '<dependency>${ws}<groupId>org.openhab.core.bundles</groupId>${ws}<artifactId>' + module + '</artifactId>${ws}<version>${any}</version>${ws2}</dependency>${ws2}';
transformations = [
core.replace(dependencyStatement, '', paths=glob(['**/pom.xml']), regex_groups=re_groups2, multiline=True, repeated_groups=True)
]

As you can see there is basic regular expression which should not bring any troubles, however if I use ${ws} after ${any} entire expression do not match. Is there any particular reason why regex group can not be repeated one after other?

Best,
Łukasz

Łukasz Dywicki

unread,
Sep 26, 2023, 5:32:54 AM9/26/23
to Copybara OSS
Does anyone follow this list? I've ran into another regex issue and lack of clear information how these are handled is a real pain. I wish to replace a XML chunk (everything between tags), yet copybara simply doesn't get it. Guessing a proper regex without a proper tooling is just waste of everybody's time.

Example of trivial regex which doesn't work as expected:
  core.replace(
    "(?s)<distributionManagement>${any}</distributionManagement>",
    """<distributionManagement><site /></distributionManagement>""",
    {'any': '.*'},
    multiline=True,
    paths=glob(["pom.xml"])
  )

Is there a way to trace what regex is generated by replace transformation from its inputs?

Best,
Łukasz

Mikel Alcon

unread,
Sep 26, 2023, 10:35:43 AM9/26/23
to Copybara OSS
Something like this should work [1]. The main issue is that you have to escape the '$' in ${project.version}, otherwise copybara confuses it with a regex_group var.





[1]
#!/bin/bash

d="$(mktemp -d)"

echo "$d"

cd "$d"
cat > copy.bara.sky  <<'EOF'
core.workflow(
    name = "default",
    origin = folder.origin(),
    destination = folder.destination(),
    authoring = authoring.overwrite("Foo <f...@example.com>"),
    transformations = [
       core.replace(before = "<dependency>${ws}<groupId>org.openhab.core.bundles</groupId>${ws}<artifactId>org.openhab.core.ephemeris</artifact\
Id>${ws}<version>$${${id}}</version>${ws2}</dependency>",
                    after = "REMOVED",
                    multiline = True,
                    repeated_groups = True,
                    regex_groups = {
                          "ws": "(\\n| |\\t)*",
                          "ws2": "(\\n| |\\t)*",
                          "id": "[^}]+"
                    })
    ],
)
EOF

mkdir in
mkdir out

cat > in/file.xml <<'EOF'
    <dependency>
      <groupId>org.openhab.core.OTHER</groupId>

      <artifactId>org.openhab.core.ephemeris</artifactId>
      <version>${project.version}</version>
    </dependency>
    <dependency>
      <groupId>org.openhab.core.bundles</groupId>
      <artifactId>org.openhab.core.ephemeris</artifactId>
      <version>${project.version}</version>
    </dependency>
EOF

copybara copy.bara.sky default in --folder-dir out

cat out/*



Łukasz Dywicki

unread,
Oct 2, 2023, 8:28:48 AM10/2/23
to Copybara OSS
Hi,
Thank you for answer, this makes sense, however there is still a bit of mystery why $ is not captured by (.*) pattern? I encountered similar problem with blanks but also other characters such as {} <>.
For example this regex works perfectly fine: https://regex101.com/r/OOA3hS/1, but it does not apply to copybara.

Is there any reason why above doesn't fly?

Cheers,
Łukasz
Reply all
Reply to author
Forward
0 new messages