stitching questions

11 views
Skip to first unread message

Ezra Kissel

unread,
Jun 13, 2014, 1:36:36 PM6/13/14
to geni-...@googlegroups.com
Stitching has been working pretty well for me, but I've run into a few
things:

1) Stitching to/from Utah seems to prefer the protogeni ION interface in
Atlanta. For example, stitching the IG Utah and Illinois racks goes
through the Atlanta POP:

<hop id="1"><link
id="urn:publicid:IDN+instageni.illinois.edu+interface+procurve2:2.4:ion.chic.et-10_0_0">
<hop id="2"><link
id="urn:publicid:IDN+ion.internet2.edu+interface+rtr.chic:et-10/0/0:illinois-ig">
<hop id="3"><link
id="urn:publicid:IDN+ion.internet2.edu+interface+rtr.atla:ge-10/3/2:protogeni">
<hop id="4"><link
id="urn:publicid:IDN+emulab.net+interface+procurve-pgeni-atla:3.21">
<hop id="5"><link id="urn:publicid:IDN+emulab.net+interface+procurveA:3.19">
<hop id="6"><link
id="urn:publicid:IDN+utah.geniracks.net+interface+procurve2:1.19">

This results in an RTT of about 136ms versus 36ms via the control
interfaces, which IMO is a little excessive.

Is there a reason the SCS can't do Chicago to Kansas City for hop 3,
i.e.
urn:ogf:network:domain=ion.internet2.edu:node=rtr.kans:port=ge-10/2/9:link=protogeni
instead?

This might come up between other racks but I haven't had a chance to
test more than a few pairings thus far. Illinois to BBN is reasonable
(ION CHIC<->NEWY), for instance, 36ms circuit versus 32ms control.

2) The GENI stitching example
(http://groups.geni.net/geni/wiki/GENIExperimenter/ExperimentExample-stitching)
states that the default bandwidth allocation is "100 MB" but stitcher.py
sets its default to "20 MB." This caught me by surprise and I had to
recreate my stitched links when I realized my traffic was policed at
20Mb/s. Those values should probably read "Mb" and not "MB" as the
circuit capacities are in bits.

3) Sometimes the circuit simply stops passing traffic. For instance,
the Utah - Illinois example (created last night) worked last night but
today I can't ping across the link. I checked the ION, Utah, and
Illinois AMs and the resources are all renewed and active. I'm not sure
what to do about this beyond deleting and trying again. Any
suggestions? I can send along the stitching manifests if anyone has
cycles to take a look.

4) How much effort would be involved in updating the stitching
advertisement with the current *available* vlan ranges? The racks know
which vlans are in use, could they not simply update the ad rspec so the
SCS and users know whether or not a stitch is even possible by doing a
listresources? I know POA commands update the advertisements (e.g.,
shared_vlan) almost immediately to list the new resources, couldn't
something similar be done for vlan ranges? The biggest problem right
now is that you try to stitch to a rack with limited vlan range (e.g. IG
Stanford right now) where all the VLANs are in-use and you wait 40-50
minutes while stitcher.py tries random vlans in the range only to find
out it will never succeed.

thanks,
- ezra

Jonathon Duerig

unread,
Jun 13, 2014, 1:42:17 PM6/13/14
to geni-...@googlegroups.com
On Fri, 13 Jun 2014, Ezra Kissel wrote:

> 1) Stitching to/from Utah seems to prefer the protogeni ION interface in
> Atlanta. For example, stitching the IG Utah and Illinois racks goes through
> the Atlanta POP:

I believe that the workaround for this is to pass in excluded
stitchpoints when talking to the SCS.

> 4) How much effort would be involved in updating the stitching advertisement
> with the current *available* vlan ranges? The racks know which vlans are in
> use, could they not simply update the ad rspec so the SCS and users know
> whether or not a stitch is even possible by doing a listresources? I know
> POA commands update the advertisements (e.g., shared_vlan) almost immediately
> to list the new resources, couldn't something similar be done for vlan
> ranges? The biggest problem right now is that you try to stitch to a rack
> with limited vlan range (e.g. IG Stanford right now) where all the VLANs are
> in-use and you wait 40-50 minutes while stitcher.py tries random vlans in the
> range only to find out it will never succeed.

This is already implemented. When you ask for an advertisement and pass
the available flag, it will return the vlans that are currently available
rather than the whole range.

Luisa Nevers

unread,
Jun 13, 2014, 1:51:07 PM6/13/14
to geni-...@googlegroups.com
On 6/13/14, 1:36 PM, Ezra Kissel wrote:
> 2) The GENI stitching example
> (http://groups.geni.net/geni/wiki/GENIExperimenter/ExperimentExample-stitching)
> states that the default bandwidth allocation is "100 MB" but
> stitcher.py sets its default to "20 MB." This caught me by surprise
> and I had to recreate my stitched links when I realized my traffic was
> policed at 20Mb/s. Those values should probably read "Mb" and not
> "MB" as the circuit capacities are in bits.
Hi Ezra,

I am leaving the rest of the questions for other.

I have updated the two instances to use "Mb".

The stitcher default was modified to 20 Mb capacity because there has
been very high demand for limited resources. Many stitching requests are
from experimenters trying to see how stitching works. We chose the lower
bandwidth allocation so that experimenters must explicitely choose to
increase the bandwidth for their experiments.

Luisa

Ezra Kissel

unread,
Jun 13, 2014, 2:00:49 PM6/13/14
to geni-...@googlegroups.com
On 6/13/2014 1:42 PM, Jonathon Duerig wrote:
> On Fri, 13 Jun 2014, Ezra Kissel wrote:
>
>> 1) Stitching to/from Utah seems to prefer the protogeni ION interface in
>> Atlanta. For example, stitching the IG Utah and Illinois racks goes through
>> the Atlanta POP:
>
> I believe that the workaround for this is to pass in excluded
> stitchpoints when talking to the SCS.
>

I saw those options in the stitcher usage. I'll try excluding the
Atlanta hop and see what happens. I guess the bigger questions is
figuring out the default path finding in SCS...


>> 4) How much effort would be involved in updating the stitching advertisement
>> with the current *available* vlan ranges? The racks know which vlans are in
>> use, could they not simply update the ad rspec so the SCS and users know
>> whether or not a stitch is even possible by doing a listresources? I know
>> POA commands update the advertisements (e.g., shared_vlan) almost immediately
>> to list the new resources, couldn't something similar be done for vlan
>> ranges? The biggest problem right now is that you try to stitch to a rack
>> with limited vlan range (e.g. IG Stanford right now) where all the VLANs are
>> in-use and you wait 40-50 minutes while stitcher.py tries random vlans in the
>> range only to find out it will never succeed.
>
> This is already implemented. When you ask for an advertisement and pass
> the available flag, it will return the vlans that are currently available
> rather than the whole range.
>

That's good to know, although I'm not sure how to pass that flag. I
don't see how with Omni (--arbitrary-option ?), in the emulab protogeni
scripts maybe?

I imagine the SCS could use this feature instead of randomly trying
within the full range.

Xi Yang

unread,
Jun 13, 2014, 2:07:26 PM6/13/14
to geni-...@googlegroups.com
Ezra,

SCS will try return the least cost path which is based on sum of trafficEngineeringMetic defined for the stitching links. It is not necessarily the least hop path.

In exclusion routing profile, you can exclude link, port, node or even aggregate by providing corresponding URNs.

—Xi
> --
> GENI Users is a community supported mailing list, so please help by responding to questions you know the answer to.
>
> If this is your first time posting a question to this list, please review http://groups.geni.net/geni/wiki/GENIExperimenter/CommunityMailingList
> --- You received this message because you are subscribed to the Google Groups "GENI Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to geni-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Ezra Kissel

unread,
Jun 13, 2014, 2:47:29 PM6/13/14
to geni-...@googlegroups.com
Can you tell me how the SCS determines the sum of those metric values?
The ION topology has each port metric set to "10" and the AMs advertise
the same value. We can take this off-list, I'm just curious about the
path finding details.


On 6/13/2014 2:07 PM, Xi Yang wrote:
> Ezra,
>
> SCS will try return the least cost path which is based on sum of trafficEngineeringMetic defined for the stitching links. It is not necessarily the least hop path.
>
> In exclusion routing profile, you can exclude link, port, node or even aggregate by providing corresponding URNs.
>
> --Xi
>
> On Jun 13, 2014, at 2:00 PM, Ezra Kissel <ezki...@indiana.edu> wrote:
>

Sarah Edwards

unread,
Jun 13, 2014, 3:03:23 PM6/13/14
to geni-...@googlegroups.com, Sarah Edwards
> That's good to know, although I'm not sure how to pass that flag. I don't see how with Omni (--arbitrary-option ?), in the emulab protogeni scripts maybe?


To answer the question you asked:

--arbitary-option is a testing thing (of use to very, very few people all of them developers).

You can use the --available flag in omni like this:
omni listresources -a gpo-ig -o --available

When I do the above command, I see things like this in the manifest:
<vlanRangeAvailability>3706-3712,3714-3715,3717-3721,3723-3724,3726,3728-3731,3746,3748-3749</vlanRangeAvailability>

That said, I'm not sure that information gets you where you want to go at this time.

So....

Are you running the latest and greatest gcf 2.6 that came out three days ago? There were some improvements to deal with picking VLANs in the latest release of stitcher. Aaron (who is out today) told me that it vastly reduces the number of retries.

From the release notes:
"Where possible, request that the AM pick the VLAN tag, by requesting 'any'. Therefore, many fewer VLAN unavailable errors. This only works at AMs that are VLAN producers or per the SCS do not depend on other AMs. This does not currently work at ExoGENI or GRAM-based AMs. (#576,#604)"


*******************************************************************************
Sarah Edwards
GENI Project Office

BBN Technologies
Cambridge, MA
phone: (617) 873-2329
email: sedw...@bbn.com





Ezra Kissel

unread,
Jun 13, 2014, 3:09:57 PM6/13/14
to geni-...@googlegroups.com
On 6/13/2014 3:04 PM, Sarah Edwards wrote:
>> That's good to know, although I'm not sure how to pass that flag. I don't see how with Omni (--arbitrary-option ?), in the emulab protogeni scripts maybe?
>
>
> To answer the question you asked:
>
> --arbitary-option is a testing thing (of use to very, very few people all of them developers).
>
> You can use the --available flag in omni like this:
> omni listresources -a gpo-ig -o --available
>

Ah, thank you, and sorry, I completely missed it under "Basic and Most
Used Options" in the usage. :/

> When I do the above command, I see things like this in the manifest:
> <vlanRangeAvailability>3706-3712,3714-3715,3717-3721,3723-3724,3726,3728-3731,3746,3748-3749</vlanRangeAvailability>
>
> That said, I'm not sure that information gets you where you want to go at this time.
>

This will be very helpful when I do manual stitching to external
(non-GENI) testbeds and need to find an available VLAN ahead of time.


> So....
>
> Are you running the latest and greatest gcf 2.6 that came out three days ago? There were some improvements to deal with picking VLANs in the latest release of stitcher. Aaron (who is out today) told me that it vastly reduces the number of retries.
>
> From the release notes:
> "Where possible, request that the AM pick the VLAN tag, by requesting 'any'. Therefore, many fewer VLAN unavailable errors. This only works at AMs that are VLAN producers or per the SCS do not depend on other AMs. This does not currently work at ExoGENI or GRAM-based AMs. (#576,#604)"
>

I just updated now thinking the same thing. I originally didn't want to
deal with an upgrade while in the middle of everything else but it does
sound like this will make my life a lot easier.

- ezra

Ezra Kissel

unread,
Jun 13, 2014, 3:27:48 PM6/13/14
to geni-...@googlegroups.com
On 6/13/2014 1:51 PM, Luisa Nevers wrote:
> On 6/13/14, 1:36 PM, Ezra Kissel wrote:
>> 2) The GENI stitching example
>> (http://groups.geni.net/geni/wiki/GENIExperimenter/ExperimentExample-stitching)
>> states that the default bandwidth allocation is "100 MB" but
>> stitcher.py sets its default to "20 MB." This caught me by surprise
>> and I had to recreate my stitched links when I realized my traffic was
>> policed at 20Mb/s. Those values should probably read "Mb" and not
>> "MB" as the circuit capacities are in bits.
> Hi Ezra,
>
> I am leaving the rest of the questions for other.
>
> I have updated the two instances to use "Mb".
>
> The stitcher default was modified to 20 Mb capacity because there has
> been very high demand for limited resources. Many stitching requests are
> from experimenters trying to see how stitching works. We chose the lower
> bandwidth allocation so that experimenters must explicitely choose to
> increase the bandwidth for their experiments.
>

Thank you for updating the page. The stitcher.py usage does mention the
default is 20 Mbps so things are consistent again!

Ezra Kissel

unread,
Jun 13, 2014, 3:54:14 PM6/13/14
to geni-...@googlegroups.com
On 6/13/2014 1:36 PM, Kissel, Ezra D wrote:

> 3) Sometimes the circuit simply stops passing traffic. For instance,
> the Utah - Illinois example (created last night) worked last night but
> today I can't ping across the link. I checked the ION, Utah, and
> Illinois AMs and the resources are all renewed and active. I'm not sure
> what to do about this beyond deleting and trying again. Any
> suggestions? I can send along the stitching manifests if anyone has
> cycles to take a look.
>

In this case, I can only blame myself. I see now that I forgot to renew
at pg-utah in addition to ig-utah. I went back and saw that I got email
for both, only renewed one.

I realize there are benefits to each aggregate having their own policy,
but I'll make one last appeal to have default sliver expiration dates be
the same across a slice. It really is difficult to keep track when you
have slivers all over the place, making sure they work as you expect,
and then sorting out which ones need to be renewed before others.

It does look like the new stitcher version in gcf-2.6 makes it much
clearer where it allocates slivers, which I think will be a big help.

Nicholas Bastin

unread,
Jun 13, 2014, 4:07:54 PM6/13/14
to geni-...@googlegroups.com
On Fri, Jun 13, 2014 at 3:53 PM, Ezra Kissel <ezki...@indiana.edu> wrote:
I realize there are benefits to each aggregate having their own policy, but I'll make one last appeal to have default sliver expiration dates be the same across a slice.  It really is difficult to keep track when you have slivers all over the place, making sure they work as you expect, and then sorting out which ones need to be renewed before others.

The only way this will ever happen is if you enforce it yourself - if you set all of your slivers to expire at the time of the earliest one.  (Of course, barring out-of-band information, you can't actually do this - we need to at least advertise what the maximum sliver reservation time is for each resource).  For example for certain long-lived slivers I always just renew them for 5 days everywhere, even though I know that everywhere except Utah I could renew them for much longer.

There will always be resources in GENI which effectively are not constrained and thus don't have a functional maximum expiration date (dataplane flowspace, for example), while other resources (raw PCs, for example) will be highly constrained and have high contention and thus low maximum reservation times.  Further, it's a federation so each resource is controlled by the local administrators and some will set policies based on their local usage as well.  Now, part of the current "problem" is I believe that the PG-side max renewal time is per-AM, not per-resource, which means that the maximum renewal time is set based on the scarce resource (raw PCs) and not necessarily the resource you are using (VMs).  That being said, even if that problem were addressed, you'd still likely have varying policies enforced by the AMs you are using across your slice.  The most useful feature here would be for each resource to advertise its' max reservation time so you could coordinate them yourself.

--
Nick

Ezra Kissel

unread,
Jun 13, 2014, 4:16:11 PM6/13/14
to geni-...@googlegroups.com
On 6/13/2014 4:07 PM, Nicholas Bastin wrote:
> On Fri, Jun 13, 2014 at 3:53 PM, Ezra Kissel <ezki...@indiana.edu
> <mailto:ezki...@indiana.edu>> wrote:
>
> I realize there are benefits to each aggregate having their own
> policy, but I'll make one last appeal to have default sliver
> expiration dates be the same across a slice. It really is difficult
> to keep track when you have slivers all over the place, making sure
> they work as you expect, and then sorting out which ones need to be
> renewed before others.
>
>
> The only way this will ever happen is if you enforce it yourself - if
> you set all of your slivers to expire at the time of the earliest one.
> (Of course, barring out-of-band information, you can't actually do
> this - we need to at least advertise what the maximum sliver reservation
> time is for each resource). For example for certain long-lived slivers
> I always just renew them for 5 days everywhere, even though I know that
> everywhere except Utah I could renew them for much longer.
>
> There will always be resources in GENI which effectively are not
> constrained and thus don't have a functional maximum expiration date
> (dataplane flowspace, for example), while other resources (raw PCs, for
> example) will be highly constrained and have high contention and thus
> low maximum reservation times. Further, it's a federation so each
> resource is controlled by the local administrators and some will set
> policies based on their local usage as well. Now, part of the current
> "problem" is I believe that the PG-side max renewal time is per-AM, not
> per-resource, which means that the maximum renewal time is set based on
> the scarce resource (raw PCs) and not necessarily the resource you are
> using (VMs). That being said, even if that problem were addressed,
> you'd still likely have varying policies enforced by the AMs you are
> using across your slice. The most useful feature here would be for each
> resource to advertise its' max reservation time so you could coordinate
> them yourself.
>
> --

Yeah, all good points. I know there are user tools that make this
easier to manage, too. When you are building things piecemeal with
less-supported features, that's where things get a little hairy, but I
agree that the onus is on the user to manage their resources. A "max
reservable time" would be useful.

...and sorry if I over use the lists to chronicle my love/hate
relationship with the GENI CF, but hopefully these points are useful to
others as well. ;)

Ezra Kissel

unread,
Jun 13, 2014, 5:37:28 PM6/13/14
to geni-...@googlegroups.com
I tried a few more things. Excluding ATLA, I get CHIC->HOUS. When I
exclude ATLA+HOUS, I get CHIC->KANS. When I exclude ATLA+HOUS+KANS, I
get CHIC->LOSA. When I exclude ATLA+HOUS+KANS+LOSA, I get CHIC->SALT.
This last one seemingly makes the most sense.

Here are the results of each request (stitching ig-illinois and ig-utah):

CHIC->ATLA (default) -- Success, 70-136ms RTT

CHIC->HOUS -- Stitcher completes, nodes are up. Can't ping. Tried two
times, fresh slices each time.

CHIC->KANS -- The createsliver at pg-utah tells me there's no capacity
to map the edge nodes. Tried down to 5 Mbps capacity.

CHIC->LOSA -- Success, 60ms RTT.

CHIC->SALT -- Success, 35ms RTT. Happy days are here! :)

You'll notice that each attempt selects a next hop name in alphabetical
order. Correct me if I'm wrong, but I think what's happening is that
protogeni is in a unique position of having presence at each Internet2
POP, so the SCS sees ATLA, HOUS, KANS, etc. as valid next hops to get to
ig-utah from I2 CHIC. Xi mentioned to me that if the cost for the set
of shortest path is the same, it will take the first one in the list,
which apparently is ordered alphabetically.

IMHO, if the path cost from CHIC to all of those other options is the
same, then I feel that the link metrics are off, or maybe there needs to
be a secondary metric in play.

Anyway, I can get what I want with the above method. It just takes a
little more effort than I anticipated.

Here's the command I used for that last test:

stitcher.py -r idms createsliver idms-ui-2
--excludehop="urn:publicid:IDN+ion.internet2.edu+interface+rtr.atla:ge-10/3/2:protogeni"
--excludehop="urn:publicid:IDN+ion.internet2.edu+interface+rtr.hous:ge-1/2/4:protogeni"
--excludehop="urn:publicid:IDN+ion.internet2.edu+interface+rtr.kans:ge-10/2/9:protogeni"
--excludehop="urn:publicid:IDN+ion.internet2.edu+interface+rtr.losa:ge-10/3/0:protogeni"
../../stitch-idms-ig-utah-ig-ill.xml

thanks,
- ezra


On 6/13/2014 2:47 PM, Kissel, Ezra D wrote:
> Can you tell me how the SCS determines the sum of those metric values?
> The ION topology has each port metric set to "10" and the AMs advertise
> the same value. We can take this off-list, I'm just curious about the
> path finding details.
>
>
> On 6/13/2014 2:07 PM, Xi Yang wrote:
>> Ezra,
>>
>> SCS will try return the least cost path which is based on sum of trafficEngineeringMetic defined for the stitching links. It is not necessarily the least hop path.
>>
>> In exclusion routing profile, you can exclude link, port, node or even aggregate by providing corresponding URNs.
>>
>> --Xi
>>

Ezra Kissel

unread,
Jun 13, 2014, 5:59:39 PM6/13/14
to geni-...@googlegroups.com
btw, the new 2.6 stitcher is a huge improvement. I never would have
been able to try all those tests in that short amount of time with the
previous version. The combined manifest, improved logging, and pretty
XML printing is a big win for debugging. Thanks to Aaron and whoever
else worked on the new features.

Xi Yang

unread,
Jun 13, 2014, 6:10:44 PM6/13/14
to geni-...@googlegroups.com
It appears that stitcher also has a --includehop option which should map into SCS hop-inclusion routing profile.
You may want to try that to see if it works for you better than --excludehop.

—Xi

Sarah Edwards

unread,
Jun 13, 2014, 6:23:50 PM6/13/14
to geni-...@googlegroups.com, Sarah Edwards
On Jun 13, 2014, at 4:15 PM, Ezra Kissel <ezki...@indiana.edu> wrote:

> ...and sorry if I over use the lists to chronicle my love/hate relationship with the GENI CF, but hopefully these points are useful to others as well. ;)

I for one enjoy the analysis.

I'm glad things are working and that the new stitcher is such a big help.
Reply all
Reply to author
Forward
0 new messages