[DiffusionDrive][nuscenes] [motion_plan_head] V13MotionPlanningHead

ad_official·2025년 2월 6일

diffusion planning

목록 보기

8/19

forward_test

1. 초기 Feature 추출 및 Top‑K Selection

1.1. Detection Feature 추출

기능 및 목적:
- det_output에서 검출된 인스턴스(features), 앵커(embed), 분류(classification) 및 예측(prediction) 결과를 추출
- 검출 결과 중 신뢰도가 높은 상위 num_det (예: 50) 인스턴스를 선택하여 후속 연산에서 노이즈를 줄이고 연산량을 감소시킵니다.
입력 및 출력 스펙:
- instance_feature: 원본 shape → [B, 900, 256]
- anchor_embed: [B, 900, 256]
- det_classification: [B, 900, 10] (sigmoid 적용 후)
- det_anchors: [B, 900, 11]
- Top‑K Selection → 선택 후
  - instance_feature_selected: [B, 50, 256]
  - anchor_embed_selected: [B, 50, 256]

1.2. Mapping Feature 추출

기능 및 목적:
- map_output에서 지도(정적 맵) 관련 인스턴스 feature와 앵커 정보를 추출하고, 상위 num_map (예: 10) 개를 선택합니다.
입력 및 출력 스펙:
- map_instance_feature: [B, 100, 256]
- map_anchor_embed: [B, 100, 256]
- map_classification: [B, 100, 3] (sigmoid 적용 후)
- map_anchors: [B, 100, 40]
- Top‑K Selection → 선택 후
  - map_instance_feature_selected: [B, 10, 256]
  - map_anchor_embed_selected: [B, 10, 256]

2. Ego 및 Temporal Feature 획득

기능 및 목적:
- instance_queue.get(...) 호출을 통해 자차(Ego)와 주변 객체의 시간적 이력을 포함한 특징(temporal features)과 앵커 정보를 얻습니다.
- 이를 통해 Ego 정보와 과거 프레임의 정보를 후속 Graph 연산에서 활용합니다.
입출력 스펙:
- 반환값은 튜플로 구성되며,
  - ego_feature: [B, 1, 256]
  - ego_anchor: [B, 1, 11]
  - temp_instance_feature: [B, 901, 4, 256]
  - temp_anchor: [B, 901, 4, 11]
  - temp_mask: [B, 901, 4]
- 이후, ego_anchor_embed는 anchor_encoder(ego_anchor)로 얻어지며 → [B, 1, 256]
- temp_anchor_embed는 anchor_encoder(temp_anchor) 후 flatten → [B*901, 4, 256]
- temp_instance_feature와 temp_mask도 flatten하여 각 [B*901, 4, ...].

3. Mode Anchor 및 Query 초기화

3.1. Motion Anchor 생성

기능 및 목적:
- get_motion_anchor 함수를 통해 검출 결과를 기반으로 각 검출 인스턴스에 대한 모션 예측의 초기 앵커를 생성합니다.
- 예상 shape는 [B, 900, 6, 12, 2] (예: 900개의 인스턴스, 6 모달, 12 시점, 2D 좌표).

3.2. Plan Anchor 생성

기능 및 목적:
- 사전에 저장된 K-means 기반 plan 앵커(self.plan_anchor)를 배치 차원으로 타일링하여 사용합니다.
- 결과 shape는 [B, cmd_mode, modal_mode, ego_fut_ts, 2] (예: B=1, cmd_mode 보통 1 또는 3, modal_mode=6, ego_fut_ts=6).

3.3. Mode Query 생성

Motion Mode Query:
- 입력: motion_anchor[..., -1, :] → [B, 900, 6, 2] (각 인스턴스의 마지막 시점의 좌표).
- 과정: gen_sineembed_for_position를 적용하여 [B, 900, 6, 256]로 변환 후, motion_anchor_encoder 적용 → 최종 shape [B, 900, 6, 256].
  - motion_mode_query
Plan Mode Query:
- 입력: plan_anchor[..., -1, :] → (예: [B, cmd_mode, 6, 2]).
- 과정: gen_sineembed_for_position 후 plan_anchor_encoder → [B, cmd_mode, 6, 256]
- 후처리: flatten 후 unsqueeze하여 최종 shape [B, 1, cmd_mode * 6, 256] (예: [1,1,6,256] 혹은 [1,1,18,256] 등, 실제 값은 cmd_mode에 따라 다름).
  - plan_mode_query

4. Feature Concatenation

목적:
- 선택된 검출 인스턴스와 Ego feature를 통합하여, 전체 인스턴스(검출 + Ego)에 대해 통합된 정보(및 앵커 임베딩)를 구성합니다.
스펙:
- instance_feature_selected: 기존 [B, 50, 256]에 Ego feature [B, 1, 256]를 concat → [B, 51, 256].
- anchor_embed_selected: [B, 51, 256].
- 원본 instance_feature와 anchor_embed에도 Ego 정보를 추가하여, 각각 [B, 901, 256].

5. Interact Layers를 통한 Feature Update (interact_operation_order 처리)

            interact_operation_order=(
                [
                    "temp_gnn",
                    "gnn",
                    "norm",
                    "cross_gnn",
                    "norm",
                    "ffn",                    
                    "norm",
                ] * 3 +
                [
                    "refine",
                ]
            ),

목적 및 흐름:

검출 및 Ego 관련 인스턴스의 특징과 앵커 임베딩을 시간적/공간적 상호작용을 통해 정제하고,
지도(mapping) 정보를 cross-attention으로 융합

5.1. temp_gnn 단계

역할:
- 자신의 과거 정보들과 attention
  - instance_feature와 temp_instance_feature 끼리 어텐션
  - anchor_embed와 temp_anchor_embed 끼리 어텐션
입력:
- Query: instance_feature.flatten(0,1).unsqueeze(1) → shape: [B*(900+1), 1, 256].
- Key/Value: temp_instance_feature → [B*901, 4, 256].
- Query Pos: anchor_embed.flatten(0,1).unsqueeze(1) → [B*901, 1, 256].
- Key Pos: temp_anchor_embed → [B*901, 4, 256].
- Key Padding Mask: temp_mask → [B*901, 4].
출력:
- 업데이트된 인스턴스 특징 → reshape하여 [B, 901, 256].
  - instance_feature
  - anchor_embed

5.2. gnn 단계

역할:
- 선택된 검출 인스턴스(및 Ego) 특징(instance_feature_selected)과 앵커 임베딩(anchor_embed_selected)을 사용해, 전체 인스턴스 특징에 대해 self-attention 기반 공간적 관계를 학습합니다.
입력:
- Query: instance_feature → [B, 901, 256].
- Key/Value: instance_feature_selected → [B, 51, 256].
- Query Pos: anchor_embed → [B, 901, 256].
- Key Pos: anchor_embed_selected → [B, 51, 256].
출력:
- 업데이트된 인스턴스 특징 → [B, 901, 256].
  - instance_feature
  - anchor_embed

5.3. norm / ffn 단계

역할:
- Layer Normalization 또는 Feed-Forward Network를 통해 특징값을 안정화 및 비선형 변환합니다.
입출력:
- Shape 유지: [B, 901, 256].

5.4. cross_gnn 단계

역할:
- 지도(mapping) 정보 (선택된 지도 feature와 앵커 임베딩)를 인스턴스 특징과 결합하여, 주변 정적 환경 정보가 반영되도록 합니다.
입력:
- Query: instance_feature → [B, 901, 256].
- Key: map_instance_feature_selected → [B, 10, 256].
- Query Pos: anchor_embed → [B, 901, 256].
- Key Pos: map_anchor_embed_selected → [B, 10, 256].
출력:
- 업데이트된 인스턴스 특징 → [B, 901, 256].
  - instance_feature
  - anchor_embed

5.5. refine 단계

목적
- Motion Query([B, 900, 6, 256])와 Plan Query ([B, 1, 6, 256])를 받아 세분화된 예측값으로 정제(refinement)하는 역할
- Motion 쿼리는 검출된 각 객체(인스턴스)에 대한 미래 궤적 예측에 관한 정보를 담고 있으며, 이 단계에서 두 개의 분기(branch)를 통해
  - Motion Classification: 각 객체/모드별로 미래 궤적의 존재 여부 또는 신뢰도(확률)를 산출
  - Motion Regression: 각 객체/모드별로 미래 시점마다 2차원 좌표(또는 추가 상태)를 예측
- Plan 쿼리는 Ego 차량(자차)에 대한 주행 계획 정보를 나타내지만,
- 이 모듈에서는 계획 분류 및 회귀(branch)는 주석 처리되어 있으며 대신 Planning Status를 산출
  - Planning Status: Ego 차량의 현재 상태나 주행 상황을 요약하는 벡터
  - 후속 처리 단계(예: diffusion 기반 정제)에서 사용될 수 있습니다.
입력
- Motion Query:
  - 계산: motion_mode_query + (instance_feature + anchor_embed)[:, :num_anchor].unsqueeze(2)
  - 입력:
    - motion_mode_query: [B, 900, 6, 256]
    - (instance_feature + anchor_embed)[:, :num_anchor]: [B, 900, 256] → unsqueeze(2) → [B, 900, 1, 256]
  - 최종입력: Motion Query → [B, 900, 6, 256]
- Plan Query:
  - 계산: plan_mode_query + (instance_feature + anchor_embed)[:, num_anchor:].unsqueeze(2)
  - 입력:
    - plan_mode_query: [B, 1, 6, 256]
    - (instance_feature + anchor_embed)[:, num_anchor:]: [B, 1, 256] → unsqueeze(2) → [B, 1, 1, 256]
  - 최종입력: Plan Query → [B, 1, 6, 256]
- 추가 입력
  - ego_feature: [B, 1, 256]
    - 전방 이미지로 만든 feature
  - ego_anchor_embed: (B, 1, 256)
    - Ego 차량의 앵커 임베딩
  - metas (선택적):
    - 주행 상황 등 부가 정보를 포함하는 딕셔너리

(A) Motion Branch

motion_cls_branch:
- 구성:
  - linear_relu_ln 블록 (입력 차원: 256 → 중간 차원; 마지막 출력 차원: 256)
  - Linear layer: 256 → 1
- 입력: Motion Query, shape ([B, 900, 6, 256])
- 출력:
  - Raw classification logits, shape ([B, 900, 6, 1])
  - 이후 squeeze(-1) → ([B, 900, 6])
- 목적: 각 인스턴스와 각 모달에 대해 모션 예측의 신뢰도 혹은 분류 점수를 산출
motion_reg_branch:
- 구성:
  - Linear (256 → 256), ReLU, Linear (256 → 256), ReLU, Linear (256 → fut_ts × 2)
- 입력: Motion Query, shape ([B, 900, 6, 256])
- 출력:
  - Raw 회귀 출력, shape: ([B, 900, 6, (fut_ts × 2)])
  - 재구성(reshape) → ([B, 900, fut_mode, fut_ts, 2])
  - 예시: fut_ts = 12, fut_mode = 6이면, 최종 motion_reg의 shape는 ([B, 900, 6, 12, 2])
- 목적: 각 인스턴스에 대해, 6가지 모달 별로 12 시점의 2차원 좌표 (혹은 추가 상태)를 예측

(B) Plan Branch

코드에서는 plan_cls 및 plan_reg는 주석 처리되어 있으며, 대신 plan_status_branch가 사용됩니다.
plan_status_branch:
- 구성:
  - Linear (256 → 256), ReLU, Linear (256 → 256), ReLU, Linear (256 → 10)
- 입력:
  - Ego 관련 정보를 위해, ego_feature + ego_anchor_embed
  - Shape: ([B, 1, 256])
- 출력:
  - Planning Status, shape: \([B, 1, 10]\)
- 목적:
  - Ego 차량의 상태 혹은 주행 상황을 나타내는 상태 벡터를 산출 (예: 10차원 벡터)

(C) 최종 출력

반환값:
- 모듈은 다음 5가지 출력을 반환합니다.
  - motion_cls: [B, 900, 6]
  - motion_reg: [B, 900, fut_mode, fut_ts, 2] (예: [B, 900, 6, 12, 2])
  - plan_cls: None (코드에서는 주석 처리됨)
  - plan_reg: None (코드에서는 주석 처리됨)
  - planning_status: [B, 1, 10]

            diff_operation_order=(
                [
                    "traj_pooler",
                    "self_attn",
                    "norm",
                    # "modulation",
                    "agent_cross_gnn",
                    "norm",
                    "anchor_cross_gnn",
                    "norm",
                    # "modulation",
                    "ffn",                    
                    "norm",
                    "modulation",
                    "diff_refine",
                ] * 2
            ),

6.1. Diffusion 설정 및 Plan Query 재구성

목적:
- 기존 refine 단계에서 얻은 plan 쿼리(예: plan_mode_query에서 파생된 plan_query)를 기반으로, diffusion scheduler를 사용하여 계획 궤적을 추가 정제
과정:
- plan_query (shape: [B, 1, 18, 256])를 squeeze하여 [B, 18, 256]로 만든 후, 재구성하여 [B, 3, ego_fut_mode, 256] (예: [1, 3, 6, 256])로 변환합니다.
- 메타 정보(metas['gt_ego_fut_cmd'])를 통해 명령(cmd)을 선택하여, 선택된 plan query인 cmd_plan_nav_query의 shape는 [B, ego_fut_mode, 256] (예: [1,6,256]).

6.2. Plan Anchor Processing

목적:
- plan_anchor에서 명령에 해당하는 부분을 선택하고, 연속된 앵커들 간의 차이를 계산하여 목표(ground truth) 계획 궤적의 변화량(tgt_cmd_plan_anchor)을 구합니다.
입출력 스펙:
- plan_anchor → 원래 shape [B, cmd_mode, modal_mode, ego_fut_ts, 2]
- 선택 후 → cmd_plan_anchor: [B, ego_fut_mode, 6, 2]
- 차분 연산 후 → tgt_cmd_plan_anchor: [B, ego_fut_mode, 6-1, 2] (예: [1,6,5,2])

6.3. Normalization 및 Diffusion Noise 추가

과정:
- tgt_cmd_plan_anchor를 normalize_ego_fut_trajs를 통해 정규화하고, reshape하여 [B * ego_fut_mode, ego_fut_ts, 2] (예: [6,6,2])로 만듭니다.
- 임의의 노이즈와 timesteps를 diffusion scheduler를 사용해 추가하여 noisy traj points를 생성합니다.
입출력 스펙:
- 입력: [6,6,2] → 노이즈 추가 후에도 동일한 shape.

6.4. Trajectory Feature Embedding

과정:
- 노이즈가 추가된 궤적(노이즈 제거 전후)을 sine positional embedding (gen_sineembed_for_position)을 통해 임베딩한 후, plan_pos_encoder를 적용하여 최종 traj_feature를 생성합니다.
입출력 스펙:
- 입력 noisy_traj_points: [B*ego_fut_mode, ego_fut_ts, 2] (예: [6,6,2])
- Sine embedding → [6,6,hidden_dim] (예: hidden_dim=128)
- Flatten 후 → [6, 6*hidden_dim] (예: [6,768])
- plan_pos_encoder 변환 후 → traj_feature: [6, embed_dims] (예: [6,256])
- Reshape → [B, ego_fut_mode, embed_dims] (예: [1,6,256])

6.5. Time Embedding

과정:
- 각 diffusion timestep에 대해 time_mlp를 적용하여 time embedding을 얻고, shape는 [B, ego_fut_mode, embed_dims] (예: [1,6,256]).

목적:
- 여러 diffusion timestep(roll_timesteps)에 대해, plan query에 대해 추가 정제(diffusion refinement)를 진행합니다.
구성 모듈 및 흐름:
- Loop over each timestep k (예: 두 개의 단계)
- 각 timestep마다, 초기 노이즈가 추가된 traj_feature를 입력으로 받아 다음 연산들을 차례로 적용합니다:
  - "traj_pooler": 궤적 관련 정보를 풀링하여 요약합니다.
    - 입력: traj_feature 및 현재 noisy traj (diff_plan_reg)
    - 출력: 업데이트된 traj_feature, shape 유지 [B*ego_fut_mode, embed_dims].
  - "self_attn": 자기-어텐션을 통해 궤적 내 내부 관계를 학습합니다.
    - 입력/출력: [B*ego_fut_mode, embed_dims].
  - "modulation": time embedding과 결합하여 특징 분포를 조절합니다.
    - 입력: traj_feature, time_embed → 출력 동일 shape.
  - "agent_cross_gnn": 선택된 인스턴스(feature_selected 및 anchor_embed_selected)와 교차 어텐션을 수행합니다.
    - 입력: traj_feature와 인스턴스 관련 컨텍스트; 출력: 정제된 traj_feature.
  - "map_cross_gnn": 지도 정보와의 교차 어텐션 수행.
    - 입력: traj_feature, map_instance_feature_selected, map_anchor_embed_selected.
  - "anchor_cross_gnn": plan query(cmd_plan_nav_query)와 교차하여, 앵커 기반 정보를 반영합니다.
    - 입력: traj_feature, cmd_plan_nav_query; 출력: 정제된 traj_feature.
  - "norm"/"ffn": 정규화 또는 피드포워드 네트워크를 적용하여 특징을 보완합니다.
  - "diff_refine": 최종적으로 diffusion refinement 모듈에서, 정제된 traj_feature를 바탕으로 diff_plan_reg (예측된 궤적 회귀값)와 diff_plan_cls (예측 분류값)를 산출합니다.
    - 출력 스펙:
      - diff_plan_reg: 보통 [B*ego_fut_mode, ego_fut_ts, 2] (예: [6,6,2])
      - diff_plan_cls: 분류 결과, shape는 예를 들어 [B*ego_fut_mode, num_cls_plan].

6.7. Inverse Diffusion Step

과정:
- 각 roll timestep마다 diffusion_scheduler의 step 메서드를 호출하여, 모델 출력(x_start)를 기반으로 노이즈를 제거하며 최종 예측 궤적을 복원합니다.
- 최종적으로, diff_plan_reg가 업데이트됩니다.

6.8. Diffusion Output Aggregation

과정:
- 최종 diff_plan_reg는 reshape되어 [B, 1, ego_fut_mode, ego_fut_ts, 2]로 변환되고, 반복(repeat)하여 modal dimension을 확장합니다. (예: 최종 shape → [B, 1, 3*ego_fut_mode, ego_fut_ts, 2])
- diff_plan_cls도 리스트에 추가됩니다.
출력:
- planning_output 딕셔너리에 “diffusion_prediction”, “diffusion_classification”, 그리고 “tgt_cmd_plan_anchor” 등이 업데이트됩니다.

7. 최종 Output 및 Return

Motion Output:
- 딕셔너리 형태로,
  - "classification": interact 단계에서 얻은 motion 분류 결과 리스트 (각 항목의 shape 예: [B, 900, 6, num_cls_motion])
  - "prediction": motion 회귀 결과 리스트 (예: [B, 900, 6, fut_ts, reg_dim])
  - "period" 및 "anchor_queue": 인스턴스 큐 관련 메타정보.
Planning Output:
- 딕셔너리 형태로,
  - "classification": interact 단계에서 얻은 plan 분류 결과 리스트
  - "prediction": plan 회귀 결과 리스트 (refine 단계 및 diffusion 정제 결과 포함)
  - "status": plan 상태 정보
  - 추가로 diffusion 정제 결과가 "diffusion_prediction"와 "diffusion_classification"에 저장됨.
Return:
- 최종적으로, forward_test는 (motion_output, planning_output)을 반환합니다.

요약

초기 단계에서는 검출(det) 및 지도(map) 관련 feature와 앵커 정보를 추출하고, Top‑K를 통해 신뢰도 높은 인스턴스들을 선택합니다.
Ego 및 Temporal 정보를 instance_queue에서 받아, Ego와 과거 프레임 정보를 포함한 특징을 구성합니다.
Mode Anchor 및 Query를 초기화하여, 각 인스턴스(검출 및 Ego)별로 motion과 plan 예측을 위한 초기 query들을 생성합니다.
Interact Layers(temp_gnn, gnn, norm/ffn, cross_gnn, refine)를 거쳐 인스턴스 특징을 업데이트하고, motion_query와 plan_query를 산출하여 초기 예측을 생성합니다.
Diffusion 단계에서는 plan query를 기반으로 diffusion scheduler를 통해 추가 정제를 수행하여 최종 plan 궤적을 생성합니다.
최종적으로, motion과 planning 예측 결과를 딕셔너리로 묶어 반환합니다.

각 단계에서의 입력과 출력 텐서들은 위에서 언급한 스펙(예: [B, 900, 256], [B,900,6,256], [B, 1,6,256], [B, ego_fut_mode, ego_fut_ts, 2] 등)을 따르며, 이는 모델의 각 모듈들이 공간적, 시간적, 그리고 다중 모달 정보를 효율적으로 융합하여 최종 주행 계획을 산출할 수 있도록 설계되어 있습니다.

이와 같이 forward_test 메서드는 전체적인 파이프라인을 통해 검출 및 지도 정보, 시간적 이력, Ego 정보, 그리고 사전 정의된 앵커 및 쿼리들을 결합하여, Motion과 Planning의 최종 예측을 생성하는 역할을 수행합니다.

forward

알아내야 할 것

TODO: instance_feature 와 anchor_embed 어떻게 만드는지 알아야함
- det_output 와 map_output 모두
첫번째 구조 로직 파악
- 목적: 자차와 주변 장애물의 미래 위치 예측
- TODO: input/output이 어떤 네트워크에 들어가서 나오는지 파악
  - 목적:
    - 이 구조가 정말 장점이 있는가?
    - 어떤 input이 정말 필요한가?
두번째 diffusion 구조 로직 파악
- TODO: input/output이 어떤 네트워크에 들어가서 나오는지 파악

input

""" det_output : Dict
instance_feature: (1, 900, 256)
anchor_embed: (1, 900, 256)

len(det_output['classification']): 6
    det_output['classification'][-1].shape: torch.Size([1, 900, 10])
len(det_output['prediction']): 6
    det_output['prediction'][-1].shape: torch.Size([1, 900, 11])
len(det_output['quality']): 6
    det_output['quality'][-1].shape: torch.Size([1, 900, 2])
"""
---------------------------
""" map_output: Dict
instance_feature: (1, 100, 256)
anchor_embed: (1, 100, 256)
len(map_output['classification']): 6
map_output['classification'][-1].shape: torch.Size([1, 100, 3])
len(map_output['prediction']): 6
map_output['prediction'][-1].shape: torch.Size([1, 100, 40])
"""
---------------------------
""" 
                    feature_maps : List[Tensor]
                            [0]: (1, 89760, 256)
    – 여러 스케일(또는 레벨)과 카메라에서 추출된 feature map들을 공간적 위치(픽셀 또는 패치) 단위로 평탄화(flatten)하여 하나의 큰 텐서로 연결한 결과
    – 여기서 89760는 모든 카메라와 모든 스케일의 픽셀(또는 패치) 수의 총합이고, 256는 각 위치에서의 feature 채널 수를 의미
    – 즉, 이 텐서는 배치와 카메라 차원을 합쳐서 모든 spatial 위치의 특징들을 한 번에 처리할 수 있도록 만들어진 “열(feature column)” 형태의 표현        
                            [1]: (6, 4, 2)
    – 각 카메라(6)와 각 스케일(레벨, 4)에서 원래의 공간적 크기(높이, 너비)를 나타내는 정보
    – 여기서 6는 카메라의 수, 4는 각 카메라에서 사용한 스케일(또는 레벨)의 수, 2는 각각 (높이, 너비)를 의미
    – 이 정보는 평탄화된 col_feats를 다시 원래의 spatial 구조로 복원할 때 기준으로 사용
                            [2]: (6, 4)
    – 각 카메라별, 각 스케일별로 평탄화된 col_feats 내에서 해당 스케일의 feature들이 시작하는 인덱스를 나타냅
    – (6, 4)에서 6는 카메라 수, 4는 각 카메라에서의 스케일 수를 의미하며, 
        이 값들을 이용해 분할된 feature map 들을 다시 각 스케일 단위로 분리하거나 재구성할 수 있음
"""

output

""" motion_output : Dict
"classification": len = 1
    (1, 900, fut_mode=6)
"prediction": len = 1
    (1, 900, fut_mode=6, fut_ts=12, 2)
"period": (1, 900)
"anchor_queue": len = 4
    (1, 900, 11)
"""
-------------------------
""" planning_output : Dict
classification: len 1
    (1, 1, cmd_mode(3)*modal_mode(6)=18)
prediction: len 1
    (1, 1, cmd_mode(3)*modal_mode(6)=18, ego_fut_mode=6, 2)
status: len 1
    (1, 1, 10)
anchor_queue: len 4
    (1, 1, 11)
period: ( 1, 11)
"""

로직

decode

알아내야 할 것

decode network 의 목적이 무엇인지 파악하기

input

""" det_output : Dict
len(det_output['classification']): 6
    det_output['classification'][-1].shape: torch.Size([1, 900, 10])
len(det_output['prediction']): 6
    det_output['prediction'][-1].shape: torch.Size([1, 900, 11])
len(det_output['quality']): 6
    det_output['quality'][-1].shape: torch.Size([1, 900, 2])
"""
------------------------------------------------------
""" motion_output : Dict
"classification": len = 1
    (1, 900, fut_mode=6)
"prediction": len = 1
    (1, 900, fut_mode=6, fut_ts=12, 2)
"period": (1, 900)
"anchor_queue": len = 4
    (1, 900, 11)
"""
------------------------------------------------------
""" planning_output : Dict
classification: len 1
    (1, 1, cmd_mode(3)*modal_mode(6)=18)
prediction: len 1
    (1, 1, cmd_mode(3)*modal_mode(6)=18, ego_fut_mode=6, 2)
status: len 1
    (1, 1, 10)
anchor_queue: len 4
    (1, 1, 11)
period: ( 1, 11)
"""

output

""" motion_result: List, len = 1 (아마 batch size 만큼 나올 것)
dict
    trajs_3d: (300, fut_mode=6, fut_ts=12, 2)
    trajs_score: (300, fut_mode=6)
    anchor_queue: (300, max_length(=4), 10)
    period: (300)

"""
--------------------------------------
""" planning result: len = 1 (아마 batch size 만큼 나올 것)
dict
    planning_score: (cmd_mode=3, modal_mode=6)
    planning: (cmd_mode=3, modal_mode=6, ego_fut_mode=6, 2)
    final_planning: (ego_fut_mode=6, 2)
    ego_period: (1)
    ego_anchor_queue: (1, max_length=4, 10)
"""

로직

3. 최종 요약

V13MotionPlanningHead는 자율주행 시스템에서 객체 검출 및 지도 결과(det_output, map_output)를 기반으로

미래 모션과 자차 경로를 예측하는 엔드투엔드 모듈

입력 및 Feature 통합:
- det_output:
  - "instance_feature"
    - [1, 900, 256]
  - "anchor_embed"
    - [1, 900, 256]
  - "classification"
    - len: 6
    - [-1].shape : [1, 900, 10]
  - "prediction"
    - len: 6
    - [-1].shape: [1, 900, 11]
      - x,y,z, 뭐 이런 값이 11개
      - 경로 예측은 아니고, 위치 예측
- map_output:
  - instance_feature
    - [1, 100, 256]
  - anchor_embed
    - [1, 100, 256]
  - classification
    - len: 6
    - [-1].shape: [1, 100, 3]
      - 차선, 경계, 보행자 횡단보도
  - prediction
    - len: 6
    - [-1].shape:[1, 100, 40]
      - x,y 좌표 점이 20개 있는 것
- instance_queue
  - 과거 temporal 정보(ego_feature, ego_anchor 등)를 제공하여 시간적 연속성을 확보
    - ego_feature: [1, 1, 256]
    - ego_anchor, # [1, 1, 11]
    - temp_instance_feature, # [1, 901, 1, 256]
    - temp_anchor, # [1, 901, 1, 11]
    - temp_mask, # [1, 901, 1]
- feature_maps 및 metas:
  - feature_maps: List[Tensor] with shape [B, N, C, H, W]
  - 이미지의 다중 스케일 feature maps 및 카메라/프로젝션 메타 정보를 전달
쿼리 생성 및 상호작용:
- motion_anchor와 plan_anchor를 기반으로 각각 motion_mode_query와 plan_mode_query를 생성
- Interact Operation Order (interact_layers)로 후보 특징을 정제하고, refine_layer (V11MotionPlanningRefinementModule)를 통해
  - motion_cls: [B, 900, fut_mode=6]
  - motion_reg: [B, 900, fut_mode=6, fut_ts=12, 2]
  - plan_status: 자차 상태 정보
- 이후, diffusion 과정(diff_layers, modulation_layer, traj_pooler_layer)을 적용하여 자차 플래닝 예측을 더욱 정제

용어	의미	예시 설명
fut_mode	다른 객체(예: 주변 차량, 보행자 등)의 미래 궤적 예측에서 고려하는 모드의 수	주변 차량의 미래 경로가 여러 방식(급격한 변화, 완만한 변화 등)으로 예측될 때, 그 후보 수 (예: 6)
ego_fut_mode	자차(ego vehicle)의 미래 궤적 예측에서 고려하는 모드의 수	자차의 미래 경로가 여러 방식(예: 다른 주행 전략)으로 예측될 때, 그 후보 수 (예: 6)
cmd_mode	자차가 수행할 고수준 주행 명령의 종류 (어떤 행동을 할 것인가에 대한 분류)	예를 들어, "직진", "좌회전", "우회전" 등, 각 명령이 하나의 cmd_mode (예: 3개)로 표현될 수 있음
modal_mode	각 고수준 주행 명령(cmd_mode) 내에서 예측할 수 있는 여러 세부 경로 후보 (세부 실행 전략)	예를 들어, "좌회전" 명령 내에서 급격하게 좌회전하는 경우, 완만하게 좌회전하는 경우 등 여러 후보(예: 6개)

최종 출력 및 후처리:
- motion_output:
  - "classification": [B, num_det=50, fut_mode=6]
  - "prediction": [B, num_det=50, fut_mode=6, fut_ts=12, 2]
  - 추가적으로 instance_queue에서 업데이트된 period와 anchor_queue 제공
- planning_output:
  - "classification": [B, 1, cmd_mode×modal_mode] → 후에 reshape 및 diffusion으로 보정되어 최종 자차 플래닝 분류 결과
  - "prediction": [B, 1, cmd_mode×modal_mode, ego_fut_ts=6, 2] → diffusion refinement를 거쳐 [B, 1, 3×ego_fut_mode, ego_fut_ts, 2]로 정제됨
  - "status": 플래닝 상태 정보, 그리고 ego_period, ego_anchor_queue (instance_queue 기반)
손실 및 디코딩:
- motion_loss와 plan_loss는 각각 FocalLoss와 L1Loss 등을 사용하여 계산되고,
- motion_decoder와 planning_decoder를 통해 최종 예측 결과가 후처리되어 반환됩니다.

이처럼 V13MotionPlanningHead는 감지, 지도, temporal 정보를 통합해 미래 모션 및 자차 플래닝을 정밀하게 예측하는 핵심 모듈

1. V13MotionPlanningHead의 전체 역할

V13MotionPlanningHead는 객체 검출(det_output) 및 지도(map_output) 모듈에서 얻은 정보를 기반으로
- 미래 모션(다른 객체의 모션)과 자차(ego) 플래닝(경로 예측)을 동시에 수행하는 최종 모션 플래닝 모듈
  주요 역할은 다음과 같습니다.
후보 특징 통합 및 정제:
- 감지(det_output)에서 얻은 후보 인스턴스 특징(instance_feature)과 앵커 임베딩(anchor_embed)을 입력받아,
  - temporal 정보(ego/instance queue에서 얻은 과거 정보)와 결합
- 여러 단계의 Transformer/Graph 모듈(설정된 interact_operation_order)을 통해
  - 후보 간 상호작용, 정규화, 그리고 교차 모달 상호작용(cross_gnn 등)을 수행하여,
    - 모션 및 플래닝에 적합한 쿼리 벡터(motion_query, plan_query)를 생성
쿼리 생성 및 앵커 선택:
- 미리 학습된 모션 앵커(motion_anchor)와 플랜 앵커(plan_anchor)를 로드한 후, 이를 기반으로
  - motion_mode_query: 객체 모션 예측에 사용되는 쿼리
  - plan_mode_query: 자차 플래닝(경로 예측)에 사용되는 쿼리
    를 생성합니다.
- plan_anchor는 외부 파일에서 불러온 후, 배치 단위로 tile하여 shape를 [B, cmd_mode, modal_mode, ego_fut_ts, 2] (예: [B, 3, 6, 6, 2])로 확장합니다.

용어	의미	예시 설명
fut_mode	다른 객체(예: 주변 차량, 보행자 등)의 미래 궤적 예측에서 고려하는 모드의 수	주변 차량의 미래 경로가 여러 방식(급격한 변화, 완만한 변화 등)으로 예측될 때, 그 후보 수 (예: 6)
ego_fut_mode	자차(ego vehicle)의 미래 궤적 예측에서 고려하는 모드의 수	자차의 미래 경로가 여러 방식(예: 다른 주행 전략)으로 예측될 때, 그 후보 수 (예: 6)
cmd_mode	자차가 수행할 고수준 주행 명령의 종류 (어떤 행동을 할 것인가에 대한 분류)	예를 들어, "직진", "좌회전", "우회전" 등, 각 명령이 하나의 cmd_mode (예: 3개)로 표현될 수 있음
modal_mode	각 고수준 주행 명령(cmd_mode) 내에서 예측할 수 있는 여러 세부 경로 후보 (세부 실행 전략)	예를 들어, "좌회전" 명령 내에서 급격하게 좌회전하는 경우, 완만하게 좌회전하는 경우 등 여러 후보(예: 6개)

모션 및 플래닝 예측:
- refine_layer (여기서는 V11MotionPlanningRefinementModule)를 통해, 정제된 motion_query와 plan_query를 입력받아
  - motion_cls: 각 후보에 대해 fut_mode(예: 6) 모드별 분류 점수를 산출
  - motion_reg: 각 후보에 대해 fut_ts(예: 12) 타임스텝에 걸쳐 2D (x, y) 좌표로 구성된 모션 회귀 예측을 산출
  - plan_status: 자차 플래닝 상태를 ego_feature와 ego_anchor_embed의 결합을 통해 산출
- 이후, diff_operation_order에 정의된 diffusion 기반 모듈(예: traj_pooler, self_attn, modulation, diff_refine 등)을 적용하여,
  - noisy한 자차 미래 궤적(noisy_traj_points)을 점진적으로 정제하고,
  - 최종 플랜(prediction 및 classification)을 보정
후처리 및 최종 출력:
- 최종적으로, motion_decoder (SparseBox3DMotionDecoder)와 planning_decoder (HierarchicalPlanningDecoder)를 통해 각각 모션 및 플래닝 결과를 디코딩
- 출력은 두 개의 딕셔너리로 구성되며, 각각 motion_output과 planning_output으로 나뉩니다.

2. 주요 구성 요소 및 세부 역할 (입출력 명세 포함)

A. Instance Queue (instance_queue)

역할:
- 감지 모듈에서 얻은 인스턴스 특징(instance_feature)과 앵커 정보를 시간순으로 저장하여 temporal consistency를 제공
- 이를 통해, 최근 프레임의 정보를 ego feature로 추출하고, 자차 플래닝 시 temporal context로 사용합니다.
입력:
- det_output의 instance_feature: [B, 900, embed_dims] (예: [B, 900, 256])
- det_output의 예측 앵커: [B, 900, D] (후에 anchor_encoder를 거쳐 [B, 900, embed_dims])
출력:
- ego_feature: 보통 [B, 1, embed_dims]
- ego_anchor: [B, 1, D] (또는 임베딩 후 [B, 1, embed_dims])
- Temporal 정보 (temp_instance_feature, temp_anchor, temp_mask): shapes는 배치 크기 B와 queue_length에 따라 결정

B. Motion & Plan Anchors and Their Encoders

motion_anchor:
- 역할: 미리 클러스터링된 모션 후보 (파일 "data/kmeans/kmeansmotion{fut_mode}.npy")
- 출력: 로드된 값은 일반적으로 [M_motion, D_motion] (M_motion: 후보 모드 수)
- 처리: get_motion_anchor() 메서드를 통해, 감지 결과에서 얻은 분류 결과를 기반으로 인덱싱하여 각 배치에 맞게 선택되고, _agent2lidar() 함수를 통해 좌표 변환됨.
plan_anchor:
- 역할: 자차 플래닝 후보 (파일 "data/kmeans/kmeansplan{ego_fut_mode}.npy")
- 출력: shape 예시: [cmd_mode, modal_mode, ego_fut_ts, 2] (예: [3, 6, 6, 2])
- 처리: tile() 연산을 통해 배치 차원으로 확장됨.
motion_anchor_encoder:
- 역할: motion_anchor에 sinusoidal positional embedding 및 선형 변환을 적용하여, 고차원 쿼리 벡터로 변환 ([B, num_anchor, embed_dims])
- 입력: motion_anchor의 각 후보 벡터
- 출력: motion_mode_query, shape 동일 (예: [B, 900, embed_dims])
plan_anchor_encoder & plan_pos_encoder:
- 역할: plan_anchor에서 생성된 positional 정보를 임베딩하여, 자차 플래닝 쿼리(plan_mode_query)를 생성합니다.
- 입력: plan_anchor, 후에 gen_sineembed_for_position()을 통해 생성된 pos 정보
- 출력: plan_mode_query, shape → [B, 1, (cmd_mode×modal_mode)*embed_dims] 후 flatten 및 unsqueeze하여 [B, 1, ?] (최종 쿼리로 사용)

C. Interact Operation Order (interact_layers)

역할:
- 설정된 interact_operation_order는 ["temp_gnn", "gnn", "norm", "cross_gnn", "norm", "ffn", "norm"]를 3번 반복한 뒤, "refine"을 추가합니다.
- 각 모듈은 attention 기반 연산(MultiheadAttention, FlashAttention 등), LayerNorm, Feedforward 네트워크(AsymmetricFFN) 등을 통해 후보 feature (instance_feature, anchor_embed)를 정제합니다.
입력:
- instance_feature: [B, 900, embed_dims]
- anchor_embed: [B, 900, embed_dims]
- (필요시) temporal 정보: temp_instance_feature, temp_anchor_embed (flattened 형태, shape: [B×n, embed_dims])
출력:
- 업데이트된 instance_feature와 anchor_embed, 최종적으로 refine 단계 이전까지 shape는 그대로 [B, 900, embed_dims].

D. Refine Layer (refine_layer → V11MotionPlanningRefinementModule)

역할:
- interact_layers의 마지막 "refine" 단계에서, motion_query와 plan_query를 받아 최종 예측을 산출합니다.
- 구체적으로:
  - motion_cls: 각 후보에 대해 fut_mode(예: 6) 모드별 분류 점수를 계산
  - motion_reg: 각 후보에 대해 fut_ts(예: 12) 타임스텝에 걸친 모션 회귀 값을 계산하여, shape는 [B, num_anchor, fut_mode, fut_ts, 2]
  - plan_status: ego_feature와 ego_anchor_embed의 합을 입력받아, 자차 플래닝 상태 정보를 계산 (출력 shape는 [B, num_ego, ?])
입력:
- motion_query: [B, num_anchor, embed_dims]
- plan_query: [B, num_anchor, embed_dims]
- ego_feature + ego_anchor_embed: [B, num_ego, embed_dims] (여기서 num_ego는 ego 후보 수; 보통 1)
출력:
- motion_cls: [B, num_anchor, fut_mode]
- motion_reg: [B, num_anchor, fut_mode, fut_ts, 2]
- plan_cls, plan_reg: (여기서는 plan_cls와 plan_reg가 diffusion 과정에서 후처리되며, 최종 자차 플래닝은 [B, 1, 3×ego_fut_mode, ego_fut_ts, 2]로 조정됨)
- plan_status: 상태 정보 벡터, shape는 설계에 따라 다름

E. Diffusion Operation Order (diff_layers)

역할:
- diff_operation_order에 따라, 자차 플래닝 쿼리(plan_query)에 diffusion 기반의 refinement 과정을 적용합니다.
- 이 과정은 DDIMScheduler를 사용하여, noisy 자차 미래 trajectory(odo_info_fut)를 정제하고,
  - 시간 임베딩(time_mlp)을 적용하여 noise 수준을 조절한 후,
  - 여러 diffusion 단계(step) 동안 차례로 모델 예측(traj_feature, plan_reg 등)을 업데이트합니다.
입력:
- 초기 traj_feature: [B, ego_fut_mode, feature_dim] (출력 from 이전 모듈, 예: [B, 6, 256])
- 타임스텝 정보(timesteps): [B] → repeat/interleave → [B×ego_fut_mode]
- global_cond (옵션): (자체적으로 추가적인 조건 벡터)
출력:
- 최종 diffusion된 plan_reg: 최종 자차 경로 회귀 값, shape → [B, 1, 3×ego_fut_mode, ego_fut_ts, 2]
- 최종 diffusion된 plan_cls: 자차 플래닝 분류 점수, shape → [B, 1, ?]

F. Modulation Layer (modulation_layer → V1ModulationLayer)

역할:
- 주어진 traj_feature에 대해, 시간 임베딩(time_embed) 및 (옵션) global condition을 결합한 후, scale과 shift 값을 산출하여 feature에 곱셈/덧셈 연산을 수행합니다.
- 이는 diffusion 과정 중 feature modulation의 역할을 하여, 시간에 따른 변화나 전역 조건에 따른 feature 조정을 가능하게 합니다.
입력:
- traj_feature: [B, ..., embed_dims] (예: [B, ego_fut_mode, 256])
- time_embed: [B, embed_dims] 또는 [B, 1, embed_dims]
- global_cond: (옵션) [B, embed_dims] 또는 [B, 1, embed_dims]
출력:
- modulated traj_feature: 같은 shape as input (예: [B, ego_fut_mode, embed_dims])

G. Trajectory Pooler (traj_pooler_layer → V1TrajPooler)

역할:
- 자차 플래닝 과정에서, 예측된 trajectory (예: noisy trajectory)로부터 keypoint를 추출하고, feature maps로부터 해당 위치의 특징을 pool하여, 최종 trajectory feature를 생성합니다.
- 내부적으로, TrajSparsePoint3DKeyPointsGenerator를 사용하여, 각 trajectory에 대해 다수의 keypoint를 생성하고, deformable aggregation (DAF) 연산을 적용하여 feature를 집계합니다.
입력:
- 보통, instance_feature와 noisy trajectory points (예: [B, ego_fut_ts, 2])
- meta 정보 (projection matrix, image size 등)
- feature_maps: 다중 스케일 feature maps, 각 shape [B, N, C, H, W]
출력:
- pooled trajectory feature: [B, 1, embed_dims] (예: [B, 1, 256])

H. Losses, Samplers, and Decoders

Motion Sampler (MotionTarget):
- 입력:
  - motion_reg 예측: [B, num_anchor, fut_mode, fut_ts, 2]
  - GT agent future trajectories, masks
- 출력:
  - cls_target: [B, num_anchor]
  - reg_target: [B, num_anchor, fut_ts×?]
  - reg_weight, num_pos (scalar or per-instance weights)
Planning Sampler (V1PlanningTarget):
- 입력:
  - planning cls_pred: [B, 1, ego_fut_mode] (reshape 후)
  - planning reg_pred: [B, 1, ego_fut_mode, ego_fut_ts, 2]
  - GT ego future trajectories, masks
- 출력:
  - cls_target, reg_target, reg_weight
Motion Losses:
- motion_loss_cls (FocalLoss)와 motion_loss_reg (L1Loss) 계산 on motion predictions
Plan Losses:
- plan_loss_cls, plan_loss_reg, plan_loss_status 계산 on planning predictions
Motion Decoder (SparseBox3DMotionDecoder):
- 역할:
  - motion_output의 분류 및 회귀 결과를 후처리하여, 최종 3D trajectory(또는 박스) 정보를 복원
- 입력:
  - cls_scores: [B, num_anchor, fut_mode]
  - box_preds: [B, num_anchor, D_motion] (raw regression 값)
  - motion_output의 추가 정보 (anchor_queue, period)
- 출력:
  - List[dict] with keys “trajs_3d”, “trajs_score”, “anchor_queue”, “period”
Planning Decoder (HierarchicalPlanningDecoder):
- 역할:
  - planning_output의 예측 결과를 계층적으로 정제하여, 최종 자차 미래 경로를 산출
- 입력:
  - planning_output["classification"]: [B, 1, cmd_mode×modal_mode]
  - planning_output["prediction"]: [B, 1, cmd_mode×modal_mode, ego_fut_ts, 2]
- 출력:
  - 최종 플래닝 결과: List[dict] with keys “planning_score”, “planning”, “final_planning”, “ego_period”, “ego_anchor_queue”

ad_official

이전 포스트

[DiffusionDrive][nuscenes] [map_head] Sparse4DHead

다음 포스트

[DiffusionDrive][nuscenes] [motion_plan_head] V13MotionPlanningHead

diffusion planning

forward_test

1. 초기 Feature 추출 및 Top‑K Selection

1.1. Detection Feature 추출

1.2. Mapping Feature 추출

2. Ego 및 Temporal Feature 획득

3. Mode Anchor 및 Query 초기화

3.1. Motion Anchor 생성

3.2. Plan Anchor 생성

3.3. Mode Query 생성

4. Feature Concatenation

5. Interact Layers를 통한 Feature Update (interact_operation_order 처리)

5.1. temp_gnn 단계

5.2. gnn 단계

5.3. norm / ffn 단계

5.4. cross_gnn 단계

5.5. refine 단계

(A) Motion Branch

(B) Plan Branch

(C) 최종 출력

6. Diffusion 단계 (후처리 및 Refinement)

6.1. Diffusion 설정 및 Plan Query 재구성

6.2. Plan Anchor Processing

6.3. Normalization 및 Diffusion Noise 추가

6.4. Trajectory Feature Embedding

6.5. Time Embedding

6.6. Diffusion Refinement Loop (diff_operation_order 처리)

6.7. Inverse Diffusion Step

6.8. Diffusion Output Aggregation

7. 최종 Output 및 Return

요약

forward

알아내야 할 것

input

output

로직

decode

알아내야 할 것

input

output

로직

3. 최종 요약

1. V13MotionPlanningHead의 전체 역할

2. 주요 구성 요소 및 세부 역할 (입출력 명세 포함)

A. Instance Queue (instance_queue)

B. Motion & Plan Anchors and Their Encoders

C. Interact Operation Order (interact_layers)

D. Refine Layer (refine_layer → V11MotionPlanningRefinementModule)

E. Diffusion Operation Order (diff_layers)

F. Modulation Layer (modulation_layer → V1ModulationLayer)

G. Trajectory Pooler (traj_pooler_layer → V1TrajPooler)

H. Losses, Samplers, and Decoders

[DiffusionDrive][nuscenes] [map_head] Sparse4DHead

[DiffusionDrive][nuscenes] InstanceQueue

0개의 댓글