Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Video Question and Answering model JSFusion #61

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

yunseong
Copy link
Contributor

@yunseong yunseong commented Jul 18, 2019

This PR adds a video QA model named JSFusion (Yu et al. ECCV 2018). Most of the code has been taken from @jsjason's implementation (https://github.com/jsjason/jsfusion-pytorch).

Changes specifically made for our system:

  • All components implement RnB interface (e.g., VideoPathIterator, Runner)
  • Move the place of creating inputs (masks, sentences, sentence_masks) from loader to steps, where the inputs are actually used
  • Place all tensors (for model parameters) to corresponding GPUs; most parameters are defined as class variables (not sure it's a correct name) and torch.nn.Module.register_buffer is used if unavailable.

How to run

$ pip install hickle    # Q: how do I add this entry to spec-file.txt?
$ export LSMDC_PATH=/cmsdata/hdd0/cmslab/LSMDC # in elsa-12
$ CUDA_VISIBLE_DEVICES=0 python benchmark.py -c config/jsfusion-whole.json -v 10

Although the code still has a vast space to optimize, I would like to parallelize the review process by taking a chance to ask questions on my side.

runner.py Outdated Show resolved Hide resolved
(frames,), filename = tensors
resnet_output = self.pool(frames)
resnet_output = resnet_output.view(resnet_output.shape[0], resnet_output.shape[1])
# TODO handle the case when resnet_output.shape[0] < num_frames (fill zeros)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment points to the branch for computing mask in https://github.com/jsjason/jsfusion-pytorch/blob/master/infer.py#L177, which I think can become simplified.

@jsjason jsjason self-requested a review July 18, 2019 09:35
# frames: (40, 3, 224, 224)

filename = os.path.basename(file_path)
out = (frames, filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be ((frames,), filename). The first item needs to be a tuple of tensors, not a single tensor.


conv3 = self.conv3(bn2)
relu3 = self.relu3(conv3)
bn3 = self.bn3(relu3)
Copy link
Contributor

@jsjason jsjason Jul 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen a lot of open PyTorch code that does something like the following.

x = self.conv3(x)
x = self.relu3(x)
x = self.bn3(x)

I'm not sure what the real intention of this is, but I'm assuming it's for reducing the memory footprint? If our implementations takes up a lot of memory, then I guess we could change our code to have this pattern, too. I'm not sure how this affects inference time.


cut_mask_indices = [i for i in range(cut_mask.shape[1]) if i % 2 == 1 and i < cut_mask.shape[1] - 1]
cut_mask_indices = torch.tensor(cut_mask_indices)
cut_mask_indices = cut_mask_indices.to(device=self.device, non_blocking=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this part takes long, then we perhaps could pre-allocate a bunch of tensors for every possible mask length, and simply fetch the correct cut_mask_indices tensor at runtime. The default max length 40 isn't THAT long, so this shouldn't take up a lot of memory.


# sentences: np.ndarray shape=(Bx5xL) dtype=int32
# sentence_masks: np.ndarray shape=(Bx5xL) dtype=float32
sentences, sentence_masks = self.parse_sentences(self.word2idx, mc_path, self.num_frames)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this segment turns out to be a bottleneck, I guess we could construct a very complex pipeline in which two steps (e.g., ResNetFeatureExtractor and SentenceParser) can be run concurrently and their results are aggregated in the final step MCModel... maybe next year?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants